https://www.youtube.com/watch?v=OYcdw5vOgIc&list=PLUl4u3cNGP60uVBMaoNERc6knT_MgPKS0&index=21 字幕記錄 00:00 00:00 The following content is provided under a Creative 00:02 Commons license. 00:03 Your support will help MIT OpenCourseWare 00:06 continue to offer high quality educational resources for free. 00:10 To make a donation or to view additional materials 00:12 from hundreds of MIT courses, visit MIT OpenCourseWare 00:16 at oct.mit.edu. 00:17 00:32 PHILIPPE RIGOLLET: All right. 00:34 So let's continue talking about maximum likelihood estimation 00:41 in the context of generalized linear models, all right? 00:43 So in those generalized linear models, 00:45 what we spent most of the past lectures working on 00:49 is the conditional distribution of Y given 00:55 X. And we're going to assume that this 00:59 follows some distribution in the exponential family. 01:08 01:12 OK. 01:13 And so what it means is that if we look at the density, say-- 01:18 or the PMF, but let's talk about density 01:21 to make things clearer-- 01:23 we're going to assume that Y given X has distribution. 01:29 So X is now fixed, because we're conditioning on it. 01:31 And it has a density, which is of this form, c of Yi phi. 01:50 OK. 01:50 So this c, again, we don't really need to think about it. 01:53 This is something that's going to come up naturally 01:55 as soon as you need normalization factor. 01:58 And so here what it means, if this is the distribution of Y 02:02 given Xi, so that's the density of Yi 02:08 given Xi is equal to little xi. 02:11 So if it's the conditional distribution of Yi given Xi, 02:15 it should depend on xi somehow. 02:18 And it does not appear to depend on Xi. 02:20 And here, the model is going to be on theta i, which is just 02:26 a function, theta i of Xi. 02:29 And we're going to take a very specific one. 02:30 It's going to be a function of a linear form of the Xi. 02:38 So really we're going to take something which is of the form 02:40 theta I, which is really just-- as theta does not depend 02:43 on Xi-- 02:44 of Xi transposed from beta. 02:47 OK? 02:47 So all these parts here, this is really some modeling 02:50 assumptions that we're making once we've agreed 02:54 on what distribution we want. 02:55 OK. 02:56 So to do that, our goal, of course, 02:58 is going to try to understand what this beta is. 03:01 There's one beta here. 03:02 What's important is that this beta does not depend on i. 03:12 So if they observe pairs Xi, Yi-- 03:17 let's say I observe n of them, i equals 1 to n-- 03:22 the hope is that as a accumulate more and more pairs 03:25 of this form where there's always the same parameter that 03:29 links Xi to Yi, that's this parameter beta, 03:32 that I should have a better and better estimation of this beta. 03:35 Because it's always the same. 03:37 And that's essentially with couples 03:38 all of our distribution. 03:40 If I did not assume this, then I could 03:42 have a different distribution for each pair Xi given YI. 03:46 And I would not be able to do any statistics. 03:48 Nothing would average in the end. 03:50 But here I have the same beta, which 03:52 means that I can hope to do statistics and average 03:54 errors in them. 03:56 OK. 03:56 So I'm going to collect, so I'll come back to this. 04:00 But as usual in the linear regression model, 04:02 we're going to collect all our observations Yi. 04:05 So I'm going to assume that they're real valued 04:08 and that my Xi's takes value in Rp 04:10 just like in the regression model. 04:12 And I'm going to collect all my Yi's into one big vector of Y 04:16 in our n and all my X's into one big matrix in Rn times p 04:23 just like for the linear regression model. 04:26 All right, so, again, what I'm interested in here 04:30 is the conditional distribution of Yi given Xi. 04:36 OK. 04:36 I said this is this distribution. 04:39 When we're talking about regression, 04:40 I defined last time what the definition 04:42 of regression function was. 04:44 It's just one particular aspect of 04:46 this conventional distribution. 04:48 It's the conditional expectation of Yi given Xi. 04:51 OK. 04:52 And so this conditional expectation, 04:53 I will denote it by-- 04:57 so I talk about the conditional, I'm 05:06 going to call it, say, mu i, which 05:08 is the conditional expectation of Yi given 05:11 Xi equals some little xi, say. 05:15 You can forget about this part if you find it confusing. 05:17 It really doesn't matter. 05:18 It's just that this means that this 05:21 is a function of little xi. 05:25 But if I only had the expectation of Yi given big Xi, 05:29 this would be just a function of big Xi. 05:32 So it really doesn't change anything. 05:34 It's just a matter of notation. 05:36 OK. 05:36 So just forget about this part. 05:39 But I'll just do it like that here. 05:41 OK. 05:42 So this is just the conditional expectation of Yi given Xi. 05:46 It just depends on Xi, so I think it depends on i, 05:50 and so I will call it mu i. 05:52 But I know that since in a canonical exponential family, 05:55 then I know that mu i is actually B prime of theta i. 06:00 OK. 06:01 So there's a 1 to 1 link between the canonical parameter 06:04 of my exponential family and the mean mu 06:07 i, the conditional expectation. 06:09 And the modeling assumption we're going to make 06:12 is not directly-- 06:14 remember, that was the second aspect 06:15 of the generalized linear model. 06:17 We're not going to assume that theta i itself directly 06:20 depends on Xi. 06:22 We're going to assume that mu i has 06:23 a particular dependence on Xi through the link function. 06:28 So, again, we're back to modeling. 06:31 So we have a link function g. 06:33 06:38 And we assume that mu i depends on Xi as follows. 06:51 06:56 g of mu i-- 06:58 and remember, all g does for us is really 07:02 map the space in which mu i lives, which 07:05 could be just the interval 0, 1 to the entire real line, 07:08 all right? 07:09 And we're going to assume that this thing that 07:11 lives in the real line is just Xi transpose beta. 07:14 I should maybe put a small one, Xi transpose beta. 07:17 OK? 07:17 07:20 So we're making, indeed, some modeling assumption. 07:23 But compared to in the linear regression model, 07:26 we only assume that mu i was Xi transpose beta. 07:29 So if you want to make a parallel 07:30 between generalized linear models 07:32 and linear model is the only difference 07:35 is that g is not the identity necessarily in this case. 07:39 And all the g does for us is to just 07:42 make this thing compatible, that those two 07:45 things on the left and the right of equality 07:48 live in the same space. 07:50 So in a way, we're not making a much bigger leap of faith 07:56 by assuming a linear model. 07:57 The linear link is already here. 07:59 We're just making things compatible, all right? 08:04 And so it's always the same link function. 08:09 So now if I want to go back to beta-- 08:12 right, because I'm going to want to express my likelihood-- 08:16 if I were to express my likelihood from this, 08:18 it would just be a function of theta, right? 08:21 And so if I want to maximize my likelihood, 08:22 I don't want to maximize it in theta. 08:24 I want to maximize it in beta. 08:26 So if I can write my density as a function of beta, 08:29 then I will be able to write my likelihood 08:31 as a function of beta, and then talk 08:33 about my maximum likelihood estimator. 08:35 And so all they need to do is to just say, OK, 08:37 how do I replace theta by-- 08:39 I know that theta is a function of beta, right? 08:42 I wrote it here. 08:45 So the question is, what is this function? 08:46 And I actually have access to all of this. 08:49 So what I know is that theta-- 08:50 right, so mu is b prime of theta, 08:54 which means that theta i is b prime inverse of mu i. 09:02 OK. 09:02 So that's what we've got from this derivative of the log 09:06 likelihood equal to 0. 09:07 That give us this guy inverted. 09:09 And now I know that mu i is g inverse of Xi beta. 09:13 09:23 So this composition of b prime inverse and g inverse 09:27 is actually just the composition of g with b prime. 09:30 09:38 Everybody's comfortable with this notation, 09:40 the little circle? 09:42 Any question about this? 09:43 It just means that I first applied b prime. 09:45 Well, actually, it's D inverse. 09:47 But if I look at a function g composed with b prime, 09:49 I first applied the g b prime of x, is just g of b prime of x. 09:59 OK. 10:00 And then I take the inverse of this function, 10:02 which is first take g inverse, and then take b prime inverse. 10:05 10:09 OK. 10:09 So now I have everywhere I saw theta, 10:12 now I see this function of beta. 10:16 So I could technically plug that in. 10:18 Of course, it's a little painful to have 10:20 to write g circle beta prime all the time. 10:22 So I'm going to give this guy a name. 10:24 And so you're just going to define 10:29 h, which is g b prime inverse so that theta i is simply 10:39 h of Xi transpose beta. 10:42 OK. 10:43 I could give it a name, you know. 10:46 But let's just call that the h function. 10:47 10:52 And something which is nice about this h function 10:54 is that if g is the canonical link-- 11:01 11:04 what is the canonical link? 11:05 11:12 So what is it canonical to? 11:15 A canonical link, it's canonical to a particular distribution 11:19 in the canonical exponential family, right? 11:22 A canonical exponential family is completely characterized 11:26 by the function b. 11:28 Which means that if I want to talk about the canonical link, 11:31 all I need to tell you is how it depends on b. 11:35 So what is g as a function of b? 11:37 11:40 AUDIENCE: [INAUDIBLE] 11:41 PHILIPPE RIGOLLET: b inverse. 11:42 11:45 b prime inverse, right? 11:47 So this is g is equal to b prime inverse, which 11:52 means that if g is composed with b prime that means 11:56 that this is just the identity. 11:58 12:02 So h is the identity. 12:04 12:11 So h of Xi transpose beta is simply Xi transpose beta. 12:18 And it's true that the way we introduce the canonical link 12:22 was just the function for which we model directly theta i 12:26 as Xi transpose beta, which we can read off here, right? 12:30 So theta i is simply Xi transpose beta. 12:34 So now, for example, if I go back to my log-likelihood, 12:43 so if I look log-likelihood, the log-likelihood 12:51 is sum of the log of the densities. 12:56 So it's sum from i equal 1 to n of log of exponential Yi theta 13:06 i minus b theta i divided by phi plus c of Yi phi. 13:14 So this term does not depend on theta. 13:19 So I have two things. 13:19 First of all, the log and the exponential 13:22 are going to cancel each other. 13:23 And second, I actually know that theta is just 13:28 a function of beta. 13:29 And it has this form. 13:30 Theta i is h of Xi transpose beta. 13:32 And that's my modeling assumption. 13:34 So this is actually equal to the sum from i equal 1 to n of Yi. 13:41 And then here I'm going to write h of Xi transpose beta 13:45 minus b of h of Xi transpose beta divided by phi. 13:54 And then I have, again, this function 13:56 c of Yi phi, which again won't matter. 14:00 Because when I'm going to try to maximize this thing, 14:03 this is just playing the role of a constant 14:05 that's shifting the entire function. 14:07 In particular, your max is going to be exactly what it was. 14:10 OK? 14:12 So this thing is really not going to matter for me. 14:14 I'm keeping track of it. 14:16 And actually, if you look here, it's gone, right? 14:20 It's gone, because it does not matter. 14:23 So let's just pretend it's not here, 14:26 because it won't matter when I'm trying 14:28 to maximize the likelihood. 14:31 OK? 14:33 While it's here up to constant term, it says. 14:36 That's the constant term. 14:37 14:40 All right, any question? 14:41 14:44 All I'm doing here is replacing my likelihood as a function 14:49 of theta i's. 14:50 So if I had one theta i per observation, 14:52 again, this would not help me very much. 14:56 But if I assume that they are all linked together 14:58 by saying that theta i is of the form Xi transpose beta 15:02 or h of Xi transpose beta if I'm not using the canonical link, 15:07 then I can hope to make some estimation. 15:10 And so, again, if I have the canonical link, 15:14 h is the identity. 15:15 So I'm left only with Yi times Xi transpose beta. 15:19 And then I have b of Xi transpose beta 15:21 and not b composed with h, because h is the identity, 15:26 which is fairly simple, right? 15:28 Why is it simple? 15:30 Well, let's actually focus on this guy for one second. 15:35 So let me write it down, so we know what we're talking about. 15:38 15:47 So we just showed that the log-likelihood when I use 15:52 the canonical link-- 15:54 so that h is equal to the identity, 15:57 the log-likelihood actually takes the form ln. 16:00 And it depends on a bunch of stuff. 16:01 But let's just make it depend only on the parameter 16:04 that we care about, which is beta, all right? 16:06 So this is of the form l of beta and that's equal to what? 16:09 It's the sum from i equal 1 to n of Yi Xi transpose beta minus-- 16:16 let me put the phi here. 16:18 And then I'm going to have minus b of Xi transpose beta. 16:24 16:28 OK. 16:29 And phi we know is some known positive term. 16:33 So again, optimizing a function plus some constant 16:37 or optimizing of function time as a constant, 16:39 that's not going to change much either. 16:41 So it won't really matter to think about whether this phi is 16:44 here or not. 16:45 But let's just think about what this function looks like. 16:48 I'm trying to maximize a function. 16:50 I'm trying to maximize a log-likelihood. 16:52 If it looked like this, that would be a serious problem. 16:57 But we can do like a basic, you know, back 16:59 of the envelope guess of what the variations of this function 17:03 is. 17:04 This first term here is-- 17:08 as a function of beta, what kind of function is it? 17:11 AUDIENCE: Linear. 17:12 PHILIPPE RIGOLLET: It's linear, right? 17:13 This is just Xi transpose beta. 17:16 If I multiply beta by 2, I get twice. 17:18 If I add something to beta, it just gets added, 17:21 so it's a linear function of beta. 17:23 And so this thing is both convex and concave. 17:25 In the one-dimensional case-- so think about p 17:28 as being one-dimensional-- so if beta 17:29 is a one-dimensional thing, those 17:31 are just the function that looks like this, right? 17:34 Those are linear functions. 17:35 They are both convex and concave. 17:40 So this is not going to matter when 17:42 it comes to the convexity of my overall function, 17:44 because I'm just adding something which is just a line. 17:46 And so if I started with convex, it's going to stay convex. 17:49 If I started with concave, it's going to stay concave. 17:51 And if I started with something which is both, 17:53 it's going to stay both, meaning neither. 17:56 It cannot be both. 17:57 Yeah. 17:58 So if you're neither convex or concave, adding this linear-- 18:00 so this will not really matter. 18:02 If I want to understand when my function looks like, 18:04 I need to understand would b of Xi transpose beta does. 18:07 Begin, the Xi transpose beta-- 18:10 no impact. 18:10 It's a linear function. 18:12 In terms of convexity, it's not going to play any role. 18:15 So I really need to understand what my function b looks like. 18:18 What do we know about b again? 18:19 18:22 So we know that b prime of theta is equal to mu, right? 18:30 Well, the mean of a random variable 18:34 in a canonical exponential family 18:36 can be a positive or negative number. 18:37 This really does not tell me anything. 18:40 That can be really anything. 18:41 However, if I look at the second the derivative of b, 18:48 I know that this is what? 18:49 This is the variance of Y divided by phi. 18:59 That was my dispersion parameter. 19:00 The variance was equal to phi times b prime prime. 19:03 So we know that if theta is not degenerate, 19:06 meaning that the density does not 19:08 take value infinity at only one point, 19:10 this thing is actually positive. 19:12 And clearly, when you have something 19:14 that looks like this, unless you have some crazy stuff happening 19:16 with phi being equal to 0 or anything that's not normal, 19:20 then you will see that you're not degenerate. 19:22 So this think is strictly positive. 19:24 And we've said several times that if b prime prime is 19:27 positive, then that means that's the derivative of b prime, 19:32 meaning that b prime is increasing. 19:34 And b prime is increasing is just 19:36 the same thing as saying that b is convex, All right? 19:39 So that implies that b is strictly convex. 19:44 And the strictly comes from the fact 19:47 that this is a strict sign. 19:50 Well, I should not do that, because now it's no longer. 19:52 So it's just a strict sign, meaning 19:54 that the function, this is not strictly convex, 19:56 because it's linear. 19:57 Strictly convex means there's always 19:59 some curvature everywhere. 20:02 So now I have this thing that's linear minus something that's 20:05 convex. 20:07 Something that's negative, something convex, is concave. 20:10 So this thing is linear plus concave. 20:12 So it is concave. 20:13 So I know just by looking at this that ln of beta, which, 20:18 of course, is something that lives in Rp, 20:20 but if I saw it living in R1 it would look like this. 20:23 And if I saw it living in R2, it would 20:25 look like a dome like this. 20:28 And the fact that it's strict is also 20:30 telling me that it is actually a unique maximizer. 20:34 So there's unique maximizer in Xi transpose beta, 20:37 but not in beta necessarily. 20:38 We're going to need extra assumptions for this. 20:41 OK. 20:41 So this is what I say here. 20:44 The log-likelihood is strictly concave. 20:46 And so as a consequence under extra assumptions on the Xi 20:49 is because, of course, if the Xi's are all the same, right? 20:55 So if the entries of Xi's-- 20:58 so if Xi is equal to 1, 1, 1, 1, 1, 21:01 then Xi transpose beta is just the sum of the betas. 21:05 And of the beta i's, I will be strictly concaving those guys, 21:10 but certainly not in the individual entries. 21:13 OK. 21:13 So I need extra thing on my Xi, so that this happens, 21:17 just like we needed the matrix capital 21:19 X in the linear regression case to be a full rank, 21:23 so we could actually identify would beta was. 21:25 OK. 21:25 It's going to be exactly the same thing. 21:27 21:32 So here, this is when we have this very specific 21:35 parametrization. 21:36 And the question is-- 21:39 but it may not be the case if we change the parameter 21:41 beta into something else. 21:42 OK. 21:43 So here, the fact that we use the canonical link, et cetera, 21:46 everything actually works really to our advantage, 21:49 so that everything becomes strictly concave. 21:51 And we know exactly what's happening. 21:53 21:56 All right, so I understand I went a bit fast 21:58 on playing with convex and concave functions. 22:00 This is not the purpose. 22:02 You know, I could spend a lecture 22:04 telling you, oh, if I add two concave functions, 22:06 then the result remains concave. 22:08 If I had a concave and a strictly concave, 22:11 then the result still remains strictly concave. 22:13 And we could spend time proving this. 22:16 This was just for you to get an intuition 22:17 as to why this is correct. 22:19 But we don't really have time to go into too much detail. 22:22 One thing you can do-- 22:23 a strictly concave function, if it's in one dimension, 22:26 all I need to have is that the second derivative is strictly 22:32 negative, right? 22:32 That's a strictly concave function. 22:34 That was the analytic definition we had for strict concavity. 22:38 So if this was in one dimension, it 22:40 would look like this, Yi times Xi times beta. 22:45 Now, beta is just one number. 22:47 And then I would have minus beta Xi times b. 22:52 And this is all over phi. 22:54 You take second derivatives. 22:56 The fact that this is linear in beta, this is going to go away. 22:59 And here, I'm just going to be left with minus-- 23:03 so if I take the second derivative with respect 23:07 to beta, this is going to be equal to minus b prime prime Xi 23:13 beta times Xi squared divided by phi. 23:18 So this is clearly positive. 23:21 If Xi is 0, this is degenerate, so I would not get it. 23:25 Then I have the second derivative 23:26 of b prime, which I know is positive, 23:28 because of the variance thing that I have here, 23:30 divided by phi. 23:32 And so that would all be fine. 23:34 That's for one dimension. 23:35 If I wanted to do this in higher dimensions, 23:37 I would have to say that the Hessian is 23:39 a positive definite matrix. 23:41 And that's maybe a bit beyond what this course is. 23:44 23:50 So in the rest of this chapter, I 23:54 will do what I did not do when we talked 23:56 about maximum likelihood. 23:58 And what we're going to do is we're 24:00 going to actually show how to do this maximization, right? 24:03 So here, we know that the function is concave. 24:06 But what it looks like specifically 24:08 depends on what b is. 24:11 And for different b's, I'm going to have different things to do, 24:15 Just like when I was talking about maximum likelihood 24:17 estimation, if it had a concave log-likelihood function, 24:22 I could optimize it. 24:23 But depending on what the function is, 24:24 I would actually need some algorithms 24:26 that may be working better on some functions than others. 24:29 Now, here I don't have random things. 24:32 I have the b is the cumulant generating 24:34 function of a canonical exponential family. 24:38 And there is a way for me to sort of leverage that. 24:42 So not only is there the b part, but there's also 24:45 the linear part. 24:46 And if I start trying to use that, 24:49 I'm actually going to be able to devise very specific 24:51 optimization algorithms. 24:53 And the way I'm going to be able to do this 24:54 is by thinking of simple black box 24:57 optimization to which I can actually technically feed 24:59 any function. 25:00 But it's going to turn out that the iterations 25:02 of this iterative algorithms are going 25:05 to look very familiar when we just 25:10 plug in the particular values of b, of the log-likelihood 25:13 that we have for this problem. 25:15 And so the three methods we're going to talk about going from 25:19 more black box-- meaning you can basically stuff in any function 25:22 that's going to work, any concave function that's going 25:24 to work, all the way to this is working specifically 25:27 for generalized linear models-- 25:29 are Newton-Raphson method. 25:31 Who's already heard about the Newton-Raphson method? 25:35 So there's probably some people actually learned this algorithm 25:40 without even knowing the word algorithm, right? 25:43 It's a function. 25:44 Typically, it's supposed to be finding roots of functions. 25:46 But finding the root of a function of the derivative 25:49 is the same as finding the minimum of a function. 25:53 So that's the first black box method. 25:55 I mean, it's pretty old. 25:57 And then there's something that's 25:59 very specific to what we're doing, which is called-- 26:01 so this Newton-Raphson method is going 26:03 to involve the Hessian of our log-likelihood. 26:06 And since we know something about the Hessian 26:09 for a particular problem, we're going to be able move one 26:11 to Fisher-scoring. 26:13 And the word Fisher here is actually exactly coming 26:15 from Fisher information. 26:17 So the Hessian is going to involve the Fisher information. 26:20 And finally, we will talk about iteratively re-weighted least 26:25 squares. 26:26 And that's not for any function. 26:28 It's really when we're trying to use 26:29 the fact that there is this linear dependence on the Xi. 26:35 And this is essentially going to tell us, well, you know, 26:37 you can use least squares for linear regression. 26:39 Here, you can use least squares, but locally, 26:41 and you have to iterate. 26:42 OK. 26:43 And this last part is essentially a trick 26:46 by statisticians to be able to solve 26:49 the Newton-Raphson updates without actually 26:52 having a dedicated software for this, 26:54 but just being able to reuse some least squares software. 26:59 OK. 26:59 So you know, we've talked about this many times. 27:02 I just want to make sure that we're all on the same page 27:05 here. 27:07 We have a function f. 27:09 We're going to assume that it has two derivatives. 27:12 And it's a function from Rm to R. 27:14 So it's fist derivative is called gradient. 27:16 That's the vector that collects all the partial derivatives 27:20 with respect to each of the coordinates. 27:23 It's dimension m, of course. 27:25 And the second derivative is an m by m matrix. 27:28 It's called the Hessian. 27:29 And ith row and jth column, you see 27:34 the second partial derivative with respect 27:37 to the ith component and the jth component. 27:39 OK. 27:40 We've seen that several times. 27:41 This is just multi-variable calculus. 27:44 But really the point here is to maybe the notation 27:48 is slightly different, because I want to keep track of f. 27:51 So when I write the gradient, I write nabla sub f. 27:55 And when I write Hessian, I write nabla H sub f. 28:00 And as I said, if f is strictly concave, 28:03 then Hf of x is negative definite. 28:05 What it means is that if I take any x in Rm, then x 28:14 transpose Hf, well, that's for any X0 X, 28:20 this is actually strictly negative. 28:23 That's what it means to be negative definite. 28:25 OK? 28:26 So every time I do x transpose-- 28:30 so this is like a quadratic form. 28:31 And I want it to be negative for all values of X0 and X, 28:35 both of them. 28:36 That's very strong, clearly. 28:39 But for us, actually, this is what happens just 28:41 because of the properties of b. 28:42 28:45 Well, at least the fact that it's 28:48 negative, less than or equal to, if I 28:50 want it to be strictly less I need some properties on X. 28:54 And then I will call the Hessian map the function 28:56 that maps X to this matrix Hf of X. 29:02 So that's just the second derivative at x. 29:04 Yeah. 29:04 AUDIENCE: When you what are [INAUDIBLE]?? 29:09 PHILIPPE RIGOLLET: Where do [INAUDIBLE]?? 29:11 Oh, yeah. 29:13 I mean, you know, you need to be able to apply Schwarz lemma. 29:16 Let's say two continue derivatives that's smooth. 29:19 AUDIENCE: [INAUDIBLE] 29:21 PHILIPPE RIGOLLET: No, that's fine. 29:23 29:29 OK. 29:29 So how does the Newton-Raphson method work? 29:32 Well, what it does is that it forms a quadratic approximation 29:35 to your function. 29:37 And that's the one it optimizes at every single point. 29:39 OK. 29:40 And the reason is because we have 29:41 a closed-form solution to defining the minimum 29:44 of a quadratic function. 29:46 So if I give you a function that's 29:47 of the form ax squared plus b x plus c, 29:51 you know exactly a closed form for its minimum. 29:54 But if I give you any function or, let's say-- 29:57 yeah, yeah. 29:58 So here, it's all about maximum. 29:59 I'm sorry. 30:01 If you're confused with me using the word minimum, 30:03 just assume that it was the word maximum. 30:06 So this is how it works. 30:07 OK. 30:08 If I give you a function which is concave, that's quadratic. 30:14 OK. 30:14 So it's going to look like this. 30:15 30:19 So that's of the form ax squared-- 30:21 where a is negative, of course-- 30:23 plus bx plus c. 30:25 Then you can solve your whatever. 30:28 You can take the derivative of this guy, set it equal to 0, 30:31 and you will have an exact equation 30:33 into what the value of x is that it realizes this maximum. 30:37 If I give you any function that's concave, 30:42 that's all clear, right? 30:43 I mean, if I tell you the function that we have here 30:46 is that the form ax minus b of x, 30:50 then I'm just going to have something that inverts b prime. 30:53 But how do I do that exactly? 30:56 It's not clear. 30:57 And so what we do is we do a quadratic approximation, which 31:00 should be true approximately everywhere, right? 31:03 So I'm at this point here, I'm going to say, 31:05 oh, I'm close to being that function. 31:08 And if I'm at this point here, I'm 31:10 going to be close to being that function. 31:12 And for this function, I can actually optimize. 31:14 And so if I'm not moving too far from one to the other, 31:17 I should actually get something. 31:19 So here's how the quadratic approximation works. 31:21 I'm going to write the second order Taylor expansion. 31:27 31:31 OK. 31:32 And so that's just going to be my quadratic approximation. 31:35 It's going to say, oh, f of x, when x is close 31:38 to some point x0, is going to close 31:40 to f of x0 plus the gradient of f at x0 transpose x minus x0. 31:48 And then I'm going to have plus 1/2x minus x0 transpose 31:53 Hf at 0x x minus x transpose, right-- 31:58 x minus x0. 31:59 So that's just my second order Taylor expansion 32:02 multi-variate 1. 32:03 And let's say x0 is this guy. 32:06 32:08 Now, what I'm going to do is say, OK, 32:13 if I wanted to set this derivative of this guy 32:15 equal to 0, I would just have to solve, 32:17 well, you know, f prime of x equals 0, 32:21 meaning that X has to be f prime inverse of 0. 32:26 And really apart from like being some notation manipulation, 32:29 this is really not helping me. 32:31 OK. 32:32 Because I don't know what f prime inverse of 0 32:35 is in many instances. 32:36 However if f has a very specific form which 32:39 is something that depends on x in a very specific way, 32:43 there's just a linear term and then a quadratic term, 32:45 then I can actually do something. 32:47 So let's forget about this approach. 32:49 And rather than minimizing f, let's just 32:51 minimize the right-hand side. 32:54 OK. 32:55 So sorry- maximize. 32:59 So maximize the right-hand side. 33:06 And so how do I get this? 33:08 Well, I just set the gradient equal to 0. 33:11 So what is the gradient? 33:12 The first term does not depend on x. 33:15 So that means that this is going to be 0 plus-- 33:21 what is the gradient of this thing, of the gradient of f 33:25 at x0 transpose x minus x0? 33:27 What is the gradient of this guy? 33:28 33:40 So I have a function of the form b transpose x. 33:43 What is the gradient of this thing? 33:46 AUDIENCE: [INAUDIBLE] 33:47 PHILIPPE RIGOLLET: I'm sorry? 33:49 AUDIENCE: [INAUDIBLE] 33:50 PHILIPPE RIGOLLET: I'm writing everything in two-column form, 33:52 right? 33:53 So it's just b. 33:55 OK. 33:55 So here, what is b? 33:57 Well, it's gradient of f at x0. 33:59 34:03 OK. 34:04 And this term here gradient of f at x0 transpose x0 34:07 is just a constant. 34:07 This thing is going away as well. 34:10 And then I'm looking at the derivative of this guy here. 34:13 And this is like a quadratic term. 34:15 It's like H times x minus x0 squared. 34:18 So when I'm going to take the derivative, 34:20 I'm going to have a factor 2 that's 34:22 going to pop out and cancel this one half. 34:23 And then I'm going to be left only 34:25 with this part times this part. 34:28 OK. 34:28 So that's plus Hf x minus x0. 34:36 OK. 34:36 So that's just a gradient. 34:39 And I want it to be equal to 0. 34:40 So I'm just going to solve this equal to 0. 34:44 OK? 34:46 So that means that if I want to find the minimum, 34:49 this is just going to be the x* that satisfies this. 34:53 So that's actually equivalent to Hf times x* is equal to Hf x0 35:07 minus gradient f at x0. 35:14 Now, this is a much easier thing to solve. 35:16 What is this? 35:16 35:21 This is just a system of linear equations, right? 35:25 I just need to find the x* such that when I pre-multiply it 35:29 by a matrix I get this vector on the right-hand side. 35:33 This is just something of the form ax equals b. 35:37 And I have many ways I can do this. 35:39 I could do Gaussian elimination, or I 35:41 could use Spielman's fast Laplacian solvers 35:46 if I had some particular properties of H. I mean, 35:49 there's huge activity in terms of how to solve those systems. 35:54 But let's say I have some time. 35:56 It's not a huge problem. 35:58 I can actually just use linear algebra. 35:59 And linear algebra just tells me that x* is equal to Hf inverse 36:06 times this guy, which those two guys are going to cancel. 36:16 So this is actually equal to x0 minus Hf inverse gradient f 36:23 at x0. 36:26 And that's just what's called a Newton iteration. 36:28 I started at some x0. 36:31 I'm at some x0, where I make my approximation. 36:34 And it's telling me starting from this x0, 36:37 I wanted to fully optimize a quadratic approximation, 36:40 I would just have to take the x*. 36:43 That's this guy. 36:44 And then I could just use this guy as my x0 and do it again, 36:48 and again, and again, and again. 36:50 And those are called Newton iterations. 36:52 And they're basically the workhorse 36:54 of interior point methods, for example, a lot 36:57 of optimization algorithms. 36:59 And that's what you can see here. 37:01 x* is equal to x0 minus the inverse Hessian times 37:05 the gradient. 37:07 We briefly mentioned gradient descent. 37:10 We briefly mentioned gradient decent, at some point, 37:13 to optimize the convex function, right? 37:15 And if I wanted to use gradient descent, again, H is a matrix. 37:22 But if I wanted to think of H as being a scalar, 37:24 would it be a positive or negative number? 37:26 37:31 Yeah. 37:33 AUDIENCE: [INAUDIBLE] 37:34 PHILIPPE RIGOLLET: Why? 37:36 AUDIENCE: [INAUDIBLE] 37:39 PHILIPPE RIGOLLET: Yeah. 37:40 So that would be this. 37:42 So I want to move against the gradient to do what? 37:44 AUDIENCE: [INAUDIBLE] 37:45 PHILIPPE RIGOLLET: To minimize. 37:46 But I'm maximizing here, right? 37:47 Everything is maximized, right? 37:49 So I know that H is actually negative definite. 37:52 So it's a negative number. 37:55 So you have the same confusions they do. 37:59 We're maximizing a concave function here. 38:01 So H is negative. 38:02 So this is something of the form x0 plus something 38:07 times the gradient. 38:10 And this is what your gradient ascent, rather than descent, 38:13 would look like. 38:15 And all it's saying, Newton is telling 38:17 you don't take the gradient for granted 38:19 as a direction in which you want to go. 38:21 It says, do a slight change of coordinates before you 38:25 do this according to what your Hessian looks like, all right? 38:30 And those are called second order methods that require 38:34 knowing what the Hessian is. 38:36 But those are actually much more powerful 38:39 than the gradient descent, because they're 38:41 using all of the local geometry of the problem. 38:43 All of the local geometry of your function 38:45 is completely encoded in this Hessian. 38:48 And in particular it implies that it tells you 38:50 where to switch and not to go slower in some places 38:53 or go faster in other places. 38:55 Now, this in practice for, say, modern large scale 38:58 machine learning problems, inverting this matrix H 39:02 is extremely painful. 39:03 It takes too much time. 39:05 The matrix is too big, and computers cannot do it. 39:08 And people resort to what's called 39:10 pseudo-Newton method, which essentially tries 39:13 to emulate what this guy is. 39:15 And there's many ways you can do this. 39:17 Some of them is by using gradients 39:20 that you've collected in the past. 39:23 Some of them just say, well let's just pretend H 39:27 is diagonal. 39:29 There's a lot of things you can do 39:30 to just play around this and not actually have 39:32 to invert this matrix. 39:34 OK? 39:36 So once you have this, you started from edge 0. 39:38 It tells you which H* you can get as a maximizer of the local 39:44 quadratic approximation to your function. 39:47 You can actually just iterate that, all right? 39:49 So you start at some x0 somewhere. 39:52 And then once you get to some xk, 39:55 you just do the iteration which is described, 39:57 which is just find a k plus 1, which 39:59 is the maximizer of the local quadratic approximation 40:03 to your function at xk and repeat until convergence. 40:09 OK. 40:09 So if this was an optimization class, 40:13 we would prove that convergence actually, eventually, 40:16 happens for a strictly concave function. 40:20 This is a stats class, so you're just 40:22 going to have to trust me that this is the case. 40:24 And it's globally convergent, meaning 40:26 that you can start wherever you want, and it's 40:30 going to work for under minor conditions on f. 40:35 And in particular, those conditions 40:36 are satisfied for the log-likelihood functions 40:40 we have in mind. 40:41 OK. 40:42 And it converges at an extremely fast rate. 40:44 Usually it's quadratic convergence, 40:46 which means that every time you make one step, 40:49 you improve the accuracy of your solution by two digits. 40:52 40:56 If that's something you're vaguely interested in, 40:58 I highly recommend that you take a class on them 41:01 in your optimization. 41:02 It's a fascinating topic. 41:04 Unfortunately, we don't have much time, 41:05 but it starts being more and more intertwined 41:08 with high dimensional statistics and machine learning. 41:11 41:14 I mean, it's an algorithms class, typically. 41:17 But it's very much more principled. 41:22 It's not a bunch of algorithms that solve a bunch of problems. 41:24 There's basically one basic idea, 41:26 which is if I have a convex function, 41:28 I can actually minimize it. 41:30 If I have a concave function, I can maximize it. 41:32 And it evolves around a similar thing. 41:35 So let's stare at this iterative step for a second and pause. 41:40 And let me know if you have any questions. 41:43 41:46 OK. 41:47 So, of course, in a second we will plug in 41:50 for the log-likelihood. 41:51 This is just a general thing for a general function f. 41:54 But then in a second, f is going to be ln. 41:57 OK. 41:58 So if I wanted to implement that for real, 42:00 I would have to compute the gradient of ln at a point xk. 42:04 And I would have to compute the Hessian at a given point 42:06 and invert it. 42:08 OK. 42:08 So this is just the basic algorithm. 42:11 And this, as you can tell, used in no place 42:14 the fact that ln was the log-likelihood associated 42:17 to some canonical exponential family 42:19 in a generalized linear model. 42:21 This never showed up. 42:23 So can we use that somehow? 42:26 Optimization for longest time was about making your problems 42:28 as general as possible accumulating maybe 42:31 in the interior point method theory in Koenig programming 42:34 in the mid-'90s. 42:35 And now what optimization is doing 42:37 is that it's [INAUDIBLE] very general. 42:38 It' says, OK, if I want to start to go fast, 42:40 I need to exploit as much structure about my problem 42:42 as I can. 42:43 And the beauty is that as statisticians 42:45 are a machine learning people, we 42:47 do have a bunch of very specific problem 42:49 that we want optimizers to solve. 42:50 And they can make things run much faster. 42:53 But this did not require to wait until the 21st century. 42:56 Problems with very specific structure 42:58 arose already in this generalized linear model. 43:01 So what do we know? 43:03 Well, we know that this log-likelihood is really 43:07 one thing that comes when we're trying 43:09 to replace an expectation by an average, 43:12 and then doing something fancy, right? 43:14 That was our statistical hammer. 43:16 And remember when we introduced likelihood maximization we just 43:19 said, what do we really want to do 43:21 is to minimize the KL, right? 43:23 That's the thing we wanted to minimize, 43:25 the KL divergence between two distributions, the true one 43:29 and the one that's parameterized by some unknown theta. 43:32 And we're trying to minimize that over theta. 43:34 And we said, well, I don't know what 43:36 this is, because it's an expectation with respect 43:38 to some known distribution. 43:40 So let me just replace the expectation with respect 43:42 to my unknown distribution by an average over my data points. 43:46 And that's how we justified the existence 43:49 of the log-likelihood maximization problem. 43:54 But here, actually, I might be able to compute 44:00 this expectation, at least partially where I need it. 44:04 And what we're going to do is we're going to say, OK, 44:07 since at a given point xk, say, let me call it here theta, 44:12 I'm trying to find the inverse of the Hessian 44:16 of my log-likelihood, right? 44:17 So if you look at the previous one, as I said, 44:19 we're going to have to compute the Hessian H sub l n of xk, 44:23 and then invert it. 44:24 But let's forget about the inversion step for a second. 44:27 We have to compute the Hessian. 44:30 This is the Hessian of the function 44:31 we're trying to minimize. 44:32 But if I could actually replace it 44:34 not by the function I'm trying to minimize to maximize 44:37 or the log-likelihood, but really 44:38 by the function I wish I was actually minimizing, 44:42 which is the KL, right? 44:45 Then that would be really nice. 44:46 And what happens is that since I'm actually 44:48 trying to find this at a given xk, 44:51 I can always pretend that this xk that I have 44:53 in my current iteration is the true one 44:55 and compute my expectation with respect to that guy. 44:58 And what happens is that I know that when 45:00 I compute the expectation of the Hessian of the log-likelihood 45:05 at a given theta and when I take the expectation with respect 45:08 to the same theta, what I get out 45:10 is negative Fisher information. 45:15 The Fisher information was defined in two ways-- 45:18 as the expectation of the square of the derivative or negative 45:25 of the expectation of the second derivative 45:27 of the log-likelihood. 45:30 And so now, I'm doing some sort of a leap of faith here. 45:35 Because there's no way the theta, which is the current xk, 45:40 that's the current theta at which I'm actually 45:42 doing this optimization-- 45:44 I'm actually pretending that this is the right one. 45:46 45:48 But what's going to change by doing this 45:51 is that it's going to make my life easier. 45:52 Because when I take expectations, 45:56 we'll see that when we look at the Hessian, 46:00 the Hessian as essentially the derivative of, say, a product 46:07 is going to be the sum of two terms, right? 46:09 The derivative of u times v is u prime v plus uv prime. 46:13 One of those two terms is actually 46:14 going to have expectation 0. 46:16 And that's going to make my life very easy when 46:18 I take expectations and basically just 46:20 have one term that's going to go away. 46:22 And so in particular, my formula, 46:24 just by the virtue of taking this expectation 46:27 before inverting the Hessian, is going 46:29 to just shrink the size of my formulas by half. 46:33 OK. 46:34 So let's see how this works. 46:35 You don't have to believe me. 46:37 Is there any question about this slide? 46:39 You guys remember when we were doing 46:41 maximum estimation and Fisher information 46:43 and the KL divergence, et cetera? 46:47 Yeah. 46:48 AUDIENCE: [INAUDIBLE] 46:51 PHILIPPE RIGOLLET: Because that's what we're really 46:53 trying to minimize. 46:55 AUDIENCE: [INAUDIBLE] 46:59 PHILIPPE RIGOLLET: Yeah. 47:00 So there's something you need to trust me 47:05 with, which is that the expectation of H of ln 47:08 is actually H of the expectation of ln, all right? 47:14 Yeah, it's true, right? 47:16 Because taking derivative is a linear operator. 47:20 And we actually used that several times when 47:22 we said expectation of partial of l with respect to theta 47:30 is equal to 0. 47:31 Remember we did that? 47:32 That's basically what we used, right? 47:34 47:39 AUDIENCE: ln is the likelihood. 47:40 PHILIPPE RIGOLLET: It's the log-likelihood. 47:42 AUDIENCE: Log-likelihood, [INAUDIBLE] OK. 47:45 When we did Fisher [INAUDIBLE],, we 47:47 did the likelihood of [INAUDIBLE] observation. 47:49 PHILIPPE RIGOLLET: Yeah. 47:51 AUDIENCE: Why is it ln in this case? 47:54 PHILIPPE RIGOLLET: So actually, ln is typically not normalized. 48:01 So I really should talk about ln over n. 48:04 OK. 48:04 But let's see that, OK? 48:05 So if I have IID observations, that should be pretty obvious. 48:10 OK. 48:11 So if I have IID x1, xn, with density f theta and if I look 48:22 at log f theta of Xi, sum from i equal 1 to n, as I said, 48:28 I need to actually have a 1 over n here. 48:30 When I look at the expectation, they all have 48:32 the same expectation, right? 48:34 So this is actually, indeed, equal to negative KL 48:41 plus a constant. 48:42 OK? 48:43 And negative KL is because this-- 48:45 sorry, if I look at the expectation. 48:46 So the expectation of this guy is just the expectation of one 48:50 of them, all right? 48:56 So I just do expectation theta. 48:57 OK? 48:59 Agree? 49:02 Remember, the KL was expectation theta log f theta divided by f. 49:09 So that's between p theta and p theta prime. 49:11 49:14 Well, no, sorry. 49:15 That's the true p. 49:17 And let's call it f. 49:19 p theta, right? 49:20 So that's what showed up, which is, indeed, 49:23 equal to minus expectation theta log 49:27 f theta plus log of f, which is just a constant with respect 49:33 to theta. 49:34 It's just the thing that's up doesn't matter. 49:36 OK? 49:38 So this is what shows up here. 49:40 And just the fact that I have this 1 over n 49:42 doesn't change, because they're IID. 49:44 Now, when I have things that are not IID-- because what I really 49:47 had was Y1 Yn, and Yi at density f theta i, which is just 49:54 the conditional density given Xi, 49:56 then I could still write this. 49:59 And now when I look at the expectation of this guy, what 50:02 I'm going to be left with is just 1 over n sum from i 50:08 equal 1 to n of the expectation of log f theta i of Yi. 50:17 50:19 And it's basically the same thing, except that I have a 1 50:22 over n expectation in front. 50:24 And I didn't tell you this, because I only showed you 50:26 what the KL divergence was for between two distributions. 50:32 But here, I'm telling you what the KL 50:35 is between two products of distributions 50:39 that are independent, but not necessarily 50:40 identically distributed. 50:42 50:48 But that's what's going to show up, just because it's 50:50 a product of things. 50:51 So when you have the log, it's just going to be a sum. 50:53 50:58 Other questions? 51:01 All right, so what do we do here? 51:05 Well, as I said, now we know that the expectation of H 51:08 is negative Fisher information. 51:10 So rather than putting H inverse in my iterates 51:13 for Newton-Raphson, I'm just going to put the inverse Fisher 51:19 information. 51:20 And remember, it had a minus sign in front. 51:23 So I'm just going to pick up a plus sign now, 51:25 just because i is negative, the expectation of the Hessian. 51:31 And this guy has, essentially, the same convergence 51:33 properties. 51:35 And it just happens that it's easier 51:37 to compute the i than H Ln. 51:39 And that's it. 51:40 That's really why you want to do this. 51:43 Now, you might say that, well, if I use more information, 51:50 I should do better, right? 51:51 But it's actually not necessarily 51:53 true for several reasons. 51:54 But let's say that one is probably the fact that I 51:56 did not use more information. 51:58 Every step when I was computing this thing at xk, 52:02 I actually pretended that at theta k 52:04 the true distribution was the one distributed 52:07 according to theta k. 52:09 And that was not true. 52:10 This is only true when theta k becomes 52:12 close to the true theta. 52:13 52:14 And so in a way, what I gained I lost again 52:18 by making this thing. 52:20 It's just really a matter of simple computation. 52:23 So let's just see it on a particular example. 52:25 Actually, in this example, it's not going to look much simpler. 52:28 It's actually going to be the same. 52:30 All right, so I'm going to have the Bernoulli example. 52:35 All right, so we know that Bernoulli 52:36 belongs to the canonical exponential family. 52:41 And essentially, all I need to tell you what b is. 52:46 And b of theta for Bernoulli is log 1 plus e theta, right? 52:53 We computed that. 52:56 OK. 52:57 And so when I look at my log-likelihood, 53:02 it is going to look like the sum from i equal 1 to n of Yi of-- 53:10 OK, so here I'm going to actually use 53:11 the canonical link. 53:12 So it's going to be Xi transpose beta 53:15 minus log 1 plus exponential Xi transpose beta. 53:20 53:24 And phi for this guy is equal to 1. 53:27 Is it clear for everyone what I did? 53:29 OK. 53:29 So remember the density, so that was really just-- 53:35 so the PMF was exponential Y theta minus log 1 plus e theta. 53:45 There was actually no normalization. 53:47 That's just the density of a Bernoulli. 53:51 And the theta is actually log p over 1 minus p. 54:00 And so that's what actually gives me what my-- 54:03 since p is the expectation, this is actually 54:05 giving me also my canonical link, 54:06 which is the log [? at link. ?] We saw that last time. 54:09 And so if I start taking the log of this guy and summing over n 54:12 and replacing theta by Xi transpose beta, which 54:16 is what the canonical link tells me to do, I get this guy. 54:19 Is that clear for everyone? 54:21 If it's not, please redo this step on your own. 54:25 54:27 OK. 54:28 54:31 So I want to maximize this function. 54:33 Sorry. 54:34 So I want to maximize this function over there 54:36 on the first line as a function of beta. 54:39 And so to do this, I want to use either Newton-Raphson 54:44 or what I call Fisher-scoring. 54:45 So Fisher-scoring is the second one, 54:47 when you replace the Hessian by negative Fisher information. 54:51 So I replace these two things. 54:53 And so I first take the gradient. 54:55 OK. 54:55 So let's take the gradient of ln. 54:57 55:04 So the gradient of ln is going to be, well, sum-- 55:08 so here, this is of the form Yi, which 55:11 is a scalar, times a vector, Xi beta. 55:14 That's what I erased from here. 55:18 The gradient of b transpose x is just b. 55:20 So here, I have just Yi Xi. 55:23 So that's of the form Yi, which is a scalar, times Xi, 55:27 which is a vector. 55:30 Now, what about this guy? 55:31 Well, here I have a function. 55:32 So I'm going to have just the usual rule, the chain rule, 55:34 right? 55:35 So that's just going to be 1 over this guy. 55:38 55:40 And then I need to find the Hessian of this thing. 55:43 So the 1 is going away. 55:44 And then I apply the chain rule again. 55:46 So I get e of Xi transpose beta, and then 55:49 the Hessian of this thing, which is Xi. 55:53 55:57 So my Hessian-- my radiant, sorry, 56:00 I can actually factor out all my Xi's. 56:02 And it's going to look like this. 56:03 56:15 My gradient is a weighted average or weighted sum 56:21 of the Xi's. 56:22 56:25 This will always happen when you have 56:30 a generalized linear model. 56:32 And that's pretty clear. 56:33 Where did the Xi show up? 56:35 Whether it's from this guy or that guy, 56:37 the Xi came from the fact that when 56:39 I take the gradient of Xi transpose beta, 56:41 I have this vector Xi that comes out. 56:43 It's always going to be the thing that comes out. 56:46 So I will always have something that looks like with some sum 56:49 with some weights here of the Xi's. 56:53 Now, when I look at the second derivative-- 56:55 57:04 so same thing, I'm just going to take the derivative this guy. 57:08 Since nothing depends on beta here or here, 57:11 I'm just going to have to take the derivative of this thing. 57:13 And so it's going to be equal. 57:15 So if I look now at the Hessian ln as a function of beta, 57:21 I'm going to have sum from i equal 1 to n of, well, Yi-- 57:25 what is a derivative of Yi with respect to beta? 57:28 57:31 AUDIENCE: [INAUDIBLE] 57:32 PHILIPPE RIGOLLET: What? 57:34 AUDIENCE: [INAUDIBLE] 57:37 PHILIPPE RIGOLLET: Yeah. 57:39 0. 57:39 OK? 57:40 It doesn't depend on data. 57:41 I mean, this distribution does. 57:42 But Y itself is just a number, right? 57:46 So this is 0. 57:47 So I'm going to get the minus. 57:49 And then I'm going to have, again, the chain 57:51 rule that shows up. 57:52 So I need to find the derivative of x over 1 plus x. 57:56 What is the derivative of x over 1 plus x. 57:59 Actually don't even know. 58:00 58:03 So that gives me-- 58:04 58:12 OK. 58:12 So that's 1 over 1 plus x squared. 58:15 So that's minus e Xi transpose beta-- 58:19 sorry, 1 divided by 1 plus e Xi transpose beta squared times 58:27 the derivative of the exponential, which is e Xi 58:31 transpose beta and again, Xi. 58:35 58:38 And then I have this Xi that shows up. 58:39 But since I'm looking for a matrix, 58:41 I'm going to have Xi, Xi transpose, right? 58:42 58:53 OK? 58:55 AUDIENCE: [INAUDIBLE] 58:58 PHILIPPE RIGOLLET: So I know I'm going to need something that 59:02 looks like a matrix in the end. 59:04 And so one way you want to think about it is this 59:07 is going to spit out an Xi. 59:09 There's already an Xi here. 59:11 So I'm going to have something that looks like Xi. 59:14 And I'm going to have to multiply by it another vector 59:15 Xi. 59:16 And I want it to form a matrix. 59:18 And so what you need to do is to take an outer product. 59:22 And that's it. 59:23 59:36 So now as a result, the updating rule is this. 59:42 Honestly, this is not a result of anything. 59:44 I actually rewrote everything that I had before with a theta 59:47 replaced by beta, because it's just 59:50 painful to rewrite this entire thing, put some big parenthesis 59:54 and put minus 1 here. 59:55 59:58 And then I would have to put the gradient, which 60:00 is this thing here. 60:03 So as you can imagine, this is not super nice. 60:07 Actually, what's interesting is at some point 60:11 I mentioned there's a pseudo-Newton method. 60:15 They're actually doing exactly this. 60:16 They're saying, oh, at each iteration, 60:22 I'm actually going to just take those guys. 60:24 If I'm at iteration k, I'm actually 60:26 just going to sum those guys up to k 60:28 rather than going all the way to n and look at every one. 60:30 So you're just looking at your observations one at a time 60:33 based on where you were before. 60:35 OK. 60:36 So you have a matrix. 60:38 You need to invert it. 60:39 So if you want to be able to invert it, 60:41 you need to make sure that the sum 60:43 with those weights of Xi outer, Xi, or Xi, Xi transpose 60:46 is invertible. 60:47 So that's a condition that you need to have. 60:50 And well, you don't have to, because technically you 60:53 don't need to invert. 60:54 You just need to solve the linear system. 60:57 But that's actually guaranteed in most of the cases 61:00 if n is large enough. 61:03 All right, so everybody sees what we're doing here? 61:05 OK. 61:06 So that's for the Newton-Raphson. 61:11 If I wanted to actually do the Fisher-scoring, 61:17 all I would need to do is to replace the Hessian here 61:22 by its expectation when I pretend that the beta have, 61:25 iteration k, is the true one. 61:28 What is the expectation of this thing? 61:31 And when I say expectation here, I'm 61:34 always talking about conditional expectation of Y given X. 61:37 The only distributions that matter, that have mattered 61:40 in this entire chapter, are conditional expectation of Y 61:44 given X. 61:45 The conditional expectation of this thing given X is what? 61:50 61:55 It's itself. 61:57 It does not depend on Y. It only depends on the X's. 61:59 So conditionally on x, this thing 62:01 as far as we're concerned, is completely deterministic. 62:04 So it's actually equal to it's expectation. 62:07 And so in this particular example, 62:11 there's no difference between Fisher-scoring 62:16 and Newton-Raphson. 62:19 And the reason is because the gradient no longer 62:24 depends on Yi-- 62:26 I'm sorry. 62:26 The Hessian no longer depends on Yi. 62:30 OK? 62:30 62:41 This slide is just repeating some stuff that I've said. 62:44 62:49 OK. 62:49 62:54 So I think this is probably-- 62:55 62:58 OK, let's go through this actually. 63:00 63:04 At some point, I said that Newton-Raphson-- 63:07 do you have a question? 63:08 AUDIENCE: Yeah. 63:08 When would the gradient-- sorry, the Hessian ever depend on Yi? 63:12 Because it seems like Yi is just-- or at least 63:15 when you have a canonical link, that the log-likelihood is just 63:22 [INAUDIBLE] to Yi Xi [INAUDIBLE] theta and that's the only place 63:27 Y shows up. 63:28 So [INAUDIBLE] derivative [INAUDIBLE] never depend on Y? 63:30 PHILIPPE RIGOLLET: Not when you have a canonical link. 63:33 AUDIENCE: So if it's not a [INAUDIBLE] there's is no 63:35 difference between-- 63:35 PHILIPPE RIGOLLET: No. 63:36 AUDIENCE: OK. 63:37 PHILIPPE RIGOLLET: Yeah. 63:38 So yeah, maybe I wanted you to figure that out for yourself. 63:45 OK. 63:45 So Yi times Xi transpose beta. 63:49 So essentially, when I have a general family, what 63:55 he's referring to is that this is just b of Xi transpose beta. 64:00 So I'm going to take some derivatives. 64:01 And there's going to be something 64:02 complicated coming out of this. 64:04 But I'm certainly not going to have some Yi showing up. 64:06 The only place where Yi shows up is here. 64:09 Now, if I take two derivatives, this thing 64:11 is gone, because it's linear. 64:13 The first one is going to keep on like this guy. 64:15 And the second one is going to make it go on. 64:17 The only way this actually shows up is when to have an H here. 64:21 And if I have an H, then I can take second derivatives. 64:24 And this thing is not going to be 64:26 completely independent of beta. 64:30 Sorry. 64:31 Yeah, this thing is still going to depend 64:33 on beta, which means that this Yi term is not 64:35 going to disappear. 64:36 64:39 I believe we'll see an example of that, or maybe I removed it. 64:41 I'm not sure, actually. 64:42 I think we will see an example. 64:44 64:49 So let us do a Iteratively Re-weighted Least 64:55 Squares, or IRLS, which I've actually recently learned 65:00 is a term that even though it was defined in the '50s, 65:03 people still feel free to use to define 65:06 to call their new algorithms which have nothing 65:08 to do with this. 65:09 This is really something where you actually do iteratively 65:12 re-weighted least squares. 65:13 65:18 OK. 65:18 Let's just actually go through this quickly 65:20 what is going to be iteratively re-weighted least squares. 65:23 The way the steps that we had here showed up-- 65:28 let's say those guys, x*-- 65:32 is this, is when were actually solving this linear system, 65:37 right? 65:37 That was the linear system we were trying to solve. 65:43 But solving a linear system can be 65:45 done by just trying to minimize, right? 65:48 If I have x a and b, it's the same 65:50 as minimizing the norm of ax minus b squared over x. 65:58 If I can actually find an x for which it's 0, 66:01 it means that I've actually solved my problem. 66:04 And so that means that I can solve linear systems by solving 66:10 least square problems. 66:11 And least square problems are things that statisticians 66:14 are comfortable solving. 66:15 And so all I have to do is to rephrase this 66:18 as at least square problem. 66:19 OK? 66:20 And you know, I could just write it directly like this. 66:23 But there's a way to streamline it a little bit. 66:25 And that's actually by using weights. 66:30 OK. 66:30 So I've come in the weights-- 66:32 well, not today, actually, but very soon, all right? 66:37 So this is just a reminder of what we had. 66:39 We have that's Yi give Xi as a distribution distributed 66:43 according to some distribution in the canonical exponential 66:47 family. 66:48 So that means that the log-likelihood looks like this. 66:50 Again, this does not matter to us. 66:52 This is the form that matters. 66:54 And we have a bunch of relationships 66:55 that we actually spent some time computing. 66:58 The first one is that mu is b prime of theta i. 67:01 The second one is that if I take g of mu i, 67:04 I get this systematic component, Xi transpose beta that's 67:08 modeling. 67:09 Now, if I look at the derivative mu i with respect to theta i, 67:13 this is the derivative of b prime of theta i 67:15 with respect to theta i. 67:16 So that's the second derivative. 67:18 And I'm going to call it Vi. 67:20 If phi is equal to 1, this is actually the variance. 67:24 And then I have this function H, which 67:27 allows me to bypass altogether the existence of this parameter 67:30 mu, which says if I want to go from Xi transpose beta 67:33 all the way to theta i, I have to first do g inverse, 67:37 and then b prime inverse. 67:39 If I stopped here, I would just have mu. 67:41 OK? 67:42 67:46 OK. 67:46 So now what I'm going to do is I'm 67:48 going to apply the chain rule. 67:50 And I'm going to try to compute the derivative 67:53 of my log-likelihood with respect to beta. 67:56 So, again, the log-likelihood is much nicer 68:00 when I read it as a function of theta than a function of beta, 68:03 but it's basically what we've been doing by hand. 68:05 You can write it as a derivative with respect 68:08 to theta first, and then multiply by the derivative 68:11 of theta with respect to beta. 68:13 OK. 68:13 And we know that theta depends on beta 68:16 as H of Xi transpose beta. 68:20 OK? 68:20 I mean, that's basically what we've been 68:22 doing for the Bernoulli case. 68:26 I mean, we used the chain rule without actually saying it. 68:28 But this is going to be convenient to actually make 68:30 it explicitly show up. 68:32 OK. 68:32 So when I first take the derivative of my log-likelihood 68:35 with respect to theta, I'm going to use the fact 68:39 that my canonical family is super simple. 68:42 OK. 68:42 So what I have is that my log-likelihood ln 68:50 is the sum from i equal 1 to n of Yi theta i minus b 68:56 of theta i divided by phi plus some constant, which 69:00 will go away as soon as I'm going 69:01 to take my first derivative. 69:03 So if I take the derivative with respect to theta i of this guy, 69:08 this is actually going to be equal to Yi minus b prime theta 69:12 i divided by phi. 69:16 And then I need to multiply by the derivative of theta i 69:20 with respect to beta. 69:21 Remember, theta is H of Xi transpose beta. 69:26 So the derivative of theta i with respect to beta j, 69:31 this is equal to H prime of Xi transpose beta. 69:36 And then I have the derivative of this guy. 69:40 Actually, let me just do the gradient of theta I 69:46 at beta, right? 69:48 That's what we did. 69:50 I'm just thinking of theta i as being a function of theta. 69:53 So what should I add here? 69:55 It was just the vector Xi, which is just the chain rule again. 70:01 That's Hi prime, right? 70:02 You don't see it, but there's a prime here that's derivative. 70:06 OK. 70:06 We've done that without saying it explicitly. 70:10 So now if I multiply those two things 70:11 I have this Yi minus b prime of the theta 70:14 i, which I call by its good name, which is mu i. 70:19 b prime of theta i is the expectation 70:20 of Yi conditionally on Xi. 70:22 And then I multiply by this thing here. 70:24 So here, this thing is written coordinate by coordinate. 70:28 But I can write it as a big vector 70:30 when I stack them together. 70:33 And so what I claim is that this thing here 70:36 is of the form Y minus mu. 70:39 But here I put some tildes. 70:40 Because what I did is that first I multiplied everything 70:46 by g prime of mu for each mu. 70:50 OK. 70:50 So why not? 70:52 OK. 70:54 Actually, on this slide it will make no sense why I do this. 70:58 I basically multiply by g prime on one side 71:01 and divide by g prime on the other side. 71:04 So what I write so far is that the gradient of ln with respect 71:12 to beta is the sum from i equal 1 to n of Yi minus mu i, 71:19 let's call it, divide by phi times 71:23 H prime of Xi transpose beta Xi. 71:28 OK. 71:29 So I just stacked everything that's here. 71:31 And now I'm going to start calling things. 71:33 The first thing I'm going to do is I'm going to divide. 71:36 So this guy here I'm going to push here. 71:38 71:40 Now, this guy here I'm actually going 71:42 to multiply by g prime of mu i. 71:48 And this guy I'm going to divide by g prime of mu i. 71:51 So there's really nothing that happened here. 71:54 I just took g prime and multiply and divide it by g prime. 71:59 Why do I do this? 72:00 Well, that's actually going to be clear 72:02 when we talk about iteratively re-weighted least squares. 72:06 But now, essentially I have a new mu, a Y which is-- 72:11 so this thing now is going to be Y tilde minus mu 72:14 tilde, so i, i. 72:19 Now, this guy here I'm going to call Wi. 72:23 72:27 And I have the Xi that's there, which 72:31 means that now the thing that I have here I can write as 72:38 follows. 72:39 Gradient ln of beta is equal to what? 72:46 Well, I'm going to write it in matrix forms. 72:49 So I have the sum over i of something multiplied by Xi. 72:53 So I'm going to write it as x transpose. 72:57 Then I'm going to have this matrix W1 Wn, and then 73:02 0 elsewhere. 73:05 And then I'm going to have my Y tilde minus mu. 73:09 And remember, X is the matrix with-- sorry, 73:15 it should be a bit [INAUDIBLE]. 73:16 I have n, and then p. 73:19 And here I have my Xi j in this matrix on row i and column j. 73:26 And this is just a matrix that has the Wi's on the diagonal. 73:30 And then I have Y tilde minus mu. 73:32 So this is just the matrix we're writing of this formula. 73:36 All right. 73:37 So it's just saying that if I look 73:38 at the sum of weighted things of my columns of Xi, 73:42 it's basically the same thing. 73:44 When I'm going to multiply this by my matrix, 73:46 I'm going to get exactly those terms, right? 73:48 Yi minus mu i tilde times Wi. 73:52 And then when I actually take this Xi transpose 73:56 times this guy, I'm really just getting 73:58 the sum of the columns with the weights, right? 74:04 Agree? 74:05 If I look at this thing here, this 74:08 is a vector that has S coordinates, Wi times Yi 74:15 tilde minus mu i tilde. 74:19 And I have n of them. 74:21 So when I multiply X transpose by this guy, 74:24 I'm just looking at a weighted sum of the columns of X 74:28 transpose, which is a weighted sum of the rows of X, 74:31 which are exactly my Xi's. 74:34 All right, and that's this weighted sum of the Xi's. 74:36 74:40 OK. 74:40 So here, as I said, the fact that we 74:43 decided to put this g prime of mu i here and g prime of mu i 74:48 here, we could have not done this, right? 74:50 We could have just said, I forget about the tilde 74:54 and just call it Yi minus mu i. 74:56 And here, I just put everything I don't know into some Wi. 75:00 And so why do I do this? 75:01 Well, it's because when I actually 75:03 start looking at the Hessian, what's going to happen? 75:06 AUDIENCE: [INAUDIBLE]. 75:08 PHILIPPE RIGOLLET: Yeah. 75:09 We'll do that next time. 75:10 But let's just look quickly at the outcome of the computation 75:14 of my Hessian. 75:16 So I compute a bunch of second derivatives. 75:20 And here, I have two terms, right? 75:21 Well, he's gone. 75:22 So I have two terms. 75:23 And when I take the expectation now, 75:26 it's going to actually change, right? 75:28 This thing is actually going to depend on Yi. 75:31 Because I have an H which is not the identity. 75:34 Oh, no, you're here, sorry. 75:37 So when I start looking at the expectation, 75:39 so I look at the conditional expectation given Xi. 75:42 The first term here has a Yi minus expectation. 75:46 So when I take the conditional expectation, 75:48 this is going to be 0. 75:49 The first term is going away when I take 75:50 the conditional expectation. 75:52 But this was actually gone already 75:54 if we had the canonical term, because the second derivative 75:56 of H when H is the identity is 0. 76:00 But if H is not the identity, H prime prime may not be 0. 76:03 And so I need that part to remove that term. 76:07 And so now, you know, I work a little bit, 76:09 and I get this term. 76:10 That's not very surprising. 76:11 In the second derivative, I see I have terms in b prime prime. 76:15 I have term in H prime, but squared. 76:18 And then I have my Xi outer Xi, Xi, Xi transpose, 76:21 which we know we would see. 76:23 OK. 76:23 So we'll go through those things next time. 76:26 But what I want to show you is that now once I compute this, 76:31 I can actually show that if I look at this product that 76:35 showed up, I had b prime prime times H prime squared. 76:40 One of those terms is actually 1 over g prime. 76:44 And so I can rewrite it as one of the H primes, 76:46 because I had a square, divided by g prime. 76:49 And now, I have this Xi Xi transpose. 76:51 So if I did not put the g prime in the W 76:57 that I put here completely artificially, 77:00 I would not be able to call this guy Wi, which is exactly 77:04 what it is from this board. 77:06 And now that this guy is Wi, I can actually write this thing 77:09 here as X transpose WX. 77:14 OK? 77:15 And that's why I really wanted my W 77:17 to have this g prime of mu i in the denominator. 77:20 Because now I can actually write a term that depends on W. 77:24 Now, you might say, how do I reconcile those two things? 77:26 What the hell are you doing? 77:28 And what the hell I'm doing is essentially 77:29 that I'm saying that if you write beta k according 77:35 to the Fisher-scoring iterations, 77:37 you can actually write it as just this term here, 77:41 which is of the form X transpose X inverse X transpose Y. 77:46 But I actually squeezed in these W's. 77:50 And that's actually a weighted least square. 77:52 And it's applied to this particular guy. 77:53 So we'll talk about those weighted least squares. 77:56 But remember, least squares is of the form-- 77:58 beta hat is X transpose X inverse X transpose Y. 78:02 And here it's basically the same thing, 78:04 except that I squeeze in some W after my X transpose. 78:09 OK. 78:09 So that's how we're going to solve it. 78:12 I don't want to go into the details 78:14 now, mostly because we're running out of time. 78:18 Are there any questions?