https://www.youtube.com/watch?v=mc1y8m9-hOM&list=PLUl4u3cNGP60uVBMaoNERc6knT_MgPKS0&index=20 字幕記錄 00:00 00:00 The following content is provided under a Creative 00:02 Commons license. 00:03 Your support will help MIT OpenCourseWare 00:06 continue to offer high quality educational resources for free. 00:10 To make a donation or to view additional materials 00:12 from hundreds of MIT courses, visit MIT OpenCourseWare 00:16 at ocw.mit.edu. 00:17 00:20 PHILIPPE RIGOLLET: We're talking about 00:23 generalized linear models. 00:24 And in generalized linear models, 00:25 we generalize linear models in two ways. 00:28 The first one is to allow for a different distribution 00:31 for the response variables. 00:32 And the distributions that we wanted 00:34 was the exponential family. 00:37 00:44 And this is a family that can be generalized 00:46 over random variables that are defined 00:49 on c or q in general, with parameters rk. 00:52 But we're going to focus in a very specific case when 00:58 y is a real valued response variable, which 01:00 is the one you're used to when you're doing linear regression. 01:04 And the parameter theta also lives in r. 01:09 And so we're going to talk about the canonical case. 01:12 So that's the canonical exponential family, 01:15 where you have a density, theta of x, which is 01:19 of the form, exponential plus. 01:25 And then, we have y, which interacts with theta 01:27 only by taking a product. 01:29 Then, there's a term that depends only on theta, 01:32 some dispersion parameter phi. 01:35 And then, we have some normalization factor. 01:37 Let's call it c of y phi. 01:44 So it really should not matter too much, so it's c of y phi, 01:48 and that's really just the normal position factor. 01:50 And here, we're going to assume that phi is known. 01:54 01:57 I have no idea what I write. 01:58 I don't know if you guys can read. 02:00 I don't know what chalk has been used today, 02:01 but I just can't see it. 02:05 That's not my fault. All right, so we're going 02:08 to assume that phi is known. 02:09 And so we saw that several distributions 02:12 that we know well, including the Gaussian for example, 02:14 belong to this family. 02:15 And there's other ones, such as Poisson-- 02:17 02:21 Poisson and Bernoulli. 02:22 So if the PMF has this form, if you 02:24 have a discrete random variable, this is also valid. 02:27 And the reason why we introduced this family 02:29 is because there are going to be some properties that we know 02:32 that this thing here, this function, b of theta, 02:34 is essentially what completely characterizes 02:37 your distribution. 02:38 So if phi is fixed, we know that the interaction is the form. 02:42 And this really just comes from the fact 02:44 that we want the function to integrate to one. 02:46 So this b here in the canonical form 02:49 encodes everything we want to know. 02:50 If I tell you what b of theta is-- 02:53 and of course, I tell you what phi 02:54 is, but let's say for a second that phi is equal to one. 02:56 If I tell you this b of theta, you 02:58 know exactly what distribution I'm talking about. 03:00 So it should encode everything that's 03:02 specific to this distribution, such as mean, variance, 03:05 all the moments that you would want. 03:07 And we'll see how we can compute from this thing 03:12 the mean and the variance, for example. 03:14 So today, we're going to talk about likelihood, 03:16 and we're going to start with the likelihood 03:18 function or the log likelihood for one observation. 03:21 From this, we're going to do some computations, 03:23 and then, we'll move on to the actual log likelihood based 03:26 on n independent observations. 03:28 And here, as we will see, the observations 03:30 are not going to be identically distributed, 03:32 because we're going to want each of them, 03:35 conditionally on x to be a different function of x, where 03:39 theta is just a different function of x 03:41 for each of the observation. 03:43 So remember, the log likelihood-- 03:45 03:50 and this is for one observation-- 03:52 is just the log of the density, right? 03:54 03:59 And we have this identity that I mentioned 04:02 at the end of the class on Tuesday. 04:04 And this identity is just that the expectation 04:06 of the derivative of this guy with respect to theta 04:08 is equal to 0. 04:10 So let's see why. 04:11 So if I take the derivative with respect to theta, of log f, 04:15 theta of x, what I get is the derivative 04:18 with respect to theta of f, theta 04:21 of x, divided by f theta of x. 04:26 Now, if I take the expectation of this guy, 04:30 with respect to this theta as well, what I get 04:37 is that this thing-- what is the expectation? 04:40 Well, it's just the integral against f theta. 04:42 Or if I'm in a discrete case, I just 04:45 have the sum against f theta, if it's a pmf. 04:48 Just the definition, the expectation of x, 04:53 is either the integral-- well, let's say of h of x-- 04:56 is integral of h of x. 04:59 F theta of x-- 05:01 if this is discrete or is just the sum 05:04 of h of x, f theta of x. 05:07 If x is discrete-- 05:09 so if it's continuous, you put this soft sum. 05:15 This guy is the same thing, right? 05:17 So I'm just going to illustrate the case when it's continuous. 05:20 So this is what? 05:21 Well, this is the integral of partial derivative with respect 05:24 to theta, of f theta of x, divided by f theta 05:29 of x, all time f theta of x-- 05:35 dx. 05:36 And now, this f theta is canceled, 05:38 so I'm actually left with the integral 05:40 of the derivative, which I'm going 05:41 to write as the derivative of the integral. 05:43 05:50 But f theta being density for any value of theta 06:01 that I can take, this is the function. 06:04 As a function of theta, this function 06:07 is constantly equal to 1. 06:10 For any theta that I take it, it takes value of 1. 06:13 So this is constantly equal to 1. 06:16 I put three bars to see that for any value of theta, 06:18 this is 1, which actually tells me that the derivative is 06:21 equal to 0. 06:24 OK, yes? 06:25 06:30 AUDIENCE: What is the first [INAUDIBLE] 06:32 that you wrote on the board? 06:34 06:38 PHILIPPE RIGOLLET: That's just the definition 06:40 of the derivative of the log of a function? 06:44 AUDIENCE: OK. 06:45 06:49 PHILIPPE RIGOLLET: Log of f prime is f prime over f. 06:53 That's a log, yeah. 06:56 Just by elimination. 06:59 AUDIENCE: [INAUDIBLE] 07:01 PHILIPPE RIGOLLET: I'm sorry. 07:02 AUDIENCE: When you write a squiggle that starts with an l, 07:05 I assume it's lambda. 07:06 PHILIPPE RIGOLLET: And you do good, because that's probably 07:09 how my mind processes. 07:11 And so I'm like, yeah, l. 07:13 Here is enough information. 07:16 OK, everybody is good with this? 07:19 So that was convenient. 07:21 So it just said that the expectation 07:22 of the derivative of the log likelihood is equal to 0. 07:26 That's going to be our first identity. 07:29 Let's move onto the second identity, 07:30 using exactly the same trick, which is let's hope 07:34 that at some point, we have the integral 07:35 of this function that's constantly 07:37 equal to 1 as a function of theta, and then use the fact 07:41 that its derivative is equal to 0. 07:43 So if I start taking the second derivative of the log 07:54 of f theta, so what is this? 07:57 Well, it's the derivative of this guy 08:00 here, so I'm going to go straight to it. 08:02 So it's second derivative, f theta of x, times 08:08 f theta of x, minus first derivative of f theta of x, 08:19 times first derivative of f theta of x. 08:22 08:26 Here is some super important stuff-- 08:29 no, I'm kidding. 08:31 So you can still see that guy over there? 08:34 So it's just the square. 08:35 And then, I divide by f theta of x squared. 08:38 08:43 So here I have the second derivative, times f itself. 08:48 And here, I have the product of the first derivative 08:51 with itself. 08:52 So that's the square. 08:53 So now, I'm going to integrate this guy. 08:55 So if I take the expectation of this thing here, what I get 09:01 is the integral. 09:03 So here, the only thing that's going 09:05 to happen when I'm going to take my integral 09:07 is that one of those squares is going to cancel 09:09 against f theta, right? 09:10 So I'm going to get the second derivative 09:22 minus the second derivative squared. 09:24 09:32 And then, I'm divided by f theta. 09:34 09:38 And I know that this thing is equal to 0. 09:39 09:44 Now, one of these guys here-- 09:46 sorry, why do I have-- 09:48 so I have this guy here. 09:49 So this guy here is going to cancel. 09:50 09:53 So this is what this is equal to-- 09:57 09:59 the integral of the partial, so the second derivative of f 10:05 theta of x, because those two guys cancel-- 10:09 minus the integral of the second derivative. 10:14 10:24 And this is telling me what? 10:28 10:55 Yeah, I'm losing one, because I have some weird sequences. 10:58 Thank you. 10:58 11:03 OK, this is still positive. 11:12 I want to say that this thing is actually equal to 0. 11:14 11:17 But then, it gives me some weird things, 11:19 which are that I have an integral 11:24 of a positive function, which is equal to 0. 11:26 11:32 Yeah, that's what I'm thinking of doing. 11:34 But I'm going to get 0 for this entire integral, which 11:37 means that I have the integral of a positive function, which 11:39 is equal to 0, which means that this function is equal to 0, 11:42 which sounds a little bad-- 11:44 basically, tells me that this function, f theta, is linear. 11:48 11:52 So I went a little too far, I believe, 11:55 because I only want to prove that the expectation 11:57 of the second derivative-- 11:58 12:24 Yes, so I want to pull this out. 12:25 12:31 So let's see, if I keep rolling with this, I'm going to get-- 12:35 well, no because the fact that it's divided by f theta, 12:37 means that, indeed, the second derivative is equal to 0. 12:40 So it cannot do this here. 12:41 12:49 AUDIENCE: [INAUDIBLE] 12:51 12:59 PHILIPPE RIGOLLET: OK, but let's write it like this. 13:03 You're right, so this is what? 13:05 This is the expectation of the partial with respect 13:12 to theta of f theta of x, divided 13:15 by f theta of x squared. 13:21 And this is exactly the derivative of the log, right? 13:25 So indeed, this thing is equal to the expectation with respect 13:28 to theta of the partial of l-- 13:34 of log f theta, divided by partial theta. 13:41 All right, so this is one of the guys that I want squared. 13:44 This is one of the guys that I want. 13:47 And this is actually equal, so this will 13:50 be equal to the expectation-- 13:53 13:56 AUDIENCE: [INAUDIBLE] 13:58 13:59 PHILIPPE RIGOLLET: Oh, right, so this term should be equal to 0. 14:02 This was not 0. 14:03 You're absolutely right. 14:04 So at some point, I got confused, 14:06 because I thought putting this equal to 0 14:08 would mean that this is 0. 14:09 But this thing is not equal to 0. 14:10 So this thing, you're right. 14:11 I take the same trick as before, and this is actually 14:13 equal to 0, which means that now I have 14:16 what's on the left-hand side, which is equal to what's 14:19 on the right-hand side. 14:20 And if I recap, I get that e theta 14:23 of the second derivative of the log of f theta 14:30 is equal to minus-- 14:32 because I had a minus sign here-- 14:34 to the expectation with respect to theta, of log of f theta, 14:39 divided by theta squared. 14:44 Thank you for being on watch when I'm falling apart. 14:48 All right, so this is exactly what 14:50 you have here, except that both terms have 14:52 been put on the same side. 14:54 All right, so those things are going to be useful to us, 14:57 so maybe, we should write them somewhere here. 14:59 15:13 And then, we have that the expectation 15:16 of the second derivative of the log 15:26 is equal to minus the expectation of the square 15:33 of the first derivative. 15:34 15:40 And this is, indeed, my Fisher information. 15:42 This is just telling me what is the second derivative of my log 15:48 likelihood at theta, right? 15:49 So everything is with respect to theta 15:50 when I take these expectations. 15:52 And so it tells me that the expectation 15:55 of the second derivative-- at least first of all, what 15:58 it's telling me is that it's concave, 16:00 because the second derivative of this thing, 16:02 which is the second derivative of kl divergence, 16:05 is actually minus something which is must be non-negative. 16:09 And so it's telling me that it's concave here 16:11 at this [INAUDIBLE]. 16:14 And in particular, it's also telling me 16:16 that it has to be strictly positive, unless the derivative 16:19 of f is equal to 0. 16:21 Unless f is constant, then it's not going to change. 16:27 All right, do you have a question? 16:32 So now, let's use this. 16:35 So what does my log likelihood look 16:37 like when I actually compute it for this canonical exponential 16:41 family. 16:42 We have this exponential function, so taking the log 16:45 should make my life much easier, and indeed, it does. 16:48 So if I look at the canonical, what I have 16:56 is that the log of f theta of x, it's equal 17:04 simply to y theta minus b of theta, divided by phi, 17:10 plus this function that does not depend on theta. 17:14 17:18 So let's see what this tells me. 17:20 Let's just plug-in those equalities in there. 17:23 I can take the derivative of the right-hand side 17:25 and just say that in expectation, it's equal to 0. 17:28 So if I start looking at the derivative, 17:32 this is equal to what? 17:33 Well, here I'm going to pick up only y. 17:37 Sorry, this is a function of y. 17:39 17:46 I was talking about likelihood, so I actually 17:48 need to put the random variable here. 17:50 So I get y minus the derivative of b of theta. 17:54 Since it's only a function of theta, 17:56 I'm just going to write b prime, is at OK-- 17:58 rather than having the partial with respect to theta. 18:01 Then, this is a constant. 18:02 This does not depend on theta, so it goes away. 18:04 18:10 So if I start taking the expectation of this guy, 18:15 I get the expectation of this guy, 18:20 which is the expectation of y, minus-- well, 18:24 this does not depend on y, so it's just itself-- 18:27 b prime of theta. 18:28 And the whole thing is divided by phi. 18:30 But from my first equality over there, 18:33 I know that this thing is actually equal to 0. 18:35 18:38 We just proved that. 18:40 So in particular, it means that since phi is non-zero, 18:43 it means that this guy must be equal to this guy-- 18:45 or phi is not infinity. 18:47 And so that implies that the expectation 18:50 with respect to theta of y is equal to b prime of theta. 18:56 19:02 I'm sorry, you're not registered in this class. 19:04 I'm going to have to ask you to leave. 19:07 I'm not kidding. 19:09 AUDIENCE: [INAUDIBLE] 19:10 19:11 PHILIPPE RIGOLLET: You are? 19:12 I've never seen you here. 19:13 I saw you for the first lecture. 19:15 OK. 19:16 19:19 All right, so e theta of y is equal to b prime of theta. 19:23 Everybody agrees with that? 19:24 19:27 So this is actually nice, because if I give you 19:31 an exponential family, the only thing I really need to tell 19:34 you is what b theta is. 19:36 And if I give you b of theta, then computing a derivative 19:38 is actually much easier than having to integrate y 19:42 against the density itself. 19:44 You could really have fun and try 19:46 to compute this, which you would be able to do, right? 19:48 19:54 And then, there's the plus c of y phi, blah, blah, blah-- 19:58 dy. 19:59 And that's the way you would actually compute this thing. 20:01 20:05 Sorry, this guy is here. 20:06 That would be painful. 20:07 I don't know what this normalization looks like, 20:10 so it would have to also explicit that, 20:12 so I can actually compute this thing. 20:13 And you know, just the same way, if you 20:15 want to compute the expectation of a Gaussian-- 20:17 well, the expectation of a Gaussian 20:19 is not the most difficult one. 20:20 But even if you compute the expectation of a Poisson, 20:23 you start to have to work in a little bit. 20:25 There's a few things that you have to work through. 20:27 Here, I'm just telling you, all you have to know 20:29 is what b of theta is, and then, you 20:30 can just take the derivative. 20:33 Let's see what the second equality is going to give us. 20:35 20:56 OK, so what is the second equality? 21:00 It's telling me that if I look at the second derivative, 21:03 and then I take its expectation, I'm 21:07 going to have something which is equal to negative this guy 21:11 squared. 21:13 Sorry, that was the log, right? 21:14 21:19 We've already computed this first derivative 21:22 of the likelihood. 21:24 It's just the expectation of the square of this thing here. 21:29 So expectation of the derivative, 21:34 with respect to theta of log, f theta of x, divided 21:38 by partial theta squared. 21:44 This is equal to the expectation of the square of y, 21:50 minus b theta, divided by phi squared-- 21:58 b prime, theta squared. 21:59 22:04 OK, sorry, I'm actually going to move on with the-- 22:06 22:11 so if I start computing, what is this thing? 22:13 Well, we just agreed that this was what? 22:16 22:19 The expectation of theta, right? 22:22 So that's just the expectation of y. 22:27 We just computed it here. 22:28 22:31 AUDIENCE: [INAUDIBLE] 22:32 22:35 PHILIPPE RIGOLLET: Yeah, that's b prime. 22:37 There's a derivative here. 22:39 22:44 So now, this is what? 22:47 This is simply-- anyone? 22:57 23:01 I'm sorry? 23:02 Variance of y, but you're scaling by phi squared. 23:05 23:11 OK, so this is negative of the right-hand side 23:15 of our inequality. 23:17 And now, I just have to take one more derivative to this guy. 23:21 So now, if I look at the left-hand side now, 23:27 I have that the second derivative 23:29 of log, of f theta of y, divided by partial of theta squared. 23:38 So this thing is equal to-- 23:40 well, no, I'm not left with much. 23:42 The white part is going to go away, 23:44 and I'm left only with the second derivative of theta, 23:47 minus the second derivative theta, divided by phi. 23:49 24:00 So if I take expectation-- 24:03 well, it just doesn't change. 24:05 24:08 This is deterministic. 24:09 So now, what I've established is that this guy 24:11 is equal to negative this guy. 24:14 So those two things, the signs are going to go away. 24:17 And so this implies that the variance of y 24:20 is equal to b prime prime theta. 24:25 And then, I have a phi square in denominator 24:30 that cancels only one of the phi squares, so it's time phi. 24:33 24:37 So now, I have that my second derivative-- since I know phi 24:41 is completely determining the variance. 24:46 So basically, that's why b is called the cumulant generating 24:52 function. 24:52 It's not generating moments, but cumulants. 24:55 But cumulants, in this case, correspond, basically, 24:59 to the moments, at least for the first two. 25:01 If I start going farther, I'm going 25:03 to have more combinations of the expectation of y3, y2, 25:08 and y itself. 25:09 25:13 But as we know, those are the ones 25:14 that are usually the most useful, at least 25:17 if we're interested in asymptotic performance. 25:19 The central limit theorem tells us 25:20 that all that matters are the first two moments, 25:23 and then, the rest is just going to go and say well, 25:25 it doesn't matter. 25:26 It's all going to [INAUDIBLE] anyway. 25:29 So let's go to a Poisson for example. 25:31 So if I had a Poisson distribution-- 25:33 25:39 so this is a discrete distribution. 25:42 And what I know is that f-- 25:46 let me call mu the parameter of y. 25:51 25:56 So it's mu to the y, divided by y factorial, exponential 26:01 minus mu. 26:02 OK so mu is usually called lambda, 26:04 and y is usually called x, that's 26:06 why it takes me to a little bit of time. 26:07 But it usually it's lambda to the x over 26:09 factorial x, exponential minus lambda. 26:13 Since this is just the series expansion of the exponential 26:16 when I sum those things from 0 to infinity, 26:19 this thing sums to 1. 26:20 But then, if I wanted to start understanding what 26:22 the expectation of this thing is-- 26:25 so if I want to understand the expectation with respect 26:28 to mu of y, then, I would have to compute the sum 26:33 from k equals 0 to infinity of k, times mu to the k, 26:48 over factorial of k, exponential minus mu-- 26:51 which means that I would, essentially, 26:53 have to take the derivative of my series in the end. 27:06 So I can do this. 27:07 This is a standard exercise. 27:08 You've probably done it when you took probability. 27:10 But let's see if we can actually just read it off 27:12 from the first derivative of b. 27:14 So to do that, we need to write this 27:16 in the form of an exponential, where there 27:18 is one parameter that captures mu, that interacts with y, just 27:23 doing this parameter times y, and then something that 27:25 depends only on y, and then something that depends only 27:29 on mu. 27:32 That's the important one. 27:34 That's going to be our B. And then, 27:35 there's going to be something that depends only on y. 27:39 So let's write this and check that this f mu, indeed, 27:42 belongs to this canonical exponential family. 27:46 So I definitely have an exponential 27:48 that comes from this guy. 27:50 So I have minus mu. 27:51 And then, this thing is going to give me what? 27:53 It's going to give me plus y log mu. 27:58 And then, I'm going to have minus log of y factorial. 28:02 28:06 So clearly, I have a term that depends only 28:08 on mu, terms that depend only on y, 28:11 and I have a product of y and something that depends on mu. 28:15 If I want to be canonical, I must 28:17 have this to be exactly the parameter theta itself. 28:23 So I'm going to call this guy theta. 28:27 So theta is log mu, which means that mu 28:30 is equal to e to the theta. 28:32 And so wherever I see mu, I'm going 28:34 to replace it by [INAUDIBLE] the theta, because my new parameter 28:36 now, is theta. 28:38 So this is what? 28:38 This is equal to exponential y times theta. 28:43 And then, I'm going to have minus e of theta. 28:47 And then, who cares, something that depends only on mu. 28:51 So this is my c of y, and phi is equal to 1 in this case. 28:58 So that's all I care about. 29:00 So let's use it. 29:01 29:05 So this is my canonical exponential family. 29:08 Y interacts with theta exactly like this. 29:11 And then, I have this function. 29:13 So this function here must be b of theta. 29:17 29:20 So from this function, exponential theta, 29:22 I'm supposed to be able to read what the mean is. 29:25 29:39 AUDIENCE: [INAUDIBLE] 29:41 29:43 PHILIPPE RIGOLLET: Because since in this course 29:46 I always know what the dispersion is, 29:48 I can actually always absorb it into theta from one. 29:52 But here, it's really of the form y times something 29:54 divided by 1, right? 29:55 30:01 If it was like log of mu divided by phi, 30:04 that would be the question of whether I 30:06 want to call phi my dispersion, or if I 30:10 want to just have it in there. 30:12 30:16 This makes no difference in practice. 30:18 But the real thing is it's never going 30:20 to happen that this thing, this version, 30:22 is going to be an exact number. 30:23 If it's an actual numerical number, 30:26 this just means that this number should be absorbed 30:28 in the definition of theta. 30:32 But if it's something that is called sigma, say, 30:34 and I will assume that sigma is known, 30:36 then it's probably preferable to keep it in the dispersion, 30:39 so you can see that there's this parameter here 30:41 that you can, essentially, play with. 30:44 It doesn't make any difference when you know phi. 30:48 So now, if I look at the expectation of some y-- so now, 30:55 I'm going to have y, which follows my Poisson mu. 31:00 I'm going to look at the expectation, 31:01 and I know that the expectation is b prime of theta. 31:09 Agreed? 31:09 That's what I just erased, I think. 31:14 Agreed with this, the derivative? 31:17 So what is this? 31:18 Well, it's the derivative of e to the theta, which 31:23 is e to the theta, which is mu. 31:27 So my Poisson is parametrized by its mean. 31:30 I can also compute the variance, which 31:34 is equal to minus the second derivative of-- 31:40 no, it's equal to the second derivative of b. 31:42 31:47 Dispersion is equal to 1. 31:49 Again, if I took phi elsewhere, I would see it here as well. 31:55 So if I just absorbed phi here, I would see it divided here, 31:57 so it would not make any difference. 32:00 And what is the second derivative of the exponential? 32:02 32:06 Still the exponential-- so it's still equal to mu. 32:09 32:14 So that certainly makes our life easier. 32:17 Just one quick from remark-- 32:19 32:31 here's the function. 32:32 I am giving you problem-- 32:35 can the b function-- 32:36 32:39 can it ever be equal to log of theta? 32:46 32:55 Who says yes? 32:58 Who says no? 33:00 Why? 33:02 AUDIENCE: [INAUDIBLE] 33:04 33:09 PHILIPPE RIGOLLET: Yeah, so what I've learned from this-- 33:13 it's sort of completely analytic, right? 33:16 So we just took derivatives, and this thing just happened. 33:19 This thing actually allowed us to relate the second derivative 33:22 of b to the variance. 33:24 And one thing that we know about a variance 33:26 is that this is non-negative. 33:27 And in particular, it's always positive. 33:30 If they give you a canonical exponential family that 33:35 has zero variance, trust me, you will see it. 33:39 That means that this thing is not 33:40 going to look like something that's finite, 33:42 and it's going to have a point mass. 33:44 It's going to take value infinity at one point. 33:46 So this will, basically, never happen. 33:48 This thing is, actually, strictly positive, 33:50 which means that this thing is always strictly concave. 33:52 It means that the second derivative of this function, b, 33:55 has to be strictly positive, and so that the function is convex. 34:00 So this is concave, so this is definitely not working. 34:03 I need to have something that looks like this 34:04 when I talk about my b. 34:07 So f theta squared-- 34:10 we'll see a bunch of exponential theta. 34:13 And there's a bunch of them. 34:14 But if you started writing something, and you find b-- 34:18 try to think of the plot of b in your mind, 34:20 and you find that b looks like it's going to become concave, 34:23 you've made a sign mistake somewhere. 34:25 34:30 All right, so we've done a pretty big parenthesis 34:33 to try to characterize what the distribution of y 34:37 was going to be. 34:37 We wanted to extend from, say, Gaussian to something else. 34:41 But when we're doing regression, which 34:43 means generalized linear models, we 34:46 are not interested in the distribution of y 34:48 but really the conditional distribution of y given x. 34:51 So I need now to couple those back together. 34:55 So what I know is that this same mu, in this case, 34:59 which is the expectation-- what I want to say is that 35:01 the conditional expectation of y given x-- 35:09 35:12 this is some mu of x. 35:15 When we did linear models, we said well, 35:17 this thing was some x transpose beta for linear models. 35:21 35:26 And the whole premise of this chapter 35:27 is to say well, this might make no sense, 35:29 because x transpose beta can take the entire range 35:32 of real values. 35:34 Whereas, this mu can take only a partial range. 35:36 So even if you actually focus on the Poisson, for example, 35:40 we know that the expectation of a Poisson has to be 35:43 a non-negative number-- 35:45 actually, a positive number as soon as you 35:47 have a little bit of variance. 35:49 It's mu itself-- mu is a positive number. 35:52 And so it's not going to make any sense 35:54 to assume that mu of x is equal to x transpose beta, 35:57 because you might find some x's for which this value ends up 36:00 being negative. 36:02 And so we're going to need, what we 36:03 call, the link function that relates, 36:05 that transforms mu, maps onto the real line, 36:08 so that you can now express it of the form x transpose beta. 36:13 So we're going to take not this, but we're 36:17 going to assume that g of mu of x 36:21 is not equal to x transpose beta, 36:24 and that's the generalized linear models. 36:26 36:33 So as I said, it's weird to transform x transpose beta-- 36:40 a mu to make it take the real line. 36:43 At least to me, it feels a bit more 36:44 natural to take x transpose beta and make 36:47 it fit to the particular distribution that I want. 36:51 And so I'm going to want to talk about g and g inverse 36:53 at the same time. 36:55 So I'm going to actually take always g. 36:59 So g is my link function, and I'm 37:04 going to want g to be continuous differentiable. 37:10 37:16 OK, let's say that it has a derivative, 37:19 and its derivative is continuous. 37:22 And I'm going to want g to be strictly increasing. 37:28 37:34 And that actually implies that g inverse exists. 37:39 Actually, that's not true. 37:43 What I'm also going to want is that g of mu spans-- 37:54 37:57 how do I do this? 37:58 38:06 So I want the g, as I arrange for all possible values of mu, 38:09 whether they're all positive values, 38:11 or whether they're values that are limited 38:12 between the intervals 0, 1, I want those to span 38:15 the entire real line, so that when I want to talk about g 38:18 inverses define over the entire real line, 38:20 I know where I started. 38:21 38:24 So this implies that gene inverse exists. 38:26 38:30 What else does it imply about g inverse? 38:32 38:39 So for a function to be invertible, 38:41 I only need for it to be strictly monotone. 38:43 I don't need it to be strictly increasing. 38:45 So in particular, the fact that I picked increasing 38:47 implies that this guy is actually increasing. 38:53 AUDIENCE: [INAUDIBLE] 38:54 PHILIPPE RIGOLLET: That's the image. 38:56 39:03 So this is my link function, and this slide is just telling me 39:06 I want my function to be invertible, 39:08 so I can talk about g inverse. 39:09 I'm going to switch between the two. 39:12 So what link functions am I going to get? 39:15 So for linear models, we just said 39:17 there's no link function, which is 39:18 the same as saying that the link function is identity, 39:20 which certainly satisfies all these conditions. 39:22 It's invertible. 39:23 It has all these nice properties, 39:25 but might as well not talk about it. 39:27 For Poisson data, when we assume that 39:29 the conditional distribution of y given x is Poisson, 39:32 the mu, as I just said, is required to be positive. 39:37 So I need a g that goes from the interval 0 infinity 39:43 to the entire real line. 39:45 I need a function that starts from one end 39:47 and just takes-- not only the positive values 39:51 are split between positive and negative values. 39:54 And here, for example, I could take the log link. 39:56 So the log is defined on this entire interval. 40:01 And as I range from 0 to plus infinity, 40:04 the log is ranging from negative infinity to plus infinity. 40:07 40:10 You can probably think of other functions 40:12 that do that, like 2 times log. 40:15 That's another one. 40:16 But there's many others you can think of. 40:20 But let's say the log is one of them 40:21 that you might want to think about. 40:23 40:32 It is unnatural in the sense that it's 40:34 one of the first function we can think of. 40:36 We will see, also, that it has another canonical property that 40:39 makes it a natural choice. 40:42 The other one is the other example, 40:44 where we had an even stronger condition on what mu could be. 40:47 Mu could only be a number between 0 and 1, 40:49 that was the probability of success of a coin flip-- 40:52 probability of success of a Bernoulli random variable. 40:55 And now, I need g to map 0, 1 to the entire real line. 40:59 And so here are a bunch of things 41:02 that you can come up with, because now you 41:04 start to have maybe-- 41:08 I will soon claim that this one, log of mu, 41:11 divided by 1 minus mu is the most natural one. 41:14 But maybe, if you had never thought of this, 41:16 that might not be the first function 41:18 you would come up with, right? 41:20 You mentioned trigonometric functions, for example, 41:23 so maybe, you can come up with something 41:25 that comes from hyperbolic trigonometry or something. 41:30 So what does this function do? 41:32 Well, we'll see a picture, but this function does 41:34 map the interval 0, 1 to the entire real line. 41:36 We also discuss the fact that if we think reciprocally-- 41:40 41:43 what I want if I want to think about g inverse, 41:46 I want a function that maps the entire real line into the unit 41:49 interval. 41:49 And as we said, if I'm not a very creative statistician 41:52 or probabilist, I can just pick my favorite continuous, 41:55 strictly increasing cumulative distribution function, 41:59 which as we know, will arise as soon 42:01 as I have a density that has support 42:03 on the entire real line. 42:04 If I have support everywhere, then it means that my-- 42:07 42:12 it is strictly positive everywhere, then, 42:14 it means that my community distribution function 42:17 has to be strictly increasing. 42:18 And of course, it has to go from 0 to 1, because that's just 42:21 the nature of those things. 42:22 And so for example, I can take the Gaussian, 42:24 that's one such function. 42:25 But I could also take the double exponential 42:28 that looks like an exponential on one end, 42:30 and then an exponential on the other end. 42:32 And basically, if you take capital phi, which 42:39 is the standard Gaussian cumulative distribution 42:43 function, it does work for you, and you can take its inverse. 42:47 And in this case, we don't talk about, 42:49 so this guy is called logit or logit. 42:51 And this guy is called probit. 42:53 And you see it, usually, every time 42:54 you have a package on generalized linear models. 42:58 You are trying to implement. 42:59 You have this choice. 43:00 And for what's called logistic regression-- so it's funny 43:04 that it's called logistic regression, 43:05 but you can actually use the probit link, 43:07 which in this case, is called probit regression. 43:10 But those things are essentially equivalent, 43:12 and it's really a matter of taste. 43:14 Maybe of communities-- some communities 43:16 might prefer one over the other. 43:18 We'll see that again, as I claimed 43:20 before, the logistic, the logit one 43:24 has a slightly more compelling argument for its reason 43:29 to exist. 43:30 I guess this one, the compelling argument 43:31 is that it involved the standard Gaussian, which 43:34 of course, is something that should show up everywhere. 43:37 And then, you can think about crazy stuff. 43:41 Even crazy gets name-- 43:43 complimentary log, log, which is the log of minus, log 1 minus. 43:48 Why not? 43:49 43:52 So I guess you can iterate that thing. 43:56 You can just put a log 1 minus in front of this thing, 43:59 and it's still going to go. 44:01 So that's not true. 44:07 I have to put a minus and take-- 44:10 no, that's not true. 44:11 44:13 So you can think of whatever you want. 44:15 44:19 So I claimed that the logit link is the natural choice, so 44:21 here's a picture. 44:22 I should have actually plotted the other one, 44:25 so we can actually compare it. 44:27 To be fair, I don't even remember how it would actually 44:29 fit into those two functions. 44:32 So the blue one, which is this one, for those of you 44:35 don't see the difference between blue and red-- 44:37 sorry about that. 44:39 So this the blue one is the logistic one. 44:45 So this guy is the function that does e to the x, over 1 plus 44:49 e to the x. 44:50 As you can see, this is a function 44:51 that's supposed to map the entire real line 44:53 into the intervals, 0, 1. 44:55 So that's supposed to be the inverse of your function, 44:58 and I claimed that this is the inverse of the logistic 45:00 of the logit function. 45:02 And the blue one, well, this is the Gaussian CDF, 45:04 so you know it's clearly the inverse of the inverse 45:06 of the Gaussian CDF. 45:07 And that's the red one. 45:08 That's the one that goes here. 45:09 45:12 I would guess that the complimentary log, log is 45:15 something that's probably going above here, 45:17 and for which the slope is, actually, 45:19 even a little flatter as you cross 0. 45:22 45:26 So of course, this is not our link functions. 45:29 These are the inverse of our link function. 45:30 So what do they look like when actually, 45:32 basically, flip my thing like this? 45:36 So this is what I see. 45:38 And so I can see that in blue, this is my logistic link. 45:42 So it crosses 0 with a slightly faster rate. 45:46 Remember, if we could use the identity, that 45:49 would be very nice to us. 45:51 We would just want to take the identity. 45:52 The problem is that if I start having 45:55 the identity that goes here, it's 45:56 going to start being a problem. 45:58 And this is the probit link, the phi verse that you see here. 46:06 It's a little flatter. 46:07 46:10 You can compute the derivative at zero of those guys. 46:16 What is the derivative of the-- 46:17 46:21 So I'm taking the derivative of log of x over 1 minus x. 46:24 So it's 1 over x, minus 1 over 1 minus x. 46:32 46:35 So if I look at 0.5-- 46:39 sorry, this is the interval 0, 1. 46:40 So I'm interested in the slope at 0.5. 46:48 Yes, it's plus, thank you. 46:49 So at 0.5, what I get is 2 plus 2. 46:53 46:57 Yeah, so that's the slope that we get, 47:02 and if you compute for the derivative-- 47:07 what is the derivative of a phi inverse? 47:09 Well, it's a little phi of x, divided 47:13 by little phi of capital phi, inverse of x. 47:20 So little phi at 1/2-- 47:24 I don't know. 47:24 47:29 Yeah, I guess I can probably compute 47:30 the derivative of the capital phi 47:32 at 0, which is going to be just that. 47:34 1 over square root of 2 pi, and then just say well, 47:37 the slope has to be 1 over that. 47:38 47:42 Square root 2 pi. 47:43 47:47 So that's just a comparison, but again, so far, we 47:50 do not have any reason to prefer one to the other. 47:54 And so now, I'm going to start giving you some reasons 47:56 to prefer one to the other. 47:58 And one of those two-- 48:01 and actually for each canonical family, 48:03 there is something which is called the canonical link. 48:05 And when you don't have any other reason 48:07 to choose anything else, why not choose the canonical one? 48:10 And the canonical link is the one 48:11 that says OK, what I want is g to map mu onto the real line. 48:19 48:22 But mu is not the parameter of my canonical family. 48:28 Here for example, mu is e of theta, 48:31 but the canonical parameter is theta. 48:33 48:36 But the parameter of a canonical exponential family 48:39 is something that lives in the entire real line. 48:42 It was defined for all thetas. 48:45 And so in particular, I can just take theta 48:50 to be the one that's x transpose beta. 48:52 And so in particular, I'm just going 48:54 to try to find the link that just says OK, when 48:57 I take g of mu, I'm going to map, 49:00 so that's what's going to be. 49:01 So I know that g of mu is going to be equal to x beta. 49:05 And now, what I'm going to say is OK, 49:07 let's just take the g that makes this guy equal to theta, 49:09 so that this is theta that actually model, 49:11 like x transpose beta. 49:14 Feels pretty canonical, right? 49:17 What else? 49:19 What other central, easy choice would you take? 49:22 This was pretty easy. 49:23 There is a natural parameter for this canonical family, 49:27 and it takes value on the entire real line. 49:29 I have a function that maps mu onto the entire real line, 49:32 so let's just map it to the actual parameter. 49:36 So now, OK, why do I have this? 49:40 Well, we've already figured that out. 49:41 The canonical link function is strictly increasing. 49:44 Sorry, so I said that now I want this guy-- 49:49 so I want g of mu to be equal to theta, 49:57 which is equivalent to saying that I want mu to be 50:00 equal to g inverse of theta. 50:03 But we know that mu is what-- 50:08 b prime of theta. 50:09 50:15 So that means that b prime is the same function as g inverse. 50:21 And I claimed that this is actually giving me, indeed, 50:24 a function that has the properties that I want, 50:27 because before I said, just pick any function that 50:30 has these properties. 50:31 And now, I'm giving you a very hard rule 50:33 to pick this, though you need still 50:34 to check that it satisfies those conditions and particular, 50:37 that it's increasing and invertible. 50:39 And so for this to be increasing and invertible, 50:41 strictly increasing and invertible, 50:42 really what I need is that the inverse is strictly 50:44 increasing and invertible, which is the case here, 50:48 because b prime as we said-- 50:51 well, b prime is the derivative of a strictly convex function. 50:56 A strictly convex function has a second derivative 50:58 that's strictly positive. 50:59 We just figured that out using the fact 51:01 that the variance was strictly positive. 51:03 And if phi is strictly positive, then this thing 51:06 has to be strictly positive. 51:08 So if b prime, prime is strictly positive-- 51:10 this is the derivative of a function called b prime. 51:13 If your derivative is strictly positive, 51:15 you are strictly increasing. 51:17 And so we know that b prime is, indeed, strictly increasing. 51:22 And what I need also to check-- well, 51:26 I guess this is already checked on its own, 51:28 because b prime is actually mapping all of our 51:33 into the possible values. 51:37 When theta ranges on the entire real line, 51:38 then b prime ranges in the entire interval 51:41 of the mean values that it can take. 51:45 And so now, I have this thing that's completely defined. 51:48 B prime inverse is a valid link. 51:50 51:56 And it's called a canonical link. 51:57 52:02 OK, so again, if I give you an exponential family, which 52:05 is another way of saying I'll give you a convex function, b, 52:09 which gives you some exponential family, 52:12 then if you just take b prime inverse, 52:15 this gives you the associated canonical link 52:17 for this canonical exponential family. 52:21 So clearly there's an advantage of doing 52:26 this, which is I don't have to actually think 52:28 about which one to pick if I don't want to think about it. 52:30 But there's other advantages that come to it, 52:34 and we'll see that in the representations. 52:36 There's, basically, going to be some light cancellations that 52:38 show up. 52:39 So before we go there, let's just 52:40 compute the canonical link for the Bernoulli distribution. 52:43 So remember, the Bernoulli distribution 52:46 has a PMF, which is part of the canonical exponential family. 52:55 So the PMF of the Bernoulli is f theta of x. 53:00 53:06 Let me just write it like this. 53:07 So it's p to the y, let's say-- one minus p 53:12 to the 1 minus y, which I will write 53:16 as exponential y log p, plus 1 minus y, log 1 minus p. 53:28 OK, we've done that last time. 53:30 Now, I'm going to group my terms in y 53:32 to see how y interacts with this parameter p. 53:37 And what I'm getting is y, which is times log p 53:40 divided by 1 minus p. 53:42 And then, the only term that remains is log, 1 minus p. 53:47 53:50 Now, I want this to be a canonical exponential family, 53:53 which means that I just need to call this guy, 53:56 so it is part of the exponential family. 53:58 You can read that. 53:59 If I want it to be canonical, this guy must be theta itself. 54:04 So I have that theta is equal to log p, 1 minus p. 54:11 If I invert this thing, it tells me 54:12 that p is e to the theta, divided by 1, plus e 54:16 to the theta. 54:18 It's just inverting this function. 54:19 54:23 In particular, it means that log, 1 minus p, 54:28 is equal to log, 1 minus this thing. 54:31 So the exponential thetas go away. 54:33 So in the numerator, this is what I get. 54:39 That's the log 1 minus this guy, which is equal to minus log 1, 54:44 plus e to the theta. 54:46 54:50 So I'm going a bit too fast, but these are 54:52 very elementary manipulations-- 54:56 maybe, it requires one more line to convince yourself. 54:59 But just do it in the comfort of your room. 55:05 And then, what you have is the exponential of y times theta, 55:11 and then, I have minus log, 1 plus e theta. 55:16 So this is the representation of the p 55:20 and f of a Bernoulli distribution, 55:23 as part of a member of the canonical exponential family. 55:27 And it tells me that b of theta is equal to log 1, 55:33 plus e of theta. 55:36 That's what I have there. 55:38 From there, I can compute the expectation, which hopefully, 55:41 I'm going to get p as the mean and p times 1, 55:46 minus p as the variance. 55:47 Otherwise, that would be weird. 55:49 55:51 So let's just do this. 55:55 B prime of theta should give me the mean. 56:00 And indeed, b prime of theta is e to the theta, 56:04 divided by 1, plus e to the theta, which is exactly 56:08 this p that I had there. 56:09 56:14 OK just for fun-- 56:18 well, I don't know. 56:19 Maybe, that's not part of it. 56:20 Yeah, let's not compute the second derivative. 56:22 That's probably going to be on your homework at some point-- 56:25 if not, on the final. 56:29 So b prime now-- 56:32 oh, I erased it, of course. 56:34 56:37 G, the canonical link is b prime inverse. 56:39 56:42 And I claim that this is going to give me 56:44 the logit function, log of mu, over 1 minus mu. 56:48 So let's check that. 56:50 So b prime is this thing, so now, 56:54 I want to find the inverse. 56:55 57:02 Well, I should really call my inverse a function of p. 57:05 And I've done it before-- 57:06 all I have to do is to solve this equation, which 57:08 I've actually just done it, that's 57:10 where I'm actually coming from. 57:11 So it's actually telling me that the solution of this thing 57:14 is equal to log of p over 1 minus p. 57:17 57:25 We just solve this thing both ways. 57:28 And this is, indeed, logit of p by definition of logit. 57:38 So b prime inverse, this function that 57:40 seemed to come out of nowhere, is really 57:42 just the inverse of b prime, which we know 57:45 is the canonical link. 57:46 And canonical is some sort of ad hoc choices 57:49 that we've made by saying let's just take the link, such that d 57:53 of mu is giving me the actual canonical parameter of theta. 57:57 Yeah? 57:58 AUDIENCE: [INAUDIBLE] 58:00 58:02 PHILIPPE RIGOLLET: You're right. 58:03 58:08 Now, of course, I'm going through all this trouble, 58:11 but you could see it immediately. 58:13 I know this is going to be theta. 58:16 We also have prior knowledge, hopefully, 58:19 that the expectation of a Bernoulli is p itself. 58:23 So right at this step, when I say 58:25 that I'm going to take theta to be this guy, 58:28 already knew that the canonical link was the logit-- 58:32 because I just said oh, here's theta. 58:34 And it's just this function of mu [INAUDIBLE].. 58:37 58:41 OK, so you can do that for a bunch of examples, 58:43 and this is what they're going to give you. 58:45 So the Gaussian case, b of theta-- 58:47 we've actually computed it, actually, once. 58:49 This is theta squared over 2. 58:51 So the derivative of this thing is really 58:53 just theta, which means that g or g inverse is actually 58:56 equal to the identity. 58:59 And again, sanity check-- 59:02 when I'm in the Gaussian case, there's 59:04 nothing general about general linear models 59:06 if you don't have a link. 59:09 The Poisson case-- you can actually check. 59:12 Did we do this, actually? 59:13 Yes we did. 59:14 So that's when we had this e of theta. 59:16 And so b is e of theta, which means that the natural link is 59:19 the inverse, which is log, which is the inverse of exponential. 59:24 And so that's logarithm link, which as I said, 59:29 I used the word natural. 59:31 You can also use the word canonical 59:33 if you want to describe this function as being 59:35 the right function to map the positive real line 59:38 to the entire real line. 59:40 The Bernoulli-- we just did it. 59:42 So b-- the cumulative enduring function is log of 1, 59:46 plus e of theta, which is log of mu over 1 minus mu. 59:52 And gamma function-- where you have 59:57 the thing you're going to see is minus log of minus [INAUDIBLE].. 60:00 You see the reciprocal link is the link that actually 60:04 shows up, so minus 1 over mu. 60:08 That maps. 60:08 60:35 So are there any questions about the canonical links, 60:40 canonical families? 60:42 I use the word canonical a lot. 60:45 But is everything fitting together right now? 60:48 So we have this function. 60:49 We have canonical exponential family, by assumption. 60:53 It has a function, b, which contains 60:54 every information we want. 60:56 At the beginning of the lecture, we 60:58 established that it has information 60:59 about the mean in the first derivative, 61:01 about the variance in the second derivative. 61:03 And it's also giving us a canonical link. 61:05 So just cherish this b once you've found it, 61:08 because it's everything you need. 61:09 Yeah? 61:09 AUDIENCE: [INAUDIBLE] 61:11 61:15 PHILIPPE RIGOLLET: I don't know, a political preference? 61:19 61:24 I don't know, honestly. 61:26 If I were a serious practitioner, 61:29 I probably would have a better answer for you. 61:31 At this point, I just don't. 61:32 I think it's a matter of practice 61:34 and actual preferences. 61:36 You can also try both. 61:38 We didn't mention it, but there's 61:39 this idea of cross-validation-- well, 61:41 we mentioned it without going too much into detail. 61:43 But you could try both and see which one performs 61:46 best on a yet unseen data set. 61:48 In terms of prediction, just say I prefer this one of the two, 61:51 because this actually comes as part of your modeling 61:53 assumption, right? 61:56 Not only did you decide to model the image of mu 61:59 through the link function as a linear model, but really 62:03 what you're saying-- 62:03 your model is saying well, you have 62:05 two pieces of [INAUDIBLE],, the distribution of y. 62:07 But you also have the fact that mu 62:10 is modeled as g inverse of x transpose beta. 62:14 And for different g's, this is just different modeling 62:17 assumptions, right? 62:18 So why should this be linear-- 62:25 I don't know. 62:26 62:29 My authority as a person who has not 62:32 examined the [INAUDIBLE] data sets 62:34 for both things would be that the changes are fairly minor. 62:38 62:42 OK, so this was all for one observation. 62:45 We just, basically, did probability. 62:49 We described some density, some properties of the densities, 62:52 how to compute expectations. 62:53 That was really just probability. 62:55 There was no data involved at any point. 62:57 We did a bit of modeling, but it was all for one observation. 63:00 What we're going to try to do now 63:01 is given the reverse engineering to probability 63:06 that is statistics, given data, what 63:08 can I infer about my model? 63:10 Now remember, there's three parameters 63:12 that are floating around in this model. 63:15 There is one that was theta. 63:18 There is one that was mu, and there is one that is beta. 63:21 OK, so those are the three parameters 63:23 that are floating around. 63:25 What we said is that the expectation of y, given x, 63:32 is mu of x. 63:34 So if I estimate mu, I know the conditional expectation of y, 63:37 given x, which definitely gives me theta of x. 63:44 How do I go from mu of x to theta of x? 63:46 63:55 The inverse of what-- 63:58 of the arrow? 63:59 Yeah, sure, but how do I go from this guy to this guy? 64:07 So theta as a function of mu is? 64:08 64:12 AUDIENCE: [INAUDIBLE] 64:13 PHILIPPE RIGOLLET: Yeah, so we just 64:15 computed that mu was b prime of theta. 64:18 So it means that theta is just b prime inverse of mu. 64:23 So those two things are the same as far 64:24 as we're concerned, because we know that b prime is strictly 64:27 increasing it's invertible. 64:29 So it's just a matter of re-parametrization, 64:31 and we just can switch from one to the other whenever we want. 64:34 But why we go through mu, because so far 64:36 for the entire semester I told you 64:38 there was one parameter that's theta. 64:39 It does not have to be the mean, and that's the parameter 64:41 that we care about. 64:42 It's the one on which we want to do interference. 64:43 That's the one for which we're going to compute the Fisher 64:45 information. 64:46 This was the parameter that was our object of worship. 64:49 And now, I'm saying oh, I'm going to have 64:51 mu that's coming around. 64:53 And why we have mu, because this is the mu 64:55 that we use to go to beta. 64:58 So I can go freely from theta to mu using b prime or b 65:06 prime inverse. 65:07 And now, I can go from mu to beta, 65:11 because I have that g of mu of x is beta transpose x. 65:19 So in the end, now, this is going 65:21 to be my object of worship. 65:22 This is going to be the parameter that matters. 65:24 Because once I set beta, I set everything else 65:27 through this chain. 65:30 So the question is if I start stacking up this pile 65:33 of parameters-- so I start with my beta, 65:36 which in turns give me a mu, which in turn, 65:38 gives me a theta-- 65:39 can I just have a long, streamlined-- 65:43 what is the outcome when I actually 65:45 start writing my likelihood, not as a function of theta, 65:48 not as a function of mu, but as a function of beta, 65:50 which is the one at the end of the chain? 65:52 And hopefully, things are going to happen nicely, 65:55 and they might no. 65:56 Yeah? 65:56 AUDIENCE: [INAUDIBLE] 65:58 66:02 PHILIPPE RIGOLLET: Is G-- 66:03 that's my link. 66:04 G of mu of x-- 66:06 now, its mu is a function of x, because its conditional on x. 66:09 66:12 So this is really theta of x, mu of x, 66:17 but b is not a function of x, because it's just something 66:21 to tells me what the function of x is. 66:22 AUDIENCE: [INAUDIBLE] 66:24 66:26 PHILIPPE RIGOLLET: Mu is the conditional expectation 66:28 of y, given x. 66:29 It has, actually, a fancy name in the statistics literature. 66:33 It's called-- anybody knows the name of the function, mu 66:36 of x, which is a conditional expectation of y, given x? 66:39 66:42 AUDIENCE: [INAUDIBLE] 66:43 PHILIPPE RIGOLLET: That's the regression function. 66:46 That's actual definition. 66:47 If I tell you what is the definition of the regression 66:48 function, that's just the conditional expectation 66:51 of why, given x. 66:52 And I could look at any property of the conditional distribution 66:58 of y given x. 67:00 I could look at the conditional 95th percentile. 67:02 I can look at the conditional median. 67:04 I can look at the conditional [INAUDIBLE] range. 67:06 I can look at the conditional variance. 67:08 But I decide to look at the conditional expectation, which 67:12 is called the regression function. 67:15 67:18 Yes? 67:19 AUDIENCE: [INAUDIBLE] 67:21 67:24 PHILIPPE RIGOLLET: Oh, there's no transpose here. 67:26 Actually, only Victor-Emmanuel used this prime for transpose, 67:28 and I found it confusing with the derivatives. 67:30 So primes here is only a derivative. 67:33 AUDIENCE: [INAUDIBLE] 67:34 67:35 PHILIPPE RIGOLLET: Oh, yeah, sorry, beta transpose x. 67:38 So you said what? 67:40 I said that g of mu of x is beta transpose x? 67:43 AUDIENCE: [INAUDIBLE] 67:45 67:48 PHILIPPE RIGOLLET: Isn't that the same thing? 67:49 67:52 X is a vector here, right? 67:53 AUDIENCE: Yeah. 67:54 PHILIPPE RIGOLLET: So x transpose beta, 67:56 and beta transpose x are of the same thing. 68:00 AUDIENCE: [INAUDIBLE] 68:02 68:03 PHILIPPE RIGOLLET: So beta looks like this. 68:05 X looks like this. 68:08 It's just a simple number. 68:12 Yeah, you're right. 68:13 I'm going to start to look at matrices. 68:15 I'm going to have to be slightly more careful when I do this. 68:18 OK so let's do the reverse engineering. 68:20 I'm giving you data. 68:22 From this data, hopefully, you should 68:23 be able to get what the conditional-- if I give you 68:26 an infinite amount of data, you would know exactly, 68:29 of pairs xy, what the conditional distribution of y 68:33 given x is. 68:36 And in particular, you would know 68:37 what the conditional expectation of y given x 68:40 is, which means that you would know mu, 68:42 which means that you would know theta, which 68:44 means that you would know beta. 68:45 Now, when I have a finite number of observations, 68:48 I'm going to try to estimate mu of x. 68:50 But really, I'm going to go the other way around. 68:53 Because the fact that I assume, specifically, that mu of x 68:56 is of the form g of mu of x is x transpose beta, then that 69:00 means that I only have to estimate beta, which 69:02 is a much simpler object than the entire regression function. 69:06 So that's what I'm going to go for. 69:07 I'm going to try to represent the likelihood, the log 69:10 likelihood, of my data as a function, not of theta, 69:12 not of mu, but of beta-- 69:15 and then, maximize that guy. 69:18 So now, rather than thinking of just one observation, 69:21 I'm going to have a bunch of observations. 69:23 69:27 So this might actually look a little confusing, 69:29 but let's just make sure that we understand each other 69:32 before we go any further. 69:33 So I'm going to have observations, 69:38 x1, y1, all the way to xn, yn, just 69:43 like in a natural regression problem, 69:45 except that here my y's might be 0 one valued. 69:49 They might be positive valued. 69:50 They might be exponential. 69:51 They might be anything in the canonical exponential family. 69:54 69:57 OK so I have this thing, and now, 69:59 what I have is that my observations are x1, 70:01 y1, xn, yn. 70:03 And what I want is that I'm going 70:06 to assume that the conditional expectation of yi, given-- 70:11 70:14 the conditional distribution of yi, given xi, 70:18 is something that has density. 70:20 70:30 Did I put an i on y-- yeah. 70:31 70:42 I'm not going to deal with the phi and the c now. 70:45 And why do I have theta i and not theta 70:48 is because theta i is really a function of xi. 71:01 So it's really theta i of xi. 71:05 But what do I know about theta i of xi, 71:07 it's actually equal to b-- 71:11 I did this error twice-- 71:13 b prime inverse of mu of xi. 71:16 71:30 And I'm going to assume that this is of the form beta 71:34 transpose xi. 71:36 And this is why I have theta i-- 71:37 is because this theta i is a function of xi, 71:40 and I'm going to assume a very simple form for this thing. 71:42 71:46 Sorry, sorry, sorry, sorry-- 71:48 I should not write it like this. 71:50 This is only when I have the canonical link. 71:51 So this is actually equal to b prime inverse of g, 71:57 of xi transpose beta. 71:59 72:05 Sorry, g inverse-- those two things 72:07 are actually canceling each other. 72:09 72:13 So as before, I'm going to stack everything into some-- 72:17 well, actually, I'm not going to stack anything for the moment. 72:20 I'm just going to give you a peek at what's 72:22 happening next week, rather than just manipulating the data. 72:28 So here is how we're going to proceed at this point. 72:33 Well now, I want to write my likelihood function, 72:36 not as a function of theta, but as a function of beta, 72:39 because that's the parameter I'm actually trying to maximize. 72:44 So if I have a link-- 72:47 so this thing that matters here, I'm going to call h. 72:50 72:53 By definition, this is going to be h of xi transpose beta. 72:58 Helena, you have a question? 73:00 AUDIENCE: Uh, no [INAUDIBLE] 73:02 PHILIPPE RIGOLLET: So this is just all the things 73:04 that we know. 73:04 Theta is just the, by definition of the fact that mu 73:09 is b prime of theta, the mean is b prime of theta-- 73:11 it means that theta is b prime inverse of mu. 73:14 And then, mu is modeled from the systematic component. 73:19 G of mu is xi transpose beta, so this is 73:21 g inverse of xi transpose beta. 73:23 So I want to have b prime inverse of g inverse. 73:27 This function is a bit annoying to say, 73:30 so I'm just going to call it h. 73:32 And when I do the composition of two inverses, 73:34 the inverse of the composition of those two things 73:36 in the reverse order-- 73:38 so h is really the inverse of g, composed with b 73:42 prime, g of b prime inverse. 73:46 And now, if I have the canonical link, 73:48 since I know that g is b prime inverse, 73:51 this is really just the identity. 73:54 As you can imagine, this entire thing, 73:58 which is actually quite complicated-- 73:59 would just say oh, this thing, actually, does not show up 74:01 when I have the canonical link. 74:03 I really just have that theta can be replaced by xi of beta. 74:06 So think about going back to this guy here. 74:09 Now, theta becomes only xi transpose beta. 74:15 That's going to be much more simple to optimize, 74:18 because remember, when I'm going to log likelihood, 74:20 this thing is going to go away. 74:21 I'm going to sum those guys. 74:23 And so what I'm going to have is something which 74:24 is essentially linear in beta. 74:26 And then, I'm going to have this minus b, 74:28 which is just minus the sum of convex functions of beta. 74:31 And so I'm going to have to bring in the tools of convex 74:34 optimization. 74:34 Now, it's not just going to be take the gradient, set it to 0. 74:37 It's going to be more complicated to do that. 74:39 I'm going to have to do that in an iterative fashion. 74:42 And so that's what I'm telling you, 74:43 when you look at your log likelihood for all 74:46 those functions. 74:47 You sum, the exponential goes away because you had the log, 74:50 and then, you have all these things here. 74:51 I kept the b. 74:52 I kept the h. 74:53 But if h is the identity, this is the linear function, 74:56 the linear part, yi times xi transpose 74:59 beta, minus b of my theta, which is now only xi transpose beta. 75:03 And that's the function I want to maximize in beta. 75:05 75:10 It's a convex function. 75:11 When I know what b is, I have an explicit formula for this, 75:15 and I want to just bring in some optimization. 75:18 And that's what we're going to do, 75:19 and we're going to see three different methods, which 75:21 are really, basically, the same method. 75:24 It's just an adaptation or specialization 75:28 of the so-called Newton-Raphson method, which is essentially 75:31 telling you do iterative local quadratic approximations 75:34 through your function-- so second order 75:36 [INAUDIBLE] expansion, minimize this guy, 75:38 and then do it again from where you were. 75:41 And we'll see that this can be, actually, 75:43 implemented using what's called iteratively re-weighted least 75:47 squares, which means that every step-- 75:49 since it's just a quadratic, it's 75:51 going to be just squares in there-- 75:53 can actually be solved by using a weighted least 75:56 squares version of the problem. 75:59 So I'm going to stop here for today. 76:02 So we'll continue and probably not finish this chapter, 76:05 but finish next week. 76:07 And then, I think there's only one lecture. 76:10 Actually, for the last lecture, what do you guys want to do? 76:13 76:16 Do you want to have doughnuts and cider? 76:18 Do you want to just have some more outlooking lecture 76:25 on what's happening post 1975 in statistics? 76:31 Do you want to have a review for the final exam-- 76:36 pragmatic people. 76:38 AUDIENCE: [INAUDIBLE] interesting, advanced topics. 76:43 PHILIPPE RIGOLLET: You want to do interesting, advanced-- 76:46 for the last lecture? 76:48 AUDIENCE: Something that we haven't thought of yet. 76:50 PHILIPPE RIGOLLET: Yeah, that's, basically, what I'm asking, 76:53 right-- interesting advanced topics, 76:55 versus ask me any question you want. 77:00 Those questions can be about interesting, advanced topics, 77:03 though. 77:03 Like, what are interesting, advanced topics? 77:06 I'm sorry? 77:06 AUDIENCE: Interesting with doughnuts-- is that OK? 77:08 PHILIPPE RIGOLLET: Yeah, we can always do the doughnuts. 77:10 [LAUGHTER] 77:11 AUDIENCE: As long as there are doughnuts. 77:14 PHILIPPE RIGOLLET: All right, so we'll do that. 77:16 So you guys have a good weekend.