https://www.youtube.com/watch?v=X-ix97pw0xY&list=PLUl4u3cNGP60uVBMaoNERc6knT_MgPKS0&index=19 字幕記錄 00:00 00:00 The following content is provided under a Creative 00:02 Commons license. 00:03 Your support will help MIT OpenCourseWare 00:06 continue to offer high quality educational resources for free. 00:10 To make a donation or to view additional materials 00:12 from hundreds of MIT courses, visit MIT OpenCourseWare 00:16 and ocw.mit.edu. 00:19 PHILIPPE RIGOLLET: The chapter is a natural capstone 00:22 chapter for this entire course. 00:24 We'll see some of the things we've 00:26 seen during maximum likelihood and some of the things 00:29 we've seen during linear regression, some of the things 00:34 we've seen in terms of the basic modeling that we've had before. 00:37 We're not going to go back to much inference questions. 00:39 It's really going to be about modeling. 00:41 And in a way, generalized linear models, as the word says, 00:44 are just a generalization of linear models. 00:47 And they're actually extremely useful. 00:49 They're often forgotten about and people just 00:51 jump onto machine learning and sophisticated techniques. 00:54 But those things do the job quite well. 00:57 So let's see in what sense they are a generalization 00:59 of the linear models. 01:02 So remember, the linear model looked like this. 01:05 We said that y was equal to x transpose beta plus epsilon, 01:13 right? 01:13 That was our linear regression model. 01:15 And it's-- another way to say this is that if-- 01:19 and let's assume that those were, say, 01:20 Gaussian with mean 0 and identity covariance matrix. 01:25 Then another way to say this is that 01:26 the conditional distribution of y given x is equal to-- 01:32 sorry, I a Gaussian with mean x transpose beta and variance-- 01:39 well, we had a sigma squared, which I will forget as usual-- 01:43 x transpose beta and then sigma squared. 01:46 OK, so here, we just assumed that-- so what is regression 01:50 is just saying I'm trying to explain why as a function of x. 01:54 Given x, I'm assuming a distribution for the y. 01:57 And this x is just going to be here 01:59 to help me model what the mean of this Gaussian is, right? 02:05 I mean, I could have something crazy. 02:07 I could have something that looks like y given 02:13 x is n0 x transpose beta. 02:17 And then this could be some other thing 02:19 which looks like, I don't know, some x transpose 02:22 gamma squared times, I don't know, 02:26 x, x transpose plus identity-- 02:30 some crazy thing that depends on x here, right? 02:33 And we deliberately assumed that all the thing that depends on x 02:37 shows up in the mean, OK? 02:39 And so what I have here is that y 02:42 given x is a Gaussian with a mean that 02:45 depends on x and covariance matrix sigma square identity. 02:51 Now the linear model assumed a very specific form 02:54 for the mean. 02:55 It said I want the mean to be equal to x 02:59 transpose beta which, remember, was 03:01 the sum from, say, j equals 1 to p of beta j xj, right? 03:10 It's where the xj's are the coordinates of x. 03:13 But I could do something also more complicated, right? 03:16 I could have something that looks like instead , 03:19 replace this by, I don't know, sum of beta j log of x to the j 03:28 divided by x to the j squared or something like this, right? 03:34 I could do this as well. 03:37 So there's two things that we have assumed. 03:39 The first one is that when I look 03:41 at the conditional distribution of y given x, 03:43 x affects only the mean. 03:45 I also assume that it was Gaussian 03:47 and that it affects only the mean. 03:48 And the mean is affected in a very specific way, 03:51 which is linear in x, right? 03:53 So this is essentially the things 03:56 we're going to try to relax. 03:58 So the first thing that we assume, 03:59 the fact that y was Gaussian and had only its mean [INAUDIBLE] 04:03 dependant no x is what's called the random component. 04:07 It just says that the response variables, you know, 04:09 it sort of makes sense to assume that they're Gaussian. 04:13 And everything was essentially captured, right? 04:17 So there's this property of Gaussians 04:18 that if you tell me-- if the variance is known, 04:22 all you need to tell me to understand 04:23 exactly what the distribution of a Gaussian is, 04:25 all you need to tell me is its expected value. 04:29 All right, so that's this mu of x. 04:31 And the second thing is that we have this link that says, 04:35 well, I need to find a way to use my x's to explain 04:38 this mu you and the link was exactly 04:40 mu of x was equal to x transpose beta. 04:42 04:45 Now we are talking about generalized linear models. 04:51 So this part here where mu of x is of the form-- the way 04:56 I want my beta, my x, to show up is linear, 05:00 this will never be a question. 05:03 In principle, I could add a third point, 05:06 which is just question this part, the fact that mu of x 05:10 is x transpose beta. 05:11 I could have some more complicated, nonlinear function 05:13 of x. 05:14 And then we'll never do that because we're talking 05:15 about generalized linear model. 05:17 The only thing with generalize are the random component, 05:20 the conditional distribution of y given x, 05:23 and the link that just says, well, once you actually tell me 05:26 that the only thing I need to figure out is the mean, 05:29 I'm just going to slap it exactly these x transpose beta 05:32 thing without any transformation of x transpose beta. 05:36 So those are the two things. 05:37 05:40 It will become clear what I mean. 05:42 This sounds like a tautology, but let's just 05:44 see how we could extend that. 05:46 So what we're going to do in generalized linear models-- 05:50 right, so when I talk about GLNs, 05:55 the first thing I'm going to do with my x 05:57 is turn it into some x transpose beta. 05:59 And that's just the l part, right? 06:02 I'm not going to be able to change. 06:03 That's the way it works. 06:05 I'm not going to do anything non-linear. 06:07 But the two things I'm going to change 06:09 is this random component, which is 06:16 that y, which used to be some Gaussian with mean mu of x 06:21 here in sigma squared-- 06:24 so y given x, sorry-- 06:26 this is going to become y given x follows some distribution. 06:35 And I'm not going to allow any distribution. 06:37 I want something that comes from the exponential family. 06:40 06:49 Who knows what the exponential family of distribution is? 06:52 This is not the same thing as the exponential distribution. 06:55 It's a family of distributions. 06:58 All right, so we'll see that. 07:00 It's-- wow. 07:01 07:04 What can that be? 07:06 Oh yeah, that's actually [INAUDIBLE].. 07:08 07:11 So-- I'm sorry? 07:17 AUDIENCE: [INAUDIBLE] 07:19 PHILIPPE RIGOLLET: I'm in presentation mode. 07:21 That should not happen. 07:23 OK, so hopefully, this is muted. 07:25 07:29 So essentially, this is going to be a family of distributions. 07:32 And what makes them exponential typically 07:34 is that there's an exponential that 07:35 shows up in the definition of the density, all right? 07:39 We'll see that the Gaussian belongs 07:41 to the exponential family. 07:42 But they're slightly less expected ones 07:44 because there's this crazy thing that a to the x 07:48 is exponential x log a, which makes the potential show up 07:52 without being there. 07:53 So if there's an exponential of some power, 07:54 it's going to show up. 07:55 But it's more than that. 07:56 So we'll actually come to this particular family 07:58 of distribution. 07:59 Why this particular family? 08:00 Because in a way, everything we've 08:02 done for the linear model with Gaussian 08:04 is going to extend fairly naturally to this family. 08:08 All right, and it actually also, because it encompasses 08:11 pretty much everything, all the distributions 08:13 we've discussed before. 08:15 All right, so the second thing that I want to question-- 08:19 right, so before, we just said, well, 08:22 mu of x was directly equal to this thing. 08:28 08:31 Mu of x was directly x transpose beta. 08:34 So I knew I was going to have an x transpose beta 08:36 and I said, well, I could do something with this x transpose 08:39 beta before I used it to explain the expected value. 08:42 But I'm actually taking it like that. 08:44 Here, we're going to say, let's extend this to some function 08:52 is equal to this thing. 08:54 Now admittedly, this is not the most natural way 08:56 to think about it. 08:57 What you would probably feel more comfortable doing 08:59 is write something like mu of x is a function. 09:03 Let's call it f of x transpose beta. 09:08 But here, I decide to call f g inverse. 09:12 OK, let's just my g inverse. 09:14 Yes. 09:15 AUDIENCE: Is this different then just [INAUDIBLE] 09:18 PHILIPPE RIGOLLET: Yeah. 09:19 09:22 I mean, what transformation you want to put on your x's? 09:26 AUDIENCE: [INAUDIBLE] 09:35 PHILIPPE RIGOLLET: Oh no, certainly not, right? 09:37 I mean, if I give you-- if I force you to work with x1 plus 09:40 x2, you cannot work with any function of x1 plus any 09:44 function of x2, right? 09:46 So this is different. 09:48 09:51 All right, so-- yeah. 09:55 The transformation would be just the simple part 09:57 of your linear regression problem 09:59 where you would take your exes, transform them, 10:01 and then just apply another linear regression. 10:03 This is genuinely new. 10:04 10:07 Any other question? 10:08 10:11 All right, so this function g and the reason 10:13 why I sort of have to, like, stick to this slightly less 10:16 natural way of defining it is because that's 10:18 g that gets a name, not g inverse that gets a name. 10:21 And the name of g is the link function. 10:23 10:29 So if I want to give you a generalized linear model, 10:33 I need to give you two ingredients. 10:35 The first one is the random component, 10:37 which is the distribution of y given x. 10:40 And it can be anything in what's called the exponential family 10:44 of distributions. 10:45 So for example, I could say, y given 10:47 x is Gaussian with mean mu x sigma identity. 10:50 But I can also tell you y given x 10:53 is gamma with shared parameter equal to alpha of x, OK? 10:57 I could do some weird things like this. 11:00 And the second thing is I need to give you a link function. 11:03 And the link function is going to become very clear 11:08 how you pick a link function. 11:09 And the only reason that you actually pick a link function 11:12 is because of compatibility. 11:15 This mu of x, I call it mu because mu of x 11:18 is always the conditional expectation of y given x, 11:21 always, which means that let's think 11:25 of y as being a Bernoulli random variable. 11:27 11:31 Where does mu of x live? 11:32 11:37 AUDIENCE: [INAUDIBLE] 11:38 PHILIPPE RIGOLLET: 0, 1, right? 11:39 That's the expectation of a Bernoulli. 11:40 It's just the probability that my coin flip gives me 1. 11:43 So it's a number between 0 and 1. 11:45 But this guy right here, if my x's are anything, right-- 11:49 think of any body measurements plus [INAUDIBLE] 11:52 linear combinations with arbitrarily large coefficients. 11:55 This thing can be any real number. 11:57 So the link function, what it's effectively going to do 12:01 is make those two things compatible. 12:03 It's going to take my number which, 12:04 for example, is constrained to be between 0 and 1 12:07 and map it into the entire real line. 12:11 If I have mu which is forced to be positive, for example, 12:13 in an exponential distribution, the mean is positive, right? 12:16 That's the, say, don't know, inter-arrival time 12:20 for Poisson process. 12:22 This thing is known to be positive for an exponential. 12:25 I need to map something that's exponential 12:27 to the entire real line. 12:28 I need a function that takes something positive 12:30 and [INAUDIBLE] everywhere. 12:31 So we'll see. 12:32 By the end of this chapter, you will 12:34 have 100 ways of doing this, but there are some more traditional 12:36 ones [INAUDIBLE]. 12:38 So before we go any further, I gave you the example 12:41 of a Bernoulli random variable. 12:46 Let's see a few examples that actually fit there. 12:48 Yes. 12:49 12:51 AUDIENCE: Will it come up later [INAUDIBLE] already know 12:53 why do we need the transformer [INAUDIBLE] why 12:56 don't [INAUDIBLE] 12:59 PHILIPPE RIGOLLET: Well actually, this 13:01 will not come up later. 13:02 It should be very clear from here 13:04 because if I actually have a model, 13:06 I just want it to be plausible, right? 13:08 I mean, what happens if I suddenly decide that my-- 13:11 so this is what's going to happen. 13:12 You're going to have only data to fit this model. 13:14 Let's say you actually forget about this thing here. 13:17 You can always do this, right? 13:19 You can always say I'm going to pretend my y's just 13:23 happen to be the realizations of said Gaussians that 13:26 happen to be 0 or 1 only. 13:28 You can always, like, stuff that in some linear model, right? 13:32 You will have some least squares estimated for beta. 13:35 And it's going to be fine. 13:36 For all the points that you see, it 13:38 will definitely put some number that's 13:40 actually between 0 and 1. 13:42 So this is what your picture is going to look like. 13:44 You're going to have a bunch of values for x. 13:48 This is your y. 13:50 And for different-- so these are the values 13:51 of x that you will get. 13:53 And for a y, you will see either a 0 or a 1, right? 13:55 13:59 Right, that's what your Bernoulli dataset would look 14:02 like with a one dimensional x. 14:05 Now if you do least squares on this, you will find this. 14:09 And for this guy, this line certainly 14:11 takes values between 0 and 1. 14:14 But let's say now you get an x here. 14:16 You're going to actually start pretending 14:17 that the probability it spits out one conditionally in x 14:20 is like 1.2, and that's going to be weird. 14:22 14:28 Any other questions? 14:31 All right, so let's start with some examples. 14:34 Right, I mean, you get so used to them through this course. 14:38 So the first one is-- 14:41 so all these things are taken. 14:42 So there's a few books on generalizing, 14:44 your models, generalize [INAUDIBLE] models. 14:45 And there's tons of applications that you can see. 14:48 Those are extremely versatile, and as soon 14:50 as you want to do modeling to explain some y given x, 14:53 you sort of need to do that if you want 14:55 to go beyond linear models. 14:58 So this was in the disease occurring rate. 15:00 So you have a disease epidemic and you 15:04 want to basically model the expected number 15:08 of new cases given-- 15:11 at a certain time, OK? 15:13 So you have time that progresses for each of your reservation. 15:16 Each of your reservation is a time stamp-- 15:18 say, I don't know, 20th day. 15:21 And your response is the number of new cases. 15:26 And you're going to actually put your model directly 15:28 on mu, right? 15:29 When I looked at this, everything here 15:31 was on mu itself, on the expected, right? 15:34 Mu of x is always the expected-- 15:36 15:39 the conditional expectation of y given x. 15:42 15:45 right? 15:45 So all I need to model is this expected value. 15:51 So this mu I'm going to actually say-- 15:54 so I look at some parameters, and it says, well, 15:57 it increases exponentially. 16:00 So I want to say I have some sort of exponential trend. 16:02 I can parametrize that in several ways. 16:04 And the two parameters I want to slap in 16:06 is, like, some sort of gamma, which is just the coefficient. 16:10 And then there's some rate delta that's in the exponential. 16:13 So if I tell you it's exponential, 16:15 that's a nice family of functions you 16:17 might want to think about, OK? 16:18 So here, mu of x, if I want to keep the notation, x 16:24 is gamma exponential delta x, right? 16:30 Except that here, my x are t1, t2, t3, et cetera. 16:34 And I want to find what the parameters gamma and delta are 16:37 because I want to be able to maybe compare 16:40 different epidemics and see if they have the same parameter 16:42 or maybe just do some prediction based on the data 16:46 that I have without-- to extrapolate in the future. 16:49 16:52 So here, clearly mu of x is not of the form 16:58 x transpose beta, right? 17:01 That's not x transpose beta at all. 17:04 And it's actually not even a function of x transpose data, 17:07 right? 17:08 There's two parameters, gamma and delta, 17:09 and it's not of the form. 17:11 So here we have x, which is 1 and x, right? 17:14 I have two parameters. 17:16 So what I do here is that I say, well, 17:17 first, let me transform mu in such a way 17:20 that I can hope to see something that's linear. 17:23 So if I transform mu, I'm going to have log of mu, which 17:26 is log of this thing, right? 17:28 So log of mu of x is equal, well, 17:33 to log of gamma plus log of exponential delta 17:36 x, which is delta x. 17:39 17:42 And now this thing is actually linear in x. 17:46 So I have that this guy is my first beta 1. 17:49 And so that's beta 1 finds 1. 17:50 And this guy is beta 2-- 17:53 times, sorry that said beta 0-- times 1, and this guy 17:55 is beta 1 times x. 17:58 OK, so that looks like a linear model. 18:00 I just have to change my parameters-- 18:02 my parameters beta 1 becomes the log of gamma and beta 2 18:05 becomes delta itself. 18:08 And the reason why we do this is because, well, the way 18:11 we put those gamma and those delta was just so that we 18:13 have some parametrization. 18:14 It just so happens that if we want this to be linear, 18:17 we need to just change the parametrization itself. 18:20 This is going to have some effects. 18:21 We know that it's going to have some effect 18:23 in the fissure information. 18:24 It's going to have a bunch of effect to change those things. 18:27 But that's what needs to be done to have 18:29 a generalized linear model. 18:32 Now here, the function that I took 18:35 to turn it into something that's linear is simple. 18:37 It came directly from some natural thing I would do here, 18:41 which is taking the log. 18:42 And so the function g, the link that I take, 18:44 is called the log link very creatively. 18:47 And it's just the function that I 18:49 apply to mu so that I see something that's linear 18:52 and that looks like this. 18:53 18:59 So now this only tells me how to deal with the link function. 19:03 But I still have to deal with 0.1. 19:06 And this, again, is just some modeling. 19:08 Given some data, some random data, 19:11 what distribution do you choose to explain the randomness? 19:14 And this-- I mean, unless there's no choice, 19:17 you know, it's just a matter of practice, right? 19:19 I mean, why would it be Gaussian and not, you know, 19:22 doubly exponential? 19:23 This is-- there's matters of convenience that 19:25 come into this, and there's just matter of experience 19:27 that come into this. 19:29 You know, I remember when you chat with engineers, 19:32 they have a very good notion of what 19:34 the distribution should be. 19:35 They have y bold distributions. 19:37 You know, they do optics and things like this. 19:39 So there's some distributions that just come up but sometimes 19:42 just have to work. 19:43 Now here what do we have? 19:45 The thing we're trying to measure, y-- 19:47 as we said, so mu is the expectation, 19:49 the conditional expectation, of y given x. 19:52 But y is the number of new cases, right? 19:56 Well it's a number of. 19:57 And the first thing you should think 19:59 of when you think about number of, 20:00 if it were bounded above, you would think binomial, baby. 20:03 But here, it's just a number. 20:05 So you think Poisson. 20:06 That's how insurers think. 20:08 I have a number of, you know, claims per year. 20:13 This is a Poisson distribution. 20:15 And hopefully they can model the conditional distribution 20:18 of the number of claims given everything that they actually 20:20 ask you in the surveys that I hear 20:24 you now fail in 15 minutes. 20:26 All right, so now you have this Poisson distribution. 20:31 And that's just the modeling assumption. 20:33 There's no particular reason why you 20:34 should do this except that, you know, 20:36 that might be a good idea. 20:38 And the expected value of your Poisson 20:39 has to be this mu i, OK? 20:42 At time i. 20:46 Any question about this slide? 20:48 OK, so let's switch to another example. 20:51 Another example is the so-called pray capture rate. 20:54 So here, what you're interested in 20:58 is the rate capture of preys yi for a given prey. 21:05 And you have xy, which is your explanation. 21:10 And this is just the density of pray. 21:12 So you're trying to explain the rate of captures of preys given 21:17 the density of the prey, OK? 21:20 And so you need to find some sort of relationship 21:22 between the two. 21:23 And here again, you talk to experts 21:25 and what they tell you is that, well, it's 21:27 going to be increasing, right? 21:28 I mean, animals like predators are going to just eat more 21:32 if there's more preys. 21:34 But at some point, they're just going 21:35 to level off because they're going to be [INAUDIBLE] full 21:38 and they're going to stop capturing those prays. 21:42 And you're just going to have some phenomenon that 21:44 looks like this. 21:45 So here is a curve that sort of makes sense, right? 21:47 As your capture rate goes from 0 to 1, you're increasing, 21:52 and then you see you have this like [INAUDIBLE] function 21:54 that says, you know, at some point it levels up. 21:57 OK, so here, one way I could-- 21:59 I mean, there's again many ways I could just 22:01 model a function that looks like this. 22:03 But a simple one that has only two parameters 22:05 is this one, where mu i is this a function of xi where 22:09 I have some parameter alpha here and some parameter h here. 22:13 OK, so there's clearly-- 22:15 so this function, there's one that essentially tells you-- 22:21 so this thing starts at 0 for sure. 22:23 And essentially, alpha tells you how 22:25 sharp this thing is, and h tells you 22:28 at which points you end here. 22:30 Well, it's not exactly what those values are equal to, 22:32 but that tells you this. 22:35 OK, so, you know-- simple, and-- 22:41 well, no, OK. 22:41 Sorry, that's actually alpha, which is the maximum capture. 22:44 The rate and h represent the pre-density 22:46 at which the capture weight is. 22:47 So that's the half time. 22:49 OK, so there's actual value [INAUDIBLE].. 22:52 All right, so now I have this function. 22:54 It's certainly not a function. 22:56 There's no-- I don't see it as a function of x. 22:59 So I need to find something that looks like a function of x, OK? 23:06 So then here, there's no log. 23:08 There's no-- well, I could actually take a log here. 23:13 But I would have log of x and log of x plus h. 23:15 So that would be weird. 23:17 So what we propose to do here is to look, 23:19 rather than looking at mu i, we look 1 over mu i. 23:23 Right, and so since your function 23:24 was mu i, when you take 1 over mu i, 23:37 you get h plus xi divided by alpha xi, which 23:42 is h over alpha times one over xi plus 1 over alpha. 23:49 And now if I'm willing to make this transformation 23:52 of variables and say, actually, I don't-- 23:54 my x, whether it's the density of prey 23:57 or the inverse density of prey, it really doesn't matter. 24:00 I can always make this transformation 24:02 when the data comes. 24:03 Then I'm actually just going to think of this 24:06 as being some linear function beta 0 plus beta 1, 24:11 which is this guy, times 1 over xi. 24:17 And now my new variable becomes 1 over xi. 24:20 And now it's linear. 24:21 And the transformation I had to take 24:23 was this 1 over x, which is called the reciprocal link, OK? 24:34 You can probably guess what the exponential link is going to be 24:37 and things like this, all right? 24:38 So we'll talk about other links that have slightly less 24:41 obvious names. 24:43 Now again, modeling, right? 24:45 So this was the random component. 24:46 This was the easy part. 24:47 Now I need to just poor in some domain knowledge 24:50 about how do I think this function, this y, which 24:55 is which is the rate of capture of praise, 25:01 I want to understand how this thing is actually 25:05 changing what is the randomness of the thing around its mean. 25:09 And you know, something that-- so that 25:11 comes from this textbook. 25:12 The standing deviation of capture rate 25:14 might be approximately proportional to the mean rate. 25:16 You need to find a distribution that 25:18 actually has this property. 25:19 And it turns out that this happens 25:21 for gamma distributions, right? 25:23 In gamma distributions, just like say, 25:26 for Poisson distribution, the-- 25:29 well, for Poisson, the variance and mean are of the same order. 25:32 Here is the standard deviation that's 25:34 of the same order as the [INAUDIBLE] for gammas. 25:39 And it's a positive distribution as well. 25:42 So here is a candidate. 25:43 Now since we're sort of constrained 25:45 to work under the exponential family of distributions, 25:48 then you can just go through your list 25:50 and just decide which one works best for you. 25:52 25:55 All right, third example-- 25:56 so here we have binary response. 25:59 Here, essentially the binary response variable 26:01 indicates the presence or absence 26:02 of postoperative deforming for kyphosis on children. 26:07 And here, rather than having one covariance which was before, 26:10 in the first example, was time, in the second example 26:12 was the density, here there's three ways 26:15 that you measure on children. 26:17 The first one is age of the child 26:19 and the second one is the number of vertebrae 26:21 involved in the operation. 26:23 And the third one is the start of the range, 26:25 right-- so where it is on the spine. 26:29 OK, so the response variable here is, you know, 26:35 did it work or not, right? 26:36 I mean, that's very simple. 26:37 And so here, it's nice because the random component 26:41 is the easiest one. 26:42 As I said, any random variable that takes only two outcomes 26:45 must be a Bernoulli, right? 26:49 So that's nice there's no modeling going on here. 26:52 So you know that y given x is going to be Bernoulli, 26:54 but of course, all your efforts are 26:55 going to try to understand what the conditional mean 26:58 of your Bernoulli, what the conditional probability 27:00 of being 1 is going to be, OK? 27:02 And so in particular-- so I'm just-- here, 27:05 I'm spelling it out before we close those examples. 27:08 I cannot say that mu of x is x transpose data for exactly this 27:12 picture that I drew for you here, right? 27:15 There's just no way here-- the goal 27:17 of doing this is certainly to be able to extrapolate 27:20 for yet unseen children whether this is something 27:23 that we should be doing. 27:24 And maybe the range of x is actually 27:27 going to be slightly out. 27:28 And so, OK I don't want to see that have 27:30 a negative probability of outcome or a positive one-- 27:34 sorry, or one that's lower than one. 27:38 So I need to make this transformation. 27:40 So what I need to do is to transform mu, which 27:43 is, we know only a number. 27:44 All we know is a number between 0 and 1. 27:46 And we need to transform it in such a way 27:48 that it maps the entire real line 27:50 or reciprocally to say that-- 27:57 or inversely, I should say-- 27:58 that f of x transpose beta should 28:00 be a number between 0 and 1. 28:02 I need to find a function that takes any real number 28:05 and maps it into 0 and 1. 28:06 And we'll see that again, but you 28:10 have an army of functions that do that for you. 28:12 What are those functions? 28:13 28:16 AUDIENCE: [INAUDIBLE] 28:17 PHILIPPE RIGOLLET: I'm sorry? 28:19 AUDIENCE: [INAUDIBLE] 28:20 PHILIPPE RIGOLLET: Trait? 28:21 AUDIENCE: [INAUDIBLE] 28:22 PHILIPPE RIGOLLET: Oh. 28:23 AUDIENCE: [INAUDIBLE] 28:25 PHILIPPE RIGOLLET: Yeah, I want them to be invertible, right? 28:28 AUDIENCE: [INAUDIBLE] 28:34 PHILIPPE RIGOLLET: I have an army of function. 28:35 I'm not asking for one soldier in this army. 28:39 I want the name of this army. 28:41 AUDIENCE: [INAUDIBLE] 28:44 PHILIPPE RIGOLLET: Well, they're not really invertible either, 28:46 right? 28:48 So they're actually in [INAUDIBLE] textbook. 28:53 Because remember, statisticians don't 28:55 know how to integrate functions, but they 28:56 know how to turn a function into a Gaussian integral. 28:59 So we know it integrates to 1 and things like this. 29:01 Same thing here-- we don't know how 29:03 to build functions that are invertible and map 29:06 the entire real line to 0, 1, but there's 29:08 all the cumulative distribution functions that do that for us. 29:11 So I can you any of those guys, and that's 29:13 what I'm going to be doing, actually. 29:16 All right, so just to recap what I just 29:19 said as we were speaking, so normal linear model is not 29:23 appropriate for these examples if only because the response 29:30 variable is not necessarily Gaussian 29:34 and also because the linear model has to be-- 29:37 the mean has to be transformed before I can actually 29:39 apply a linear model for all these plausible nonlinear 29:42 models that I actually came up with. 29:44 OK, so the family we're going to go for 29:48 is the exponential family of distributions. 29:50 And we're going to be able to show-- 29:54 so one of the nice part of this is 29:56 to actually compute maximum likelihood 29:58 estimaters for those right? 29:59 In the linear model, maximum-- like, in the Gauss 30:02 linear model, maximum likelihood was as nice as it gets, right? 30:05 This actually was the least squares estimator. 30:08 We had a close form. 30:10 x transpose x inverse x transpose y, 30:12 and that was it, OK? 30:14 We had to just take one derivative. 30:15 Here, we're going to have a generally concave likelihood. 30:19 We're not going to be able to actually 30:21 solve this thing directly in close form 30:23 unless it's Gaussian, but we will have-- 30:26 we'll see actually how this is not just 30:30 a black box optimization of a concave function. 30:32 We have a lot of properties of this concave function, 30:35 and we will be able to show some iterative algorithms. 30:38 We'll basically see how, when you opened the box of convex 30:42 optimization, you will actually be able to see how things work 30:46 and actually implement it using least squares. 30:49 So each iteration of this iterative algorithm 30:51 will essentially be a least squares, 30:52 and that's actually quite [INAUDIBLE].. 30:54 So, very demonstrative of statisticians 30:56 being pretty ingenious so that they 30:59 don't have to call in some statistical software 31:01 but just can repeatedly call their least squares 31:06 Oracle within a statistical software. 31:09 OK, so what is the exponential family, right? 31:12 I promised to do the exponential family. 31:14 Before we go into this, let me just 31:17 tell you something about exponential families, 31:19 and what's the only thing to differentiate 31:22 an exponential family from all possible distributions? 31:25 An exponential family has two parameters, right? 31:28 And those are not really parameters, 31:30 but there's this theta parameter of my distribution, OK? 31:33 So it's going to be indexed by some parameter. 31:35 Here, I'm only talking about the distribution 31:37 of, say, some random variable or some random vector, OK? 31:40 So here in this slide, you see that the parameter theta that 31:44 indexed those distribution is k dimensional 31:48 and the space of the x's that I'm looking at-- so 31:53 that should really be y, right? 31:55 What I'm going to plug in here is 31:57 the conditional distribution of y given x and theta is 31:59 going to depend on x. 32:00 But this really is the y. 32:02 That's their distribution of the response variable. 32:04 And so this is on q, right? 32:06 So I'm going to assume that y takes-- 32:09 q dimensional-- is q dimensional. 32:12 Clearly soon, q is going to be equal to 1, 32:14 but I can define those things generally. 32:16 OK, so I have this. 32:17 I have to tell you what this looks like. 32:19 And let's assume that this is a probability density function. 32:23 So this, right this notation, the fact that I just 32:26 put my theta in subscript, is just 32:28 for me to remember that this is the variable that 32:31 indicates the random variable, and this is just the parameter. 32:34 But I could just write it as a function of theta and x, right? 32:37 This is just going to be-- right, if you were in calc, 32:39 in multivariable calc, you would have 32:41 two parameter of theta and x and you would 32:43 need to give me a function. 32:45 Now think of all-- 32:46 think of x and theta as being one dimensional at this point. 32:50 Think of all the functions that can 32:51 be depending on theta and x. 32:54 There's many of them. 32:56 And in particular, there's many ways theta and x can interact. 33:01 What the exponential family does for you 33:03 is that it restricts the way these things 33:05 can actually interact with each other. 33:07 It's essentially saying the following. 33:09 It's saying this is going to be of the form exponential-- 33:15 so this exponential is really not much because I 33:18 could put a log next to it. 33:20 But what I want is that the way theta and x 33:24 interact has to be of the form theta times x 33:30 in an exponential, OK? 33:32 So that's the simplest-- that's one 33:34 of the ways you can think of them interacting is you just 33:36 the product of the two. 33:37 Now clearly, this is not a very rich family. 33:40 So what I'm allowing myself is to just slap 33:43 on some terms that depend only on theta and depend only on x. 33:46 So let's just call this thing, I don't know, f of x, g of theta. 33:52 OK, so here, I've restricted the way theta and x can interact. 33:56 So I have something that depends only 33:58 on x, something that depends only on theta. 33:59 And here, I have this very specific interaction. 34:02 And that's all that exponential families are doing for you, OK? 34:06 So if we go back to this slide, this is much more general, 34:09 right? if I want to go from theta and x in r to theta 34:14 and x theta in r-- 34:16 34:19 to theta in r k and x in rq, I cannot take the product 34:26 of theta and x. 34:27 I cannot even take the inner product between theta and x 34:29 because they're not even of compatible dimensions. 34:32 But what I can do is to first map my theta into something 34:37 and map my x into something so that I actually end up 34:40 having the same dimensions. 34:42 And then I can take the inner product. 34:43 That's the natural generalization 34:44 of this simple product. 34:45 34:59 OK, so what I have is-- 35:03 right, so if I want to go from theta 35:05 to x, when I'm going to first do is I'm going to take theta, 35:10 eta of theta-- 35:11 so let's say eta1 of theta to eta k of theta. 35:16 35:20 And then I'm going to actually take 35:22 x becomes t1 of x all the way to tk of x. 35:29 And what I'm going to do is take the inner product-- 35:32 so let's call this eta and let's call this t. 35:35 And I'm going to take the inner product of eta and t, which 35:39 is just the sum from j equal 1 to k of eta j of theta times 35:49 tj of x. 35:52 OK, so that's just a way to say I want this simple interaction 35:57 but in higher dimension. 35:58 The simplest way I can actually make those things happen 36:00 is just by taking inner product. 36:02 36:05 OK, and so now what it's telling me 36:07 is that the distribution-- so I want the exponential times 36:09 something that depends only on theta and something that 36:11 depends only on x. 36:12 And so what it tells me is that when 36:14 I'm going to take p of theta x, it's 36:16 just going to be something which is exponential 36:19 times the sum from j equal 1 to k of eta j theta tj of x. 36:30 And then I'm going to have a function that depends only-- 36:32 so let me read it for now like c of theta and then 36:36 a function that depends only on x. 36:37 Let me call it h of x. 36:39 And for convenience, there's no particular reason 36:42 why I do that. 36:43 I'm taking this function c of theta 36:45 and I'm just actually pushing it in there. 36:47 So I can write c of theta as exponential minus log of 1 36:57 over c of theta, right? 36:58 37:01 And now I have exponential times exponential. 37:03 So I push it in, and this thing actually 37:04 looks like exponential sum from j equal 1 to k of eta 37:10 j theta tj of x minus log 1 over c of theta times h of x. 37:22 And this thing here, log 1 over c of theta, I call actually 37:26 b of theta Because c, I called it c. 37:32 But I can actually directly call this guy b, 37:35 and I don't actually care about c itself. 37:38 Now why don't I put back also h of x in there? 37:43 Because h of x is really here to just-- 37:48 how to put it-- 37:50 37:54 OK, h of x and b of theta don't play the same role. 38:00 B of theta in many ways is a normalizing constant, right? 38:03 I want this density to integrate to 1. 38:06 If I did not have this guy, I'm not 38:09 guaranteed that this thing integrates to 1. 38:11 But by tweaking this function b of theta or c of theta-- 38:14 they're equivalent-- 38:16 I can actually ensure that this thing integrates to 1. 38:18 So b of theta is just a normalizing constant. 38:22 H of x is something that's going to be funny for us. 38:25 It's going to be something that allows 38:26 us to be able to treat both discrete and continuous 38:29 variables within the framework of exponential families. 38:38 So for those that are familiar with this, 38:40 this is essentially saying that that h of x 38:41 is really just a change of measure. 38:44 When I actually look at the density of p of theta-- 38:48 this is with respect to some measure-- 38:50 the fact that I just multiplied by a function of x just 38:52 means that I'm not looking-- 38:53 that this guy here without h of theta 38:56 is not the density with respect to the original measure, 38:59 but it's the density with respect to the distribution 39:01 that has h as a density. 39:04 That's all I'm saying, right? 39:05 So I can first transform my x's and then take the density 39:08 with respect to that. 39:10 If you don't want to think about densities or measures, 39:12 you don't have to. 39:13 This is just the way-- 39:14 this is just the definition. 39:16 Is there any question about this definition? 39:19 All right, so it looks complicated, 39:21 but it's actually essentially the simplest 39:23 way you could think about it. 39:25 You want to be able to have x and theta interact 39:29 and you just say, I want the interaction 39:30 to be of the form exponential x times theta. 39:34 And if they're higher dimensions, 39:35 I'm going to take the exponential 39:36 of the function of x inner product 39:38 with a function of theta. 39:39 39:43 All right, so I claimed since the beginning 39:45 that the Gaussian was such an example. 39:47 So let's just do it. 39:48 So is the Gaussian of the-- is the interaction between theta 39:51 and x in a Gaussian of the form in the product? 39:55 And the answer is yes. 39:58 Actually, whether I know or not what the variance is, OK? 40:03 So let's start for the case where I actually do not 40:06 know what the variance is. 40:07 So here, I have x is n mu sigma squared. 40:13 This is all one dimensional. 40:14 And here, I'm going to assume that my parameter is both mu 40:17 and sigma square. 40:19 OK, so what I need to do is to have some function of mu, 40:22 some function of stigma square, and take an inner product 40:24 of some function of x and some other function of x. 40:26 So I want to show that-- 40:29 so p theta of x is what? 40:32 Well, it's one over square root sigma 2 pi 40:36 exponential minus x minus mu squared over 2 sigma squared, 40:42 right? 40:44 So that's just my Gaussian density. 40:45 And I want to say that this thing here-- so 40:49 clearly, the exponential shows up already. 40:51 I want to show that this is something that looks 40:53 like, you know, eta 1 of-- 41:01 sorry, so that was-- yeah, eta 1 of, say, mu sigma squared. 41:08 So I have only two of those guys, 41:09 so I'm going to need only two etas, right? 41:11 So I want it to be eta 1 of mu and sigma times t1 41:16 of x plus eta 2 mu 1 mu sigma squared times t2 of x, right? 41:22 So I want to have something like that that shows up, 41:26 and the only things that are left, 41:27 I want them to depend either only on theta or only on x. 41:32 So to find that out, we just need to expand. 41:37 OK, so I'm going to first put everything into my exponential 41:42 and expand this guy. 41:43 So the first term here is going to be minus x 41:46 squared over 2 sigma square. 41:47 The second term is going to be minus mu 41:49 squared over two sigma squared. 41:51 And then the cross term is going to be plus x mu divided 41:55 by sigma squared. 41:57 And then I'm going to put this guy here. 41:58 So I have a minus log sigma over 2 pi, OK? 42:05 42:09 OK, is this-- so this term here contains an interaction 42:13 between X and the parameters. 42:15 This term here contains an interaction 42:17 between X and the parameters. 42:18 So let me try to write them in a way that I want. 42:21 This guy only depends on the parameters, 42:22 this guy only depends on the parameter. 42:25 So I'm going to rearrange things. 42:28 And so I claim that this is of the form x squared. 42:34 Well, let's say-- do-- 42:36 42:43 who's getting the minus? 42:44 Eta, OK. 42:46 So it's x squared times minus 1 over 2 sigma 42:52 squared plus x times mu over sigma squared, right? 42:58 So that's this term here. 42:59 That's this term here. 43:01 Now I need to get this guy here, and that's minus. 43:04 So I'm going to write it like this-- minus, 43:05 and now I have mu squared over 2 sigma 43:09 squared plus log sigma square root 2 pi. 43:15 43:22 And now this thing is definitely of the form t of x times-- 43:31 did I call them the right way or not? 43:34 Of course not. 43:36 OK, so that's going to be t2 of x times eta 43:39 2 of x eta 2 of theta. 43:41 This guy is going to be t1 of x times eta 1 of theta. 43:48 All right, so just a function of theta times a function of x-- 43:50 just a function of theta times a function of x. 43:52 And the way combined is just by sending them. 43:55 And this is going to be my d of theta. 43:58 44:01 What is h of x? 44:04 AUDIENCE: 1. 44:06 PHILIPPE RIGOLLET: 1. 44:07 There's one thing I can actually play with, 44:09 and this is something you're going to have some three 44:13 choices, right? 44:14 This is not actually completely determined here is that-- 44:19 for example, so when I write the log sigma square root 2 pi, 44:27 this is just log of sigma plus log square root 2 pi. 44:32 So I have two choices here. 44:34 Either my b becomes this guy, or-- 44:37 so either I have b of theta, which 44:41 is mu squared over 2 sigma squared plus log sigma 44:45 square root 2 pi and h of x is equal to 1, or I have 44:51 that b of theta is mu square over 2 sigma squared 44:56 plus log sigma. 44:58 And h of x is equal to what? 44:59 45:08 Well, I can just push this guy out, right? 45:10 I can push it out of the exponential. 45:12 And so it's just square root of 2 pi, which is 45:15 a function of x, technically. 45:16 I mean, it's a constant function of x, but it's a function. 45:19 So you can see that it's not completely clear 45:22 how you're going to do the trade off, right? 45:25 So the constant terms can go either in b or in h. 45:28 But you know, why bother with tracking down b and h when 45:33 you can actually stuff everything into one 45:35 and just call h one and call it a day? 45:38 Right, so you can just forget about h. 45:40 You know it's one and think about the right. 45:43 H won't matter actually for estimation purposes or anything 45:46 like this. 45:48 All right, so that's basically everything that's written. 45:50 When stigma square is known, what's 45:55 happening is that this guy here is no longer 46:00 a function of theta, right? 46:03 Agreed? 46:05 This is no longer a parameter. 46:06 When sigma square is known, then theta is equal to mu only. 46:14 There's no sigma square going on. 46:17 So this-- everything depends on sigma square 46:19 can be thought of as a constant. 46:20 Think one. 46:23 So in particular, this term here does not 46:26 belong in the interaction between x and theta. 46:30 It belongs to h, right? 46:37 So if sigma is known, then this guy is only a function of h-- 46:49 of x. 46:50 So h of x becomes exponential x squared minus x squared 47:01 over 2 sigma squared, right? 47:05 That's just a function of x. 47:06 47:11 Is that clear? 47:11 47:16 So if you complete this computation, what you're 47:18 going to get is that your new one parameter thing is that p 47:28 theta x is not equal to exponential x times mu 47:35 over sigma squared minus-- 47:39 well, it's still the same thing. 47:40 47:49 And then you have your h of x that comes out-- 47:51 47:54 x squared over 2 sigma squared. 47:58 OK, so that's my h of x. 48:02 That's still my b of theta. 48:05 And this is my t1 of x. 48:11 And this is my eta one of theta. 48:15 And remember, theta is just equal to mu in this case. 48:18 48:22 So if I ask you prove that this distribution belongs 48:26 to an exponential family, you just have to work it out. 48:29 Typically, it's expanding what's in the exponential and see 48:32 what's-- 48:33 and just write it in this term and identify 48:35 all the components, right? 48:36 So here, notice those guys don't even get an index anymore 48:39 because there's just one of them. 48:40 So I wrote eta 1 and t1, but it's really just eta and t. 48:45 48:50 Oh sorry, this guy also goes. 48:54 This is also a constant, right? 48:56 So it can actually just put sigma divided 49:01 by sigma square root 2 pi. 49:03 So h of x is what, actually? 49:04 49:08 Is it the density of-- 49:12 AUDIENCE: Standard [INAUDIBLE]. 49:13 PHILIPPE RIGOLLET: It's not standard. 49:14 It's centered. 49:15 It has mean 0. 49:16 But it variance sigma squared, right? 49:18 But it's the density of a Gaussian. 49:21 And this is what I meant when I said 49:23 h of x is really just telling you with respect to which 49:27 distribution, which measure you're taking the density. 49:30 And so this thing here is really telling you 49:33 the density of my Gaussian with mean mu 49:37 is equal to-- is this with respect to a centered Gaussian 49:41 is this guy, right? 49:43 That's what it means. 49:44 If this thing ends up being a density, 49:46 it just means that now you just have a new measure, which 49:49 is this density. 49:51 So it's just saying that the density 49:53 of the Gaussian with mean mu with respect 49:57 to the Gaussian with mean 0 is just this [INAUDIBLE] here. 50:00 50:05 All right, so let's move on. 50:07 So here, as I said, you could actually 50:11 do all these computations and forget about the fact 50:13 that x is continuous. 50:16 You can actually do it with PMFs and do it for x is discrete. 50:20 This actually also tells you if you can actually 50:23 get the same form for your density, which 50:26 is of the form exponential times the product 50:29 of the the interaction between theta 50:32 and x is just taking this product, 50:34 then a function only of theta and of function only of x, 50:36 for the PMF, it also works. 50:40 OK, so I claim that the Bernoulli 50:42 belongs to this family. 50:44 So the PMF of a Bernoulli-- 50:49 we say parameter p is p to the x 1 minus p to the 1 minus x, 50:54 right? 50:55 Because we know so that's only for x equals 0 or 1. 51:00 And the reason is because when x is equal to 0, 51:03 this is 1 minus p. 51:04 When x is equal to 1, this is minus 0. 51:06 OK, we've seen that when we're looking 51:08 at likelihoods for Bernoullis. 51:11 OK, this is not clear this is going to look like this at all. 51:16 But let's do it. 51:19 OK, so what does this thing look like? 51:21 Well, the first thing I want to do 51:23 is to make an exponential show up. 51:24 So what I'm going to write is I'm 51:26 going to write p to the x as exponential x log p, right? 51:31 51:33 And so I'm going to do that for the other one. 51:35 So this thing here-- 51:37 so I'm going to get exponential x log 51:43 p plus 1 minus x log 1 minus p. 51:47 51:51 So what I need to do is to collect my terms in x 51:54 and my terms in whatever parameters I have, 51:56 see here if theta is equal to p. 51:59 52:03 So if I do this, what I end up having 52:05 is equal to exponential-- 52:08 so determine x is log p minus log 1 minus p. 52:12 So that's x times log p over 1 minus p. 52:18 And then the term that rest is just-- 52:20 that stays is just 1 times log 1 minus p. 52:23 But I want to see this as a minus something, right? 52:25 It was minus b of theta. 52:27 So I'm going to write it as minus-- 52:28 52:32 well, I can just keep the plus, and I'm going to do-- 52:35 52:41 and that's all [INAUDIBLE]. 52:44 A-ha! 52:46 Well, this is of the form exponential-- 52:48 something that depends only on x times something that depends 52:50 only on theta-- 52:52 minus a function that depends only on theta. 52:56 And then h of x is equal to 1 again. 52:59 OK, so let's see. 53:00 So I have t1 of x is equal to x. 53:03 That's this guy. 53:04 Eta 1 of theta is equal to log p1 minus p. 53:11 And b of theta is equal to log 1 over 1 minus p, OK? 53:20 And h of x is equal to 1, all right? 53:26 53:29 You guys want to do Poisson, or do you 53:31 want to have any homework? 53:32 53:35 It's a dilemma because that's an easy homework versus 53:37 no homework at all but maybe something more difficult. OK, 53:41 who wants to do it now? 53:43 Who does not want to raise their hand now? 53:46 Who wants to raise their hand now? 53:47 All right, so let's move on. 53:57 I'll just do-- do you want to do the gammas instead 53:59 in the homework? 54:00 That's going to be fun. 54:02 I'm not even going to propose to do the gammas. 54:04 And so this is the gamma distribution. 54:08 It's brilliantly called gamma because it 54:10 has the gamma function just like the beta distribution had 54:14 the beta function in there. 54:16 They look very similar. 54:17 One is defined over r plus, the positive real line. 54:20 And remember, the beta was defined over the interval 0, 1. 54:24 And it's of the form x to some power times exponential 54:28 of minus x to some-- 54:30 times something, right? 54:32 So there's a function of polynomial [INAUDIBLE] 54:34 x where the exponent depends on the parameter. 54:38 And then there's the exponential minus x times something depends 54:40 on the parameters. 54:41 So this is going to also look like some function of x-- 54:47 sorry, like some exponential distribution. 54:49 Can somebody guess what is going to be t2 of x? 54:52 54:58 Oh, those are the functions of x that show up in this product, 55:01 right? 55:01 Remember when we have this-- 55:03 we just need to take some transformations 55:05 of x so it looks linear in those things and not in x itself. 55:08 Remember, we had x squared and x, for example, 55:11 in the Gaussian case. 55:12 I don't know if it's still there. 55:14 Yeah, it's still there, right? 55:15 t2 was x squared. 55:17 What do you think x is going-- t2 of x here. 55:20 So here's a hint. t1 is going to be x. 55:23 AUDIENCE: [INAUDIBLE] 55:24 PHILIPPE RIGOLLET: Yeah, [INAUDIBLE],, 55:25 what is going to be t1? 55:26 Yeah, you can-- this one is taken. 55:27 This one is taken. 55:28 55:31 What? 55:32 Log x, right? 55:33 Because this x to the a minus 1, I'm 55:35 going to write that as exponential a minus 1 log x. 55:39 So basically, eta 1 is going to be a minus 1. 55:43 Eta 2 is going to be minus 1 over b-- 55:47 well, actually the opposite. 55:48 And then you're going to have-- 55:50 but this is actually not too complicated. 55:52 All right, then those parameters get names. 55:55 a is the shape parameter, b is the scale parameter. 55:58 It doesn't really matter. 56:00 You have other things that are called the inverse gamma 56:02 distribution, which has this form. 56:05 The difference is that the parameter alpha 56:09 shows negatively there and then the inverse Gaussian 56:14 distribution. 56:15 56:18 You know, just densities you can come up with 56:20 and they just happened to fall in this family. 56:23 And there's other ones that you can actually put in there 56:25 that we've seen before. 56:26 The chi-square is actually part of this family. 56:28 The beta distribution is part of this family. 56:30 The binomial distribution is part of this family. 56:32 Well, that's easy because the Bernoulli was. 56:35 The negative binomial, which is some stopping time-- 56:39 the first time you hit a certain number of successes 56:42 when you flip some Bernoulli coins. 56:46 So you can check for all of those, 56:47 and you will see that you can actually write them as part 56:50 of the exponential family. 56:51 So the main goal of this slide is 56:53 to convince you that this is actually 56:54 a pretty broad range of distributions 56:56 because it basically includes everything we've seen 57:00 but not anything there-- 57:03 sorry, plus more, OK? 57:06 Yeah. 57:07 AUDIENCE: Is there any example of a distribution 57:09 that comes up pretty often that's 57:10 not in the exponential family? 57:11 PHILIPPE RIGOLLET: Yeah, like uniform. 57:13 AUDIENCE: Oh, OK, so maybe a bit more complicated than 57:16 [INAUDIBLE]. 57:17 Anything Anything that has a support that 57:19 depends on the parameter is not going to fall-- 57:21 is not going to fit in there. 57:24 Right, and you can actually convince yourself 57:26 why anything that has the support that 57:31 does not-- that depends on the parameter 57:33 is not going to be part of this guy. 57:35 It's kind of a hard thing to-- 57:37 in fact, you proved that it's not and you prove this rule. 57:42 That's kind of a little difficult, 57:43 but the way you can convince yourself is that remember, 57:46 the only interaction between x and theta that I allowed 57:49 was taking the product of those guys 57:51 and then the exponential, right? 57:54 If you have something that depends on some parameter-- 57:56 let's say you're going to see something that looks like this. 57:59 Right, for uniform, it looks like this. 58:01 58:04 Well, this is not of the form exponential x times theta. 58:08 There's an interaction between x and theta here, 58:10 but it's actually certainly not of the form 58:12 x exponential x times theta. 58:14 So this is definitely not going to be 58:16 part of the exponential family. 58:18 And every time you start doing things like that, 58:20 it's just not going to happen. 58:21 58:25 Actually, to be fair, I'm not even sure 58:28 that all these guys, when you allow 58:30 them to have all their parameters free, 58:32 are actually going to be part of this. 58:34 For example-- the beta probably is, 58:36 but I'm not actually entirely convinced. 58:38 58:43 There's books on experiential families. 58:47 All right, so let's go back. 58:48 So here, we've put a lot of effort understanding 58:52 how big, how much wider than the Gaussian distribution 58:57 can we think of for the conditional distribution 59:01 of our response y given x. 59:04 So let's go back to the generalized linear models, 59:06 right? 59:07 So [INAUDIBLE] said, OK, the random component? 59:09 y has to be part of some exponential family 59:11 distribution-- check. 59:13 We know what this means. 59:14 So now I have to understand two things. 59:16 I have to understand what is the expectation, right? 59:20 Because that's actually what I model, right? 59:21 I take the expectation, the conditional expectation, 59:24 of y given x. 59:24 So I need to understand given this guy, 59:27 it would be nice if you had some simple rules that would tell me 59:30 exactly what the expectation is rather than having to do it 59:32 over and over again, right? 59:34 If I told you, here's a Gaussian, 59:36 compute the expectation, every time 59:37 you had to use that would be slightly painful. 59:40 So hopefully, this thing being simple enough-- 59:43 we've actually selected a class that's 59:45 simple enough so that we can have rules. 59:47 Whereas as soon as they give you those parameters t1, t2, eta 1, 59:52 eta 2, b and h, you can actually have some simple rules 59:55 to compute the mean and variance and all those things. 60:00 And so in particular, I'm interested in the mean, 60:03 and I'm going to have to actually say, well, you know, 60:05 this mean has to be mapped into the whole real line. 60:09 So I can actually talk about modeling this function 60:12 of the mean as x transpose beta. 60:14 And we saw that for the [INAUDIBLE] dataset 60:17 or whatever other data sets. 60:21 You actually can-- you can actually do this using the log 60:24 of the reciprocal or for the-- 60:27 oh, actually, we didn't do it for the Bernoulli. 60:30 We'll come to this. 60:30 This is the most important one, and that's called 60:32 a logit it or a logistic link. 60:34 60:37 But before we go there, this was actually 60:39 a very broad family, right? 60:42 When I wrote this thing on the bottom board-- it's gone now, 60:44 but when I wrote it in the first place, 60:46 the only thing that I wrote is I wanted x times theta. 60:48 Wouldn't it be nice if you have some distribution that 60:51 was just x times theta, not some function of x 60:53 times some function of theta? 60:54 The functions seem to be here so that they actually 60:58 make things a little-- 61:02 so the functions were here so that I can actually 61:05 put a lot of functions there. 61:06 But first of all, if I actually decide 61:08 to re-parametrize my problem, I can always 61:10 assume-- if I'm one dimensional, I 61:12 can always assume that eta 1 of theta 61:14 becomes my new theta, right? 61:17 So this thing-- here for example, 61:20 I could say, well, this is actually 61:22 the parameter of my Bernoulli. 61:23 Let me call this guy theta, right? 61:25 I could do that. 61:28 Then I could say, well, here I have x that shows up here. 61:31 And here since I'm talking about the response, 61:33 I cannot really make any transformations. 61:35 So here, I'm going to actually talk about a specific family 61:38 for which this guy is not x square or square root of x 61:41 or log of x or anything I want. 61:43 I'm just going to actually look at distributions 61:45 for which this is x. 61:46 This exponential families are called 61:48 a canonical exponential family. 61:51 So in the canonical exponential family, what I have 61:55 is that I have my x times theta. 61:57 I'm going to allow myself some normalization factor phi, 61:59 and we'll see, for example, that it's 62:01 very convenient when I talk about the Gaussian, right? 62:05 Because even if I know-- 62:07 62:11 yeah, even if I know this guy, which I actually pull into my-- 62:15 oh, that's over here, right? 62:16 62:20 Right, I know sigma squared. 62:23 But I don't want to change my parameter 62:24 to be mu over sigma squared. 62:26 It's kind of painful. 62:27 So I just take mu, and I'm going to keep this guy 62:30 as being this phi over there. 62:31 And it's called the dispersion parameter 62:34 from a clear analogy with the Gaussian, right? 62:38 That's the variance and that's measuring dispersion. 62:41 OK, so here, what I want is I'm going 62:45 to think throughout this class-- so phi may be known or not. 62:49 And depending-- when it's not known, 62:51 this actually might turn into some exponential family 62:54 or it might not. 62:55 And the main reason is because this b of theta over phi 63:01 is not necessarily a function of theta over phi, right? 63:04 If I actually have phi unknown, then y theta over phi 63:09 has to be-- 63:10 this guy has to be my new parameter. 63:13 And b might not be a function of this new parameter. 63:17 OK, so in a way, it may or may not, 63:21 but this is not really a concern that we're going to have 63:24 because throughout this class, we're 63:26 going to assume that phi is known, OK? 63:29 Phi is going to be known all the time, which means that this is 63:31 always an exponential family. 63:34 And it's just the simplest one you 63:35 could think of-- one dimensional parameter, one 63:38 dimensional response, and I just have-- the product is just y 63:42 times or, we used to call it x. 63:45 Now I've switched to y, but y times theta divided by phi, OK? 63:49 63:52 Should I write this or this is clear to everyone what this is? 63:56 Let me write it somewhere so we actually keep track of it 63:58 toward the [INAUDIBLE]. 64:00 64:05 OK, so this is-- 64:07 remember, we had all the distributions. 64:11 And then here we had the exponential family. 64:15 And now we have the canonical exponential family. 64:18 64:21 It's actually much, much smaller. 64:24 Well, actually, it's probably sort of a good picture. 64:26 And what I have is that my density or my PMF 64:32 is just exponential y times theta minus b 64:37 of theta divided by phi. 64:41 And I have plus phi of-- 64:46 oh, yeah, plus phi of y phi, which 64:53 means that this is really-- if phi is known, h of y 64:58 is just exponential c of y phi, agreed? 65:05 Actually, this is the reason why it's not necessarily 65:07 a canonical family. 65:10 It might not be that this depends only on y. 65:12 It could depend on y and phi in some annoying way 65:15 and I may not be able to break it. 65:18 OK, but if phi is known, this is just a function 65:21 that depends on y, agreed? 65:23 65:28 In particular, I think you need-- 65:29 I hope you can convince yourself that this is just 65:31 a subcase of everything we've seen before. 65:33 65:41 So for example, the Gaussian when the variance is known 65:44 is indeed of this form, right? 65:47 So we still have it on the board. 65:49 So here is my y, right? 65:51 So then let me write this as f theta of y. 65:53 So every x is replaceable with y, blah, blah, blah. 65:59 This is this guy. 66:01 And now what I have is that this is going to be my phi. 66:07 This is my parameter of theta. 66:10 So I'm definitely of the form y times theta divided by phi. 66:14 And then here I have a function b 66:16 that depends only on theta over phi again. 66:20 So b of theta is mu squared divided by 2. 66:27 66:31 OK, then it's divided by 6 sigma square. 66:33 And then I have this extra stuff. 66:35 But I really don't care what it is for now. 66:37 It's just something that depends only on y and known stuff. 66:42 So it was just a function of y just like my h. 66:44 I stuff everything in there. 66:47 The b, though, this thing here, this 66:50 is actually what's important because 66:52 in the canonical family, if you think 66:53 about it, when you know phi-- 66:57 sorry-- right, this is just y times theta 67:03 scaled by a known constant-- sorry, y times 67:05 theta scaled by a known constant is the first term. 67:08 The second term is b of theta scaled by some known constant. 67:12 But b of theta is what's going to make 67:13 the difference between the Gaussian and Bernoullis 67:17 and gammas and betas-- 67:19 this is all in this b of theta. b of theta 67:21 contains everything that's idiosyncratic to 67:25 this particular distribution. 67:27 And so this is going to be important. 67:29 And we will see that b of theta is going to capture information 67:32 about the mean, about the variance, 67:34 about likelihood, about everything. 67:37 67:44 Should I go through this computation? 67:46 I mean, it's the same. 67:47 We've just done it, right? 67:48 So maybe it's probably better if you can redo it on your own. 67:53 All right, so the canonical exponential family also 67:56 has other distributions, right? 67:58 So there's the Gaussian and there's the Poisson 68:00 and there's the Bernoulli. 68:02 But the other ones may not be part of this, right? 68:05 In particular, think about the gamma distribution. 68:07 We had this-- log x was one of the things that showed up. 68:13 I mean, I cannot get rid of this log x. 68:15 I mean, that's part of it except if a is equal to 1 68:18 and I know it for sure, right? 68:20 So if a is equal to 1, then I'm going to have a minus 1, 68:23 which is equal to 0. 68:25 So I'm going to have a minus 1 times log 68:27 x, which is going to be just 0. 68:28 So log x is going to vanish from here. 68:30 But if a is equal to 1, then this distribution 68:33 is actually much nicer, and it actually does not even 68:36 deserve the name gamma. 68:37 What is it if a is equal to 1? 68:38 68:42 It's an exponential, right? 68:43 Gamma 1 is equal to 1. x to the a minus 1 is equal to 1. 68:47 b-- so I have exponential x over b divided by b. 68:51 So 1 over b-- call it lambda. 68:53 And this is just an exponential distribution. 68:56 And so every time you're going to see something-- 68:58 so all these guys that don't make it to this table, 69:02 they could be part of those guys, but they're just more-- 69:06 they're just to-- 69:09 they just have another name in this thing. 69:10 All right, so you could compute the value of theta 69:13 for different values, right? 69:15 So again, you still have some continuous or discrete ones. 69:18 This is my b of theta. 69:19 And I said this is actually really what captures my theta. 69:22 This b is actually called cumulant generating function, 69:26 OK? 69:27 I don't have time. 69:28 I could write five slides to explain to you, 69:30 but it would just only tell you why it's called 69:32 cumulant generating function. 69:34 It's also known as the log of the moment generating function. 69:38 And the way it's called cumulant generating function 69:42 is because if I start taking successive derivatives 69:44 and evaluating them at 0, I get the successive cumulance 69:47 of this distribution, which are some transformation 69:50 of the moments. 69:51 AUDIENCE: What are you talking about again? 69:53 PHILIPPE RIGOLLET: The function b. 69:55 AUDIENCE: [INAUDIBLE] 69:55 PHILIPPE RIGOLLET: So this is just normalization. 69:57 So this is just to tell you I can compute this, 70:00 but I really don't care. 70:01 And obviously I don't care about stuff that's complicated. 70:04 This is actually cute, and this is what completes everything. 70:07 And the rest is just like some general description. 70:09 You only need to tell you that the range of y 70:11 is 0 to infinity, right? 70:14 And that is essentially telling me 70:16 this is going to give me some hints as to which link function 70:19 I should be using, right? 70:20 Because the range of y tells me what 70:21 the range of expectation of y is going to be. 70:23 All right, so here, it tells me that the range of y 70:25 is between 0 and 1. 70:28 OK, so what I want to show you is 70:30 that this captures a variety of different ranges 70:33 that you can have. 70:34 70:40 OK, so I'm going to want to go into the likelihood. 70:46 And the likelihood I'm actually going 70:48 to use to compute the expectations. 70:50 But since I actually don't have time 70:52 to do this now, let's just go quickly through this 70:55 and give you spoiler alert to make sure that you all wake up 70:59 on Thursday and really, really want 71:01 to think about coming here immediately. 71:03 All right, so the thing I'm going to want to do, 71:05 as I said, is it would be nice if, at least 71:07 for this canonical family, when I give you b, 71:11 you would be able to say, oh, here 71:12 is a simple computation of b that would actually give me 71:16 the mean and the variance. 71:17 The mean and the variance are also known as moments. 71:20 b is called cumulant generating function. 71:22 So it sounds like moments being related 71:24 to cumulance, I might have a path to finding those, right? 71:28 And it might involve taking derivatives of b, as we'll see. 71:31 The way we're going to prove this 71:33 by using this thing that we've used several times. 71:36 So this property we use when we're computing, 71:39 remember, the fisher information, right? 71:41 We had two formulas for the fisher information. 71:43 One was the expectation of the second derivative of the log 71:49 likelihood, and one was negative expectation of the square-- 71:53 sorry, expectation of the square, and the other one 71:55 was negative the expectation of the second derivative, right? 71:57 The log likelihood is concave, so this number is negative, 72:00 this number is positive. 72:02 And the way we did this is by just permuting some derivative 72:04 and integral here. 72:06 And there was just-- we used the fact that something 72:08 that looked like this, right? 72:09 The log likelihood is log of f theta. 72:13 And when I take the derivative of this guy with respect 72:20 to theta, then I have something that 72:24 looks like the derivative divided by f theta. 72:30 And if I start taking the integral against f theta 72:34 of this thing, so the expectation of this thing, 72:39 those things would cancel. 72:42 And then I had just the integral of a derivative, which 72:45 I would make a leap of faith and say that it's actually 72:48 the derivative of the integral. 72:49 72:53 But this was equal to 1. 72:56 So this derivative was actually equal to 0. 72:58 And so that's how you got that the expectation 73:00 of the derivative of the log likelihood is equal to 0. 73:02 And you do it once again and you get this guy. 73:04 It's just some nice things that happen 73:06 with the [INAUDIBLE] taking derivative of the log. 73:08 We've done that, we'll do that again. 73:10 But once you do this, you can actually apply it. 73:13 And-- missing a parenthesis over there. 73:17 So when you write the log likelihood, 73:19 it's just log of an exponential. 73:21 Huh, that's actually pretty nice. 73:22 Just like the least squares came naturally, the least 73:25 squares [INAUDIBLE] came naturally 73:26 when we took the log likelihood of the Gaussians, 73:29 we're going to have the same thing that happens when 73:31 I take the log of the density. 73:33 The exponential is going to go away, 73:35 and then I'm going to use this formula. 73:36 But this formula is going to actually give me 73:39 an equation directly-- oh, that's where it was. 73:43 So that's the one that's missing up there. 73:44 And so the expectation minus this thing 73:49 is going to be equal to 0, which tells me 73:50 that the expectation is just the derivative. 73:53 Right, so it's still a function of theta, 73:55 but it's just a derivative of b. 73:57 And the variance is just going to be 73:59 the second derivative of b. 74:01 But remember, this was some sort of a scaling, right? 74:03 It's called the dispersion parameter. 74:05 So if I had a Gaussian and the variance of the Gaussian 74:09 did not depend on the sigma squared 74:12 which I stuffed in this phi, that would be certainly weird. 74:15 And it cannot depend only on mu, and so this will-- 74:17 for the Gaussian, this is definitely going to be equal 74:19 to 1. 74:20 And this is just going to be equal to my variance. 74:24 So this is just by taking the second derivative. 74:28 So basically, the take-home message is that this function b 74:33 captures-- 74:35 by taking one derivative of the expectation 74:37 and by taking two derivatives captures the variance. 74:39 Another thing that's actually cool 74:41 and we'll come back to this and I 74:42 want to think about is if this second derivative is 74:45 the variance, what can I say about this thing? 74:49 74:52 What do I know about a variance? 74:53 AUDIENCE: [INAUDIBLE] 74:54 PHILIPPE RIGOLLET: Yeah, that's positive. 74:56 So I know that this is positive. 74:58 So what does that tell me? 75:00 Positive? 75:03 That's convex, right? 75:03 A function that has positive second derivative is convex. 75:07 So we're going to use that as well, all right? 75:09 So yeah, I'll see you on Thursday. 75:12 I have your homework.