https://www.youtube.com/watch?v=JTbZP0yt9qc&list=PLUl4u3cNGP60uVBMaoNERc6knT_MgPKS0&index=6 字幕記錄 00:00 00:00 The following content is provided under a Creative 00:02 Commons license. 00:03 Your support will help MIT OpenCourseWare 00:06 continue to offer high quality educational resources for free. 00:10 To make a donation or to view additional materials 00:12 from hundreds of MIT courses, visit MIT OpenCourseWare 00:16 at ocw.mit.edu. 00:17 00:21 PHILIPPE RIGOLLET: So welcome back. 00:23 We're going to finish this chapter on maximum likelihood 00:27 estimation. 00:28 And last time, I briefly mentioned something that 00:30 was called Fisher information. 00:33 So Fisher information, in general, 00:35 is actually a matrix when you have a multivariate parameter 00:40 theta. 00:41 So if theta, for example, is of dimension d, 00:44 then the Fisher information matrix 00:46 is going to be a d by d matrix. 00:48 You can see that, because it's the outer product. 00:51 So it's of the form gradient gradient transpose. 00:55 So if it's gradient gradient transpose, 00:57 the gradient is the d dimensional. 00:59 And so gradient times gradient transpose is a d by d matrix. 01:03 And so this matrix actually contains-- 01:07 well, tells you it's called Fisher information matrix. 01:09 So it's basically telling you how much 01:11 information about the theta is in your model. 01:14 So for example, if your model is very well-parameterized, 01:17 then you will have a lot of information. 01:20 You will have a higher-- 01:21 so let's think of it as being a scalar number, 01:23 just one number now-- so you're going 01:25 to have a larger information about your parameter 01:27 in the same probability distribution. 01:30 But if start having a weird way to parameterize your model, 01:35 then the Fisher information is actually going to drop. 01:38 So as a concrete example think of, for example, 01:40 a parameter of interest in a Gaussian model, 01:44 where the mean is known to be zero. 01:45 But what you're interested in is the variance, sigma squared. 01:48 If I'm interested in sigma square, 01:50 I could parameterize my model by sigma, sigma squared, sigma 01:53 to the fourth, sigma to 24th. 01:55 I could parameterize it by whatever I want, 01:57 then I would have a simple transformation. 01:59 Then you could say that some of them 02:01 are actually more or less informative, 02:02 and you're going to have different values 02:04 for your Fisher information. 02:06 So let's just review a few well-known computations. 02:10 So I will focus primarily on the one dimensional case as usual. 02:17 And I claim that there's two definitions. 02:19 So if theta is a real valued parameter, 02:24 then there's basically two definitions 02:26 that you can think of for your Fisher information. 02:28 One involves the first derivative 02:30 of your log likelihood. 02:31 And the second one involves the second derivative. 02:34 So the log likelihood here, we're 02:36 actually going to define it as l of theta. 02:39 And what is it? 02:40 Well, it's simply the likelihood function for one observation. 02:43 So it's l-- and I'm going to write 1 just to make sure that 02:46 we all know what we're talking about one observation-- 02:49 of-- which is the order again, I think it's X and theta. 02:52 02:55 So that's the log likelihood, remember? 02:57 03:05 So for example, if I have a density, 03:07 what is it going to be? 03:08 It's going to be log of f sub theta of X. 03:12 So this guy is a random variable, 03:15 because it's a function of a random variable. 03:17 And that's what you see expectations of this thing. 03:19 It's a random function of theta. 03:21 If I view this as a function of theta, 03:23 the function becomes random, because it 03:25 depends on this random X. 03:27 And so I of theta is actually defined as the variance 03:35 of l prime of theta-- 03:40 so the variance of the derivative of this function. 03:43 And I also claim that it's equal to negative the expectation 03:50 of the second derivative of theta. 03:54 03:57 And here, the expectation and the variance 04:00 are computed, because this function, remember, is random. 04:03 So I need to tell you what is the distribution 04:05 of the X with respect to which I'm 04:06 computing the expectation and the variance. 04:08 And it's the theta itself. 04:09 04:13 So typically, the theta we're going 04:15 to be interested in-- so there's a Fisher 04:18 information for all values of the parameter, 04:20 but the one typically we're interested in 04:22 is the true parameter, theta star. 04:25 But view this as a function of theta right now. 04:29 So now, I need to prove to you-- and this is not 04:31 a trivial statement-- the variance of the derivative 04:34 is equal to negative the expectation 04:36 of the second derivative. 04:37 I mean, there's really quite a bit that comes into this right. 04:40 And it comes from the fact that this is a log not of anything. 04:44 It's the log of a density. 04:45 So let's just prove that without having 04:48 to bother too much ourselves with 04:51 some technical assumptions. 04:54 And the technical assumptions are the assumptions 04:56 that allow me to permute derivative and integral. 04:59 Because when I compute the variances and expectations, 05:01 I'm actually integrating against the density. 05:04 And what I want to do is to make sure that I can always do that. 05:09 So my technical assumptions are I can always permute 05:13 integral and derivatives. 05:15 So let's just prove this. 05:19 So what I'm going to do is I'm going to assume 05:21 that X has density f theta. 05:32 05:35 And I'm actually just going to write-- well, let 05:37 me write it f theta right now. 05:39 Let me try to not be lazy about writing. 05:42 And so the thing I'm going to use 05:43 is the fact that the integral of this density is equal to what? 05:50 1. 05:51 And this is where I'm going to start doing weird things. 05:54 That means that if I take the derivative of this guy, 05:56 it's equal to 0. 05:59 So that means that if I look at the derivative with respect 06:03 to theta of integral f theta of X dX, this is equal to 0. 06:11 And this is where I'm actually making this switch, 06:14 is that I'm going to say that this is actually 06:16 equal to the integral of the derivative. 06:19 06:27 So that's going to be the first thing I'm going to use. 06:30 And of course, if it's true for the first derivative, 06:32 it's going to be true for the second derivative. 06:34 So I'm going to actually do it a second time. 06:36 And the second thing I'm going to use 06:38 is the fact the integral of the second derivative 06:46 is equal to 0. 06:47 06:50 So let's start from here. 06:51 06:59 And let me start from, say, the expectation 07:01 of the second derivative of l prime theta. 07:06 So what is l prime prime theta? 07:08 Well, it's the second derivative of log of f theta of X. 07:21 And we know that the derivative of the log-- 07:24 sorry-- yeah, so the derivative of the log is 1 over-- 07:30 well, it's the derivative of f divided by f itself. 07:34 07:49 Everybody's with me? 07:50 07:53 Just log of f prime is f prime over f. 07:58 Here, it's just that f, I view this as a function of theta 08:01 and not as a function of X. 08:04 So now, I need to take another derivative of this thing. 08:08 So that's going to be equal to-- 08:09 well, so we all know the formula for the derivative 08:13 of the ratio. 08:14 So I pick up the second derivative times f theta 08:22 minus the first derivative squared 08:30 divided by f theta squared-- 08:33 08:38 basic calculus. 08:40 And now, I need to check that negative the expectation 08:43 of this guy is giving me back what I want. 08:47 Well what is negative the expectation of l prime prime 08:52 of theta? 08:54 Well, what we need to do is to do negative integral 08:56 of this guy against f theta. 08:59 So it's minus the integral of-- 09:01 09:28 That's just the definition of the expectation. 09:31 I take an integral against f theta. 09:34 But here, I have something nice. 09:36 What's happening is that those guys are canceling. 09:38 09:41 And now that those guys are canceling, 09:43 those guys are canceling too. 09:44 09:51 So what I have is that the first term-- 09:53 I'm going to break this difference here. 09:56 So I'm going to say that integral of this difference 09:58 is the difference of the integrals. 10:00 So the first term is going to be the integral 10:03 of d over d theta squared of f theta. 10:09 10:13 And the second one, the negative signs are going to cancel, 10:17 and I'm going to be left with this. 10:32 10:37 Everybody's following? 10:39 Anybody found the mistake? 10:40 10:44 How about the other mistake? 10:46 I don't know if there's a mistake. 10:48 I'm just trying to get you to check what I'm doing. 10:50 10:54 With me so far? 10:56 So this guy here is the integral of the second the derivative 10:59 of f of X dX. 11:02 What is this? 11:04 AUDIENCE: It's 0. 11:06 PHILIPPE RIGOLLET: It's 0. 11:08 And that's because of this guy, which I will call frowny face. 11:16 So frowny face tells me this. 11:20 And let's call this guy monkey that hides his eyes. 11:26 No, let's just do something simpler. 11:28 Let's call it star. 11:29 And this guy, we will use later on. 11:32 11:35 So now, I have to prove that this guy, which 11:37 I have proved is equal to this, is now 11:40 equal to the variance of l prime theta. 11:46 So now, let's go back to the other way. 11:48 We're going to meet halfway. 11:49 I'm going to have a series-- 11:51 I want to prove that this guy is equal to this guy. 11:56 And I'm going to have a series of equalities 11:58 that I'm going to meet halfway. 12:01 So let's start from the other end. 12:03 We started from the negative l prime prime theta. 12:05 Let's start with the variance part. 12:06 12:10 Variance of l prime of theta, so that's the variance-- 12:17 12:22 so that's the expectation of l prime 12:29 of theta squared minus the square of the expectation of l 12:35 prime of theta. 12:35 12:41 Now, what is the square of the expectation 12:43 of l prime of theta? 12:44 Well, l prime of theta is equal to the partial with respect 12:50 to theta of log f theta of X, which we know from the first 12:57 line over there-- that's what's in the bracket on the second 12:59 line there-- 13:01 is actually equal to the partial over theta of f 13:05 theta X divided by f theta X. That's 13:09 the derivative of the log. 13:11 So when I look at the expectation of this guy, 13:16 I'm going to have the integral of this against f theta. 13:18 And the f thetas are going to cancel again, 13:20 just like I did here. 13:23 So this thing is actually equal to the integral 13:26 of partial over theta of f theta of X dX. 13:33 And what does this equal to? 13:34 13:37 0, by the monkey hiding is eyes. 13:42 So that's star-- tells me that this is equal to 0. 13:46 13:50 So basically, when I compute the variance, this term is not. 13:52 Going to matter. 13:53 I only have to complete the first one. 13:54 14:10 So what is the first one? 14:12 Well, the first one is the expectation of l prime squared. 14:21 14:24 And so that guy is the integral of-- well, what is l prime? 14:29 Again, it's partial over partial theta 14:33 f theta of X divided by f theta of X. Now, this time, this guy 14:37 is squared against the density. 14:40 14:44 So one of the f thetas cancel. 14:45 15:07 But now, I'm back to what I had before for this guy. 15:11 15:16 So this guy is now equal to this guy. 15:20 There's just the same formula. 15:21 So they're the same thing. 15:23 And so I've moved both ways. 15:25 Starting from the expression that 15:27 involves the expectation of the second derivative, 15:30 I've come to this guy. 15:31 And starting from the expression that tells me 15:34 about the variance of the first derivative, 15:36 I've come to the same guy. 15:38 So that completes my proof. 15:41 Are there any questions about the proof? 15:43 15:47 We also have on our way found an explicit formula for the Fisher 15:53 information as well. 15:55 So now that I have this thing, I could actually 15:57 add that if X has a density, for example, 16:02 this is also equal to the integral of-- 16:07 16:09 well, the partial over theta of f theta of X 16:15 squared divided by f theta of X, because I just 16:24 proved that those two things were actually 16:26 equal to the same thing, which was this guy. 16:28 16:31 Now in practice, this is really going to be the useful one. 16:34 The other two are going to be useful depending 16:35 on what case you're in. 16:37 So if I ask you to compute the Fisher information, 16:40 you have now three ways to pick from. 16:43 And basically, practice will tell you 16:44 which one to choose if you want to save five minutes when 16:47 you're doing your computations. 16:48 Maybe you're the guy who likes to take derivatives. 16:50 And then you're going to go with the second derivative one. 16:52 Maybe you're the guy who likes to extend squares, 16:55 so you're going to take the one that 16:56 involves the square of the squared prime. 16:59 And maybe you're just a normal person, 17:01 and you want to use that guy. 17:02 17:06 Why do I care? 17:07 This is the Fisher information. 17:09 And I could have defined the [? Hilbert ?] information 17:11 by taking the square root of this guy 17:13 plus the sine of this thing and just be super happy 17:16 and have my name in textbooks. 17:18 But this thing has a very particular meaning. 17:22 When we're doing the maximum likelihood estimation-- 17:24 17:28 so remember the maximum likelihood estimation is just 17:32 an empirical version of trying to minimize the KL divergence. 17:36 So what we're trying to do, maximum likelihood, 17:39 is really trying to minimize the KL divergence. 17:42 17:54 And we're trying to minimize this function, remember? 17:57 So now what we're going to do is we're 18:01 going to plot this function. 18:02 We said that, let's place ourselves 18:04 in cases where this KL is convex, 18:06 so that the inverse is concave. 18:09 So it's going to look like this-- 18:11 U-shaped, that's convex. 18:13 So that's the truth thing I'm trying to minimize. 18:16 And what I said is that I'm going to actually try 18:18 to estimate this guy. 18:19 So in practice, I'm going to have 18:20 something that looks like this, but it's not really this. 18:22 18:26 And we're not going to do this, but you 18:28 can show that you can control this 18:30 uniformly over the entire space, that there is no space where 18:33 it just becomes huge. 18:35 In particular, this is not the space 18:37 where it just becomes super huge, 18:38 and the minimum of the dotted line 18:39 becomes really far from this guy. 18:42 So if those two functions are close to each other, 18:45 then this implies that the minimum here of the dotted line 18:48 is close to the minimum of the solid line. 18:53 So we know that this is theta star. 18:56 And this is our MLE, estimator, theta hat ml. 19:00 So that's basically the principle-- 19:01 the more data we have, the closer the dotted line 19:05 is to the solid line. 19:07 And so the minimum is closer to the minimum. 19:10 But now, this is just one example, 19:12 where I drew a picture for you. 19:14 But there could be some really nasty examples. 19:17 Think of this example, where I have 19:20 a function, which is convex, but it looks more like this. 19:23 19:30 That's convex, it's U-shaped. 19:31 It's just a professional U. 19:36 Now, I'm going to put a dotted line around it that has pretty 19:41 much the same fluctuations. 19:42 The bend around it is of this size. 19:44 19:52 So do we agree that the distance between the solid line 19:56 and the dotted line is pretty much the same 19:58 in those two pictures? 20:01 Now, here, depending on how I tilt this guy, 20:04 basically, I can put the minimum theta star wherever I want. 20:07 And let's say that here, I actually put it here. 20:11 That's pretty much the minimum of this line. 20:13 And now, the minimum of the dotted line is this guy. 20:16 20:20 So they're very far. 20:22 The fact that I'm very flat at the bottom 20:25 makes my requirements for being close 20:28 to the U-shaped solid curve much more stringent, 20:31 if I want to stay close. 20:34 And so this is the canonical case. 20:38 This is the annoying case. 20:39 And of course, you have the awesome case-- 20:43 looks like this. 20:45 And then whether you deviate, you 20:47 can have something that moves pretty far. 20:50 It doesn't matter, it's always going to stay close. 20:53 Now, what is the quantity that measures how 20:57 curved I am at a given point-- 20:59 how curved the function is at a given point? 21:03 The secondary derivative. 21:05 And so the Fisher information is negative the second derivative. 21:11 Why the negative? 21:12 21:17 Well here-- Yeah, we're looking for a minimum, 21:18 and this guy is really the-- you should view 21:20 this as a reverted function. 21:23 This is we're trying to maximize the likelihood, which 21:26 is basically maximizing the negative KL. 21:28 So the picture I'm showing you is trying to minimize the KL. 21:31 So the truth picture that you should see for this guy 21:33 is the same, except that it's just flipped over. 21:37 But the curvature is the same, whether I flip my sheet or not. 21:40 So it's the same thing. 21:42 So apart from this negative sign, 21:44 which is just coming from the fact 21:45 that we're maximizing instead of minimizing, 21:47 this is just telling me how curved my likelihood is 21:50 around the maximum. 21:51 And therefore, it's actually telling me how good, 21:55 how robust my maximum likelihood estimator is. 21:58 It's going to tell me how close, actually, my likelihood 22:01 estimator is going to be-- 22:03 maximum likelihood is going to be to the true parameter. 22:06 So I should be able to see that somewhere. 22:09 There should be some statement that tells me 22:11 that this Fisher information will 22:13 play a role when assessing the precision of this estimator. 22:17 And remember, how do we characterize a good estimator? 22:20 Well, we look at its bias, or we look its variance. 22:24 And we can combine the two and form the quadratic risk. 22:26 So essentially, what we're going to try to say 22:29 is that one of those guys-- either the bias 22:31 or the variance or the quadratic risk-- 22:33 is going to be worse if my function is flatter, 22:35 meaning that my Fisher information is smaller. 22:37 22:40 And this is exactly the point of this last theorem. 22:44 So let's look at a couple of conditions. 22:46 So this is your typical 1950s statistics 22:51 paper that has like one page of assumptions. 22:54 And this was like that in the early days, 22:56 because people were trying to make 22:58 theories that would be valid for as many models as possible. 23:01 And now, people are sort of abusing this, 23:03 and they're just making all this lists of assumptions 23:05 so that their particular method works 23:06 for their particular problem, because they just 23:08 want to take shortcuts. 23:10 But really, the maximum likelihood estimator 23:13 is basically as old as modern statistics. 23:15 And so this was really necessary conditions. 23:18 And we'll just parse that. 23:19 The model is identified. 23:21 Well, better be, because I'm trying to estimate 23:24 theta and not P theta. 23:25 So this one is good. 23:26 For all theta, the support of P theta does not depend on theta. 23:32 So that's just something that we need to have. 23:34 Otherwise, things become really messy. 23:36 And in particular, I'm not going to be 23:38 able to define likelihood-- 23:40 Kullback-Leibler divergences. 23:42 Then why can I not do that? 23:44 Well, because the Kullback-Leibler divergence 23:46 has a log of the ratio of two densities. 23:49 And if one of the support is changing with theta 23:51 is it might be they have the log of something that's 23:53 0 or something that's not 0. 23:55 And the log of 0 is a slightly annoying quantity to play with. 23:59 And so we're just removing that case. 24:01 Nothing depends on theta-- 24:02 think of it as being basically the entire real line 24:05 as a support for Gaussian, for example. 24:08 Theta star is not on the boundary of theta. 24:10 Can anybody tell me why this is important? 24:13 24:17 We're talking about derivatives. 24:19 So when I want to talk about derivatives, 24:20 I'm talking about fluctuations around a certain point. 24:23 And if I'm at the boundary, it's actually really annoying. 24:26 I might have the derivative-- remember, 24:27 I give you this example-- 24:28 where the maximum likelihood is just obtained at the boundary, 24:31 because the function cannot grow anymore at the boundary. 24:34 But it does not mean that the first order 24:36 derivative is equal to 0. 24:38 It does not mean anything. 24:39 So all this picture here is valid 24:42 only if I'm actually achieving the minimum inside. 24:46 Because if my theta space stops here and it's just this guy, 24:52 I'm going to be here. 24:53 And there's no questions about curvature or anything 24:55 that comes into play. 24:56 It's completely different. 24:58 So here, it's inside. 25:00 Again, think of theta as being the entire real line. 25:02 Then everything is inside. 25:05 I is invertible. 25:08 What does it mean for a positive number, a 1 by 1 matrix 25:11 to be invertible? 25:12 25:16 Yep. 25:17 AUDIENCE: It'd be equal to its [INAUDIBLE].. 25:20 PHILIPPE RIGOLLET: A 1 by 1 matrix, that's a number, right? 25:24 What is a characteristic-- if I give you a matrix with numbers 25:26 and ask you if it's invertible, what 25:28 are you going to do with it? 25:31 AUDIENCE: Check if the determinant is 0. 25:33 PHILIPPE RIGOLLET: Check if the determinant is 0. 25:35 What is the determinant of the 1 by 1 matrix? 25:37 It's just the number itself. 25:38 So that's basically, you want to check if this number is 0 25:41 or not. 25:42 So we're going to think in the one dimensional case here. 25:44 And in the one dimensional case, that just 25:46 means that the curvature is not 0. 25:51 Well, it better be not 0, because then I'm 25:53 going to have no guarantees. 25:54 If I'm totally flat, if I have no curvature, 25:56 I'm basically totally flat at the bottom. 25:58 And then I'm going to get nasty things. 26:00 Now, this is not true. 26:02 I could have the curvature which grows like-- so here, it's 26:05 basically-- the second derivative is telling me-- 26:08 if I do the Taylor expansion, it's 26:09 telling me how I grow as a function of, say, x squared. 26:13 It's the quadratic term that I'm controlling. 26:15 It could be that this guy is 0, and then the term of order, 26:19 x to the fourth, is picking up. 26:20 That could be the first one that's non-zero. 26:22 26:23 But that would mean that my rate of convergence 26:25 would not be square root of n. 26:26 When I'm actually playing central limit theorem, 26:28 it would become n to the 1/4th. 26:30 And if I have all a bunch of 0 until the 16th order, 26:33 I would have n to the 1/16th, because that's really 26:36 telling me how flat I am. 26:39 So we're going to focus on the case 26:41 where it's only quadratic terms, and the rates 26:43 of the central limit theorems kick in. 26:46 And then a few other technical conditions-- 26:48 we just used a couple of them. 26:49 So I permuted limit and integral. 26:52 And you can check that really what you want 26:54 is that the integral of a derivative is equal to 0. 26:58 Well, it just means that the values at the two ends 27:00 are actually the same. 27:01 So those are slightly different things. 27:05 So now, what we have is that the maximum likelihood estimator 27:08 has the following two properties. 27:10 The first one, if I were to say that in words, what would 27:13 I say, that theta hat is-- 27:15 27:18 Is what? 27:20 Yeah, that's what I would say when I-- 27:22 that's for mathematicians. 27:23 But if I'm a statistician, what am I going to say? 27:27 It's consistent. 27:28 It's a consistent estimator of theta star. 27:30 It converges in probability to theta star. 27:33 And then we have this sort of central limit theorem 27:35 statement. 27:36 The central limit theorem statement tells me that if this 27:39 was an average and I remove the expectation of the average-- 27:44 let's say it's 0, for example-- 27:45 then square root of n times the average blah 27:47 goes through some normal distribution. 27:49 This is telling me that this is actually true, 27:52 even if theta hat has nothing to do with an average. 27:55 That's remarkable. 27:56 Theta hat might not even have a closed form, 27:59 and I'm still having basically the same properties 28:02 as an average that would be given to me 28:04 by a central limit theorem. 28:08 And what is the asymptotic variance? 28:10 So that's the variance in the n. 28:12 So here, I'm thinking of having those guys being multivariate. 28:15 And so I have the inverse of the covariance matrix 28:18 that shows up as the variance-covariance matrix 28:21 asymptotically. 28:22 But if you think of just being a one dimensional parameter, 28:25 it's one over this Fisher information, 28:27 one over the curvature. 28:29 So the curvature is really flat, the variance 28:31 becomes really big. 28:33 If the function is really flat, curvature is low, 28:36 variance is big. 28:37 If the curvature is very high, the variance becomes very low. 28:41 And so that illustrates everything 28:42 that's happening with the pictures that we have. 28:45 And if you look, what's amazing here, 28:48 there is no square root 2 pi, there's no fudge factors 28:51 going on here. 28:52 This is the asymptotic variance, nothing else. 28:56 It's all in there, all in the curvature. 28:58 29:03 Are there any questions about this? 29:05 29:07 So you can see here that theta star is the true parameter. 29:11 And the information matrix is evaluated at theta star. 29:17 That's the point that matters. 29:18 When I drew this picture, the point 29:20 that was at the very bottom was always theta star. 29:22 It's the one that minimizes the KL divergence, 29:26 am as long as I'm identified. 29:32 Yes. 29:33 AUDIENCE: So the higher the curvature, 29:35 the higher the inverse of the Fisher information? 29:38 PHILIPPE RIGOLLET: No, the higher 29:39 the Fisher information itself. 29:42 So the inverse is going to be smaller. 29:46 So small variance is good. 29:48 So now what it means, actually, if I 29:50 look at what is the quadratic risk 29:51 of this guy, asymptotically-- what 29:54 is asymptotic quadratic risk? 29:56 Well, it's 0 actually. 29:57 But if I assume that this thing is true, 30:01 that this thing is pretty much Gaussian, 30:03 if I look at the quadratic risk, well, it's 30:05 the expectation of the square of this thing. 30:08 And so it's going to scale like the variance divided by n. 30:12 30:14 The bias goes to 0, just by this. 30:18 And then the quadratic risk is going 30:20 to scale like one over Fisher information divided by n. 30:23 30:28 So here, the-- I'm not mentioning the constants. 30:30 There must be constants, because everything is asymptotic. 30:32 So for each finite n, I'm going to have 30:33 some constants that show up. 30:35 30:39 Everybody just got their mind blown by this amazing theorem? 30:43 So I mean, if you think about it, the MLE can be anything. 30:48 I'm sorry to say to you, in many instances, 30:50 the MLE is just going to be an average, which is just 30:52 going to be slightly annoying. 30:54 But there are some cases where it's not. 30:57 And we have to resort to this theorem 30:59 rather than actually resorting to the central limit theorem 31:01 to prove this thing. 31:03 And more importantly, even if this was an average, 31:05 you don't have to even know how to compute 31:07 the covariance matrix-- 31:09 sorry, the variance of this thing 31:11 to plug it into the central limit theorem. 31:14 I'm telling you, it's actually given by the Fisher information 31:17 matrix. 31:18 So if it's an average, between you and me, 31:22 you probably want to go the central limit theorem route 31:24 if you want to prove this kind of stuff. 31:26 But if it's not, then that's your best shot. 31:28 But you have to check those conditions. 31:31 I will give you for granted the 0.5. 31:38 Ready? 31:39 Any questions? 31:40 We're going to wrap up this chapter four. 31:41 So if you have questions, that's the time. 31:43 Yes. 31:43 AUDIENCE: What was the quadratic risk up there? 31:45 PHILIPPE RIGOLLET: You mean the definition? 31:47 AUDIENCE: No, the-- what is was for this. 31:49 PHILIPPE RIGOLLET: Well, you see the quadratic risk, 31:51 if I think of it as being one dimensional, 31:53 the quadratic risk is the expectation 31:55 of the square of the difference between theta hat and theta 31:57 star. 31:58 32:01 So that means that if I think of this as having a normal 0, 1, 32:05 that's basically computing the expectation of the square 32:09 of this Gaussian divided by n. 32:13 I just divided by square root of n on both sides. 32:15 So it's the expectation of the square of this Gaussian. 32:18 The Gaussian is mean 0, so the expectation of the square 32:20 is just a variance. 32:23 And so I'm left with 1 over the Fisher information divided 32:25 by n. 32:26 AUDIENCE: I see. 32:26 OK. 32:27 32:34 PHILIPPE RIGOLLET: So let's move on to chapter four. 32:36 And this is the method of moments. 32:38 So the method of moments is actually maybe a bit older 32:42 than maximum likelihood. 32:44 And maximum likelihood is dated, say, early 20th century, 32:48 I mean as a systematic thing, because as I said, 32:50 many of those guys are going to be averages. 32:52 So finding an average is probably a little older. 32:56 The method of moments, there's some really nice uses. 32:58 There's a paper by Pearson in 1904, I believe, or maybe 1894. 33:03 I don't know. 33:04 33:06 And this paper, he was actually studying some species 33:10 of crab in an island, and he was trying 33:12 to make some measurements. 33:13 That's how he came up with this model of mixtures of Gaussians, 33:16 because there was actually two different populations 33:18 in this populations of crab. 33:20 And the way he actually fitted the parameters 33:23 was by doing the method of moments, 33:25 except that since there were a lot of parameters, 33:27 he actually had to basically solve six equations with six 33:33 unknowns. 33:34 And that was a complete nightmare. 33:35 And the guy did it by hand. 33:36 And we don't know how he did it actually. 33:40 But that is pretty impressive. 33:43 So I want to start-- 33:44 and this first part is a little brutal. 33:48 But this is a Course 18 class, and I do not want to give you-- 33:52 So let's all agree that this course might be slightly more 33:54 challenging than AP statistics. 33:56 And that means that it's going to be challenging just 34:00 during class. 34:01 I'm not going to ask you about the Weierstrass Approximation 34:04 Theorem during the exams. 34:05 But what I want is to give you mathematical motivations 34:08 for what we're doing. 34:10 And I can promise you that maybe you 34:12 will have a slightly higher body temperature during the lecture, 34:17 but you will come out smarter of this class. 34:20 And I'm trying to motivate to you for using mathematical tool 34:24 and show you where interesting mathematical things that you 34:27 might find dry elsewhere actually work very beautifully 34:31 in the stats literature. 34:33 And one that we saw was using Kullback-Leibler divergence out 34:35 of motivation for maximum likelihood estimation, 34:38 for example. 34:39 So the Weierstrass Approximation Theorem 34:42 is something that comes from pure analysis. 34:45 So maybe-- I mean, it took me a while before I saw that. 34:49 And essentially, what it's telling you 34:51 is that if you look at a function that 34:52 is continuous on an interval a, b-- 34:55 on a segment a, b-- 34:57 then you can actually approximate it 35:02 uniformly well by polynomials as long 35:05 as you're willing to take the degree 35:06 of this polynomials large enough. 35:09 So the formal statement is, for any epsilon, 35:11 there exists the d that depends on epsilon in a1 to ad-- 35:16 so if you insist on having an accuracy which is 1/10,000, 35:20 maybe you're going to need a polynomial of degree 100,000, 35:23 who knows. 35:24 It doesn't tell you anything about this. 35:26 But it's telling you that at least you 35:28 have only a finite number of parameters 35:29 to approximate those functions that typically 35:31 require an infinite number of parameters to be described. 35:35 So that's actually quite nice. 35:36 And that's the basis for many things 35:39 and many polynomial methods typically. 35:43 And so here, it's uniform, so there's 35:45 this max over x that shows up that's actually nice as well. 35:50 That's Weierstrass Approximation Theorem. 35:52 Why is that useful to us? 35:54 Well, in statistics, I have a sample of X1 to Xn. 35:58 I have, say, a unified statistical model. 36:00 I'm not always going to remind you 36:01 that it's identified-- not unified-- identified 36:04 statistical model. 36:05 And I'm going to assume that it has a density. 36:08 You could think of it as having a PMF, 36:10 but think of it as having a density for one second. 36:13 Now, what I want is to find the distribution. 36:16 I want to find theta. 36:18 And finding theta, since it's identified 36:20 as equivalent to finding P theta, which 36:22 is equivalent to finding f theta, and knowing a function 36:26 is the same-- 36:28 knowing a density is the same as knowing a density 36:30 against any test function h. 36:33 So that means that if I want to make sure I know a density-- 36:38 if I want to check if two densities are the same, 36:40 all I have to do is to compute their integral 36:42 against all bounded continuous functions. 36:46 You already know that it would be true 36:48 if I checked for all functions h. 36:50 But since f is a density, I can actually 36:53 look only at functions h that are bounded, 36:56 say, between minus 1 and 1, and that are continuous. 37:04 That's enough. 37:06 Agreed? 37:06 Well, just trust me on this. 37:08 Yes, you have a question? 37:11 AUDIENCE: Why is this-- like, why shouldn't you 37:14 just say that [INAUDIBLE]? 37:17 37:20 PHILIPPE RIGOLLET: Yeah, I can do that. 37:21 I'm just finding a characterization 37:23 that's going to be useful for me later on. 37:25 I can find a bunch of them. 37:26 But here, this one is going to be useful. 37:28 So all I need to say is that f theta star integrated against 37:32 X, h of x-- so this implies that f-- 37:35 if theta is equal to f theta star not everywhere, 37:38 but almost everywhere. 37:41 And that's only true if I guarantee to you that f theta 37:44 and f theta stars are densities. 37:46 This is not true for any function. 37:49 So now, that means that, well, if I wanted to estimate theta 37:53 hat, all I would have to do is to compute the average, right-- 37:56 so this guy here, the integral-- 38:01 let me clean up a bit my board. 38:02 38:22 So my goal is to find theta such that, if I look at f theta 38:30 and now I integrate it against h of x, then 38:34 this gives me the same thing as if I were 38:36 to do it against f theta star. 38:42 And I want this for any h, which is continuous and bounded. 38:45 38:48 So of course, I don't know what this quantity is. 38:51 It depends on my unknown theta star. 38:53 But I have theta from this. 38:54 And I'm going to do the usual-- 38:56 the good old statistical trick, which is, 38:57 well, this I can write as the expectation with respect 39:01 to P theta star of h theta of x. 39:06 That's just the integral of a function against something. 39:08 And so what I can do is say, well, now I 39:11 don't know this guy. 39:12 But my good old trick from the book 39:14 is replace expectations by averages. 39:15 And what I get-- 39:16 39:23 And that's approximately by the law of large numbers. 39:29 So if I can actually find a function f theta such 39:33 that when I integrate it against h 39:36 it gives me pretty much the average of the evaluations 39:39 of h over my data points for all h, 39:42 then that should be a good candidate. 39:44 39:47 The problem is that's a lot of functions to try. 39:52 Even if we reduced that from all possible functions 39:54 to bounded and continuous ones, that's 39:56 still a pretty large infinite number of them. 40:01 And so what we can do is to use our Weierstrass Approximation 40:05 Theorem. 40:06 And it says, well, maybe I don't need to test it against all h. 40:09 Maybe the polynomials are enough for me. 40:11 So what I'm going to do is I'm going to look only at functions 40:14 h that are of the form sum of ak-- 40:20 so h of x is sum of ak X to the k-th for k 40:29 equals 0 to d-- only polynomials of degree d. 40:34 So when I look at the average of my h's, I'm 40:37 going to get a term like the first one. 40:40 So the first one here, this guy, becomes 1/n sum from i equal 1 40:47 to n sum from k equal 0 to d of ak Xi to the k-th. 40:54 That's just the average of the values of h of Xi. 40:58 And now, what I need to do is to check 41:00 that it's the same thing when I integrate 41:03 h of this form as well. 41:06 I want this to hold for all polynomials of degree d. 41:10 That's still a lot of them. 41:12 There's still an infinite number of polynomials, 41:14 because there's an infinite number of numbers a0 to ad 41:17 that describe those polynomials. 41:21 But since those guys are polynomials, 41:23 it's actually enough for me to look only at the terms 41:26 of the form X to the k-th-- 41:28 no linear combination, no nothing. 41:30 So actually, it's enough to look only 41:32 at h of x, which is equal to X to the k-th for k equal 0 to d. 41:40 41:43 And now, how many of those guys are there? 41:46 Just d plus 1, 0 to d. 41:49 So that's actually a much easier thing for me to solve. 41:51 41:54 Now, this quantity, which is the integral of f theta against X 42:01 to the k-th-- so that the expectation of X to the k-th 42:05 here-- 42:06 it's called moment of order k, or k-th moment of P theta. 42:12 That's a moment. 42:13 A moment is just the expectation of the power. 42:16 The mean is which moment? 42:19 The first moment. 42:21 And variance is not exactly the second moment. 42:24 It's the second moment minus the first moment squared. 42:27 42:29 That's the variance. 42:30 It's E of X squared minus E of X squared. 42:34 So those are things that you already know. 42:36 And then you can go higher. 42:37 You can go to E of X cube, E of X blah, blah. 42:40 Here, I say go to E of X to the d-th. 42:43 Now, as you can see, this is not something 42:44 you can really put in action right now, 42:47 because the Weierstrass Approximation Theorem does not 42:50 tell you what d should be. 42:52 Actually, we totally lost track of the epsilon 42:54 I was even looking for. 42:54 I just said approximately equal, approximately equal. 42:57 And so all this thing is really just motivation. 42:59 But it's essentially telling you that if you 43:02 go to d large enough, technically 43:05 you should be able to identify exactly your distribution up 43:08 to epsilon. 43:11 So I should be pretty good, if I go to d large enough. 43:16 Now in practice, actually there should be much 43:19 less than arbitrarily large d. 43:23 Typically, we are going to need d which is 1 or 2. 43:25 43:28 So there are some limitations to the Weierstrass Approximation 43:31 Theorem. 43:32 And there's a few. 43:33 The first one is that it only works 43:35 for continuous functions, which is not so much of a problem. 43:39 That can be fixed. 43:42 Well, we need bounded continuous functions. 43:44 It works only on intervals. 43:45 That's annoying, because we're going 43:47 to have random variables that are defined beyond intervals. 43:51 So we need something that just goes beyond the intervals. 43:53 And you can imagine that if you let your functions be huge, 43:55 it's going to be very hard for you 43:57 to have a polynomial approximately [INAUDIBLE] well. 44:00 Things are going to start going up and down at the boundary, 44:02 and it's going to be very hard. 44:05 And again, as I said several times, 44:07 it doesn't tell us what d should be. 44:09 And as statisticians, we're looking for methods, not 44:11 like principles of existence of a method that exists. 44:15 So if E is discrete, I can actually 44:21 get a handle on this d. 44:23 If E is discrete and actually finite-- 44:26 I'm going to actually look at a finite E, 44:29 meaning that I have a PMF on, say, r possible values, x1 44:33 and xr. 44:34 My random variable, capital X, can 44:35 take only r possible values. 44:37 Let's think of them as being the integer numbers 1 to r. 44:41 That's the number of success out of r trials 44:44 that I get, for example. 44:46 Binomial rp, that's exactly something like this. 44:51 So now, clearly this entire distribution 44:55 is defined by the PMF, which gives me exactly r numbers. 45:00 So it can completely describe this distribution 45:02 with r numbers. 45:03 The question is, do I have an enormous amount of redundancy 45:08 if I try to describe this distribution using moments? 45:12 It might be that I need-- say, r is equal to 10, 45:14 maybe I have only 10 numbers to describe this thing, 45:18 but I actually need to compute moments up to the order of 100 45:20 before I actually recover entirely the distribution. 45:23 Maybe I need to go infinite. 45:25 Maybe the Weierstrass Theorem is the only thing 45:27 that actually saves me here. 45:28 And I just cannot recover it exactly. 45:30 I can go to epsilon if I'm willing to go to higher 45:33 and higher polynomials. 45:34 Oh, by the way, in the Weierstrass Approximation 45:36 Theorem, I can promise you that as epsilon goes to 0, 45:39 d goes to infinity. 45:41 So now, really I don't even have actually r parameters. 45:46 I have only r minus parameter, because the last one-- 45:50 because they sum up to 1. 45:51 So the last one I can always get by doing 45:53 1 minus the sum of the first r minus 1 first. 45:56 Agreed? 45:58 So each distribution r numbers is described 46:01 by r minus 1 parameters. 46:04 The question is, can I use only r minus moments 46:07 to describe this guy? 46:08 46:12 This is something called Gaussian quadrature. 46:16 The Gaussian quadrature tells you, yes, moments 46:18 are actually a good way to reparametrize your distribution 46:22 in the sense that if I give you the moments 46:24 or if I give you the probability mass function, 46:27 I'm basically giving you exactly the same information. 46:29 You can recover all the probabilities from there. 46:32 So here, I'm going to denote by-- 46:34 I'm going to drop the notation in theta. 46:37 I don't have theta. 46:38 Here, I'm talking about any generic distribution. 46:41 And so I'm going to call mk the k-th moment. 46:44 46:49 And I have a PMF, this is really the sum for j 46:54 equals 1 to r of xj to the k-th times p of xj. 47:06 And this is the PMF. 47:10 So that's my k-th moment. 47:12 So the k-th moment is a linear combination of the numbers 47:15 that I am interested in. 47:16 47:19 So that's one equation. 47:22 47:25 And I have as many equations as I'm actually 47:27 willing to look at moments. 47:28 So if I'm looking at 25 moments, I have 25 equations. 47:34 m1 equals blah with this to the power of 1, 47:36 m2 equals blah with this to the power of 2, et cetera. 47:40 And then I also have the equation 47:41 that 1 is equal to the sum of the p of xj. 47:51 That's just the definition of PMF. 47:55 So this is r's. 47:56 They're ugly, but those are r's. 47:58 48:00 So now, this is a system of linear equations in p, 48:04 and I can actually write it in its canonical form, which 48:07 is of the form a matrix of those guys 48:11 times my parameters of interest is equal to a right hand side. 48:15 The right hand side is the moments. 48:17 That means, if I did you the moments, 48:20 can you come back and find what the PMF, 48:22 because we know already from probability 48:24 that the PMF is all I need to know to fully describe 48:27 my distribution. 48:29 Given the moments, that's unclear. 48:32 Now, here, I'm going to actually take exactly r minus 1 moment 48:37 and this extra condition that the sum of those guys 48:39 should be 1. 48:41 So that gives me r equations based on r minus 1 moments. 48:45 And how many unknowns do I have? 48:47 Well, I have my r unknown parameters 48:54 for the PMF, the r values of the PMF. 48:59 Now, of course, this is going to play a huge role 49:02 in whether the are many p's that give me the same. 49:06 The goal is to find if there are several p's that can give me 49:09 the same moments. 49:10 But if there's only one p that can give me a set of moment, 49:13 that means that I have a one-to-one correspondence 49:15 between PMF and moments. 49:17 And so if you give me the moments, 49:18 I can just go back to the PMF. 49:23 Now, how do I go back? 49:23 Well, by inverting this matrix. 49:26 If I multiply this matrix by its inverse, 49:28 I'm going to get the identity times the vector of p's equal 49:32 the inverse of the matrix times the m's. 49:36 So what we want to do is to say that p 49:41 is equal to the inverse of this big matrix times the moments 49:45 that you give me. 49:47 And if I can actually talk about the inverse, 49:49 then I have basically a one-to-one mapping 49:52 between the m's, the moments, and the matrix. 49:55 So what I need to show is that this matrix is invertible. 49:58 And we just decided that the way to check 50:01 if a matrix is invertible is by computing its determinant. 50:05 Who has computed a determinant before? 50:10 Who was supposed to compute a determinant at least than just 50:12 to say, no, you know how to do it. 50:15 So you know how to compute determinants. 50:16 And if you've seen any determinant in class, 50:19 there's one that shows up in the exercises that professors love. 50:22 And it's called the Vandermonde determinant. 50:25 And it's the determinant of a matrix 50:26 that have a very specific form. 50:28 It looks like-- so there's basically only r parameters 50:31 to this r by r matrix. 50:33 The first row, or the first column-- sometimes, 50:36 it's presented like that-- 50:37 is this vector where each entry is to the power of 1. 50:41 And the second one is each entry is to the power of 2, 50:43 and to the power of 3, and to the power 4, et cetera. 50:46 So that's exactly what we have-- x1 to the first, x2 50:49 to the first, all the way to xr to the first, 50:51 and then same thing to the power of 2, 50:53 all the way to the last one. 50:54 And I also need to add the row of all 1's, which 50:58 you can think of those guys are to the power of 0, if you want. 51:01 So I should really put it on top, 51:02 if I wanted to have a nice ordering. 51:05 So that was the matrix that I had. 51:07 And I'm not asking you to check it. 51:09 You can prove that by induction actually, 51:10 typically by doing the usual let's eliminate 51:14 some rows and columns type of tricks 51:15 that you do for matrices. 51:17 So you basically start from the whole matrix. 51:19 And then you move onto a matrix that has only one 1's and then 51:21 0's here. 51:22 And then you have Vandermonde that's just slightly smaller. 51:25 And then you just iterate. 51:26 Yeah. 51:27 AUDIENCE: I feel like there's a loss to either the supra index, 51:31 or the sub index should have a k somewhere [INAUDIBLE].. 51:35 51:38 [INAUDIBLE] the one I'm talking about? 51:39 PHILIPPE RIGOLLET: Yeah, I know, but I 51:41 don't think the answer to your question is yes. 51:45 So k is the general index, right? 51:48 So there's no k. k does not exist. k just is here for me 51:51 to tell me for k equals 1 to r. 51:53 So this is an r by r matrix. 51:56 And so there is no k there. 51:58 So if you wanted the generic term, 52:00 if I wanted to put 1 in the middle on the j-th row and k-th 52:03 column, that would be x-- 52:07 so j-th row would be x sub k to the power of j. 52:13 That would be the-- 52:15 And so now, this is basically the sum-- 52:19 well, that should not be strictly-- 52:20 So that would be for j and k between 1 and r. 52:25 So this is the formula that get when 52:26 you try to expand this Vandermonde determinant. 52:29 You have to do it only once when you're a sophomore typically. 52:32 And then you can just go on Wikipedia to do it. 52:34 That's what I did. 52:34 I actually made a mistake copying it. 52:36 The first one should be 1 less than or equal to j. 52:39 And the last one should be k less than or equal to r. 52:42 And now what you have is the product 52:43 of the differences of xj and xk. 52:45 52:47 And for this thing to be non-zero, 52:48 you need all the terms to be non-zero. 52:51 And for all the terms to be non-zero, 52:52 you need to have no xi, xj, and no xj, xk that are identical. 52:58 If all those are different numbers, 52:59 then this product is going to be different from 0. 53:03 And those are different numbers, because those 53:05 are r possible values that your random verbal takes. 53:09 You're not going to say that it takes two 53:11 with probability 1.5-- 53:14 sorry, two with probability 0.5 and two with probability 0.25. 53:18 You're going to say it takes two with probability 0.75 directly. 53:22 So those xj's are different. 53:24 These are the different values that your random variable 53:27 can take. 53:28 53:32 Remember, xj, xk was just the different values x1 to xr-- 53:37 sorry-- was the different values that your random variable 53:39 can take. 53:41 Nobody in their right mind would write twice the same value 53:43 in this list. 53:44 53:47 So my Vandermonde is non-zero. 53:49 So I can invert it. 53:49 And I have a one-to-one correspondence 53:51 between my entire PMF and the first r minus 1's moments 53:55 to which I append the number 1, which is really 54:00 the moment of order 0 again. 54:02 It's E of X to the 0-th, which is 1. 54:05 So good news, I only need r minus 1 parameters 54:10 to describe r minus 1 parameters. 54:12 And I can choose either the values of my PMF. 54:14 Or I can choose the r minus 1 first moments. 54:16 54:20 So the moments tell me something. 54:22 Here, it tells me that if I have a discrete distribution 54:26 with r possible values, I only need 54:28 to compute r minus 1 moments. 54:30 So this is better than Weierstrass Approximation 54:34 Theorem. 54:34 This tells me exactly how many moments I need to consider. 54:37 And this is for any distribution. 54:39 This is not a distribution that's 54:41 parametrized by one parameter, like the Poisson 54:43 or the binomial or all this stuff. 54:47 This is for any distribution under a finite number. 54:50 So hopefully, if I reduce the family of PMFs 54:53 that I'm looking at to a one-parameter family, 54:55 I'm actually going to need to compute much less than r 54:58 minus 1 values. 55:01 But this is actually hopeful. 55:02 It tells you that the method of moments 55:04 is going to work for any distribution. 55:06 You just have to invert a Vandermonde matrix. 55:09 55:13 So just the conclusion-- the statistical conclusion-- 55:17 is that moments contain important information 55:20 about the PMF and the PDF. 55:24 If we can estimate these moments accurately, 55:26 we can solve for the parameters of the distribution 55:30 and recover the distribution. 55:32 And in a parametric setting, where 55:34 knowing P theta amounts to knowing theta, which 55:36 is identifiability-- this is not innocuous-- 55:41 it is often the case that even less moments are needed. 55:44 After all, if theta is a one dimensional parameter, 55:46 I have one parameter to estimate. 55:48 Why would I go and get 25 moments 55:51 to get this one parameter. 55:52 Typically, there is actually-- we 55:54 will see that the method of moments 55:55 just says if you have a d dimensional parameter, 55:58 just compute d moments, and that's it. 56:02 But this is only on a case-by-case basis. 56:04 I mean, maybe your model will totally screw up its parameters 56:07 and you actually need to get them. 56:09 I mean, think about it, if the function is parameterized just 56:16 by its 27th moment-- 56:19 like, that's the only thing that matters in this distribution, 56:21 I just describe the function, it's just a density, 56:24 and the only thing that can change from one distribution 56:26 to another is this 27th moment-- 56:28 well, then you're going to have to go get the 27th moment. 56:30 And that probably means that your modeling step was actually 56:33 pretty bad. 56:34 56:37 So the rule of thumb, if theta is in Rd, we need d moments. 56:40 56:46 So what is the method of moments? 56:48 56:52 That's just a good old trick. 56:55 Replace the expectation by averages. 56:58 That's the beauty. 56:59 The moments are expectations. 57:02 So let's just replace the expectations by averages 57:04 and then do it with the average version, 57:07 as if it was the true one. 57:10 So for example, I'm going to talk about population moments, 57:14 when I'm computing them with the true distribution, 57:16 and I'm going to talk about them empirical moments, when 57:18 I talk about averages. 57:22 So those are the two quantities that I have. 57:24 And now, what I hope is that there is. 57:28 So this is basically-- 57:30 everything is here. 57:32 That's where all the money is. 57:33 I'm going to assume there's a function psi that maps 57:36 my parameters-- let's say they're in Rd-- 57:40 to the set of the first d moments. 57:42 57:45 Well, what I want to do is to come from this guy 57:48 back to theta. 57:49 So it better be that this function is-- 57:50 57:54 invertible. 57:55 I want this function to be invertible. 57:57 In the Vandermonde case, this function 57:59 with just a linear function-- multiply a matrix by theta. 58:03 Then inverting a linear function is inverting the matrix. 58:06 Then this is the same thing. 58:08 So now what I'm going to assume-- 58:09 and that's key for this method to work-- is that this theta-- 58:14 so this function psi is one to one. 58:16 There's only one theta that gets only one set of moments. 58:24 And so if it's one to one, I can talk about its inverse. 58:26 And so now, I'm going to be able to define theta 58:28 as the inverse of the moments-- 58:32 the reciprocal of the moments. 58:33 And so now, what I get is that the moment estimator is just 58:37 the thing where rather than taking the true guys in there, 58:42 I'm actually going to take the empirical moments in there. 58:44 58:48 Before we go any further, I'd like 58:50 to just go back and tell you that this is not 58:53 completely free. 58:56 How well-behaved your function psi 58:58 is going to play a huge role. 58:59 59:02 Can somebody tell me what the typical distance-- 59:05 if I have a sample of size n, what 59:06 is the typical distance between an average and the expectation? 59:10 59:12 What is the typical distance? 59:14 What is the order of magnitude as a function of n between xn 59:18 bar and its expectation. 59:23 AUDIENCE: 1 over square root of n. 59:25 PHILIPPE RIGOLLET: 1 over square root n. 59:25 That's what the central limit theorem tells us, right? 59:28 The central limit theorem tells us 59:29 that those things are basically a Gaussian, which 59:31 is of order of 1 divided by its square of n. 59:34 And so basically, I start with something 59:37 which is 1 over square root of n away from the true thing. 59:41 Now, if my function psi inverse is super steep like this-- 59:49 that's psi inverse-- then just small fluctuations, even 59:54 if they're of order 1 square root of n, 59:57 can translate into giant fluctuations in the y-axis. 60:04 And that's going to be controlled 60:06 by how steep psi inverse is, which is the same 60:09 as saying how flat psi is-- 60:14 how flat is psi. 60:15 So if you go back to this Vandermonde inverse, 60:20 what it's telling you is that if this inverse matrix blows up 60:26 this guy a lot-- 60:29 so if I start from a small fluctuation of this thing 60:32 and then they're blowing up by applying 60:34 the inverse of this matrix, things 60:36 are not going to go well. 60:37 Anybody knows what is the number that I should be looking for? 60:41 So that's from, say, numerical linear algebra 60:45 numerical methods. 60:47 When I have a system of linear equations, 60:49 what is the actual number I should 60:50 be looking at to know how much I'm 60:53 blowing up the fluctuations? 60:54 Yeah. 60:55 AUDIENCE: Condition number? 60:55 PHILIPPE RIGOLLET: The condition number, right. 60:57 So what's important here is the condition number 60:59 of this matrix. 61:00 If the condition number of this matrix is small, 61:03 then it's good. 61:04 It's not going to blow up much. 61:05 But if the condition number is very large, 61:07 it's just going to blow up a lot. 61:08 And the condition number is the ratio 61:10 of the largest and the smallest eigenvalues. 61:13 So you'll have to know what it is. 61:14 But this is how all these things get together. 61:17 So the numerical stability translates 61:21 into statistical stability here. 61:24 And numerical means just if I had 61:26 errors in measuring the right hand side, 61:28 how much would they translate into errors on the left hand 61:30 side. 61:31 So the error here is intrinsic to statistical questions. 61:33 61:38 So that's my estimator, provided that it exists. 61:42 And I said it's a one to one, so it should exist, 61:45 if I assume that psi is invertible. 61:48 So how good is this guy? 61:51 That's going to be definitely our question-- 61:53 how good is this thing. 61:54 And as I said, there's chances that if psi is really steep, 62:00 then it should be not very good-- 62:02 if psi inverse is very steep, it should not be very good, 62:06 which means that it's-- 62:07 well, let's just leave it to that. 62:11 So that means that I should probably 62:13 see the derivative of psi showing up somewhere. 62:16 If the derivative of psi inverse, say, is very large, 62:19 then I should actually have a larger variance 62:21 in my estimator. 62:22 So hopefully, just like we had a theorem that told us 62:26 that the Fisher information was key in the variance 62:29 of the maximum likelihood estimator, 62:30 we should have a theorem that tells us 62:32 that the derivative of psi inverse 62:33 is going to have a key role in the method of moments. 62:37 So let's do it. 62:38 62:57 So I'm going to talk to you about matrices. 63:01 So now, I have-- 63:02 63:10 So since I have to manipulate d numbers at any given time, 63:15 I'm just going to concatenate them into a vector. 63:17 So I'm going to call capital M theta-- 63:19 so that's basically the population moment. 63:24 And I have M hat, which is just m hat 1 to m hat d. 63:31 And that's my empirical moment. 63:32 63:39 And what's going to play a role is 63:41 what is the variance-covariance of the random vector. 63:45 So I have this vector 1-- 63:49 do I have 1? 63:50 No, I don't have 1. 63:51 63:59 So that's a d dimensional vector. 64:02 And here, I take the successive powers. 64:04 Remember, that looks very much like a column of my Vandermonde 64:08 matrix. 64:10 So now, I have this random vector. 64:12 It's just the successive powers of some random variable X. 64:15 And the variance-covariance matrix is the expectation-- 64:19 so sigma-- 64:20 of theta. 64:21 The theta just means I'm going to take expectations 64:23 with respect to theta. 64:26 That's the expectation with respect 64:28 to theta of this guy times this guy 64:31 transpose minus the same thing but with the expectation 64:40 inside. 64:41 64:45 Why do I do X, X1. 64:46 I have X, X2, X3. 64:48 64:50 X, X2, Xd times the expectation of X, X2, Xd. 65:04 Everybody sees what this is? 65:05 So this is a matrix where if I look at the ij-th term of this 65:11 matrix-- 65:13 or let's say, jk-th term, so on row j and column k, 65:20 I have sigma jk of theta. 65:26 And it's simply the expectation of X to the j 65:30 plus k-- well, Xj Xk minus expectation of Xj expectation 65:40 of Xk. 65:41 65:45 So I can write this as m j plus k of theta minus mj of theta 65:53 times mk of theta. 65:55 66:00 So that's my covariance matrix of this particular vector 66:04 that I define. 66:06 And now, I'm going to assume that psi inverse-- 66:09 well, if I want to talk about the slope 66:11 in an analytic fashion, I have to assume 66:14 that psi is differentiable. 66:16 And I will talk about the gradient 66:18 of psi, which is, if it's one dimensional, 66:20 it's just the derivative. 66:22 And here, that's where notation becomes annoying. 66:24 And I'm going to actually just assume 66:26 that so now I have a vector. 66:28 But it's a vector of functions and I 66:30 want to compute those functions at a particular value. 66:32 And the value I'm actually interested in 66:34 is at the m of theta parameter. 66:37 So psi inverse goes from the set of moments 66:41 to the set of parameters. 66:43 So when I look at the gradient of this guy, 66:45 it should be a function that takes as inputs moments. 66:48 And where do I want this function to be evaluated at? 66:51 At the true moment-- 66:54 at the population moment vector. 66:58 Just like when I computed my Fisher information, 67:00 I was computing it at the true parameter. 67:04 So now, once they compute this guy-- 67:08 so now, why is this a d by d gradient matrix? 67:11 67:15 So I have a gradient vector when I have a function from rd to r. 67:19 This is the partial derivatives. 67:22 But now, I have a function from rd to rd. 67:25 So I have to take the derivative with respect 67:28 to the arrival coordinate and the departure coordinate. 67:32 67:35 And so that's the gradient matrix. 67:39 And now, I have the following properties. 67:41 The first one is that the law of large numbers 67:46 tells me that theta hat is a weakly or strongly consistent 67:52 estimator. 67:54 So either I use the strong law of large numbers 67:56 or the weak law of large numbers, 67:57 and I get strong or weak consistency. 68:01 So what does that mean? 68:02 Why is that true? 68:03 Well, because now so I really have the function-- 68:12 so what is my estimator? 68:13 Theta hat this psi inverse of m hat 1 to m hat k. 68:23 Now, by the law of large numbers, 68:26 let's look only at the weak one. 68:28 Law of large numbers tells me that each of the mj hat 68:35 is going to converge in probability as n 68:38 to infinity to the-- so the empirical moments 68:40 converge to the population moments. 68:44 That's what the good old trick is using, 68:48 the fact that the empirical moments 68:49 are close to the true moments as n becomes larger. 68:52 And that's because, well, just because the m hat j's 68:55 are averages, and the law of large numbers 68:57 works for averages. 68:59 So now, plus if I look at my continuous mapping theorem, 69:04 then I have that psi inverse is continuously differentiable. 69:10 So it's definitely continuous. 69:12 And so what I have is that psi inverse of m hat 69:16 1 m hat d converges to psi inverse m1 to md, which 69:28 is equal to of theta star. 69:33 So that's theta star. 69:35 By definition, we assumed that that was the unique one that 69:37 was actually doing this. 69:40 Again, this is a very strong assumption. 69:43 I mean, it's basically saying, if the method of moment works, 69:46 it works. 69:47 So the fact that psi inverse one to one 69:51 is really the key to making this guy work. 69:55 And then I also have a central limit theorem. 69:57 And the central limit theorem is basically 70:00 telling me that M hat is converging to M even 70:04 in the multivariate sense. 70:06 So if I look at the vector of M hat and the true vector of M, 70:09 then I actually make them go-- 70:11 I look at the difference for scale by square root of n. 70:14 It goes to some Gaussian. 70:15 And usually, we would see-- if it was one dimensional, 70:18 we would see the variance. 70:19 Then we see the variance-covariance matrix. 70:22 Who has never seen the-- well, nobody answers this question. 70:25 Who has already seen the multivariate central limit 70:28 theorem? 70:28 70:31 Who was never seen the multivariate central limit 70:33 theorem? 70:35 So the multivariate central limit theorem 70:37 is basically just the slight extension 70:41 of the univariate one. 70:43 It just says that if I want to think-- 70:48 so the univariate one would tell me something like this-- 70:51 71:05 and 0. 71:06 And then I would have basically the variance of X to the j-th. 71:18 So that's what the central limit theorem tells me. 71:22 This is an average. 71:23 71:29 So this is just for averages. 71:31 The central limit theorem tells me this. 71:33 Just think of X to the j-th as being y. 71:36 And that would be true. 71:37 Everybody agrees with me? 71:40 So now, this is actually telling me 71:41 what's happening for all these guys individually. 71:45 But what happens when those guys start to correlate together? 71:48 I'd like to know if they actually correlate 71:51 the same way asymptotically. 71:53 And so if I actually looked at the covariance matrix 71:56 of this vector-- 71:57 72:03 so now, I need to look at a matrix which is d by d-- 72:07 then would those univariate central limit theorems 72:10 tell me-- 72:12 so let me right like this, double bar. 72:16 So that's just the covariance matrix. 72:19 This notation, V double bar is the variance-covariance matrix. 72:23 So what this thing tells me-- so I know this thing 72:26 is a matrix, d by d. 72:30 Those univariate central limit theorems only 72:31 give me information about the diagonal terms. 72:36 But here, I have no idea where the covariance matrix is. 72:40 This guy is telling me, for example, that this thing is 72:46 like variance of X to the j-th. 72:49 But what if I want to find off-diagonal elements 72:51 of this matrix? 72:53 Well, I need to use a multivariate central limit 72:55 theorem. 72:56 And really what it's telling me is that you can actually 72:58 replace this guy here-- 73:00 73:10 so that goes in distribution to some normal mean 0, again. 73:14 And now, what I have is just sigma 73:17 of theta, which is just the covariance matrix 73:22 of this vector X, X2, X3, X4, all the way to Xd. 73:26 And that's it. 73:27 So that's a multivariate Gaussian. 73:28 Who has never seen a multivariate Gaussian? 73:33 Please, just go on Wikipedia or something. 73:35 There's not much to know about it. 73:37 But I don't have time to redo probability here. 73:40 So we're going to have to live with it. 73:43 Now, to be fair, if your goal is not 73:46 to become a statistical savant, we 73:48 will stick to univariate questions 73:52 in the scope of homework and exams. 74:01 So now, what was the delta method telling me? 74:09 It was telling me that if I had a central limit theorem that 74:13 told me that theta hat was going to theta, 74:16 or square root of n theta hat minus theta 74:17 was going to some Gaussian, then I 74:19 could look at square root of Mg of theta hat minus g of theta. 74:23 And this thing was also going to a Gaussian. 74:25 But what it had to be is the square 74:27 of the derivative of g in the variance. 74:32 So the delta method, it was just a way 74:35 to go from square root of n theta 74:38 hat n minus theta goes to some N, say 0, sigma squared, to-- 74:46 so delta method was telling me that this was square root 74:50 Ng of theta hat N minus g of theta 74:56 was going in distribution to N0 sigma squared 75:01 g prime squared of theta. 75:04 75:07 That was the delta method. 75:09 Now, here, we have a function of those guys. 75:12 The central limit theorem, even the multivariate one, 75:15 is only guaranteeing something for me regarding the moments. 75:20 But now, I need to map the moments back into some theta, 75:23 so I have a function of the moments. 75:26 And there is something called the multivariate delta 75:31 method, where derivatives are replaced by gradients. 75:35 Like, they always are in multivariate calculus. 75:39 And rather than multiplying, since things do not compute, 75:43 rather than choosing which side I want to put the square, 75:46 I'm actually just going to take half of the square on one side 75:49 and the other half of the square on the other side. 75:51 So the way you should view this, you 75:53 should think of sigma squared times g prime squared 75:59 as being g prime of theta times sigma 76:02 squared times g prime of theta. 76:06 And now, this is completely symmetric. 76:08 And the multivariate delta method 76:14 is basically telling you that you get the gradient here. 76:20 So you start from something that's 76:21 like that over there, a sigma-- 76:24 so that's my sigma squared, think of sigma squared. 76:26 And then I premultiply by the gradient and postmultiply 76:29 by the gradient. 76:30 The first one is transposed. 76:31 The second one is not. 76:33 But that's very straightforward extension. 76:36 You don't even have to understand it. 76:37 Just think of what would be the natural generalization. 76:41 Here, by the way, I wrote explicitly 76:44 what the gradient of a multivariate function is. 76:48 So that's a function that goes from Rd to Rk. 76:53 So now, the gradient is a d by k matrix. 76:56 76:58 And so now, for this guy, we can do it 77:00 for the method or moments. 77:01 And we can see that basically we're 77:03 going to have this scaling that depends 77:04 on the gradient of the reciprocal of psi, which 77:08 is normal. 77:08 Because if psi is super steep, if psi inverse is super steep, 77:13 then the gradient is going to be huge, 77:14 which is going to translate into having a huge variance 77:17 for the method of moments. 77:18 77:21 So this is actually the end. 77:24 I would like to encourage you-- and we'll probably do it 77:26 on Thursday just to start. 77:27 But I encourage you do it in one dimension, 77:30 so that you know how to use the method of moments, 77:35 you know how to do a bunch of things. 77:37 Do it in one dimension and see how you can check those things. 77:40 So just as a quick comparison, in terms of the quadratic risk, 77:43 the maximum likelihood estimator is typically 77:46 more accurate than the method of moments. 77:50 What is pretty good to do is, when 77:51 you have a non-concave likelihood 77:54 function, what people like to do is 77:56 to start with the method of moments as an initialization 77:58 and then run some algorithm that optimizes locally 78:01 the likelihood starting from this point, 78:03 because it's actually likely to be closer. 78:05 And then the MLE is going to improve it 78:07 a little bit by pushing the likelihood a little better. 78:12 So of course, the maximum likelihood 78:13 is sometimes intractable. 78:14 Whereas, computing moments is fairly doable. 78:18 If the likelihood is concave, as I said, 78:20 we can use optimization algorithms, 78:21 such as interior-point methods or gradient descent, 78:24 I guess, to maximize it. 78:25 And if the likelihood is non-concave, 78:26 we only have local heuristics. 78:28 Risk And that's what I meant-- 78:29 you have only local maxima. 78:31 And one trick you can do-- 78:32 so your likelihood looks like this, 78:37 and it might be the case that if you have a lot of those peaks, 78:42 you basically have to start your algorithm in each 78:44 of those peaks. 78:46 But the method of moments can actually 78:48 start you in the right peak, and then you 78:50 just move up by doing some local algorithm 78:53 for maximum likelihood. 78:55 So that's not key. 78:56 But that's just if you want to think about algorithmically 78:59 how I would end up doing this and how can I combine the two. 79:03 So I'll see you on Thursday. 79:04 Thank you. 79:06