https://www.youtube.com/watch?v=QXkOaifVfW4&list=PLUl4u3cNGP60uVBMaoNERc6knT_MgPKS0&index=9 字幕記錄 00:00 00:00 The following content is provided under a Creative 00:02 Commons license. 00:03 Your support will help MIT OpenCourseWare 00:06 continue to offer high quality educational resources for free. 00:10 To make a donation or give you additional materials 00:12 from hundreds of MIT courses, visit MIT OpenCourseWare 00:16 at ocw.mit.edu. 00:17 00:21 PHILIPPE RIGOLLET: So again, before we start, 00:23 there is a survey online if you haven't done so. 00:27 I would guess at least one of you has not. 00:30 Some of you have entered their answers and their thoughts, 00:33 and I really appreciate this. 00:35 It's actually very helpful. 00:36 So it seems that the course is going fairly well 00:40 from what I've read so far. 00:42 So if you don't think this is the case, 00:43 please enter your opinion and tell us 00:45 how we can make it better. 00:47 One of the things that was said is 00:48 that I speak too fast, which is absolutely true. 00:53 I just can't help it. 00:54 I get so excited, but I will really do my best. 00:59 I will try to. 01:02 I think I always start OK. 01:04 I just end not so well. 01:07 So last time we talked about this chi squared distribution, 01:10 which is just another distribution that's 01:13 so common that it deserves its own name. 01:16 And this is something that arises 01:17 when we sum the squares of independent standard Gaussian 01:22 random variables. 01:23 And in particular, why is that relevant? 01:25 It's because if I look at the sample variance, 01:27 then it is a chi square distribution, 01:29 and the parameter that shows up is also 01:32 known as the degrees of freedom, is the number 01:35 of observations of minus one. 01:37 And so as I said, this chi squared 01:39 distribution has an explicit probability density function, 01:43 and I tried to draw it. 01:44 And one of the comments was also about my handwriting, 01:47 so I will actually not rely on it for detailed things. 01:52 So this is what the chi squared with one degree of freedom 01:54 would look like. 01:55 And really, what this is is just the distribution of the square 01:57 of a standard Gaussian. 01:58 I'm summing only one, so that's what it is. 02:01 Then when I go to 2, this is what it is-- 02:03 3, 4, 5, 6, and 10. 02:07 And as I move, you can see this thing 02:08 is becoming flatter and flatter, and it's pushing to the right. 02:11 And that's because I'm summing more and more squares, 02:14 and in expectation we just get one every time. 02:18 So it really means that the mass is moving to infinity. 02:23 In particular, a chi squared distribution 02:26 with n degrees of freedom is going to infinity 02:29 as n goes to infinity. 02:32 Another distribution that I asked 02:35 you to think about-- anybody looked around 02:38 about the student t-distribution, what 02:39 the history of this thing was? 02:42 So I'll tell you a little bit. 02:44 I understand if you didn't have time. 02:46 So the t-distribution is another common distribution 02:50 that is so common that it will be used 02:53 and will have its table of quintiles that are 02:56 drawn at the back of the book. 02:59 Now, remember, when I mentioned the Gaussian, I said, 03:02 well, there are several values for alpha 03:04 that we're interested in. 03:06 And so I wanted to draw a table for the Gaussian. 03:11 We had something that looked like this, 03:13 and I said, well, q alpha over 2 to get alpha over 2 03:21 to the right of this number. 03:22 And we said that there is a table for this things, 03:25 for common values of theta. 03:28 Well, if you try to envision what this table will look like, 03:31 it's actually a pretty sad table, 03:34 because it's basically one list of numbers. 03:35 Why would I call it a table? 03:37 Because all I need to tell you is 03:38 something that looks like this. 03:40 If I tell you this is alpha and this is q alpha over 2 03:43 and then I say, OK, basically the three alphas 03:47 that I told you I care about are something like 1%, 5%, and 10%, 03:54 then my table will just give me q alpha over 2. 03:57 So that's alpha, and that's q alpha over 2. 03:59 And that's going to tell me that-- 04:01 I don't remember this one, but this guy is 1.96. 04:04 This guy is something like 2.45. 04:08 I think this one is like 1.65 maybe. 04:11 And maybe you can be a little finer, 04:15 but it's not going to be an entire page 04:16 at the back of the book. 04:18 And the reason is because I only need 04:19 to draw these things for d1 standard Gaussian 04:22 when the parameters are 0 for the mean 04:24 and 1 for the variance. 04:26 Now, if I'm actually doing this for the the chi squared, 04:30 I basically have to give you one table per values 04:34 of the degrees of freedom, because those things 04:37 are different. 04:38 There is no way I can take-- 04:41 for Gaussian's, if you give me a different mean, 04:43 I can substract it and make it back to be a standard Gaussian. 04:46 For the chi squared, there is no such thing. 04:49 There is no thing that just takes 04:50 the chi squared with d degrees of freedom, nd, 04:53 and turns it into, say, a chi square 04:54 with one degree of freedom. 04:56 This just does not happen. 04:58 So the word is standardized. 05:01 Make it a standard chi squared. 05:02 There is no such thing as standard chi squared. 05:04 So what it means is that I'm going 05:05 to need one row like that for each value of the number 05:09 of degrees of freedom. 05:11 So that will certainly fill a page at the back of a book-- 05:14 maybe even more. 05:16 I need one per sample size. 05:18 So if I want to go from simple size 1 to 1,000, 05:21 I need 1,000 rows. 05:24 So now the student distribution is 05:26 one that arises where it looks very much like the Gaussian 05:30 distribution, and there's a very simple reason for that, is 05:33 that I take a standard Gaussian and I divide it by something. 05:37 That's how I get the student. 05:39 What do I divide it with? 05:40 Well, I take an independent chi square-- 05:42 I'm going to call it v-- 05:44 and I want it to be independent from z. 05:47 And I'm going to divide z by root v over d. 05:52 So I start with a chi squared v. 05:55 So this guy is chi squared d. 05:58 I start with z, which is n 0, 1. 06:02 I'm going to assume that those guys are independent. 06:06 In my t-distribution, I'm going to write 06:08 a T. Capital T is z divided by the square root of v over d. 06:17 Why would I want to do this? 06:18 Well, because this is exactly what 06:20 happens when a divide not by the true variance, a Gaussian, 06:25 but by its empirical variance. 06:28 So let's see why in a second. 06:30 So I know that if you give me some random variable-- 06:34 let's call it x, which is N mu sigma squared-- 06:38 then I can do this. 06:40 x minus mu divided by sigma. 06:45 I'm going to call this thing z, because this thing actually 06:47 has some standard Gaussian distribution. 06:51 I have standardized x into something 06:54 that I can read the quintiles at the back of the book. 06:58 So that's this process that I want to do. 07:00 Now, to be able to do this, I need to know what mu is, 07:03 and I need to know what sigma is. 07:05 Otherwise I'm not going to be able to make this operation. 07:09 mu I can sort of get away with, because remember, 07:13 when we're doing confidence intervals 07:15 we're actually solving for mu. 07:17 So it was good that mu was there. 07:20 When we're doing hypothesis testing, 07:22 we're actually plugging in here the mu that shows up in h0. 07:26 So that was good. 07:27 We had this thing. 07:28 Think of mu as being p, for example. 07:31 But this guy here, we don't necessarily know what it is. 07:36 I just had to tell you for the entire first chapter, 07:40 assume you have Gaussian random variables 07:41 and that you know what the variance is. 07:44 And the reason why I said assume you 07:45 know it-- and I said sometimes you can read it 07:47 on the side of the box of measuring equipment in the lab. 07:52 That was just the way I justified it, 07:54 but the real reason why I did this is because I would not 07:57 be able to perform this operation if I actually did not 08:00 know what sigma was. 08:02 But from data, we know that we can form this estimator 08:07 Sn, which is 1 over n, sum from i equals 1 to n 08:11 of Xi, minus X bar squared. 08:15 And this thing is approximately equal to sigma squared. 08:18 That's the sample variance, and it's actually 08:21 a good estimator just by the law of large number, actually. 08:25 This thing, by the law of large number, as n goes to infinity-- 08:29 08:32 well, let's say it in probability 08:34 goes to sigma squared by the law of large number. 08:36 So it's a consistent estimator of sigma squared. 08:40 So now, what I want to do is to be 08:43 able to use this estimator rather than using sigma. 08:46 And the way I'm going to do it is 08:47 I'm going to say, OK, what I want to form 08:50 is x minus mu divided by Sn this time. 08:58 I don't know what the distribution of this guy is. 09:01 Sorry, it's square root of Sn. 09:02 This is sigma squared. 09:05 So this is what I would take. 09:07 And I could think of Slutsky, maybe, 09:10 something like this that would tell me, well, just use that 09:14 and pretend it's a Gaussian. 09:15 And we'll see how actually it's sort 09:18 of valid to do that, because Slutsky tells us 09:20 it is valid to do that. 09:22 But what we can also do is to say, 09:24 well, this is actually equal to x minus mu, divided by sigma, 09:28 which I knew what the distribution of this guy is. 09:31 And then what I'm going to do is I'm going to just-- 09:33 well, I'm going to cancel this effect, sigma over square root 09:38 Sn. 09:39 So I didn't change anything. 09:41 I just put the sigma here. 09:43 So now what I know what I know is that this is some z, 09:47 and it has some standard Gaussian distribution. 09:51 What is this guy? 09:54 Well, I know that Sn-- 09:57 we wrote this here. 09:59 Maybe I shouldn't have put those pictures, 10:01 because now I keep on skipping before and after. 10:04 We know that Sn times n divided by sigma squared 10:14 is actually chi squared n minus 1. 10:18 10:22 So what do I have here? 10:23 I have that chi squared-- 10:25 so here I have something that looks like 1 over square root 10:29 of Sn divided by sigma squared. 10:32 10:35 This is what this guy is if I just do some more writing. 10:38 And maybe I actually want to make my life a little easier. 10:41 I'm actually going to plug in my n here, 10:45 and so I'm going to have to multiply by square root of n 10:48 here. 10:49 10:56 Everybody's with me? 10:59 So now what I end up with is something that 11:01 looks like this, where I have-- 11:06 here I started with x. 11:07 11:15 I should really start with Xn bar minus mu times 11:19 square root of n. 11:21 That's what the central limit theorem would tell me. 11:24 I need to work with the average rather than just one 11:26 observation. 11:27 So if I start with this, then I pick up a square root of n 11:30 here. 11:30 11:43 So if I had the sigma here, I would know 11:45 that this thing is actually-- 11:47 Xn bar minus mu divided by sigma times the square root of n 11:54 would be a standard Gaussian. 11:56 So if I put Xn bar here, I really 11:58 need to put this thing that goes around the Xn bar. 12:00 12:04 That's just my central limit theorem 12:06 that says if I average, then my variance has shrunk by a factor 12:10 1 over n. 12:12 Now, I can still do this. 12:15 That was still fine. 12:16 And now I said that this thing is basically this guy. 12:26 So what I know is that this thing 12:28 is a chi squared with n minus 1 degrees of freedom, 12:32 so this guy here is chi squared with n 12:37 minus 1 degrees of freedom. 12:40 Let me call this thing v in the spirit of what was used there 12:44 and in the spirit of what is written here. 12:49 So this guy was called v, so I'm going to call this v. 12:53 So what I can write is that square root of n Xn 12:57 bar minus mu divided by square root of Sn 13:02 is equal to z times square root of n 13:10 divided by square root of v. Everybody's with me here? 13:20 13:23 Which I can rewrite as z times square root of v divided by n 13:37 And if you look at what the definition of this thing is, 13:40 I'm almost there. 13:41 What is the only thing that's wrong here? 13:45 This is a student distribution, right? 13:48 So there's two things. 13:49 The first one was that they should be independent, 13:51 and they actually are independent. 13:53 That's what Cochran's theorem tells me, 13:55 and you just have to count on me for this. 13:57 I told you already that Sn was independent of Xn bar. 14:01 So those two guys are independent, 14:04 which implies that the numerator and denominator here 14:07 are independent. 14:08 That's what Cochran's theorem tells us. 14:12 But is this exactly what I should 14:14 be seeing if I wanted to have my sample variance, if I 14:17 want to have to write this? 14:19 Is this actually the definition of a student distribution? 14:23 Yes? 14:25 No. 14:25 14:28 So we see z divided by square root of v over d. 14:33 That looks pretty much like it, except there's 14:35 a small discrepancy. 14:36 What is the discrepancy? 14:38 14:47 There's just the square root of n minus 1 thing. 14:50 So here, v has n minus 1 degrees of freedom. 14:55 And in the definition, if the v has d degrees of freedom, 14:58 I divide it by d, not by d minus 1 or not by d plus 1, actually, 15:04 in this case. 15:06 So I have this extra thing. 15:07 Well, there's two ways I can address this. 15:09 15:13 The first one is by saying, well, 15:14 this is actually equal to z over square root 15:18 of v divided by n minus 1 times square root of n 15:27 over n minus 1. 15:28 15:32 I can always do that and say for n large enough 15:35 this thing is actually going to be pretty small, 15:37 or I can take account for it. 15:39 Or for any n you give me, I can compute this number. 15:43 And so rather than having a t-distribution, 15:45 I'm going to have a t-distribution time 15:47 this deterministic number, which is just 15:49 a function of my number of observations. 15:52 But what I actually want to do instead 15:55 is probably use a slightly different normalization, 16:00 which is just to say, well, why do I have to define Sn-- 16:04 16:10 where was my Sn? 16:11 Yeah, why do I have to define Sn tend to be divided by n? 16:14 Actually, this is a biased estimator, 16:17 and if I wanted to be unbiased, I can actually just 16:20 put an n minus 1 here. 16:22 You can check that. 16:23 You can expend this thing and compute the expectation. 16:25 You will see that it's actually not sigma squared, 16:27 but n over n minus 1 sigma squared. 16:31 So you can actually just make it unbiased. 16:33 Let's call this guy tilde, and then 16:35 when I put this tilde here what I actually get is s tilde here 16:43 and s tilde here. 16:46 I need actually to have n minus 1 here 16:49 to have this s tilde be a chi squared distribution. 16:55 Yes? 16:56 AUDIENCE: [INAUDIBLE] defined this way so that you-- 17:02 PHILIPPE RIGOLLET: So basically, this is what the story did. 17:04 So the story was, well, rather than using always 17:08 the central limit theorem and just pretending 17:10 that my Sn is actually the true sigma squared, 17:13 since this is something I'm going to do a lot, 17:16 I might as well just compute the distribution, 17:19 like the quintiles for this particular distribution, 17:21 which clearly does not depend on any unknown parameter. 17:24 d is the only parameter that shows up here, 17:27 and it's completely characterized 17:28 by the number of observations that you have, 17:30 which you definitely know. 17:32 And so people said, let's just be slightly more accurate. 17:35 And in a second, I'll show you how the distribution of the T-- 17:38 so we know that if the sample size is large enough, 17:41 this should not have any difference with the Gaussian 17:43 distribution. 17:44 I mean, those two things should be 17:45 the same because we've actually not paid 17:48 attention to this discrepancy by using empirical variance rather 17:51 than true so far. 17:52 And so we'll see what the difference is, 17:55 and this difference actually manifests itself only 17:57 in small sample sizes. 17:59 So those are things that matter mostly 18:02 if you have less than, say, 50 observations. 18:04 Then you might want to be slightly more precise 18:06 and use t-distribution rather than Gaussian. 18:08 So this is just a matter of being slightly more precise. 18:12 If you have more than 50 observations, 18:14 just drop everything and just pretend 18:15 that this is the true one. 18:17 18:19 Any other questions? 18:22 So now I have this thing, and so I'm 18:25 on my way to changing this guy. 18:27 So here now, I have not root n but root n minus 1. 18:31 18:47 So I have a z. 18:48 So this guy here is S. Yet Where did I get my root 18:55 n from in the first place? 18:56 19:00 Yeah, because I wanted this guy. 19:02 And so now what I am left with is Xn minus mu 19:05 divided by Sn tilde, which is the new one, which is now 19:08 indeed of the form z v root n minus 1, which now I 19:14 can write it as z v minus 1. 19:16 And so now I have exactly what I want, 19:22 and so this guy is n 0, 1. 19:25 And this guy is chi squared with n minus 1 degrees of freedom. 19:30 And so now I'm back to what I want. 19:33 So rather than using Sn to be the empirical variance where 19:37 I just divide my normatizations by n, if I use n minus 1, 19:41 I'm perfect. 19:42 Of course, I can still use n and do this multiplying 19:45 by root n minus 1 over n at the end. 19:47 But that just doesn't make as much sense. 19:49 19:52 Everybody's fine with what this T n distribution is doing 19:54 and why this last line is correct? 19:58 So that's just basically because it's 20:01 been defined so that this is actually happening. 20:04 That was your question, and that's really what happened. 20:07 So what is this student t-distribution? 20:11 Where does the name come from? 20:13 Well, it does not come from Mr. T. And if you know who Mr. 20:18 T was-- you're probably too young for that-- 20:20 he was our hero in the 80s. 20:23 And it comes from this guy. 20:26 His name is Sean William Gosset-- 20:29 1908. 20:29 So that was back in the day. 20:31 And this guy actually worked at the Guinness Brewery 20:33 in Dublin, Ireland. 20:35 And Mr. Guinness back then was a bit of a fascist, 20:38 and he didn't want him to actually publish papers. 20:41 And so what he had to do is to use a fake name to do that. 20:45 And he was not very creative, and he used a name "student." 20:50 Because I guess he was a student of life. 20:52 And so here's the guy, actually. 20:55 So back in 1908, it was actually not 20:57 difficult to put your name or your pen name 21:01 on a distribution. 21:03 So what does this thing look like? 21:05 How does it compare to the standard normal distribution? 21:09 You think it's going to have heavier or lighter tails 21:12 compared to the standard distribution, 21:13 the Gaussian distribution? 21:17 Yeah, because they have extra uncertainty in the denominator, 21:21 so it's actually going to make things wiggle a little wider. 21:25 So let's start with a reference, which 21:26 is the standard normal distribution. 21:29 So that's my usual bell-shaped curve. 21:31 And this is actually the t-distribution 21:33 with 50 degrees of freedom. 21:35 So right now, that's probably where you should just 21:37 stand up and leave, because you're like, 21:39 why are we wasting our time? 21:40 Those are actually pretty much the same thing, and it is true. 21:43 If you have 50 observations, both the central limit 21:46 theorem-- so here one of the things that you need to know 21:49 is that if I want to talk about t-distribution for, say, eight 21:54 observations, I need those observations to be Gaussian 21:57 for real. 21:57 There's no central limit theorem happening 21:59 at eight observations. 22:00 But really, what this is telling me 22:02 is not that the central limit theorem kicks in. 22:04 It's telling me what are the asymptotics that kick in? 22:07 22:13 The law of large number, right? 22:15 This is exactly this guy. 22:19 That's here. 22:21 When I write this statement, what this picture is really 22:24 telling us is that for n is equal to 50, I'm at the limit 22:28 already almost. 22:29 There's virtually no difference between using 22:32 the left-hand side or using sigma squared. 22:36 And now I start reducing. 22:38 40, I'm still pretty good. 22:39 We can start seeing that this thing is actually 22:41 losing some mass on top, and that's 22:43 because it's actually pushing it to the left 22:44 and to the right in the tails. 22:46 And then we keep going, keep going, keep going. 22:49 So that's at 10. 22:50 When you're at 10, there's not much of a difference. 22:53 And so you can start seeing difference 22:54 when you're at five, for example. 22:57 You can see the tails become heavier. 22:59 And the effect of this is that when I'm going to build, 23:01 for example, a confidence interval to put the same amount 23:05 of mass to the right of some number-- 23:07 let's say I'm going to look at this q alpha over 2-- 23:09 I'm going to have to go much farther, which 23:11 is going to result in much wider confidence intervals 23:17 to 4, 3, 2, 1. 23:20 So that's the t1. 23:22 Obviously that's the worst. 23:24 And if you ever use the t1 distribution, 23:30 please ask yourself, why in the world are you doing statistics 23:33 based on one observation? 23:35 23:38 But that's basically what it is. 23:41 So now that we have this t-distribution, 23:44 we can define a more sophisticated test 23:48 than just take your favorite estimator 23:50 and see if it's far from the value you're currently testing. 23:53 That was our rationale to build a test before. 23:57 And the first test that's non-trivial 24:00 is a test that exploits the fact that the maximum likelihood 24:04 estimator, under some technical condition, 24:07 has a limit distribution which is Gaussian with mean 0 24:12 when properly centered and a covariance matrix given 24:18 by the Fisher information matrix. 24:19 Remember this Fisher information matrix? 24:21 24:26 And so this is the setup that we have. 24:29 So we have, again, an i.i.d. 24:31 sample. 24:32 Now I'm going to assume that I have a d-dimensional parameter 24:35 space, theta. 24:36 And that's why I talk about Fisher information matrix-- 24:39 and not just Fisher information. 24:41 It's a number. 24:42 And I'm going to consider two hypotheses. 24:45 So you're going to have h0, theta is equal to theta 0. 24:52 h1, theta is not equal to theta 0. 24:56 And this is basically what we thought 25:00 when we said, are we testing if a coin is fair or unfair. 25:05 So fair was p equals 1/2, and fair was p different from 1/2. 25:09 And here I'm just making my life a bit easier. 25:13 So now, I have this maximum likelihood estimate 25:16 that I can construct. 25:17 Because let's say I know what p theta is, 25:20 and so I can build a maximum likelihood estimator. 25:23 And I'm going to assume that these technical conditions that 25:26 ensure that this maximum likelihood properly 25:29 standardized converges to some Gaussian are actually satisfy, 25:35 and so this thing is actually true. 25:38 So the theorem, the way I stated it-- 25:41 if you're a little puzzled, this is not the way I stated it. 25:44 And the first time, the way we stated it was that theta hat 25:47 mle minus theta not-- so here I'm 25:51 going to place myself under the null hypothesis, 25:53 so here I'm going to say under h0. 25:58 And honestly, if you have any exercise on tests, 26:01 that's the way that it should start. 26:03 What is the distribution under h0? 26:05 Because otherwise you don't know what this guy should be. 26:08 So you have this, and what we showed 26:10 is that this thing was going in distribution as n goes 26:12 to infinity to some normal with mean 0 26:15 and covariance matrix, which was i of theta, 26:19 which was here for the true parameter. 26:21 But here I'm under h0, so there's 26:22 only one true parameter, which is theta 0. 26:24 26:32 This was our limiting central limit theorem for-- 26:36 I mean, it's not really central limited theorem; 26:38 limited theorem for the maximum likelihood estimator. 26:43 Everybody remembers that part? 26:47 The line before said, under technical conditions, I guess. 26:50 So now, it's not really stated in the same way. 26:53 If you look at what's on the slide, 26:54 here I don't have the Fisher information matrix, 26:57 but I really have the identity of rd. 26:59 27:02 If I have a random variable x, which 27:05 has some covariance matrix sigma, 27:10 how do I turn this thing into something that 27:12 has covariance matrix identity? 27:15 So if this was a sigma squared, well, the thing I would do 27:20 would be divide by sigma, and then I 27:21 would have a 1, which is also known 27:24 as the identity matrix of r1. 27:28 Now, what is this? 27:30 This was root of sigma squared. 27:32 So what I'm looking for is the equivalent 27:35 of taking sigma and dividing by the square root of sigma, 27:40 which-- 27:40 obviously those are matrices-- 27:42 I'm certainly not allowed to do. 27:43 And so what I'm going to do is I'm actually 27:45 going to do the following. 27:48 Sigma 1 over root of sigma squared 27:51 can be written as sigma to the negative 1/2. 27:55 And this is actually the same thing here. 27:58 So I'm going to write it as sigma to the negative 1/2, 28:02 and now this guy is actually well-defined. 28:06 So this is a positive symmetric matrix, 28:08 and you can actually define the square root 28:10 by just taking the square root of its eigenvalues, 28:16 for example. 28:17 And so you get sigma 1/2 equals and follows n0 identity. 28:23 28:26 And in general, I'm going to see something 28:30 that looks like sigma 1/2 negative 1/2 sigma 28:34 sigma negative 1/2. 28:37 And I have minus 1/2 plus 1 minus 1/2. 28:40 This whole thing collapses to 0, and it's actually the identity. 28:45 So that's the actual rule. 28:47 So if you're not familiar, this is basic multivariate Gaussian 28:52 distribution computations. 28:54 Take a look at it. 28:57 If you feel like you don't need to look at it 28:59 but you the basic maneuver, it's fine as well. 29:03 We're not going to go much deeper into that, 29:05 but those are part of the thing that 29:07 are sort of standard manipulations 29:09 about standard Gaussian vectors. 29:11 Because obviously, standard Gaussian vectors 29:13 arise from this theorem a lot. 29:17 So now I pre-multiplied my sigma to minus minus 1/2. 29:22 Now of course, I'm doing all of this in the asymptotics, 29:24 and so I have this effect. 29:26 So if I pre-multiply everything by sigma to the 1/2, 29:29 sigma being the Fisher information matrix at theta 0, 29:34 then this is actually equivalent to saying that square root 29:38 of n-- 29:39 29:43 so now i of theta now plays the role of sigma-- 29:51 times theta hat mle minus theta not goes in distribution 29:59 as n goes to infinity to some multivariate standard Gaussian 30:06 and 0 identity of rd. 30:09 And here, to make sure that we're 30:10 talking about a multivariate distribution, 30:13 I can put a d here-- 30:16 so just so we know we're talking about the multivariate, 30:18 though it's pretty clear from the context, 30:20 since the covariance matrix is actually a matrix and not 30:23 a number. 30:23 Michael? 30:24 AUDIENCE: [INAUDIBLE]. 30:26 30:29 PHILIPPE RIGOLLET: Oh, yeah. 30:30 Right. 30:31 Thanks. 30:31 30:34 So Yeah, you're right. 30:35 So that's a minus and that's a plus. 30:39 Thanks. 30:40 So yeah, anybody has a way to remember 30:47 whether it's inverse Fisher information or Fisher 30:49 information as a variance other than just learning it? 30:54 It is called information, so it's really telling me 30:58 how much information I have. 31:00 So when a variance increases, I'm 31:02 getting less and less information, 31:04 and so this thing should actually be 1 over a variance. 31:08 The notion of information is 1 over a notion of variance. 31:10 31:13 So now I just wrote this guy like this, and the reason 31:19 why I did this is because now everything 31:21 on the right-hand side does not depend on any known parameter. 31:26 There's 0 and identity. 31:30 Those two things are just absolute numbers 31:33 or absolute quantities, which means that this thing-- 31:38 I call this quantity here-- 31:42 what was the name that I used? 31:44 Started with a "p." 31:47 Pivotal. 31:47 So this is a pivotal quantity, meaning 31:50 that its distribution, at least asymptotic distribution, 31:53 does not depend on any unknown parameter. 31:56 Moreover, it is indeed a statistic, 32:00 because I can actually compute it. 32:03 I know theta 0 and I know theta hat mle. 32:05 One thing that I did, and you should actually 32:08 complain about this, is on the board 32:11 I actually used i of theta not. 32:15 And on the slides, it says i of theta hat. 32:20 And it's exactly the same thing that we did before. 32:22 Do I want to use the variance as a way for me 32:26 to check whether I'm under the right assumption or not? 32:29 Or do I actually want to leave that part 32:31 and just plug in the theta hat mle, which should 32:33 go to the true one eventually? 32:36 Or do I actually want to just plug in the theta 0? 32:39 So this is exactly playing the same role 32:41 as whether I wanted to see square root of Xn bar 32:45 1 minus Xn bar in the denominator of my test 32:48 statistic for p, or if I wanted to see square root of 0.5, 32:55 1 minus 0.5 when I was testing if p was equal to 0.5. 32:59 So this is really a choice that's left up to you, 33:03 and that's something you can really choose the two. 33:06 And as we said, maybe this guy is slightly more precise, 33:09 but it's not going to extend to the case 33:11 where theta 0 is not reduced to one single number. 33:15 33:20 Any questions? 33:22 So now we have our pivotal distribution, so from there 33:26 this is going to be my test statistic. 33:29 I'm going to use this as a test statistic 33:31 and declare that if this thing is too large, 33:35 n absolute value-- 33:36 because this is really a way to quantify how far theta hat is 33:41 from theta 0. 33:41 And since theta hat should be close to the true one, when 33:44 this thing is large in absolute value, 33:45 it means that the true theta should be far from theta 0. 33:50 So this is my new test statistic. 33:56 Now, I said it should be far, but this is a vector. 33:59 So if I want a vector to be far, two vectors to be far, 34:02 I measure their norm. 34:04 And so I'm going to form the Euclidean norm of this guy. 34:07 So if I look at the Euclidean norm of n-- 34:10 34:14 and Euclidean norm is the one you know-- 34:16 34:22 I'm going to take its square. 34:25 Let me now put a 2 here. 34:26 So that's just the Euclidean norm, 34:28 and so the norm of vector x is just x transpose x. 34:36 In the slides, the transpose is denoted by prime. 34:40 Wow, that's hard to say. 34:41 Put prime in quotes. 34:42 34:48 That's a statistic standard that people do. 34:50 They put prime for transpose. 34:53 Everybody knows what the transpose is? 34:56 So I just make it flat and I do it like this, 34:58 and then that means that's actually 34:59 equal to the sum of the coordinates Xi squared. 35:03 35:06 And that's what you know. 35:08 But here, I'm just writing it in terms of vectors. 35:10 And so when I run to write this, this is equivalent, 35:13 this is equal to-- 35:14 well, the square root of n is going to pick up the square. 35:17 So I get square root of n times square root of n. 35:20 So this guy is just 1/2. 35:23 So 1/2 times 1/2 is going to give me 1, 35:25 and so I get theta hat mle minus theta. 35:29 And then I have e of theta not. 35:32 And then I get theta hat mle minus theta not. 35:37 And so by definition, I'm going to say that this 35:41 is my test statistic Tn. 35:45 And now I'm going to have a test that rejects if Tn is large, 35:50 because Tn is really measuring the distance between theta hat 35:53 and theta 0. 35:55 So my test now is going to be psi, which rejects. 36:20 So it says 1 if Tn is larger than some threshold T. 36:27 And how do I pick this T? 36:30 Well, by controlling my type I error-- 36:32 sorry, the c by controlling my type I error. 36:35 So to choose c, what we have to check 36:44 is that p under theta not-- 36:47 so here it's theta not-- 36:49 that I reject so that psi is equal to 1. 36:55 I want this to be equal to alpha, right? 36:58 That's how I maximize my type I error 37:01 under the budget that's actually given to me, which is alpha. 37:04 So that's actually equivalent to checking whether p not of Tn 37:12 is larger than c. 37:13 37:19 And so if I want to find the c, all I need to know 37:23 is what is the distribution of Tn when 37:25 theta is equal to theta not? 37:28 Whatever this distribution is-- maybe it has some weird density 37:31 like this-- 37:32 whatever this distribution is, I'm 37:35 just going to be able to pick this number, 37:37 and I'm going to take this quintile alpha, here alpha, 37:41 and I'm going to reject if I'm larger than alpha-- 37:44 whatever this guy is. 37:45 So to be able to do that, I need to know 37:47 what is the distribution of Tn when theta is equal to theta 0. 37:56 What is this distribution? 38:00 What is Tn? 38:02 It's the norm squared of this vector. 38:08 What is this vector? 38:09 What is the asymptotic distribution of this vector? 38:12 38:17 Yes? 38:18 AUDIENCE: [INAUDIBLE]. 38:21 PHILIPPE RIGOLLET: Just look one board up. 38:23 What is this asymptotic distribution 38:24 of the vector for which we're taking the norm squared? 38:27 It's right here. 38:30 It's a standard Gaussian multivariate. 38:33 So when I look at the norm squared-- 38:36 so if z is a standard Gaussian multivariate, 38:45 then the norm of z squared, by definition of the norm squared, 38:51 is the sum of the Zi squared. 38:54 39:01 That's just the definition of the norm. 39:04 But what is this distribution? 39:06 AUDIENCE: Chi-squared. 39:07 PHILIPPE RIGOLLET: That's a chi-square, 39:09 because those guys are all of variance 1. 39:12 That's what the diagonal tells me-- 39:15 only ones. 39:15 And they're independent because they have all these zeros 39:18 outside of the diagonal. 39:20 So really, this follows some chi-squared distribution. 39:23 How many degrees of freedom? 39:25 Well, the number of them that I sell, d. 39:30 So now I have found the distribution of Tn 39:33 under this guy. 39:35 And that's true because this is true under h0. 39:41 If I was not under h0, again, I would 39:44 need to take another guy here. 39:46 39:49 How did I use the fact that theta is equal to theta 0 39:52 when I centered by theta 0? 39:54 And that was very important. 39:57 So now what I know is that this is really equal-- 40:01 why did I put 0 here? 40:02 40:05 So this here is actually equal. 40:10 So in the end, I need c such that the probability-- 40:23 and here I'm not going to put a theta 0. 40:25 I'm just talking about the possibility 40:26 of the random variable that I'm going to put in there. 40:29 It's a chi-square with d degrees of freedom [INAUDIBLE] 40:31 is equal to alpha. 40:32 40:35 I just replaced the fact that this guy, Tn, 40:39 under the distribution was just a chi-square. 40:41 And this distribution here is just 40:42 really referring to the distribution of a chi-square. 40:44 There's no parameters here. 40:46 And now, that means that I look at my chi-square distribution. 40:51 It sort of looks like this. 40:55 And I'm going to pick some alpha here, 40:59 and I need to read this number q alpha. 41:02 41:04 And so here what I need to do is to pick this q alpha here, 41:09 for c. 41:11 So take c to be q alpha, the quintile of order 1 minus 41:28 alpha of a chi-squared distribution 41:31 with this d degree of freedom. 41:32 And why do I say 1 minus alpha? 41:33 Because again, the quintiles are usually 41:36 referring to the area that's to the left of them by-- 41:41 well, actually, it's by a convention. 41:47 However, in statistics, we only care about the right tail 41:52 usually, so it's not very convenient for us. 41:55 And that's why rather than calling 41:56 this guy s sub 1 minus alpha all the time, I write it q alpha. 42:01 So now you have this q alpha, which 42:03 is the 1 minus alpha quintile, or quintile of order 1 minus 42:08 alpha of chi squared d. 42:10 And so now I need to use a table. 42:12 For each d, this thing is going to take a different value, 42:15 and this is why I can not just spit out a number to you 42:18 like I spit out 1.96. 42:21 Because if I were able to do that, 42:24 that would mean that I would remember 42:25 an entire column of this table for each possible value of d, 42:30 and that I just don't know. 42:32 So you need just to look at tables, 42:34 and this is what it will tell you. 42:36 Often software will do that, too. 42:38 You don't have to search through tables. 42:41 And so just as a remark is that this test, Wald's test, 42:46 is also valid when I have this sort of other alternative 42:50 that I could see quite a lot-- 42:51 if I actually have what's called a one-sided alternative. 42:55 By the way, this is called Wald's test-- 42:58 so taking Tn to be this thing. 43:01 43:09 So this is Wald's test. 43:12 Abraham Wald was a famous statistician 43:15 in the early 20th century, who actually was at Columbia 43:22 for quite some time. 43:26 And that was actually at the time 43:27 where statistics were getting very popular in India, 43:33 so he was actually traveling all over India 43:35 in some dinky planes. 43:37 And one of them crashed, and that's how he died-- 43:41 pretty young. 43:42 But actually, there's a huge school of statistics 43:45 now in India thanks to him. 43:47 There's the Indian Statistical Institute, 43:49 which is actually a pretty big thing 43:51 and trans the best statisticians. 43:53 So this is called Wald's test, and it's actually 43:55 a pretty popular test. 43:56 Let's just look back a second. 43:59 So you can do the other alternatives, 44:01 as I said, and for the other alternatives 44:03 you can actually do this trick where you put theta 0 as 44:06 well, as long as you take the theta 0 that's 44:08 the closest to the alternative. 44:10 You just basically take the one that's the least favorable 44:13 to you-- 44:13 44:16 the alternative, I mean. 44:18 So what is this thing doing? 44:21 If you did not know anything about statistics and I told 44:25 you here's a vector-- 44:26 that's the mle vector, theta hat mle. 44:29 44:32 So let's say this theta hat mle takes the values, say-- 44:36 44:44 so let's say theta hat mle takes values, say, 1.2, 0.9, and 2.1. 44:57 And then testing h0, theta is equal to 1, 1, 2, versus theta 45:06 is not equal to the same number. 45:08 That's what I'm testing. 45:11 So you compute this thing and you find this. 45:13 If you don't know any statistics, 45:14 what are you going to do? 45:15 45:18 You're just going to check if this guy goes to that guy, 45:21 and probably what you're going to do is compute something that 45:24 looks like the norm squared between those guys-- so 45:27 the sum. 45:28 So you're going to do 1.2 minus 1 squared 45:31 plus 0.9 minus 1 squared plus 2.1 minus 2 squared 45:38 and check if this number is large or not. 45:41 Maybe you are going to apply some stats to try to understand 45:44 how those things are, but this is basically 45:46 what you are going to want to do. 45:49 What Wald's test is telling you is 45:52 that this average is actually not what you should be doing. 45:56 It's telling you that you should have some sort 45:59 of a weighted average. 46:00 Actually, it would be a weighted average 46:01 if I was guaranteed that my Fisher information 46:06 matrix was diagonal. 46:08 If my Fisher information matrix is diagonal, 46:10 looking at this number minus this guy, 46:13 transpose i, and then this guy minus this, 46:16 that would look like I have some weight here, some weight here, 46:19 and some weight here. 46:19 46:25 Sorry, that's only three. 46:29 So if it has non-zero numbers on all of its nine entries, 46:32 then what I'm going to see is weird cross-terms. 46:36 If I look at some vector pre-multiplying this thing 46:41 and post-multiplying this thing-- 46:42 so if I look at something that looks like this, 46:44 x transpose i of theta not, x transpose-- 46:51 think of x as being theta hat mle minus theta-- 46:56 so if I look at what this guy looks like, 46:58 it's basically a sum over i and j of Xi, Xj, i, theta not Ij. 47:08 And so if none of those things are 0, 47:11 you're not going to see a sum of three terms that are squares, 47:14 but you're going to see a sum of nine cross-products. 47:18 And it's just weird. 47:20 This is not something standard. 47:21 So what is Wald's test doing for you? 47:26 Well, it's saying, I'm actually going 47:29 to look at all the directions all at once. 47:32 Some of those directions are going 47:33 to have more or less variance, i.e., less or more information. 47:41 And so for those guys, I'm actually 47:43 going to use a different weight. 47:45 So what you're really doing is putting a weight 47:47 on all directions of the space at once. 47:51 So what this Wald's test is doing-- 47:53 by squeezing in the Fisher information matrix, 47:56 it's placing your problem into the right geometry. 48:00 It's a geometry that's distorted and where balls become ellipses 48:05 that are distorted in some directions 48:07 and shrunk in others, or depending 48:10 on if you have more variance or less variance in those 48:12 directions. 48:13 Those directions don't have to be 48:14 aligned with the axes of your coordinate system. 48:18 And if they were, then that would 48:19 mean you would have a diagonal information matrix, 48:24 but they might not be. 48:25 And so there's this weird geometry that shows up. 48:28 There is actually an entire field, 48:31 admittedly a bit dormant these days, 48:34 that's called information geometry, 48:36 and it's really doing differential geometry 48:39 on spaces that are defined by Fisher information matrices. 48:44 And so you can do some pretty hardcore-- 48:46 something that I certainly cannot do-- 48:50 differential geometry , just by playing around with statistical 48:53 models and trying to understand with the geometry of those 48:55 models are. 48:56 What does it mean for two points to be 48:58 close in some curved space? 49:01 So that's basically the idea. 49:02 So this thing is basically curving your space. 49:06 So again, I always feel satisfied 49:10 when my estimator on my test does not 49:12 involve just computing an average 49:14 and checking if it's big or not. 49:16 And that's not what we're doing here. 49:18 We know that this theta hat mle can be complicated-- 49:23 CF problem set, too, I believe. 49:26 And we know that this Fisher information matrix can also 49:29 be pretty complicated. 49:30 So here, your test is not going to be trivial at all, 49:33 and that requires understanding the mathematics behind it. 49:37 I mean, it all built upon this theorem 49:40 that I just erased, I believe, which 49:43 was that this guy here inside this norm 49:45 was actually converging to some standard Gaussian. 49:47 49:52 So there's another test that you can actually use. 49:55 So Wald's test is one option, and there's another option. 50:00 And just like maximum likelihood estimation and method 50:05 of moments would sometimes agree and sometimes disagree, 50:09 those guys are going to sometimes agree and sometimes 50:12 disagree. 50:13 And this test is called the likelihood ratio test. 50:17 So let's parse those words-- 50:21 "likelihood," "ratio," "test." 50:25 So at some point, I'm going to have 50:26 to take the likelihood of something divided 50:29 by the likelihood of some other thing and then work with this. 50:33 And this test is just saying the following. 50:36 Here's the simplest principle you can think of. 50:39 50:44 You're going to have to understand 50:45 the notion of likelihood in the context of statistics. 50:51 You just have to understand the meaning of the word 50:53 "likelihood." 50:54 This test is just saying if I want to test h0, 51:03 theta is equal to theta 0, versus theta is equal to theta 51:07 1, all I have to look at is whether theta 0 is more or less 51:13 likely than theta 1. 51:14 And I have an exact number that spits out. 51:18 Given a theta 0 or a theta 1 and given data, 51:24 I can put in this function called the likelihood, 51:26 and they tell me exactly how likely those things are. 51:31 And so all I have to check is whether one 51:33 is more likely than the other, and so what I can do 51:35 is form the likelihood of theta, say, 51:41 1 divided by the likelihood of theta 0 51:50 and check if this thing is larger than 1. 51:52 That would mean that this guy is more likely than that guy. 51:57 That's a natural way to proceed. 52:00 Now, there's one caveat here, which 52:03 is that when I do hypothesis testing 52:05 and I have this asymmetry between h0 and h1, 52:10 I still need to be able to control what 52:13 my probably of type I error is. 52:15 And here I basically have no knob. 52:19 This is something if you give me data in theta 0 52:21 and theta 1 I can compute to you and spit out the yes/no answer. 52:24 But I have no way of controlling the type II and type I error, 52:29 so what we do is that we replace this 1 by some number c. 52:33 And then we calibrate c in such a way 52:35 that the type I error is exactly at level alpha. 52:37 52:40 So for example, if I want to make sure 52:44 that my type I error is always 0, all I have to do 52:50 is to say that this guy is actually never 52:52 more likely than that guy, meaning never reject. 52:55 And so if I let c go to infinity, 52:57 then this is actually going to make 52:59 my type I error go to zero. 53:02 But if I let c go to negative infinity, 53:05 then I'm always going to conclude 53:12 that h1 is the right one. 53:14 So I have this straight off, and I 53:16 can turn this knob by changing the values of c 53:19 and get different results. 53:22 And I'm going to be interested in the one that maximizes 53:25 my chances of rejecting the null hypothesis while staying 53:29 under my alpha budget of type I error. 53:33 So this is nice when I have two very simple hypotheses, 53:37 but to be fair, we've actually not seen 53:40 any tests that correspond to real-life example. 53:45 Where theta 0 was of the form am I equal to, say, 0.5 53:49 or am I equal to 0.41, we actually 53:51 sort of suspected that if somebody 53:53 asked you to perform this test, they've 53:54 sort of seen the data before and they're sort of cheating. 53:57 So it's typically something am I equal to 0.5 54:00 or not equal to 0.5 or am I equal to 0.5 54:02 or larger than 0.5. 54:03 But it's very rare that you actually get only two points 54:06 to test-- 54:07 am I this guy or that guy? 54:09 Now, I could go on. 54:11 There's actually a nice mathematical theory, 54:13 something called the Neyman-Pearson lemma 54:15 that actually tells me that this test, the likelihood ratio 54:18 test, is the test, given the constraint of type I error, 54:22 that will have the smallest type II error. 54:25 So this is the ultimate test. 54:27 No one should ever use anything different. 54:29 And we could go on and do this, but in a way, 54:32 it's completely irrelevant to practice because you will never 54:35 encounter such tests. 54:37 And I actually find students that they took my class 54:41 as sophomores and then they're still around a couple of years 54:44 later and they're doing, and they're like, 54:46 I have this testing problem and I want to use likelihood ratio 54:50 test, the Neyman-Pearson one, but I just can't because it 54:54 just never occurs. 54:56 This just does not happen. 54:57 So here, rather than going into details, 54:59 let's just look at what building on this principle 55:02 we can actually make a test that will work. 55:05 So now, for simplicity, I'm going 55:08 to assume that my alternatives-- so now, I still 55:11 have a d dimensional vector theta. 55:16 And what I'm going to assume is that the null hypothesis 55:20 is actually only testing if the last coefficients from r 55:26 plus 1 to d are fixed numbers. 55:31 So in this example, where I have theta was equal-- 55:35 so if I have d equals 3, here's an example. 55:38 55:42 h0 is theta 2 equals 1, and theta 3 equals 2. 55:53 That's my h0, but I say I don't actually 55:56 care about what theta 1 is going to be. 55:58 56:02 So that's my null hypothesis. 56:04 I'm not going to specify right now what the alternative is. 56:07 That's what the null is. 56:08 And in particular, this null is actually not of this form. 56:13 It's not restricting it to one point. 56:15 It's actually restricting it to an infinite amount of points. 56:18 Those are all the vectors of the form theta 1 1, 56:22 2 for all theta 1 in, say, r. 56:29 That's a lot of vectors, and so it's certainly 56:31 not like it's equal to one specific vector. 56:34 56:36 So now, what I'm going to do is I'm actually 56:39 going to look at the maximum likelihood estimator, 56:43 and I'm going to say, well, the maximum likelihood estimator, 56:45 regardless of anything, is going to be close to. reality. 56:50 Now, if you actually tell me ahead of time 56:53 that the true parameter is of this form, 56:56 I'm not going to maximize over all three coordinates of theta. 56:59 I'm just going to say, well, I might as well just 57:01 set the second one to 1, the third one to 2, 57:06 and just optimize for this guy. 57:09 So effectively, you can say if you're telling me 57:11 that this is the reality, I can compute 57:14 a constrained maximum likelihood estimator 57:17 which is constrained to look like what you think reality is. 57:21 So this is what the maximum likelihood estimator is. 57:24 That's the one that's maximizing, say, 57:26 here the log likelihood over the entire space of candidate 57:30 vectors, of candidate parameters. 57:32 But this partial one, this is the constraint mle. 57:36 That's the one that's actually not maximizing our real thetas, 57:38 but only over the thetas that are plausible 57:41 under the null hypothesis. 57:44 So in particular, if I look at ln of this constraint thing 57:52 theta hat n c compared to ln, theta hat-- 57:59 let's say n mle, so we know which one-- 58:04 which one is bigger? 58:05 58:13 The first one is bigger. 58:15 So why? 58:17 AUDIENCE: [INAUDIBLE]. 58:18 58:20 PHILIPPE RIGOLLET: So the second one 58:22 is maximized over a larger space. 58:25 Right. 58:25 So I have this all of theta, which 58:28 are all the parameters I can take, 58:30 and let's say theta 0 is this guy. 58:32 I'm maximizing a function over all these things. 58:35 So if the true maximum is this here, 58:38 then the two things are equal, but if the maximum 58:41 is on this side, then the one on the right 58:43 is actually going to be larger. 58:45 They're maximizing over a bigger space, 58:48 so this guy has to be less than this guy. 58:51 So maybe it's not easy to see. 58:53 So let's say that this is theta and this is theta 0 59:01 and now I have a function. 59:04 The maximum over theta 0 is this guy here, 59:09 but the maximum over the entire space is here. 59:12 59:15 So the maximum over a larger space 59:17 has to be larger than the maximum over a smaller space. 59:20 It can be equal, but the one in the bigger space 59:26 can be even bigger. 59:28 However, if my true theta actually 59:33 did belong to theta 0-- 59:35 if h0 was true-- 59:38 what would happen? 59:39 Well, if theta 0 is true, then theta isn't theta 0, 59:45 and since the maximum likelihood should be close to theta, 59:49 it should be the case that those two things should 59:51 be pretty similar. 59:52 I should be in a case not in this kind of thing, 59:56 but more in this kind of position, 59:58 where the true maximum is actually attained at theta 0. 60:00 And in this case, they're actually 60:02 of the same size, those two things. 60:05 If it's not true, then I'm going to see a discrepancy 60:08 between the two guys. 60:09 60:12 So my test is going to be built on this intuition 60:15 that if h0 is true, the values of the likelihood at theta hat 60:20 mle and at the constraint mle should be pretty much the same. 60:24 But if theta hat-- 60:25 if it's not true, then the likelihood of the mle 60:29 should be much larger than the likelihood 60:33 of the constrained mle. 60:34 60:37 And this is exactly what this test is doing. 60:40 So that's the likelihood ratio test. 60:42 So rather than looking at the ratio of the likelihoods, 60:46 we look at the difference of the log likelihood, which 60:48 is really the same thing. 60:51 And there is some weird normalization factor, too, 60:54 that shows up here. 60:55 61:04 And this is what we get. 61:06 So if I look at the likelihood ratio test, 61:18 so it's looking at two times ln of theta hat mle 61:25 minus ln of theta hat mle constrained. 61:32 And this is actually the test statistic. 61:34 So we've actually decided that this statistic is what? 61:39 61:42 It's non-negative, right? 61:44 We've also decided that it should 61:45 be close to zero if h0 is true and of course 61:49 then maybe far from zero if h0 is not true. 61:52 So what should be the natural test based on Tn? 62:00 Let me just check that it's-- 62:03 well, it's already there. 62:05 So the natural test is something that looks like indicator 62:08 that Tn is larger than c. 62:12 And you should say, well, again? 62:13 I mean, we just did that. 62:15 I mean, it is basically the same thing that we just did. 62:19 Agreed? 62:20 But the Tn now is different. 62:22 The Tn is the difference of log likelihoods, 62:24 whereas before the Tn was this theta hat minus theta 62:29 not transpose identity of Fisher information matrix theta 62:35 hat minus theta not. 62:37 And this, there's no reason why this guy 62:39 should be of the same form. 62:41 Now, if I have a Gaussian model, you 62:43 can check that those two things are actually exactly the same. 62:45 62:49 But otherwise, they don't have any reason to be. 62:52 And now, what's happening is that 62:54 under some technical conditions-- 62:57 if h0 is true, now what happens is 62:59 that if I want to calibrate c, what I need to do 63:02 is to look at what is the c such that this guy is 63:08 equal to alpha? 63:10 And that's for the distribution of T under the knob. 63:15 63:20 But there's not only one. 63:22 The null hypothesis here was actually 63:26 just the family of things. 63:28 It was not just one vector. 63:29 It was an entire family of vectors, 63:31 just like in this example. 63:33 So if I want my type I error to be constrained 63:35 over the entire space, what I need to make sure of 63:39 is that the maximum overall theta n theta not 63:44 is actually equal to alpha. 63:45 63:53 Agreed? 63:53 Yeah? 63:54 AUDIENCE: [INAUDIBLE]. 63:55 63:59 PHILIPPE RIGOLLET: So not equal. 64:04 In this case, it's going to be not equal. 64:06 I mean, it can really be anything you want. 64:08 It's just you're going to have a different type II error. 64:12 I guess here we're sort of stuck in a corner. 64:15 We built this T, and it has to be small under the null. 64:18 And whatever not the null is, we just 64:21 hope that it's going to be large. 64:22 64:25 So even if I tell you what the alternative is, 64:27 you're not going to change anything about the procedure. 64:31 So here, q alpha-- so what I need to know 64:33 is that if h0 is true, then Tn in this case 64:37 actually converges to some chi-square distribution. 64:41 And now here, the number of degrees of freedom 64:44 is kind of weird. 64:45 64:58 But actually, what it should tell you is, oh, finally, I 65:02 know when you call this parameter degrees of freedom 65:05 rather than dimension or just d parameter. 65:08 It's because here what we did is we actually pinned down 65:13 everything, but r-- 65:19 sorry, we pinned down everything but r 65:23 coordinates of this thing. 65:24 65:26 And so now I'm actually wondering why-- 65:30 65:34 did I make a mistake here? 65:36 65:40 I think this should be chi square 65:41 with r degrees of freedom. 65:43 65:46 Let me check and send you an update about this, 65:48 because the number of degrees of freedom, 65:53 if you talk to normal people they will tell you 65:55 that here the number of degrees of freedom is r. 65:59 This is what's allowed to move, and that's 66:01 what's called degrees of freedom. 66:03 The rest is pinned down to being something. 66:06 So here, this chi-square should be a chi-squared r. 66:10 And that's something you just have to believe me. 66:12 Anybody guess what theorem is going to tell me this? 66:15 66:19 In some cases, it's going to be Cochran's theorem-- 66:21 just something that tells me that thing's [INAUDIBLE].. 66:23 Now, here, I use the very specific form 66:27 of the null alternative. 66:29 And so for those of you who are sort 66:31 of familiar with linear algebra, what I did here is h0 66:35 consists in saying that theta belongs 66:39 to an r dimensional linear space. 66:43 It's actually here, the r dimensional linear space 66:45 of vectors, that have the first r coordinates that can move 66:49 and the last coordinates that are fixed to some number. 66:54 Actually, it's undefined space because it doesn't necessarily 66:57 go through zero. 66:58 And so I have this defined space that 67:00 has dimension r, and if I were to constrain it to any other r 67:05 dimensional space, that would be exactly the same thing. 67:08 And so to do that, essentially what you need to do is to say, 67:10 if I take any matrix that's say, invertible-- let's call it u-- 67:15 and then so h0 is going to be something like of the form u 67:21 times theta and now I look only at the coordinates r plus 1 2d, 67:33 then I want to fix those guys to some numbers. 67:35 So I want to call them theta, so let's call them tau. 67:39 So it's going to be tau r plus 1, all the way to tau d. 67:44 So this is not part of the requirements, 67:47 but just so you know, it's really not a matter 67:50 of keeping only some coordinates. 67:51 Really, what matters is the dimension 67:54 in the sense of linear subspaces of the problem, 67:56 and that's what determines what your degrees of freedom are. 67:59 68:03 So now that we know what the asymptotic distribution is 68:06 under the null, then we know basically 68:10 that we know how which table we need to pick our q alpha from. 68:17 And here, again, the table is a chi-squared table, 68:20 but here, the number of degrees of freedom 68:22 is this weird d minus r degrees of freedom thing. 68:26 68:29 I just said it was r. 68:31 68:34 I'm just checking, actually, if I'm-- 68:36 68:41 it's r. 68:42 It's definitely r. 68:42 68:51 So here we've made tests. 68:54 We're testing if r parameter theta was explicitly 68:57 in some set or not. 69:00 By explicitly, I mean we're saying, is theta like this 69:03 or is theta not like this? 69:04 Is theta equal to theta not or is theta 69:06 not equal to theta not? 69:07 Are the last coordinates of theta 69:10 equal to those fixed numbers, or are they not? 69:12 There was something I was stating directly about theta. 69:15 But there's going to be some instances where you actually 69:17 want to test something about a function of theta, 69:21 not theta itself. 69:22 For example, is the difference between the first coordinate 69:27 of theta and the second coordinate of theta positive? 69:30 That's definitely something you might want to test, 69:32 because maybe theta 1 is-- 69:37 let me try to think of some good example. 69:39 69:44 I don't know. 69:45 Maybe theta 1 is your drawing accuracy with the right hand 69:49 and theta 2 is the drawing accuracy with the left hand, 69:52 and I'm actually collecting data on young children 69:56 to be able to test early on whether they're 69:58 going to be left-handed or right-handed, for example. 70:01 And so I want to just compare those two with respect 70:04 to each other, but I don't necessarily 70:06 need to know what the absolute score for this handwriting 70:10 skills are. 70:12 So sometimes it's just interesting to look 70:14 at the difference of things or maybe the sum, 70:17 say the combined effect. 70:18 Maybe this is my two measurements of blood pressure, 70:22 and I just want to talk about the average blood pressure. 70:25 And so I can make a linear combination of those two, 70:28 and so those things implicitly depend on theta. 70:30 And so I can generically encapsule them 70:36 in some test of the form g of theta is equal to 0 70:39 versus g of theta is not equal to 0. 70:42 And sometimes, in the first test that we saw, g of theta 70:46 was just the identity or maybe the identity minus 0.5. 70:53 If g of theta is theta minus 0.5, 70:55 that's exactly what we've been testing. 70:57 If g of theta is theta minus 0.5 and theta 71:01 is p, the parameter of a coin, this is exactly of this form. 71:06 So this is a simple one, but then there's 71:08 more complicated ones we can think of. 71:11 71:14 So how can I do this? 71:20 Well, let's just follow a recipe. 71:22 71:24 So we traced back. 71:26 We were trying to build a test statistic which was pivotal. 71:31 We wanted to have this thing that 71:33 had nothing that depended on the parameter, 71:37 and the only thing we had for that 71:39 that we built in our chi-square test 71:41 one is basically some form of central limit theorem. 71:44 Maybe it's for the maximum likelihood estimator. 71:46 Maybe it's for the average, but it's basically 71:48 some form of asymptotic normality of the estimator. 71:52 And that's what we started from every single time. 71:55 So let's assume that I have this, 71:58 and I'm going to talk very abstractly. 72:00 Let's assume that I start with an estimator. 72:03 Doesn't have to be the mle. 72:04 It doesn't have to be the average, 72:06 but it's just something. 72:08 And I know that I have the estimator such that this guy 72:11 converges in distribution to some n0, 72:15 and I have some covariance matrix theta. 72:17 Maybe it's not the Fisher information. 72:20 Maybe that's something that's not as good as the mle, 72:23 meaning that this is going to give me 72:25 less information than the Fisher information, less accuracy. 72:29 And now I can actually just say, OK, if I know this about theta, 72:34 I can apply the multivariate delta method, which tells me 72:43 that square root of n, g of theta hat, minus g of theta 72:50 goes in distribution to some n0. 72:56 And then the price to pay in one dimension 72:58 was multiplying the square root of the derivative, 73:01 and we know that in multivariate dimensions pre-multiplying 73:03 by the gradient, post-multiplying 73:05 by the gradient. 73:06 So I'm going to write delta g of theta transpose sigma-- 73:14 sorry, not delta; nabla-- 73:15 g of theta-- so gradient. 73:19 And here, I assume that g takes values into rk. 73:25 That's what's written here. g takes value from d to k, 73:28 but think of k as being 1 for now. 73:30 So the gradient is really just a vector and not a matrix. 73:33 That's your usual gradient for real valid functions. 73:40 So effectively, if g takes values in dimension 1, 73:45 what is the size of this matrix? 73:47 73:58 I only ask trivial questions. 73:59 Remember, that's rule number one. 74:02 It's one by one, right? 74:04 And you can check it, because on this side 74:06 those are just the difference between numbers. 74:08 And it would be kind of weird if they had 74:10 a covariance matrix at the end. 74:11 I mean, this is a random variable, not a random vector. 74:15 So I know that this thing happens. 74:17 And now, if I basically divide by the square root 74:21 of this thing-- 74:22 74:30 so for board I'm working with k is equal to 1 divided by square 74:35 root of delta g of theta transpose sigma delta nabla-- 74:41 sorry, g of theta-- 74:43 74:45 then this thing should go to some standard normal random 74:51 variable, standard normal distribution. 74:56 I just divided by square root of the variance here, 74:59 which is the usual thing. 75:01 Now, if you do not have a univariate thing, 75:05 you do the same thing we did before, 75:07 which is 3 multiplied by the covariance matrix 75:11 to the negative 1/2-- 75:12 so before this role was played by the inverse Fisher 75:16 information matrix. 75:18 That's why we ended up having i of theta to the 1/2, 75:22 and now we just have this gamma, which is just this function 75:25 that I wrote up there. 75:26 That could be potentially k by k if g takes values into rk. 75:31 Yes? 75:32 AUDIENCE: [INAUDIBLE]. 75:35 PHILIPPE RIGOLLET: Yeah, the gradient of a vector 75:37 is just the vector with all the derivatives with respect 75:41 to each component, yes. 75:42 75:45 So you know the word vector for derivatives, but not 75:48 for vectors? 75:49 I mean, the word gradient you use for one-dimensional? 75:54 Yes, derivative in one dimension. 75:57 76:01 Now, of course, here, you notice there's something-- 76:03 I actually have a little caveat here. 76:06 I want this to have rank k. 76:08 I want this to be invertible. 76:10 I want this matrix to be invertible. 76:11 Even for the Fisher information matrix, 76:13 I sort of need it to be invertible. 76:15 Even for the original theorem, that 76:16 was part of my technical condition, 76:18 just so that I could actually write Fisher information matrix 76:21 inverse. 76:22 And so here, you can make your life easy and just assume 76:26 that it's true all the time, because I'm actually writing 76:28 in a fairly abstract way. 76:29 But in practice, we're going to have 76:31 to check whether this is going to be 76:33 true for specific distributions. 76:35 And we will see an example towards the end 76:37 of the chapter, the multinomial, where 76:39 it's actually not the case that Fisher information 76:42 matrix exists. 76:43 76:46 The asymptotic covariance matrix, is not invertible, 76:49 so it's not the inverse of a Fisher information matrix. 76:52 Because to be the inverse of someone, 76:54 you need to be invertible yourself. 76:55 76:58 And so now what I can do is apply Slutsky. 77:01 So here, what I needed to have is theta, the true theta. 77:06 So what I can do is just put some theta hat in there, 77:10 and so that's the gamma of theta hat that I see there. 77:16 And if theta is true, then g of theta is equal to 0. 77:19 That's what we assume. 77:20 That was our h0, was that under h0 g of theta is equal to 0. 77:25 So the number I need to plug in here, 77:29 I don't need to replace theta here. 77:31 What I need to replace here is 0. 77:33 77:36 Now let's go back to what you were saying. 77:38 Here you could say, let me try to replace 0 here, 77:41 but there is no such thing. 77:42 There is no g here. 77:43 It's only the gradient of g. 77:45 So this thing that says replace theta by theta 0 77:50 wherever you see it could not work here. 77:53 If g was invertible, I could just 77:57 say that theta is equal to g inverse of 0 in the null 78:02 and then I could plug in that value. 78:05 But in general, it doesn't have to be invertible. 78:08 And it might be a pain to invert g, even. 78:11 I mean, it's not clear how you can 78:13 invert all functions like that. 78:15 And so here you just go with Slutsky, and you say, 78:17 OK, I'm just going to put theta hat in there. 78:20 But this guy, I know I need to check whether it's 0 or not. 78:24 Same recipe we did for theta, except we do it for g of theta 78:27 now. 78:28 78:30 And now I have my asymptotic thing. 78:34 I know this is a pivotal distribution. 78:36 This might be a vector. 78:38 So rather than looking at the matrix itself, 78:41 I'm going to actually look at the norm-- 78:43 rather than looking at the vectors, 78:44 I'm going to look at their square norm. 78:46 That gives me a chi square, and I 78:47 reject when my test statistic, which is the norm square, 78:51 exceeds the quintile of a chi square-- 78:53 same as before, just doing on your own. 78:56 Before we part ways, I wanted to just mention one thing, which 79:00 is look at this thing. 79:02 If g was of dimension 1, the Euclidean norm in dimension 1 79:08 is just the absolute value of the number, right? 79:10 79:13 Which means that when I am actually computing this, 79:19 I'm looking at the square, so it's the square of something. 79:22 So it means that this is the square of a Gaussian. 79:25 And it's true that, indeed, the chi 79:26 squared 1 is just the square of a Gaussian. 79:28 79:31 Sure, this is the tautology, but let's look at this test now. 79:36 This test was built using Wald's theory and some pretty heavy 79:40 stuff. 79:42 But now if I start looking at Tn and I think of it 79:44 as being just the absolute value of this quantity over there, 79:47 squared, what I'm really doing is 79:50 I'm looking at whether the square of some Gaussian 79:54 exceeds the quintile of a chi squared of 1 degree of freedom, 80:00 which means that this thing is actually equivalent-- 80:02 completely equivalent-- to the test. 80:04 So if k is equal to 1, this is completely 80:10 equivalent to looking at the absolute value of something 80:15 and check whether it's larger than, say, q over 2-- 80:19 well, than q alpha-- 80:22 well, that's q alpha over 2-- 80:24 so that the probability of this thing 80:26 is actually equal to alpha. 80:27 And that's exactly what we've been doing before. 80:29 When we introduced tests in the first place, 80:31 we just took absolute values, said, well, 80:33 is the absolute value of a Gaussian in the limit. 80:36 And so it's the same thing. 80:37 So this is actually equivalent to the probability 80:40 that the norm squared is larger so that the chi squared 80:44 of some normal-- 80:45 and that's the q alpha of some chi squared 80:52 with one degree of freedom. 80:53 Those are exactly the two same tests. 80:58 So in one dimension, those things just 81:00 collapse into being one little thing, 81:03 and that's because there's no geometry in one dimension. 81:05 It's just one dimension, whereas if I'm in a higher dimension, 81:08 then things get distorted and things can become weird. 81:12