https://www.youtube.com/watch?v=V4xOdtqic3o&list=PLUl4u3cNGP60uVBMaoNERc6knT_MgPKS0&index=14 字幕記錄 00:00 00:00 The following content is provided under a Creative 00:02 Commons license. 00:03 Your support will help MIT OpenCourseWare 00:06 continue to offer high-quality educational resources for free. 00:10 To make a donation or to view additional materials 00:12 from hundreds of MIT courses, visit MIT OpenCourseWare 00:16 at ocw.mit.edu. 00:17 00:20 PHILIPPE RIGOLLET: So yes, before we start, 00:25 this chapter will not be part of the midterm. 00:27 Everything else will be, so all the way up to goodness of fit 00:31 tests. 00:33 And there will be some practice exams 00:36 that will be posted in the recitation 00:38 section of the course. 00:39 And that will be-- 00:40 you will be working on. 00:41 So the recitation tomorrow will be a review session 00:44 for the midterm. 00:46 I'll send an announcement by email. 00:49 So going back to our estimator, we 00:55 showed that the least squares estimator in the case 00:58 where we had some Gaussian observations. 01:01 So we had something that looked like this-- y was 01:04 equal to some matrix x times beta plus some epsilon. 01:07 This was an equation that was happening in r 01:10 to the n for n observations. 01:13 And then we wrote the least squares estimator beta hat. 01:15 01:21 And for the purpose from here on, 01:23 you see that you have this normal distribution, 01:26 this Gaussian p variant distribution. 01:28 That means that, at some point, we've 01:29 made the assumption that epsilons 01:31 were n and dimensional 0 identity of rn 01:38 times sigma squared, which I kept 01:41 on forgetting about last time. 01:43 I will try not to do that this time. 01:45 And so from this, we derived a bunch 01:48 of properties of this least squares estimator, beta hat. 01:54 And in particular, the key thing that everything was built on 01:58 was that we could write beta hat as the true unknown beta 02:02 plus some multivariate Gaussian that was centered, 02:05 but had a weird covariant structure. 02:07 So that was definitely p dimensional. 02:08 And it was sigma squared times x-- 02:11 02:13 so that's x transpose x. 02:16 And that's inverse. 02:17 02:19 And the way we derived that was by having a lot of-- 02:22 at least one cancellation between x transpose x and x 02:26 transpose x inverse. 02:28 So this is the basis for inference in linear regression. 02:47 02:51 So in a way, that's correct, because what 02:54 happened is that we used the fact that x beta hat-- 02:58 once we have this beta, x beta hat 02:59 is really just a projection of y onto the linear span 03:04 of the columns of x, or the column span of x. 03:08 And so in particular, those things-- 03:10 y minus x beta hats-- 03:11 are called residuals. 03:13 03:22 So that's the vector of residuals. 03:25 03:28 What's the dimension of this vector? 03:32 03:36 AUDIENCE: n by 1. 03:37 PHILIPPE RIGOLLET: n by 1. 03:38 So those things, we can write as epsilon hat. 03:42 There's an estimate for this epsilon 03:44 because we just put a hat on beta. 03:47 And from this one, we could actually 03:49 build an unbiased estimator of sigma hat squared, 03:54 and that was this guy. 03:55 And we showed that, indeed, the right normalization for this 03:59 was n minus p, because y minus x beta hat to norm 04:04 is actually a chi squared with n minus p degrees of freedom. 04:07 And so that's up to this scaling by sigma squared. 04:11 So that's what we came up with. 04:12 And something I told you, which follows 04:14 from Cochran's theorem-- 04:15 we did not go into details about this. 04:17 But essentially, since one of them 04:18 corresponds to projection onto the linear span of the columns 04:22 of x, and the other one corresponds to projection 04:25 onto the orthogonal of this guy, and we're in a Gaussian case, 04:28 things that are orthogonal are actually 04:30 independent in a Gaussian case. 04:31 So from a geometric point of view, 04:33 you can sort of understand everything. 04:34 You think of your subspace of the linear span of the x's, 04:37 sometimes you project onto this guy, 04:39 sometimes you project onto its orthogonal. 04:41 Beta hat corresponds to projection 04:43 onto the linear span. 04:44 Epsilon hats correspond to a projection onto the orthogonal. 04:46 And those things tend to be independent, 04:48 and that's what you have that beta hat 04:50 is independent of sigma hat squared. 04:53 So it's really just a statement about two linear spaces being 04:56 orthogonal with respect to each other. 05:00 So we left on this slide last time. 05:07 And what I claim is that this thing here is actually-- 05:10 oh, yeah-- the other thing we want to use. 05:12 So that's good for beta hat. 05:14 But since we don't know what sigma squared is-- 05:15 if we knew what sigma squared is, 05:17 that would totally be enough for us. 05:19 But we also need this extra thing-- 05:21 that sigma squared hat squared over sigma squared follows-- 05:27 and there's an n minus p. 05:29 This follows a chi squared with n minus p degrees of freedom. 05:33 And sigma hat squared is independent of beta hat. 05:36 So that's going to be something we need. 05:41 So that's useful if sigma squared if unknown. 05:47 05:51 And again, sometimes it might be known 05:53 if you're using some sort of measurement device 05:56 for which it's written on the side of the box. 05:58 06:01 So from these two things, we're going 06:02 to be able to do inference And inference, we 06:05 said there's three pillars to inference. 06:09 The first one is estimation, and we've been doing that so far. 06:12 We've constructed this least squares estimator, 06:14 which happens to be the maximum likelihood 06:16 estimator in the Gaussian case. 06:18 The two other things we do in inference 06:20 are confidence intervals. 06:22 And we can do confidence intervals. 06:24 We're not going to do much because we're 06:25 going to talk about their sort of cousin, which are tests. 06:29 And that's really where the statistical inference 06:31 comes into. 06:32 And here, we're going to be interested in a very 06:34 specific kind of test for linear regression. 06:36 And those are tests of the form beta j-- 06:42 so the j-th coefficient of beta is equal to 0, 06:46 and that's going to be our null hypothesis, versus h1 where 06:52 beta j is, say, not equal to 0. 06:55 And for the purpose of regression, 06:57 unless you have lots of domain-specific knowledge, 07:00 it won't be beta j positive or beta j negative. 07:03 It's really non-0 that's interesting to you. 07:06 So why would I want to do this test? 07:09 Well, if I expand this thing where I have y 07:14 is equal to x beta plus epsilon-- 07:19 so what happens if I look, for example, 07:21 at the first coordinates? 07:24 So I have that y is actually-- so say, y1 is equal to beta 1 07:32 plus beta 2 x 1. 07:37 Well, that's actually complicated. 07:38 Let me write it like this-- 07:39 07:42 beta 0 plus beta 1 x1 plus beta p minus 1 xp minus 1 07:56 plus epsilon. 07:58 And that's true for all i's. 08:00 08:04 So this is beta 1 times 1. 08:05 That was our first coordinate. 08:07 So that's just expanding this-- 08:09 going back to the scalar form rather than 08:12 going to the matrix vector form. 08:15 That's what we're doing. 08:16 When I write y is equal to x beta plus epsilon, 08:19 I assume that each of my y's can be represented 08:22 as a linear combination of the x's, the first one 08:25 being 1 plus some epsilon i. 08:26 Everybody agrees with this? 08:29 What does it mean for beta j to be equal to 0? 08:32 08:40 Yeah? 08:41 AUDIENCE: That xj's not important. 08:43 PHILIPPE RIGOLLET: Yeah, that xj doesn't even 08:45 show up in this thing. 08:46 So if beta j is equal to 0, that means that, essentially, we 08:51 can remove the j's coordinate, xj, from all observations. 09:05 09:12 So for example, I'm a banker, and I'm 09:15 trying to predict some score-- 09:19 let's call it y-- 09:21 without the noise. 09:22 So I'm trying to predict what is going to be your score. 09:26 And that's something that should be telling me 09:29 how likely you are to reimburse your loan on time 09:33 or do you have late payments. 09:34 Or actually, maybe these days bankers 09:36 are actually looking at how much late fees will I 09:40 be collecting from you. 09:41 Maybe that's what they are more after rather than making sure 09:44 that you reimburse everything. 09:45 So they're trying to maximize this number of late fees. 09:47 And they collect a bunch of things about you-- 09:49 definitely your credit score, but maybe your 09:52 zip code, profession, years of education, family status, 09:57 a bunch of things. 09:59 One might be your shoe size. 10:01 And they want to know-- maybe shoe is actually 10:03 a good explanation for how much fees 10:07 they're going to be collecting from you. 10:08 But as you can imagine, this would be a controversial thing 10:10 to bring, and people might want to test for their shoe 10:12 size is a good idea. 10:14 And so they would just look at the j corresponding 10:17 to shoe size and test whether shoe size should appear or not 10:21 in this formula. 10:22 And that's essentially the kind of thing 10:24 that people are going to do. 10:25 Now, if I do genomics and I'm trying 10:27 to predict the size, the girth, of a pumpkin for a competition 10:32 based on some available genomic data, 10:37 then I can test whether gene j, which is called-- 10:40 I don't know-- pea snap 24-- they always have these crazy 10:44 names-- 10:44 appears or not in this formula. 10:46 Is the gene pea snap 24 going to be important or not 10:49 for the size of the final pumpkin? 10:52 So those are definitely the important things. 10:54 And definitely, we want to put beta j not 10:57 equal to 0 as the alternative because that's where 11:00 scientific discovery shows up. 11:02 And so to do that, well, we're in a Gaussian set-up, 11:06 so we know that even if we don't know what sigma hat is, 11:10 we can actually call for a t-test. 11:14 So how did we build the t-test in general? 11:16 Well, we had something that looked like-- so before, what 11:23 we had was something that looked like theta hat was 11:28 equal to theta plus some n0 and something that 11:35 depended on n, maybe, something like this-- sigma squared 11:38 over n. 11:39 So that's what it looked like. 11:41 Now what we have is that beta hat 11:46 is equal to beta plus some n, but this time, it's p variant, 11:50 and then x transpose x inverse sigma squared. 11:56 So it's actually very similar, except that the matrix 12:00 x transpose x inverse is now replacing 12:03 just this number, 1/n, but it's playing the same role. 12:06 So in particular, this implies that for every j from 1 12:12 to p, what is the distribution of beta hat j? 12:16 12:19 Well, beta hat j is actually equal to-- 12:22 so all I have to do-- so this is a system of p equations, 12:26 and all I have to do is to read the j through. 12:29 So it's telling me here, I'm going to read beta hat j. 12:32 Here, I'm going to read beta j. 12:34 And here, I need to read, what is 12:36 the distribution of the j-th coordinates of this guy? 12:40 So this is a Gaussian vector, so we 12:43 need to understand what its definition is. 12:45 12:49 So how do I do this? 12:52 Well, the observation that's actually useful for this-- 12:56 maybe I shouldn't use the word observation in a stats class, 12:59 so let's call it claim. 13:00 13:03 The interesting claim is that if I have a vector-- 13:09 let's call it v-- 13:13 then vj is equal to v transpose ej where 13:20 ej is the vector with 0, 0, 0, and then the 1 on the j-th 13:28 coordinate, and then 0 elsewhere. 13:30 That's the j-th coordinate. 13:32 13:35 So that's the j-th vector of the canonical basis of rp. 13:38 13:41 So now that I have this form, I can 13:43 see that, essentially, beta hat j 13:45 is just ej transpose this np0 sigma squared 13:51 x transpose x inverse. 13:53 13:59 And now, I know what the distribution 14:02 of the inner product between a Gaussian 14:05 and a deterministic vector is. 14:08 What is it? 14:09 14:13 It's a Gaussian. 14:15 So all I have to check is that ej transpose np0 sigma squared 14:23 x transpose x inverse-- 14:27 well, this is equal in distribution to what? 14:31 Well, this is going to be a one-dimensional thing. 14:34 A then your product is just a real number. 14:38 So it's going to be some Gaussian. 14:42 The mean is going to be 0 in a product with ej, which is 0. 14:49 What is the variance of this guy? 14:52 We actually used this, except that ej was not a vector, 14:55 but it was a matrix. 14:57 So what we do is we, to see-- so the rule is that v transpose, 15:04 say, n mu sigma is some n v transpose mu, 15:16 and then v transpose sigma v. That's 15:21 the rule for Gaussian vectors. 15:23 There's just the property of Gaussian vectors. 15:25 15:27 So what do we have here? 15:29 Well, ej plays the role of v. And sigma 15:33 squared x transpose x inverse is the role of sigma. 15:36 So here, I'm left with ej transpose-- 15:40 let me pull out the sigma squared here. 15:42 15:54 But this thing is, what happens if I take a matrix, 15:57 I premultiply it by this vector ej, 16:00 and I postmultiply it by this vector ej? 16:02 16:05 I'm claiming that this corresponds to only one 16:07 single element of this matrix. 16:09 Which one is it? 16:11 AUDIENCE: j. 16:11 PHILIPPE RIGOLLET: j's diagonal element. 16:14 So this thing here is nothing but x transpose x inverse, 16:23 and then the j-th diagonal element is jj. 16:27 Now, I cannot go any further. 16:30 x transpose x inverse can be a complicated matrix, 16:34 and I do not know how to express jj's diagonal element much 16:40 better than this. 16:41 16:43 Well, no, actually, I don't. 16:46 It involves basically all the coefficients. 16:48 Yeah? 16:49 AUDIENCE: [INAUDIBLE] second j come from, 16:52 so I get why ej transpose [INAUDIBLE].. 16:55 Where did the-- 16:56 PHILIPPE RIGOLLET: From this rule? 16:58 AUDIENCE: [INAUDIBLE] 16:59 PHILIPPE RIGOLLET: So you always pre- 16:59 and postmultiply when you talk about the covariance, 17:01 because if you did not, it would be a vector and not a scalar, 17:04 for one. 17:06 But in general, think of v as a matrix. 17:08 It's still true even in v is a matrix that's 17:11 compatible with the premultiplying 17:12 by some Gaussian. 17:13 17:19 Any other question? 17:20 Yeah? 17:21 AUDIENCE: When you say claim a vector v, what is vector v? 17:25 17:29 PHILIPPE RIGOLLET: So for any vector v-- 17:31 AUDIENCE: OK. 17:32 17:37 PHILIPPE RIGOLLET: Any other question? 17:40 So now we've identified that the j-th coefficient 17:44 of this Gaussian, which I can represent from the claim 17:47 as ej transpose this guy, is also 17:49 a Gaussian that's centered. 17:51 And its variance, now, is sigma squared 17:54 times the j-th diagonal element of x transpose x inverse. 17:58 So the conclusion is that beta hat j 18:05 is equal to beta j plus some n. 18:10 And I'm going to emphasize the fact that now it's 18:12 one-dimensional with mean 0 and covariance sigma squared x 18:19 transpose x inverse inverse jj. 18:25 Now, if you look at the last line of the second board 18:28 and the first line on the first board, 18:31 those are basically the same thing. 18:33 18:36 Beta hat j is my theta hat. 18:39 Beta j is my theta. 18:41 And the variance sigma squared over n 18:44 is now sigma squared times this [? jj's ?] element. 18:47 Now, the inverse suggests that it looks like the inverse of n. 18:52 So those things are going to-- 18:53 we're going to want to think of those guys 18:55 as being some sort of 1/n kind of statement. 18:59 19:04 So from this, the fact that those two things are the same 19:09 leads us to believe that we are now 19:11 equipped to perform the task that we're trying to do, 19:14 because under the null hypothesis, 19:16 beta j is known it's equal to 0, so I can remove it. 19:22 And I have to deal with the sigma squared. 19:24 If sigma squared is known, then I 19:26 can just perform a regular Gaussian 19:29 test using Gaussian quintiles. 19:31 And if sigma squared is unknown, I'm 19:33 going to just divide by sigma squared 19:35 and multiply by sigma hat, and then I'm 19:38 going to basically get my t-test. 19:40 20:00 Actually, for the purpose of your exam, 20:03 I really suggest that you understand every single word 20:06 I'm going to be saying now, because this is exactly 20:08 the same thing that you're expected 20:09 to know from other courses, because right now, I'm just 20:12 going to apply exactly the same technique that we 20:14 did for the single parameter estimation. 20:17 So what do we have now is that under h0, beta j is equal to 0. 20:26 Therefore, beta hat j follows some n0 sigma squared. 20:39 Just like I do in the slide, I'm going to call this gamma j. 20:41 20:50 So gamma j is this x transpose x inverse j-th diagonal element. 20:56 20:59 So that implies that beta hat j over sigma-- 21:06 oh, was it a square root? 21:08 Yeah, sigma square root of gamma j follows some n0 1. 21:16 So I can form my test statistic, which 21:21 to be reject if the absolute value of beta hat j divided 21:30 by sigma square root gamma j is larger than what? 21:38 Can somebody tell me what I want this 21:39 to be larger than to reject? 21:41 21:43 AUDIENCE: q alpha. 21:45 PHILIPPE RIGOLLET: q alpha. 21:46 21:48 Everybody agrees? 21:49 Of what? 21:50 Of this guy, where the standard notation 21:58 that this is the quintile. 21:59 Everybody agrees? 22:01 AUDIENCE: It's alpha over 2 I think. 22:02 I think alpha's-- 22:03 PHILIPPE RIGOLLET: Alpha over 2. 22:04 So not everybody should be agreeing. 22:06 Thank you, you're the first one to disagree with yourself, 22:08 which is probably good. 22:09 22:12 It's alpha over 2 because of the absolute value. 22:14 I want to just be away from this guy, 22:15 and that's because I have-- 22:17 so the alpha over 2-- 22:19 the sanity check should be that h1 is beta j not equal to 0. 22:27 So that works if sigma is known, because I need to know sigma 22:35 to be able to build my test. 22:37 So if sigma is unknown, well, I can tell you, use this test, 22:39 but you're going to be like, OK, when 22:41 I'm going to have to plug in some numbers, 22:44 I'm going to be stuck. 22:45 22:49 But if sigma is unknown, we have sigma hat 22:59 squared as an estimator. 23:03 So let me write sigma squared here. 23:06 So in particular, beta hat divided 23:12 by sigma hat squared times square root gamma j-- something 23:18 I can compute. 23:19 Sorry, that's beta hat j. 23:20 23:23 I can compute that thing. 23:24 Agreed? 23:25 Now I have sigma hat j. 23:27 What I need to do is to be able to compute 23:28 the distribution of this thing. 23:32 So I know the distribution of beta hat j over the square root 23:37 of gamma j. 23:38 That's some Gaussian 0, 1. 23:40 I don't know exactly what the distribution of sigma hat 23:42 j squared is, but what I know is that that was actually written, 23:46 maybe, here is that n minus p sigma hat squared over sigma 23:54 squared follows some chi squared with n minus p 23:59 degrees of freedom, and that it's actually 24:01 independent of beta hat j. 24:06 It's independent of beta hat, so it's 24:08 independent of each of its coordinates. 24:10 That was part of your homework where you had to-- 24:13 some of you were confused by the fact that-- 24:15 I mean, if you're independent of some big thing, 24:18 you're independent of all the smaller 24:19 components of this big thing. 24:20 That's basically what you need to know. 24:24 And so now I can just write this as-- 24:26 24:29 this is beta hat j divided by-- 24:35 so now I want to make this guy appear, 24:37 so it's beta hat j sigma squared over sigma squared-- 24:42 sigma hat squared over sigma squared times n minus p divided 24:48 by the square root of gamma j. 24:49 So that's what I want to see. 24:51 Yeah? 24:52 AUDIENCE: Why do you have to stick 24:53 the hat in the denominator? 24:54 Shouldn't it be sigma? 24:56 PHILIPPE RIGOLLET: Yeah, so I write this. 24:59 I decide to write this. 25:01 I could have put a Mickey Mouse here. 25:03 It just wouldn't make sense. 25:04 I just decided to take this thing. 25:05 AUDIENCE: OK. 25:06 PHILIPPE RIGOLLET: OK. 25:07 So now, let-- so I take this guy, and now, 25:12 I'm going to rewrite it as something I want, 25:15 because if you don't know what sigma is-- 25:17 sorry, that's not sigm-- 25:18 you mean the square? 25:19 AUDIENCE: Yeah. 25:20 PHILIPPE RIGOLLET: Oh, thank you. 25:21 Yes, that's correct. 25:22 [LAUGHS] OK, so you don't know what's sigma 25:25 is, you replace it by sigma hat. 25:26 That's the most natural thing to do. 25:28 You just now want to find out what 25:30 the distribution of this guy is. 25:33 So this is not exactly what I had. 25:35 To be able to get this, I need to divide by sigma squared-- 25:41 sorry, I need to-- 25:42 AUDIENCE: Square root. 25:43 PHILIPPE RIGOLLET: I'm sorry. 25:44 AUDIENCE: Do we need a square root 25:46 of the sigma hat [INAUDIBLE]. 25:47 PHILIPPE RIGOLLET: That's correct now. 25:49 25:55 And now I have that-- 25:57 sorry, I should not write it like that. 25:59 That's not what I want. 26:01 What I want is this. 26:04 26:08 And to be able to get this guy, what I need 26:11 is sigma over sigma hat square root. 26:25 And then I need to make this thing show up. 26:27 So I need to have this n minus p show up in the denominator. 26:32 So to be able to get it, I need to multiply 26:34 the entire thing by the square root of n minus p. 26:37 26:41 So this is just a tautology. 26:42 I just squeezed in what I wanted. 26:46 But now this whole thing here, this is actually 26:50 of the form beta hat j divided by sigma over square root gamma 26:56 j, and then divided by square root of sigma hat squared 27:01 over sigma squared. 27:04 27:08 No, I don't want to divide it by square root of minus p, sorry. 27:11 27:15 And now it's times n minus p divided by n minus p. 27:21 27:27 And what is the distribution of this thing here? 27:29 27:43 So I'm going to keep going here. 27:45 So the distribution of this thing here is what? 27:48 Well, this numerator, what is this distribution? 27:54 27:58 AUDIENCE: [INAUDIBLE] 28:01 PHILIPPE RIGOLLET: Yeah, n0 1. 28:02 It's actually still written over there. 28:04 28:09 So that's our n0 1. 28:11 What is the distribution of this guy? 28:13 28:16 Sorry, I don't think you have color again. 28:18 So what is the distribution of this guy? 28:22 This is still written on the board. 28:24 AUDIENCE: Chi squared. 28:25 PHILIPPE RIGOLLET: It's the chi squared that I have right here. 28:28 28:32 So that's a chi squared n minus p divided by n minus p 28:35 degrees of freedom. 28:36 The only thing I need to check is 28:37 that those two guys are independent, which 28:39 is also what I have from here. 28:43 And so that implies that beta hat j divided 28:49 by sigma hat square root of gamma 28:53 j, what is the distribution of this guy? 28:55 29:04 [INTERPOSING VOICES] 29:06 PHILIPPE RIGOLLET: n minus p. 29:09 Was that crystal clear for everyone? 29:12 Was that so simple that it was boring to everyone? 29:15 OK, good. 29:16 That's where the point at which you should be. 29:18 So now I have this, I can read the quintiles of this guy. 29:23 So my test statistic becomes-- 29:28 well, my rejection region, I reject 29:31 if the absolute value of this new guy 29:40 exceeds the quintile of order alpha over 2, but this time, 29:44 of a tn minus p. 29:48 And now you can actually see that the only difference 29:50 between this test and that test, apart from replacing sigma 29:53 by sigma hat, is that now I've moved 29:55 from the quintiles of a Gaussian to the quintiles 29:58 of a tn minus p. 29:59 30:11 What's actually interesting, from this perspective, 30:13 is that the tn minus p, we know, has 30:18 heavier tails than the Gaussian, but if the number of degrees 30:20 of freedom reaches, maybe, 30 or 40, they're virtually the same. 30:26 And here, the number of degrees of freedom 30:27 is not given only by n, but it's n minus p. 30:30 So if I have more and more parameters to estimate, 30:33 this will result in some heavier, heavier tails, 30:35 and that's just to account for the fact 30:37 that it's harder and harder to estimate the variance 30:41 when I have a lot of parameters. 30:44 That's basically where it's coming from. 30:46 So now let's move on to-- 30:52 well, I don't know what because this is not working anymore. 30:57 So this is the simplest test. 30:59 And actually, if you run any statistical software 31:02 for least squares, the output in any of them 31:06 will look like this. 31:08 You will have a sequence of rows. 31:11 And you're going to have an estimate for beta 0, 31:15 an estimate for beta 1, et cetera. 31:17 Here, you're going to have a bunch of things. 31:19 And on this row, you're going to have the value here, 31:23 so that's going to be what's estimated by least squares. 31:25 And then the second line immediately is going to be, 31:30 well, either the value of this thing-- 31:32 31:35 so let's call it t. 31:36 And then there's going to be the p value 31:38 corresponding to this t. 31:40 This is something that's just routinely coming out because-- 31:44 oh, and then there's, of course, the last line for people who 31:46 cannot read numbers that's really just giving you little 31:49 stars. 31:50 31:53 They're not stickers, but that's close to it. 31:56 And that's just saying, well, I have three stars, 31:59 I'm very significantly different from 0's. 32:01 If I have 2 stars, I'm moderately differently from 0. 32:04 And if I have 1 star, it means, well, just 32:07 give me another $1,000 and I will sign that it's actually 32:10 different from 0. 32:12 So that's basically the kind of outputs. 32:14 Everybody sees what I mean by that? 32:16 So what I mean, what I'm trying to emphasize here, 32:18 is that those things are so routine when 32:20 you run linear aggression, because people stuff in maybe-- 32:23 even if you have 200 observations, 32:25 you're going to stuff in maybe 20 variables-- p equals 20. 32:28 That's still a big number to interpret what's going on. 32:31 And it's nice for you if you can actually trim some fat out. 32:35 And so the problem is that when you start doing this, and then 32:41 this, and then this, and then this, 32:44 the probability that you make a mistake 32:47 in your test, the probably that you erroneously 32:52 reject the null here is 5%. 32:55 Here, it's 5%. 32:56 Here, it's 5%. 32:58 Here, it's 5%. 33:00 And at some point, if things happen with 5% chances 33:05 and you keep on doing them over and over again, 33:08 they're going to start to happen. 33:10 So you can see that basically what's happening 33:14 is that you actually have an issue is 33:15 that if you start repeating those tests, 33:18 you might not be at 5% error at some point. 33:23 And so what do you do to prevent from that, 33:25 if you want to test all those beta j's simultaneously, 33:28 you have to do what's called the Bonferroni correction. 33:32 And the Bonferroni correction follows from what's 33:35 called a union bound. 33:36 A union bound is actually-- so if you're a computer scientist, 33:40 you're very familiar with it. 33:41 If you're a mathematician, that's just, essentially, 33:44 the third axiom of probability that you see, 33:46 that the probability of the union 33:48 is less than the sum of the probabilities. 33:50 34:00 That's the union bound. 34:02 And you, of course, can generalize that to more than 2. 34:05 And that's exactly what you're doing here. 34:07 So let's see how we would want to perform Bonferroni 34:11 correction to control the probability that they're all 34:19 equal to 0 at the same time. 34:21 34:26 So recall-- so if I want to perform this test over there 34:29 where I want to test h0, that beta j 34:34 is equal to 0 for all j in some subset s. 34:40 34:43 So think of s included in 1p. 34:48 You can think of it as being all of 1 of p if you want. 34:51 It really doesn't matter. s is something that's given to you. 34:53 Maybe you want to test the subset of them, 34:55 but maybe you want to test all of them. 34:57 Versus h1, beta j is not equal to 0 for some j in s. 35:04 35:07 That's a test that tests all these things at once. 35:10 And if you actually look at this table all at once, 35:13 implicitly, you're performing this test for all of the rows, 35:16 for s equal 1 to p. 35:19 You will do that. 35:19 Whether you like it or not, you will. 35:23 So now let's look at what the probability of type I error 35:27 looks like. 35:28 So I want the probability of type 1 error, 35:31 so that's the probably when h0 is true. 35:35 Well, so let me call psi j the indicator that, say, beta j 35:41 hat over sigma hat square root gamma j exceeds 35:51 q alpha over 2 of tn minus p. 35:54 So we know that those are the tests that I perform. 35:56 Here, I just add this extra index j 35:59 to tell me that I'm actually testing the j-th coefficient. 36:02 So what I want is the probability that under the null 36:06 so that those are all equal to 0 that beta j's-- 36:12 that I will reject to the alternative for one of them. 36:16 So that's psi 1 is equal to 1 or psi 2 36:25 is equal to 1, all the way to psi-- 36:29 well, let's just say that this is the entire thing, 36:31 because it's annoying. 36:32 36:36 I mean, you can check the slide if you 36:37 want to do it more generally. 36:39 But psi p is equal to-- 36:44 or, or-- everybody agrees that this is the probability 36:48 of type I error? 36:51 So either I reject this one, or this one, 36:54 or this one, or this one, or this one. 36:55 And that's exactly when I'm going to reject at least one 36:58 of them. 36:59 So this is the probability of type I error. 37:08 And what I want is to keep this guy less than alpha. 37:12 37:15 But what I know is to control the probability 37:17 that this guy is less than alpha, that this guy is 37:20 less than alpha, that this guy is less than alpha. 37:22 In particular, if all these guys are disjoint, 37:26 then this could really be the sum of all these probabilities. 37:29 So in the worst case, if psi j equals 1 intersected with psi k 37:42 equals 1 is the empty set, so that means 37:46 those are called disjoint sets. 37:47 37:51 You've seen this terminology in probability, right? 37:53 So if those sets are disjoint, for all of them, 38:00 for all j different from k, then this probability-- 38:04 38:07 well, let me write it as star-- 38:14 then star is equal to, well, the probability under h0 38:20 that psi 1 is equal to 1 plus the probability under h0 38:30 that psi p is equal to 1. 38:33 Now, if I use this test with this alpha here, 38:37 then this probability is equal to alpha. 38:40 This probability is also equal to alpha. 38:43 So the probably of type I error is actually not equal to alpha. 38:45 It's equal to? 38:47 AUDIENCE: p alpha. 38:48 PHILIPPE RIGOLLET: p alpha. 38:49 38:52 So what is the solution here? 38:54 Well, it's to run those guys not with alpha, 38:58 but with alpha over p. 38:59 39:02 And if they do this, then this guy is equal to alpha over p, 39:06 this guy is equal to alpha over p. 39:09 And so when I get those things, I 39:10 get p times alpha over p, which is just alpha. 39:13 39:17 So all I do is, rather than running each of the tests 39:20 with probability of error-- 39:23 so that's a test at level alpha over p. 39:28 39:32 That's actually very stringent. 39:33 If you think about it for 1 second, 39:35 even if you have only 5 variables-- p equals 5-- 39:41 and you started with the tests, you 39:43 wanted to do your tests at 5%. 39:45 It forces you to do the test at 1% for each of those variables. 39:50 If you have 10 variables, I mean, that 39:53 start to be very stringent. 39:55 So it's going to be harder and harder for you 39:59 to conclude to the alternative. 40:01 Now, one thing I need to tell you 40:03 is that here I said, if they are disjoint, 40:05 then those probabilities are equal. 40:07 But if they are not disjoint, the union bound 40:12 tells me that the probability of the union 40:14 is less than the sum of the probabilities. 40:16 And so now I'm not exactly equal to alpha, 40:20 but I'm bounded by alpha. 40:23 And that's why Bonferroni correction, 40:26 people are not super comfortable with, 40:28 is because, in reality, you never think 40:30 that those tests are going to be giving you 40:32 completely disjoint things. 40:34 I mean, why would it be? 40:36 Why would it be that if this guy is equal to 1, 40:39 then all the other ones are equal to 0? 40:42 Why would it make any sense? 40:44 So this is definitely conservative, 40:45 but the problem is that we don't know how to do much better. 40:49 I mean, we have a formula that tells you 40:51 the probability of the union as some crazy sum that 40:54 looks at all the intersection and all these little things. 40:57 I mean, it's the generalization of p of a or b 41:01 is equal to p of a plus p of b minus probability 41:06 of the intersection. 41:08 But if you start doing this for more than 2, 41:10 it's super complicated. 41:12 The number of terms grows really fast. 41:15 But most importantly, even if you go here, 41:17 you still need to control the probability 41:19 of the intersection. 41:20 And those tests are not necessarily independent. 41:22 If they were independent, then that would be easy. 41:24 The probably of the intersection would be the product 41:26 of the probabilities. 41:27 But those things are super correlated, 41:31 and so it doesn't really help. 41:33 And so we'll see, when we talk about high-dimensional stats 41:37 towards the end, that there's something 41:38 called false discovery rate, which is essentially saying, 41:41 listen, if I want to control this thing, 41:45 if I really define my probability of type I 41:47 error as this, I want to make sure that I never make 41:50 this kind of error, I'm doomed. 41:52 This is just not going to happen. 41:54 But I can revise what my goals are in terms of errors 41:59 that I make, and then I will actually be able to do. 42:02 And what people are looking at is false discovery rate. 42:05 And this is called family-wise error rate, which 42:07 is a stronger thing to control. 42:10 So this trick that consists in replacing 42:14 alpha by alpha over the number of times 42:16 you're going to be performing your test, 42:18 or alpha over the number of terms in your union, 42:21 is actually called the Bonferroni correction. 42:24 42:32 And that's something you use when you have what's called-- 42:35 another key word here is multiple testing, 42:41 when you're trying to do multiple tests simultaneously. 42:43 42:47 And if s is not of p, well, you just 42:49 divide by the number of tests that you are actually making. 42:52 So if s is of size k for some k less than p, 42:56 you just divide alpha by k and not by p, of course. 42:59 I mean, you can always divide by p, 43:00 but you're going to make your life harder for no reason. 43:03 43:11 Any question about Bonferroni correction? 43:13 43:18 So one thing that is maybe not as obvious 43:26 as the test beta j equal to 0 versus beta j not equal to 0-- 43:30 and in particular, what it means is 43:32 that it's not going to come up as a software output 43:36 without even you requesting it because this is so standard 43:39 that it's just coming out. 43:40 But there's other tests that you might 43:42 think of that might be more complicated and more 43:45 tailored to your particular problem. 43:47 And those tests are of the form g times beta 43:52 is equal to some lambda. 43:56 So let's see, the test we've just done, 44:05 beta j equals 0 versus beta j not equal to 0, 44:14 is actually equivalent to ej transpose beta equals 44:23 0 versus ej transpose beta not equal to 0. 44:28 That was our claim. 44:31 But now I don't have to stop here. 44:32 I don't have to multiply by a vector 44:34 and test if it's equal to 0. 44:36 I can actually replace this by some general matrix g 44:46 and replace this guy by some general vector lambda. 44:54 And I'm not telling you what the dimensions 44:56 are because they're general. 44:57 I can take whatever I want. 44:58 Take your favorite matrix, as long 45:00 as the right side of the matrix can be multiplying beta, 45:05 and lambda, take it as the number of rows of g, 45:09 and then you can do that. 45:11 I can always formulate this test. 45:14 What will this test encompass? 45:16 Well, those are kind of weird tests. 45:18 So you can think of things like, I 45:22 want to test if beta 2 plus beta 3 are equal to 0, for example. 45:30 Maybe I want to test if beta 5 minus 2 beta 6 is equal to 23. 45:40 Well, that's weird. 45:42 But why would you want to test if beta 2 plus beta 3 45:44 is equal to 0? 45:46 Maybe you don't want to know if the-- you know 45:48 that the effect of some gene is not 0. 45:50 Maybe you know that this gene affects this trait, 45:54 but you want to know if the effect of this gene 45:56 is canceled by the effect of that gene. 45:59 And this is the kind of stuff that you're 46:00 going to be testing for that. 46:02 46:04 Now, this guy is much more artificial, 46:06 and I don't have a bedtime story to tell you around this. 46:08 So those things can happen and can be much more complicated. 46:13 Now, here, notice that the matrix g 46:15 has one row for both of the examples. 46:18 But if I want to test if those two things happen 46:20 at the same time, then I actually can take a matrix. 46:25 Another matrix that can be useful 46:27 is g equals the identity of rp and lambda is equal to 0. 46:34 What am I doing here in this case? 46:39 What is this test testing? 46:41 Sorry, this test. 46:42 46:44 Yeah? 46:45 AUDIENCE: Whether or not beta is 0. 46:46 PHILIPPE RIGOLLET: Yeah, we're testing if the entire vector 46:49 beta is equal to 0, because g times beta is equal to beta, 46:54 and we're asking whether it's equal to 0. 46:56 47:00 So the thing is, when you want to actually test 47:04 if beta is equal to 0, you're actually 47:07 testing if your entire model, everything you're 47:09 doing in life, is just junk. 47:12 This is just telling you, actually, 47:13 forget about this y is x beta plus epsilon. 47:17 y is really just epsilon. 47:18 There's nothing. 47:19 There's just some big noise with some big variants, 47:21 and there's nothing else. 47:23 So turns out that the statistical software 47:26 output that I wrote here spits out an answer to this question. 47:30 Just the last line, usually, is doing this test. 47:34 Does your model even make sense? 47:36 And it's probably for people to check whether they actually 47:39 just mix their two data sets. 47:41 Maybe they're actually trying to predict-- 47:43 I don't know-- some credit score from genomic data, 47:49 and so just want to make sure, maybe, that's 47:51 not the right thing. 47:53 So it turns out that the machinery is exactly the same 47:56 as the one we've just taken. 47:58 So we actually start from here. 48:00 48:05 So let me pull this up. 48:06 48:12 So we start from here. 48:15 Beta hat was equal to beta plus this guy. 48:18 48:21 And the first thing we did was to say, well, 48:23 beta j is equal to this thing because, well, beta j was 48:27 just ej times beta. 48:29 So rather than taking ej here, let me just take g. 48:32 48:42 Now, we said that for any vector-- 48:45 well, that was trivial. 48:47 So the thing we need to know is, what is this thing? 48:50 Well, this thing here, what is this guy? 48:55 It's also normal and the mean is 0. 48:59 Again, that's just using properties of Gaussian vectors. 49:03 And what is the covariance matrix? 49:06 Let's call these guys sigma so that you can make an answer, 49:09 you can formulate an answer. 49:11 So what is the distribution of-- what 49:14 is the covariance of g times some Gaussian 0 sigma? 49:18 AUDIENCE: g sigma g transpose. 49:20 PHILIPPE RIGOLLET: g sigma g transpose, right? 49:22 So that's gx transpose x inverse g transpose. 49:33 49:38 Now, I'm not going to be able to go much farther. 49:41 I mean, I made this very acute observation 49:44 that ej transpose the matrix times ej is the j-th angle 49:47 element. 49:48 Now, if I have a general matrix, the price to pay is that I 49:50 cannot just shrink this thing any further because I'm trying 49:52 to be abstract. 49:54 And so I'm almost there. 49:56 The only thing that happened last time 49:58 is that when this was ej under h0, 0, 50:00 we knew that this was equal to 0 under the null. 50:03 But under the null, what is this equal to? 50:08 50:12 AUDIENCE: Lambda. 50:13 PHILIPPE RIGOLLET: Lambda, which I know. 50:15 I mean, I wrote my thing. 50:16 And in the couple instances I just showed you, 50:19 including this one over there on top, lambda was equal to 0. 50:22 But in general, it can be any lambda. 50:24 But what's key about this lambda is that I actually know it. 50:27 That's the hypothesis I'm formulating. 50:31 So now I'm going to have to be a little more careful when 50:34 I want to build the distribution of g beta hat. 50:36 I need to actually subtract this lambda. 50:39 So now we go from this, and we say, 50:40 well, g beta hat minus lambda follows 50:47 some np0 sigma squared g x transpose x 50:57 inverse g transpose. 51:00 51:04 So that's true. 51:06 Let's assume-- let's go straight to the case when 51:08 we don't know what sigma is. 51:10 So what I'm going to be interested in 51:11 is g beta hat minus lambda divided by sigma hat. 51:26 And that's going to follow some Gaussian that has this thing, 51:29 gx transpose x inverse g transpose. 51:37 So now, what did I do last time? 51:40 So clearly, the quintiles of this distribution 51:45 is-- well, OK, what is the size of this distribution? 51:48 Well, I need to tell you that g is an-- 51:52 what did I take here? 51:54 AUDIENCE: 1 divided by sigma, not sigma hat. 51:57 PHILIPPE RIGOLLET: Oh, yeah, you're right. 51:58 So let me write it like this. 52:00 52:05 Well, let me write it like this-- 52:15 sigma squared over sigma. 52:17 52:21 So let's forget about the size of g now. 52:23 Let's just think of any general g. 52:25 52:27 When g was a vector, what was nice 52:30 is that this guy was just the scalar number, just one number. 52:35 And so if I wanted to get rid of this in the right-hand side, 52:38 all I had to do was to divide it by this thing. 52:39 We called it gamma j. 52:41 And we just had to divide by square root of gamma j, 52:43 and that would be gone. 52:45 Now I have a matrix. 52:48 So I need to get rid of this matrix 52:50 somehow because, clearly, the quintiles of this distribution 52:55 are not going to be written in the back 52:56 of a book for any value of g and any value of x. 52:59 So I need to standardize before I can read anything out 53:01 of a table. 53:03 So how do we do it? 53:04 Well, we just form this guy here. 53:14 So what we know is that if-- 53:18 so here's the claim, again, another 53:21 claim about Gaussian vector. 53:23 If x follows some n0 sigma, then x transpose sigma inverse x 53:43 follows some chi squared. 53:44 53:48 And here, it's going to depend on what is the dimension here. 53:51 So if I make this a k by k, a k-dimensional Gaussian vector, 53:56 this is x squared k. 53:57 54:02 Where have we used that before? 54:04 54:08 Yeah? 54:09 AUDIENCE: Wald's test. 54:10 PHILIPPE RIGOLLET: Wald's test, that's exactly what we used. 54:13 Wald's test had a chi squared that was showing up. 54:16 And the way we made it show up was 54:18 by taking the asymptotic variance, 54:20 taking its inverse, which, in this framework, was called-- 54:24 AUDIENCE: Fisher. 54:25 PHILIPPE RIGOLLET: Fisher information. 54:27 And then we pre- and postmultiply by this thing. 54:31 So this is the key. 54:33 And so now, it tells me exactly, when 54:35 I start from this guy that has this multivariate Gaussian, 54:38 it tells me how to turn it into something 54:40 that has a distribution which is pivotal. 54:42 Chi squared k is completely pivotal, does not depend 54:45 on anything I don't know. 54:46 55:03 The way I go from here is by saying, well, now, 55:06 I look at g beta hat minus lambda transpose, 55:13 and now I need to look at the inverse 55:15 of the matrix over there. 55:16 So it's gx transpose x inverse g inverse g beta 55:29 hat minus lambda. 55:32 55:35 This guy is going to follow-- 55:36 55:39 well, here, I need to actually divide by sigma in this case-- 55:42 55:56 if g is k times p. 56:00 So what I mean here is just that's the same k. 56:04 The k that shows up is the number of constraints 56:07 that I have in my tests. 56:08 56:13 So now, if I go from here to using sigma hat, 56:20 the key thing to observe is that this guy is actually 56:23 not a Gaussian. 56:25 I'm not going to have a student t-distribution that shows up. 56:28 56:36 So that implies that if I take the same thing, 57:03 so now I just go from sigma to sigma hat, 57:06 then this thing is of the form-- 57:08 57:12 well, this chi squared k divided by the chi squared that shows 57:17 up in the denominator of the t-distribution, 57:20 which is square root of-- 57:28 oh, I should not divide by sigma-- 57:30 so this is sigma squared, right? 57:31 AUDIENCE: Yeah. 57:32 PHILIPPE RIGOLLET: So this is sigma squared. 57:34 So this is of the form divided by chi squared n 57:40 minus p divided by n minus p. 57:44 So that's the same denominator that I saw in my t-test. 57:48 The numerator has changed, though. 57:49 The numerator is now this chi squared and no longer 57:52 a Gaussian. 57:52 57:55 But this distribution is actually pivotal, as long 58:00 as we can guarantee that there's no hidden 58:02 parameter in the correlation between the two chi squares. 58:08 So again, as all statements of independence in this class, 58:13 I will just give it to you for free. 58:15 Those two things, I claim-- 58:20 so OK, let's say admit these are independent. 58:29 58:37 We're almost there. 58:38 This could be a distribution that's pivotal. 58:41 But there's something that's a little unbalanced with it 58:43 is that this guy is divided by its number of degrees 58:46 of freedom, but this guy is not divided by its number 58:48 of degrees of freedom. 58:50 And so we just have to make the extra step 58:53 that if I divide this guy by k, and this guy is a chi squared 58:57 divided by k, if I divide this guy by k, 59:00 then I get this guy divided by k. 59:03 And now it looks-- 59:05 I mean, it doesn't change anything. 59:06 I've just divided by a fixed number. 59:09 But it just looks more elegant-- 59:11 is the ratio of two independent chi 59:13 squared that are individually divided 59:15 by the number of degrees of freedom. 59:16 59:20 And this has a name, and it's called a Fisher 59:31 or F-distribution. 59:34 So unlike William Gosset, who was not 59:40 allowed to use his own name and used the name student, 59:43 Fisher was allowed to use his own name, 59:45 and that's called the Fisher distribution. 59:47 And the Fisher distribution has now 2 parameters, 59:52 a set of 2 degrees of freedom-- 59:53 1 for the numerator and 1 for the denominator. 59:57 So F- of Fisher distribution-- 60:01 60:07 so F is equal to the ratio of a chi squared p/p 60:13 and a chi squared q/q. 60:16 So that's Fpq P-q where the 2 chi squareds are independent. 60:27 60:32 Is that clear what I'm defining here? 60:35 So this is basically what plays the role of t-distributions 60:41 when you're testing more than 1 parameter at a time. 60:43 So you basically replace-- 60:45 the normal that was in the numerator, 60:47 you replace it by chi squared because you're 60:49 testing if 2 vectors are simultaneously close. 60:51 And the way you do it is by looking at their squared norm. 60:55 And that's how the chi squared shows up. 60:57 61:00 Quick remark-- are those things really very different? 61:08 How can I relate a chi squared with a t-distribution? 61:12 Well, if t follows, say, a t-- 61:19 I don't know, let's call it q. 61:20 61:24 So that means that t, let me look at-- 61:28 t is some n01 divided by the square root of a chi 61:38 squared q/q. 61:40 61:44 That's the distribution of t. 61:48 So if I look at the square of the-- the distribution of t 61:51 squared-- 61:53 let me put it here-- 61:55 61:58 well, that's the square of some n01 divided by chi squared q/q. 62:06 62:09 Agreed? 62:11 I just removed the square root here, 62:13 and I took the square of the Gaussian. 62:15 But what is the distribution of a square of a Gaussian? 62:20 AUDIENCE: Chi squared with 1 degree. 62:21 PHILIPPE RIGOLLET: Chi squared with 1 degree of freedom. 62:25 So this is a chi squared with 1 degree of freedom. 62:27 And in particular, it's also a chi 62:28 squared with 1 degree of freedom divided by 1. 62:31 So t-squared, in the end, has an F-distribution with 1 62:38 and q degrees of freedom. 62:41 So those two things are actually very similar. 62:43 The only thing that's going to change 62:45 is that, since we're actually looking at, typically, 62:48 absolute values of t when we do our tests, 62:51 it's going to be exactly the same thing. 62:52 These quintiles of one guy are going 62:54 to be, essentially, the square root of the quintiles 62:56 of the other guy. 62:57 That's all it's going to be. 63:00 So if my test is psi is equal to the indicator 63:07 that t exceeds q alpha over 2 of tq, for example, 63:16 then it's equal to the indicator that t-squared 63:19 exceeds q squared alpha over 2 tq, 63:26 because I had the absolute value here, 63:28 which is equal to the indicator that t squared is 63:33 greater than q alpha over 2. 63:35 And now this time, it's an F1q. 63:37 63:39 So in a way, those two things belong to the same family. 63:42 They really are a natural generalization of each other. 63:44 I mean, at least the F-test is a generalization of the t-test. 63:47 63:51 And so now I can perform my test just like it's written here. 63:54 I just formed this guy, and then I 63:56 perform against the quintile of an F-test. 63:58 Notice, there's no absolute value-- 64:01 oh, yeah, I forgot, this is actually 64:04 q alpha because the F-statistic is already positive. 64:09 So I'm not going to look between left and right, 64:11 I'm just going to look whether it's too large or not. 64:15 So that's by definition. 64:18 So you can check-- 64:19 if you look at a table for student 64:21 and you look at a table for F1q, one 64:23 it just going to-- you're going to have to move from one column 64:25 to the other because you're going 64:26 to have to move from alpha over 2 to alpha, 64:28 but one is going to be squared root of the other one, 64:31 just like the chi squared is the square of the Gaussian. 64:34 I mean, if you look at the chi squared 1 degree of freedom, 64:36 you will see the same thing as the Gaussians. 64:40 64:47 So I'm actually going to start with the last one 64:53 because you've been asking a few questions about why 64:55 is my design deterministic. 64:58 So there's many answers. 64:59 Some are philosophical. 65:01 But one that's actually-- well, there's the one that says 65:04 everything you cannot do if you don't have a condition-- 65:07 if you don't have x, because all of the statements that we made 65:09 here, for example, just the fact that this is chi squared, 65:12 if those guys start to be random variables, 65:15 then it's clearly not going to be a chi squared. 65:17 I mean, it cannot be chi squared when those guys are 65:19 deterministic and when they are random. 65:20 I mean, things change. 65:22 So that's just maybe [INAUDIBLE] check statement. 65:25 But I think the one that really matters is that-- 65:27 remember when we did the t-test, we 65:30 had this gamma j that showed up. 65:32 Gamma j was playing the role of the variance. 65:34 So here, the variance, you never think of-- 65:36 I mean, we'll talk about this in the Bayesian setup, 65:39 but so far, we haven't thought of the variance 65:41 as a random variable. 65:42 And so here, your x's really are the parameters of your data. 65:45 And the diagonal elements of x transpose x inverse 65:48 actually tell you what the variance is. 65:49 So that's also one reason why you should think of your x 65:52 as being a deterministic number. 65:53 They are, in a way, things that change 65:55 the geometry of your problem. 65:56 They just say, oh, let me look at it 65:58 from the perspective of x. 66:01 Actually, for that matter, we didn't really 66:03 spend much time commenting on what 66:06 is the effect of x onto gamma. 66:09 So remember, gamma j, so that was the variance parameter. 66:19 So we should try to understand what x's lead to big variance 66:23 and what x's lead to small variance. 66:26 That would be nice. 66:28 Well, if this is the identity matrix-- 66:31 let's say identity over n, which is the natural thing 66:35 to look at, because we want this thing to scale like 1/n-- 66:38 then this is just 1/n. 66:39 We're back to the original case. 66:41 Yes? 66:41 AUDIENCE: Shouldn't that be inverse? 66:43 PHILIPPE RIGOLLET: Yeah, thank you. x inverse, yes. 66:45 So if this is the identity, then, well, the inverse 66:48 is-- let's say just this guy here is n times this guy. 66:53 So then the inverse is 1/n. 66:56 So in this case, that means that gamma j is equal to 1/n 66:59 and we're back to the theta hat theta 67:02 case, the basic one-dimensional thing. 67:06 What does it mean for a matrix for when I take its-- 67:11 yeah, so that's of dimension p. 67:13 But when I take its transpose-- 67:15 so forget about the scaling by n right now. 67:17 This is just a matter of scaling things. 67:19 I can always multiply my x's so that I 67:20 have this thing that shows up. 67:22 But when I have a matrix, if I look at x transpose x 67:24 and I get something which is the identity, how 67:26 do I call this matrix? 67:27 67:31 AUDIENCE: Orthonormal? 67:32 PHILIPPE RIGOLLET: Orthogonal, yeah. 67:34 Orthonormal or orthogonal. 67:35 So you call this thing an orthogonal matrix. 67:37 And when it's an orthogonal matrix, what it means 67:39 is that the-- 67:42 so this matrix here, if you look at the matrix xx transpose, 67:46 the entries of this matrix are the inner products 67:48 between the columns of x. 67:49 That's what's happening. 67:51 You can write it, and you will see 67:52 that the entries of this matrix are linear products. 67:55 If it's the identity, that means that you get some 1's 67:59 and a bunch of 0's, it means that all the inner products 68:03 between 2 different columns is actually 0. 68:05 What it means is that this matrix x 68:07 is an orthonormal basis for your space. 68:09 The columns form an orthonormal basis. 68:12 So they're basically as far from each other as they can. 68:15 Now, if I start making those guys closer and closer, 68:20 then I'm starting to have some issues. 68:21 x transpose x is not going to be the identity. 68:24 I'm going to start to have some non-0 entries. 68:27 But if they all remain of norm 1, then-- 68:32 oh, sorry, so that's for the inverse. 68:34 So I first start putting some stuff here, which is non-0, 68:37 by taking my x's. 68:39 Rather than having this, I move to this. 68:44 Now I'm going to start seeing some non-0 entries. 68:46 And when I'm going to take the inverse of this matrix, 68:49 the diagonal elements are going to start to blow up. 68:52 Oh, sorry, the diagonals start to become smaller and smaller. 68:56 So when I take the inverse-- 68:57 no, sorry, the diagonal limits are going to blow up. 69:01 And so what it means is that the variance is going to blow up. 69:05 And that's essentially telling you 69:06 that if you get to choose your x's, you 69:09 want to take them as orthogonal as you can. 69:12 But if you don't, then you just have to deal with it, 69:14 and it will have a significant impact on your estimation 69:18 performance. 69:19 And that's what, also, routinely, statistical software 69:25 is going to spit out this value here for you. 69:26 And you're going to have-- well, actually square 69:28 root of this value. 69:30 And it's going to tell you, essentially-- 69:32 you're going to know how much randomness, how much variation 69:34 you have in this particular parameter 69:36 that you're estimating. 69:37 So if gamma j is large, then you're 69:41 going to have wide confidence intervals 69:43 and your tests are not going to reject very much. 69:45 And that's all captured by x. 69:47 That's what's important. 69:48 Everything, all of this, is completely captured by x. 69:50 Then, of course, there was the sigma squared 69:52 that showed up here. 69:54 Actually, it was here, even in the definition of gamma j. 69:57 I forgot it. 69:58 What is the sigma squared police doing? 70:00 And so this thing was here as well, 70:02 and that's just exogenous. 70:04 It comes from the noise itself. 70:06 But there was this huge factor that came from the x's itself. 70:08 70:11 So let's go back, now, to reading 70:13 this list in a linear fashion. 70:15 So I mean, you're MIT students, you've probably 70:20 heard that correlation does not imply causation many times. 70:25 Maybe you don't know what it means. 70:27 If you don't, that's OK, you just have to know the sentence. 70:30 No, what it means is that it's done 70:32 because I decided that something was going to be the x 70:35 and that something else was going 70:36 to be the y, that whatever thing I'm getting, 70:39 it means that x implies y. 70:42 For example, even if I do genetics, genomics, 70:44 or whatever, I mean, I implicitly 70:47 assume that my genes are going to have 70:49 an effect on my outside look. 70:52 I could be the opposite. 70:54 I mean, who am I to say? 70:55 I'm not a biologist. 70:56 I don't know. 70:57 I didn't open a biology book in 20 years. 70:59 So maybe, if I start hitting my head with a hammer, 71:02 I'm going to have changing my genetic material. 71:04 Probably not, but that's why-- 71:07 but causation definitely does not come from statistics. 71:09 So if you know that that's the different thing, 71:11 it's actually going to-- 71:13 it's not coming from there. 71:14 So actually, I remember, once, I put an exam to students, 71:18 and there was an old data set from police expenditures, 71:21 I think, in Chicago in the '60s. 71:23 And they were trying to understand-- 71:27 no, it was on crime. 71:28 It was the crime data set. 71:29 And they were trying-- so the y variable was just 71:31 the rate of crime, and the x's were a bunch of things, 71:34 and one of them was police expenditures. 71:36 And if you rend the regression, you 71:38 would find that the coefficient in front of police expenditure 71:41 was a positive number, which means 71:42 that if you increase police expenditures, 71:45 that increases the crime. 71:48 I mean, that's what it means to have a positive coefficient. 71:52 Everybody agrees with this fact? 71:55 If beta j is 10, then it means that if I increase by $1 71:57 my police expenditure, I [INAUDIBLE] by 10 my crime, 72:01 everything else being kept equal. 72:04 Well, there were, I think, about 80% 72:06 of the students that were able to explain to me that if you 72:09 give more money to the police, then 72:11 the crime is going to raise. 72:13 Some people were like, well, police 72:14 is making too much money, and they 72:16 don't think about their work, and they become lazy. 72:19 And I mean, people were really coming up 72:20 with some crazy things. 72:22 And what it just meant is that, no, it's not causation. 72:26 It's just, if you have more crime, 72:28 you give more money to your police. 72:29 That's what's happening. 72:31 And that's all there is. 72:33 So just be careful when you actually 72:35 draw some conclusions that causation is a very important 72:38 thing to keep in mind. 72:39 And in practice, unless you have external sources of reason 72:43 for causality-- for example, genetic material 72:45 and physical traits, we agree upon what 72:52 the direction of the arrow of causality is here. 72:54 There's places where you might not. 72:57 Now, finally, the normality on the noise-- 72:59 everything we did today required normal Gaussian distribution 73:04 on the noise. 73:05 I mean, it's everywhere. 73:07 There's some Gaussian, there's some chi squared. 73:09 Everything came out of Gaussian. 73:11 And for that, we needed this basic formula 73:13 for inference, which we derived from the fact 73:15 that the noise was Gaussian itself. 73:18 If we did not have that, the only thing we could write 73:20 is, beta hat is this number, or this vector. 73:24 We would not be able to say, the fluctuations of beta hat 73:27 are this guy. 73:28 We would not be able to do tests. 73:30 We would not be able to build, say, 73:31 confidence regions or anything. 73:34 And so this is an important condition that we need, 73:38 and that's what statistical software assumes by default. 73:40 But we now have a recipe on how to do those tests. 73:44 We can do it either visually, if we really 73:47 want to conclude that, yes, this is Gaussian, 73:49 using our normal Q-Q plots. 73:51 And we can also do it using our favorite tests. 73:54 What test should I be using to test that? 73:56 74:01 With two names? 74:03 Yeah? 74:04 AUDIENCE: Normal [INAUDIBLE]. 74:06 PHILIPPE RIGOLLET: Not the 2 Russians. 74:08 So I want a Russian and a Scandinavian person 74:10 for this one. 74:12 What's that? 74:13 AUDIENCE: Lillie-something? 74:14 PHILIPPE RIGOLLET: Yeah, Lillie-something. 74:16 So Kolmogorov Lillie-something test. 74:18 And [LAUGHS] so it's the Kolmogorov Lilliefors test. 74:23 And because I'm testing if there Gaussian, and I'm actually 74:26 not really making any-- 74:28 I don't need to know what the variance is. 74:30 The mean is 0. 74:31 We saw that at the beginning. 74:32 It's 0 by construction, so we actually 74:34 don't need to think about the mean being 0 itself. 74:37 This just happens to be 0. 74:38 So we know that it's 0, but the variance, we don't know. 74:41 So we just want to know if it belongs 74:42 to the family of Gaussians, and so we need to Kolmogorov 74:45 Lilliefors for that. 74:46 And that's also one of the thing that's spit out by statistical 74:49 software by default. When you run a linear regression, 74:52 actually, it spits out both Kolmogorov-Smirnov 74:54 and Kolmogorov Lilliefors, probably contributing 74:59 to the widespread use of Kolmogorov-Smirnov when you 75:01 really shouldn't. 75:03 So next time, we will talk about more advanced topics 75:08 on regression. 75:09 But I think I'm going to stop here for today. 75:11 So again, tomorrow, sometime during the day, 75:14 at least before the recitation, you 75:16 will have a list of practice exercises that will be posted. 75:20 And if you go to the optional recitation, 75:23 you will have someone solving them 75:26