https://www.youtube.com/watch?v=4HRhg4eUiMo&list=PLUl4u3cNGP60uVBMaoNERc6knT_MgPKS0&index=8 字幕記錄 00:00 00:00 The following content is provided under a Creative 00:02 Commons license. 00:03 Your support will help MIT OpenCourseWare 00:06 continue to offer high-quality educational resources for free. 00:10 To make a donation or to view additional materials 00:12 from hundreds of MIT courses, visit MIT OpenCourseWare 00:16 at ocw.mit.edu. 00:19 PHILIPPE RIGOLLET: We're talking about tests. 00:21 And to be fair, we spend most of our time 00:24 talking about new jargon that we're using. 00:28 The main goal is to take a binary decision, yes and no. 00:31 So just so that we're clear and we make sure that we all 00:36 speak the same language, let me just 00:37 remind you what the key words are for tests. 00:42 So the first thing is that we split theta 00:48 in theta 0 and theta 1. 00:53 Both are included in theta, and they are disjoint. 00:57 01:00 So I have my set of possible parameters. 01:04 And then I have theta 0 is here, theta 1 is here. 01:10 And there might be something that I leave out. 01:14 And so what we're doing is, we have two hypotheses. 01:18 So here's our hypothesis testing problem. 01:20 And it's h0 theta belongs to theta 0 versus h1 theta 01:27 belongs to theta 1. 01:29 This guy was called the null, and this guy 01:32 was called the alternative. 01:36 And why we give them special names 01:37 is because we saw that they have an asymmetric role. 01:41 The null represents the status quo, 01:43 and data is here to bring evidence against this guy. 01:46 And we can really never conclude that h0 is true 01:49 because all we could conclude is that h1 is not true, or may not 01:56 be true. 01:59 So that was the first thing. 02:00 The second thing was the hypothesis. 02:03 The third thing is, what is a test? 02:05 Well, psi, it's a statistic, and it takes the data, 02:17 and it maps it into 0 or 1. 02:21 And I didn't really mention it, but there's some things such 02:24 as called randomized tests, which is, well, 02:27 if I cannot really make a decision, 02:28 I might as well flip a coin. 02:30 That tends to be biased, but that's really-- 02:32 I mean, think about it in practice. 02:34 You probably don't want to make decisions 02:36 based on flipping a coin. 02:37 And so what people typically do-- 02:39 this is happening, typically, at one specific value. 02:41 So rather than flipping a coin for this very specific value, 02:44 what people typically do is they say, 02:45 OK, I'm going to side with h0 because that's the most 02:48 conservative choice I can make. 02:50 So in a way, they think of flipping this coin, 02:52 but always falling on heads, say. 02:55 So associated to this test was something called, well, 02:58 the rejection region r psi, which 03:05 is just the set of data x1 xn such that psi of x1 xn 03:15 is equal to 1. 03:16 So that means we rejected h0 when the test is 1. 03:19 And those are the set of data points 03:21 that actually are going to lead me to reject the test. 03:25 03:28 And then the things that we're actually, slightly, 03:30 a little more important and really peculiar to test, 03:35 specific to tests, were the type I and type II error. 03:40 So the type I error arises when-- 03:44 so type I error is when you reject, whereas h0 is correct. 04:01 And the type II error is the opposite, 04:06 so it's failed to reject, whereas h1 is correct-- 04:17 h is correct, yeah. 04:20 So those are the two types of errors you can make. 04:23 And we quantified their probability of type I error. 04:26 So alpha psi is the probability-- 04:31 so that's the probability of type I error. 04:38 04:41 So psi is just the probability for theta that psi rejects 04:49 and that's defined for theta and theta 0, 04:54 so for different values of theta 0. 04:56 So h0 being correct means there exists a theta in theta 0 05:00 for which that actually is the right distribution. 05:03 So for different values of theta, 05:05 I might make different errors. 05:07 So if you think, for example, about the coin example, 05:12 I'm testing if the coin is biased towards heads 05:16 or biased towards tails. 05:18 So if I'm testing whether p is larger 05:21 than 1/2 or less than 1/2, then when the true p-- let's 05:25 say our h0 is larger than 1/2. 05:27 When p is equal to 1, it's actually very difficult for me 05:29 to make a mistake, because I only see heads. 05:33 So when p is getting closer to 1/2, 05:35 I'm going to start making more and more probability of error. 05:38 And so the type II error-- so that's the probability of type 05:42 II-- 05:43 is denoted by beta psi. 05:46 And it's the function, well, that does the opposite 05:50 and, this time, is defined for theta in theta 1. 05:58 And finally, we define something called the power, pi of psi. 06:13 And this time, this is actually a number. 06:16 And so this number is equal to the maximum over theta n theta 06:23 0. 06:23 I mean, that could be a supremum, but think of it 06:25 as being a maximum of p theta of psi is equal-- 06:32 sorry, that's n0, right? 06:37 Give me one sec. 06:39 No, sorry, that's the min. 06:42 06:46 So this is not making a mistake. 06:48 Theta 0 is in theta 2 So if theta is in theta 1 06:52 and I conclude 1, so this is a good thing. 06:55 I want this number to be large. 06:56 And I'm looking at the worst house-- 06:58 what is the smallest value this number can be? 07:02 So what I want to show you a little bit is a picture. 07:06 07:09 So now I'm going to take theta, and think of it as being a p. 07:12 So I'm going to take p for some variable in the experiment. 07:18 So p can range between 0 and 1, that's for sure. 07:20 07:23 And what I'm going to try to test 07:24 is whether p is less than 1/2 or larger than 1/2. 07:30 So this is going to be, let's say, theta 0. 07:34 And this guy here is theta 1. 07:37 Just trying to give you a picture of what those guys are. 07:40 So I have my y-axis, and now I'm going to start drawing number. 07:46 All these things-- this function, 07:48 this function, and this number-- are 07:51 all numbers between 0 and 1. 07:52 07:56 So now I'm claiming that-- 07:59 so when I move from left to right, 08:03 what is my probability of rejecting going to do? 08:08 So what I'm going to plot is the probability under theta. 08:11 The first thing I want to plot is the probability under theta 08:14 that psi is equal to 1. 08:19 And let's say psi-- 08:20 think of psi as being just this indicator 08:25 that square root on n xn bar minus p over square root xn 08:35 bar 1 minus xn bar is larger than some constant c 08:40 for a probability chosen c. 08:43 So what we choose is that c is in such a way that, at 1/2, 08:48 when we're testing for 1/2, what we 08:50 wanted was this number to be equal to alpha, basically. 08:56 So we fix this alpha number so that this guy-- 09:00 so if I want alpha of psi of theta less than alpha 09:09 given in advanced-- 09:12 so think of it as being equal to, say, 5%. 09:15 So I'm fixing this number, and I want 09:16 this to be controlled for all theta and theta 0. 09:19 09:23 So if you're going to give me this budget, 09:26 well, I'm actually going to make it equal where I can. 09:29 If you're telling me you can make it equal to alpha, 09:31 we know that if I increase my type I error, 09:34 I'm going to decrease my type II error. 09:36 If I start putting everyone in jail 09:39 or if I start letting everyone go free, 09:41 that's what we were discussing last time. 09:43 So since we have this trade-off and you're 09:45 giving me a budget for one guy, I'm just going to max it out. 09:49 And where am I going to max it out? 09:50 Exactly at 1/2 at the boundary. 09:53 So this is going to be 5%. 09:54 10:00 So what I know is that since alpha of theta 10:03 is less than alpha for all theta in theta 10:06 0-- sorry, that's for theta 0, that's where alpha is defined. 10:12 So for theta and theta 0, I knew that my function 10:14 is going to look like this. 10:15 It's going to be somewhere in this rectangle. 10:18 Everybody agrees? 10:20 So this function for this guy is going to look like this. 10:22 When I'm at 0, when p is equal to 0, 10:25 which means I only observe 0's, then I 10:29 know that p is going to be 0, and I will certainly not 10:32 conclude that p is equal to 1. 10:34 This test will never conclude that p is equal to 1-- 10:39 10:42 that p is larger than 1/2, just because xn bar 10:44 is going to be equal to 0. 10:46 Well, this is actually not well-defined, 10:48 so maybe I need to do something-- put it equal to 0 10:51 if xn bar is equal to 0. 10:52 So I guess, basically, I get something which is negative, 10:55 and so it's never going to be larger than what I want. 10:58 And so here, I'm actually starting at 0. 11:00 So now, this is this function here that increases-- 11:04 I mean, it should increase smoothly. 11:06 This function here is alpha psi of theta-- 11:11 or alpha psi of p, let's say, because we're talking about p. 11:15 Then it reaches alpha here. 11:17 Now, when I go on the other side, 11:19 I'm actually looking at beta. 11:21 When I'm on theta 1, the function that matters 11:23 is the probability of type II error, which is beta of psi. 11:28 And this beta of psi is actually going to increase. 11:30 11:34 So beta of psi is what? 11:35 Well, beta of psi should also-- 11:37 sorry, that's the probability of being equal to alpha. 11:39 So what I'm going to do is I'm going 11:41 to look at the probability of rejecting. 11:43 So let me draw this functional all the way. 11:46 It's going to look like this. 11:48 Now here, if I look at this function here or here, 11:52 this is the probability under theta that psi is equal to 1. 11:57 And we just said that, in this region, 11:59 this function is called alpha of psi. 12:02 In that region, it's not called alpha of psi. 12:06 It's not called anything. 12:08 It's just the probability of rejection. 12:11 So it's not any error, it's actually 12:12 what you should be doing. 12:14 What we're looking at in this region is 1 minus this guy. 12:19 We're looking at the probability of not rejecting. 12:21 So I need to actually, basically, look at the 1 12:23 minus this thing, which here is going to be 95%. 12:27 So I'm going to do 95%. 12:31 12:34 And this is my probability. 12:36 Ability And I'm just basically drawing 12:38 the symmetric of this guy. 12:40 So this here is the probability under theta 12:44 that psi is equal to 0, which is 1 minus p theta 12:50 that psi is equal to 1. 12:52 So it's just 1 minus the wide curve. 12:56 And it's actually, by definition, equal 12:59 to beta of psi of theta. 13:00 13:06 Now, where do I read pi psi? 13:09 13:20 What is pi psi on this picture? 13:22 13:26 Is pi psi a number or a function? 13:28 13:32 AUDIENCE: Number. 13:32 PHILIPPE RIGOLLET: It's a number, right? 13:33 It's the minimum of a function. 13:35 What is this function? 13:36 It's the probability under theta that theta is equal to 1. 13:39 I drew this entire function for between theta 0 and theta 1. 13:44 I drew-- this is this entire white curve. 13:46 This is this probability. 13:48 Now I'm saying, look at the smallest value this probability 13:50 can take on the set theta 1. 13:54 What is this? 13:55 14:00 This guy. 14:02 This is where my pi-- 14:03 this thing here is pi psi, and so it's equal to 5%. 14:08 14:11 So that's for this particular test, 14:13 because this test has a continuous curve for this psi. 14:19 And so if I want to make sure that I'm 14:20 at 5% when I come to the right of the theta 0, 14:24 if it touches theta 1, then I'd better 14:26 have 5% on the other side if the function is continuous. 14:30 So basically, if this function is 14:33 increasing, which will be the case for most tests, 14:38 and continuous, then what's going to happen 14:39 is that the level of the test, which is alpha, 14:42 is actually going to be equal to the power of the test. 14:44 14:48 Now, there's something I didn't mention, 14:50 and I'm just mentioning it passing by. 14:52 Here, I define the power itself. 14:55 This function, this entire white curve here, 14:59 is actually called the power function-- 15:01 15:06 this thing. 15:07 That's the entire white curve. 15:09 And what you could have is tests that 15:12 have the entire curve which is dominated by another test. 15:16 So here, if I look at this test-- 15:18 and let's assume I can build another test that 15:21 has this curve. 15:23 Let's say it's the same here, but then here, it 15:29 looks like this. 15:29 15:34 What is the power of this test? 15:38 AUDIENCE: It's the same. 15:39 PHILIPPE RIGOLLET: It's the same. 15:40 It's 5%, because this point touches here exactly 15:43 at the same point. 15:44 However, for any other value than the worst possible, 15:48 this guy is doing better than this guy. 15:51 Can you see that? 15:52 Having a curve higher on the right-hand side 15:55 is a good thing because it means that you 15:57 tend to reject more when you're actually in h1. 16:03 So this guy is definitely better than this guy. 16:06 And so what we say, in this case, 16:07 is that the test with the dashed line 16:09 is uniformly more powerful than the other tests. 16:13 But we're not going to go into those details 16:15 because, basically, all the tests that we will describe 16:18 are already the most powerful ones. 16:22 In particular, this guy is-- 16:24 there's no such thing. 16:25 All the other guys you can come up with 16:26 are going to actually be below. 16:27 16:33 So we saw a couple tests, then we 16:36 saw how to pick this threshold, and we defined those two 16:40 things. 16:41 AUDIENCE: Question. 16:42 PHILIPPE RIGOLLET: Yes? 16:43 AUDIENCE: But in that case, the dashed line, 16:45 if it were also higher in the region of theta 0, 16:48 do you still consider it better? 16:50 PHILIPPE RIGOLLET: Yeah. 16:51 AUDIENCE: OK. 16:52 PHILIPPE RIGOLLET: Because you're given this budget of 5%. 16:55 So in this paradigm where you're given the-- 16:58 actually, if the dashed line was this dashed line, 17:01 I would still be happy. 17:03 I mean, I don't care what this thing does here, 17:05 as long as it's below 5%. 17:06 But here, I'm going to try to discover. 17:08 Think about, again, the drug discovery example. 17:11 You're trying to find-- let's say you're a scientist 17:14 and you're trying to prove that your drug works. 17:17 What do you want to see? 17:18 Well, FDA puts on you this constraint 17:22 that your probability of type I error should never exceed 5%. 17:26 You're going to work under this assumption. 17:28 But what you're going to do is, you're 17:30 going to try to find a test that will make you find something 17:33 as often as possible. 17:35 And so you're going to max this constraint of 5%. 17:38 And then you're going to try to make this curve, that means-- 17:41 this is, basically, this number here, for any point 17:45 here, is the probability that you publish your paper. 17:47 That's the probability that you can 17:50 release to market your drug. 17:51 That's the probability that it works. 17:53 And so you want this curve to be as high as possible. 17:56 You want to make sure that if there's evidence in the data 18:02 that h1 is the truth, you want to squeeze as much 18:05 of this evidence as possible. 18:07 And the test that has the highest possible curve 18:09 is the most powerful one. 18:11 Now, you have to also understand that having two curves that 18:15 are on top of each other completely, everywhere, 18:19 is a rare phenomenon. 18:22 It's not always the case that there 18:24 is a test that's uniformly more powerful than any other test. 18:27 It might be that you have some trade-off, 18:29 that it might be better here, but then you're 18:31 losing power here. 18:32 Maybe it's-- I mean, things like this. 18:33 Well, actually, maybe it should not go down. 18:35 But let's say it goes like this, and then, maybe, this guy 18:37 goes like this. 18:39 Then you have to, basically, make an educated guess 18:43 whether you think that the theta you're going to find is here 18:46 or is here, and then you pick your test. 18:47 18:51 Any other question? 18:51 Yes? 18:52 AUDIENCE: Can you explain the green curve again? 18:53 That's just the type II error? 18:55 PHILIPPE RIGOLLET: So the green curve is-- exactly. 18:57 So that's beta psi of theta. 18:58 So it's really the type II error. 19:00 And it's defined only here. 19:02 So here, it's not a definition, it's 19:05 really I'm just mapping it to this point. 19:08 So it's defined only here, and it's the probability 19:10 of type II error. 19:11 19:15 So here, it's pretty large. 19:17 I'm making it, basically, as large 19:19 as I could because I'm at the boundary, 19:22 and that means, at the boundary, since the status quo is h0, 19:26 I'm always going to go for h0 if I 19:29 don't have any evidence, which means that what's going to pay 19:31 is the type II error that's going to basically pay this. 19:34 19:38 Any other question? 19:38 19:41 So let's move on. 19:43 So did we do this? 19:47 No, I think we stopped here, right? 19:50 I didn't cover that part. 19:53 So as I said, in this paradigm, we're 19:55 going to actually fix this guy to be something. 19:58 And this thing is actually called the level of the test. 20:01 I'm sorry, this is, again, more words. 20:03 Actually, the good news is that we split it into two lectures. 20:06 So we have, what is a test? 20:09 What is a hypothesis? 20:11 What is the null? 20:11 What is the alternative? 20:14 What is the type I error? 20:15 What is the type II error? 20:16 And now, I'm telling you there's another thing. 20:18 So we define the power, which was some sort of a lower bound 20:22 on the-- 20:24 or it's 1 minus the upper bound on the type II 20:26 error, basically. 20:28 And so it's alternative-- so the power 20:32 is the smallest probability of rejecting 20:34 when you're in the null. 20:36 And it's alternative when you're in theta 1, so that's my power. 20:41 I looked here, and I looked at the smallest value. 20:43 And I can look at this side and say, well, 20:45 what is the largest probability that I make a type I error? 20:48 Again, this largest probability is the level of the test. 20:51 20:58 So this is alpha equal, by definition, 21:03 to the maximum for theta in theta 0 of alpha psi of theta. 21:15 So here, I just put the level itself. 21:18 As you can see, here, it essentially says 21:20 that if I'm of level of 5%, I'm also of level 10%, 21:23 I'm also of level 15%. 21:25 So here, it's really an upper bound. 21:27 Whatever you guys want to take, this is what it is. 21:29 But as we said, if this number is 4.5%, 21:34 you're losing in your type II error. 21:36 So if you're allowed to have-- 21:38 if this maximum here is 4.5% and FDA told you you can go to 5%, 21:43 you're losing in your type II error. 21:44 So you actually want to make sure 21:46 that this is the 5% that's given to you. 21:48 So the way it works is that you give me the alpha, 21:51 then I'm going to go back, pick c that depends on alpha here, 21:56 so that this thing is actually equal to 5%. 21:58 22:01 And so of course, in many instances, 22:04 we do not know the probability. 22:06 We do not know how to compute the probability of type I 22:09 error. 22:10 This is a maximum value for the probability of type I error. 22:12 We don't know how to compute it. 22:13 I mean, it might be a very complicated random variable. 22:15 Maybe it's a weird binomial. 22:17 We could compute it, but it would be painful. 22:19 But we know how to compute is its asymptotic value. 22:21 Just because of the central limit theorem, convergence 22:24 and distribution tells me that the probability of type I error 22:28 is basically going towards the probability 22:30 that some Gaussian is in some region. 22:33 And so we're going to compute, not the level itself, 22:36 but the asymptotic level. 22:37 22:43 And that's basically the limit as n 22:48 goes to infinity of alpha psi of theta. 22:56 And then I'm going to make the max here. 22:58 23:06 So how am I going to compute this? 23:08 Well, if I take a test that has rejection region of the form 23:13 tn-- 23:14 because it depends on the data, that's tn of x1 xn-- 23:17 my observation's larger than some number c. 23:23 Of course, I can almost always write 23:26 tests like that, except that sometimes, 23:28 it's going to be an absolute value, which essentially means 23:30 I'm going away from some value. 23:32 Maybe, actually, I'm less than something, 23:34 but I can always put a negative sign in front of everything. 23:37 So this is not without much of generality. 23:39 So this includes something that looks like-- 23:47 23:51 something is larger than the constants, so that means-- 23:56 which is equivalent to-- well, let me write that as tq, 24:02 because then that means that-- 24:05 so that's tn. 24:07 But this actually encompasses the fact 24:10 that qn is larger than c or qn is less than c and n minus c. 24:21 So that includes this guy. 24:22 That also includes qn less than c, 24:26 because this is equivalent to qn is larger than minus c. 24:32 And minus qn is-- 24:33 and so that's going to be my tn. 24:35 24:37 So I can actually encode several type of things-- 24:42 rejection regions. 24:44 So here, in this case, I have a rejection region 24:47 that looks like this, or a rejection region 24:50 that looks like this, or a rejection 24:53 region that looks like this. 24:54 24:57 And here, I don't really represent it 24:58 for the whole data, but maybe for the average, for example, 25:02 or the normalized average. 25:04 25:17 So if I write this, then-- 25:23 yeah. 25:25 And in this case, this tn that shows up 25:32 is called test statistic. 25:35 25:41 I mean, this is not set in stone. 25:43 Here, for example, q could be the test statistic. 25:46 It doesn't have to be minus q itself 25:48 that's the test statistic. 25:50 So what is the test statistic? 25:52 Well, it's what you're going to build from your data 25:55 and then compare to some fixed value. 25:57 So in the example we had here, what is our test statistic? 26:01 Well, it's this guy. 26:02 26:05 This was our test statistic. 26:09 And is this thing a statistic? 26:12 What are the criteria for a statistic? 26:14 What is the statistic? 26:15 26:21 I know you know the answer. 26:23 AUDIENCE: Measurable function. 26:25 PHILIPPE RIGOLLET: Yeah, it's a measurable function 26:26 of the data that does not depend on the parameter. 26:29 Is this guy a statistic? 26:32 AUDIENCE: It's not. 26:33 26:35 PHILIPPE RIGOLLET: Let's think again. 26:37 26:40 When I implemented the test, what did I do? 26:45 I was able to compute my test. 26:47 My test did not depend on some unknown parameter. 26:49 How did we do it? 26:52 We just plugged in 0.5 here, remember? 26:57 That was the value for which we computed it, 26:59 because under h0, that was the value we're seeing. 27:02 And if theta 0 is actually an entire set, 27:05 I'm just going to take the value that's the closest to h1. 27:09 We'll see that in a second. 27:11 I mean, I did not guarantee that to you. 27:13 But just taking the worst type I error and bounded by alpha 27:18 is equivalent to taking p and taking the value of p that's 27:22 the closest to theta 1, which is completely intuitive. 27:26 The worst type I error is going to be attained for the p that's 27:29 the closest to the alternative. 27:32 So even if the null is actually just an entire set, 27:36 it's as if it was just the point that's 27:38 the closest to the alternative. 27:41 So now we can compute this, because there's 27:44 no unknown parameters that shows up. 27:46 We replace p by 0.5. 27:48 And so that was our test statistic. 27:50 27:53 So when you're building a test, you 27:55 want to first build a test statistic, 27:58 and then see what threshold you should be getting. 28:01 So now, let's go back to our example where we want to have-- 28:08 we have x1 xn, their IID [INAUDIBLE] p. 28:16 And I want to test if p is 1/2 versus p not equal to 1/2, 28:25 which, as I said, is what you want to do if you 28:27 want to test if a coin is fair. 28:33 And so here, I'm going to build a test statistic. 28:36 And we concluded last time that-- 28:39 what do we want for this statistic? 28:41 We want it to have a distribution which, 28:44 under the null, does not depend on the parameters, 28:49 a distribution that I can actually compute quintiles of. 28:54 So what we did is, we said, well, 28:56 if I look at-- the central limit theorem tells me that square 28:59 root of n xn bar minus p divided by-- 29:03 so if I do central limit theorem plus Slutsky, for example, 29:06 I'm going to have square root. 29:08 29:12 And we've had this discussion whether we want 29:13 to use Slutsky or not here. 29:15 But let's assume we're taking Slutsky wherever we can. 29:17 So this thing tells me that, by the central limit 29:20 theorem, as n goes to infinity, this thing converges 29:23 in distribution to some n01. 29:25 29:28 Now, as we said, this guy is not something we know. 29:31 But under the null, we actually know it. 29:34 And we can actually replace it by 1/2. 29:37 So this thing holds under h0. 29:41 When I write under h0, it means when this is the truth. 29:44 29:47 So now I have something that converges 29:49 to something that has no dependence on anything I 29:52 don't know. 29:53 And in particular, if you have any statistics textbook, which 29:56 you don't because I didn't require one-- 29:59 and you should be thankful, because these things cost $350. 30:04 Actually, if you look at the back, 30:05 you actually have a table for a standard Gaussian. 30:12 I could have anything else here. 30:13 I could have an exponential distribution. 30:15 I could have a-- 30:17 I don't know-- well, we'll see the chi squared 30:20 distribution in a minute. 30:22 Any distribution from which you can actually 30:24 see a table that somebody actually 30:25 computed this thing for which you can actually 30:27 draw the pdf and start computing whatever probability you want 30:30 on them, then this is what you want 30:32 to see at the right-hand side. 30:35 This is any distribution. 30:36 It's called pivotal. 30:38 I think we've mentioned that before. 30:39 Pivotal means it does not depend on anything 30:41 that you don't know. 30:43 And maybe it's easy to compute those things. 30:45 Probably, typically, you need a computer to simulate them 30:47 for you because computing probabilities for Gaussians 30:50 is not an easy thing. 30:51 We don't know how to solve those integrals exactly, 30:53 we have to do it numerically. 30:56 So now I want to do this test. 31:08 My test statistic will be declared to be what? 31:12 Well, I'm going to reject if what 31:17 is larger than some number? 31:18 31:24 The absolute value of this guy. 31:27 So my test statistic is going to be 31:29 square root of n minus 0.5 divided by square root of xn 31:35 bar 1 minus xn bar. 31:38 31:41 That's my test statistic, absolute value of this guy, 31:43 because I want to reject either when this guy is too large 31:45 or when this guy is too small. 31:47 31:50 I don't know ahead whether I'm going 31:51 to see p larger than 1/2 or less than 1/2. 31:55 So now I need to compute c such that the probability 31:59 that tn is larger than c. 32:05 So that's the probability under p, which is unknown. 32:11 I want this probability to be less than some level alpha, 32:17 asymptotically. 32:18 So I want the limit of this guy to be less than alpha, 32:24 and that's the level of my test. 32:26 So that's the given level. 32:32 So I want this thing to happen. 32:33 Now, what I know is that this limit-- 32:35 32:38 actually, I should say given asymptotic level. 32:40 32:48 So what is this thing? 32:50 32:54 Well, OK, that's the probability that something 33:00 that looks like under p. 33:03 So under p, this guy-- 33:05 so what I know is that tn is square root of n 33:08 minus xn bar minus 0.5 divided by square root of xn bar 33:15 1 minus xn bar exceeds. 33:18 33:23 Is this true that as n to infinity, 33:26 this probability is the same as the probability 33:28 that the absolute value of a Gaussian 33:30 exceeds c of a standard Gaussian? 33:33 Is this true? 33:34 33:37 AUDIENCE: The absolute value of the standard Gaussian. 33:39 PHILIPPE RIGOLLET: Yeah, the absolute. 33:41 So you're saying that this, as n becomes large enough, this 33:43 should be the probability that some absolute value of n01 33:48 exceeds c, right? 33:49 AUDIENCE: Yes. 33:51 PHILIPPE RIGOLLET: So I claim that this is not correct. 33:54 Somebody tell me why. 33:56 AUDIENCE: Even in the limit it's not correct? 33:57 PHILIPPE RIGOLLET: Even in the limit, it's not correct. 33:59 34:03 AUDIENCE: OK. 34:04 PHILIPPE RIGOLLET: So what do you see? 34:05 AUDIENCE: It's because, at the beginning, 34:07 we picked the worst possible true parameter, 0.5. 34:11 So we don't actually know that this 0.5 is the mean. 34:13 PHILIPPE RIGOLLET: Exactly. 34:15 So we pick this 0.5 here, but this is for any p. 34:19 But what is the only p I can get? 34:21 So what I want is that this is true for all p in theta 0. 34:26 But the only p that's in theta 0 is actually p is equal to 0.5. 34:31 So yes, what you said was true, but it 34:33 required to specify p to be equal to 0.5. 34:38 So this, in general, is not true. 34:40 But it happens to be true if p belongs to theta 0, which 34:47 is strictly equivalent to p is equal to 0.5, 34:53 because theta 0 is really just this one point, 0.5. 34:59 So now, this becomes true. 35:01 And so what I need to do is to find c such 35:03 that this guy is equal to what? 35:05 35:11 I mean, let's just follow. 35:14 So I want this to be less than alpha. 35:16 But then we said that this was equal to this, 35:19 which is equal to this. 35:21 So all I want is that this guy is less than alpha. 35:24 But we said we might as well just make it equal to alpha 35:28 if you allow me to make it as big as I want, 35:30 as long as it's less than alpha. 35:32 AUDIENCE: So this is a true statement. 35:33 PHILIPPE RIGOLLET: So this is a true statement. 35:35 But it's under this condition. 35:38 AUDIENCE: Exactly. 35:39 35:43 PHILIPPE RIGOLLET: So I'm going to set it equal to alpha, 35:48 and then I'm going to try to solve for c. 35:52 36:10 So what I'm looking for is a c such that 36:13 if I draw a standard Gaussian-- 36:17 so that's pdf of some n01-- 36:20 I want the probability that the absolute value of my Gaussian 36:23 exceeding this guy-- 36:25 so that means being either here or here. 36:29 So that's minus c and c. 36:31 I want the sum of those two things to be equal to alpha. 36:36 So I want the sum of these areas to equal alpha. 36:53 So by symmetry, each of them should 36:56 be equal to alpha over 2. 36:58 37:02 And so what I'm looking for is c such that the probability 37:08 that my n01 exceeds c, which is just this area to the right, 37:15 now, equals alpha, which is equivalent to taking c, which 37:20 is q equals alpha over 2, and that's q alpha over 2 37:26 by definition of q alpha over 2. 37:28 That's just what q alpha over 2 is. 37:30 And that's what the tables at the back of the book give you. 37:34 Who has already seen a table for Gaussian probabilities? 37:42 What it does, it's just a table. 37:44 I mean, it's pretty ancient. 37:45 I mean, of course, you can actually ask 37:47 Google to do it for you now. 37:49 I mean, it's basically standard issue. 37:52 But back in the day, they actually had to look at tables. 37:56 And since the values alphas were pretty standard, 37:59 the values alpha that people were requesting 38:01 were typically 1%, 5%, 10%, all you 38:04 could do is to compute these different values 38:07 for different values of alpha. 38:08 That was it. 38:10 So there's really not much to give you. 38:13 So for the Gaussian, I can tell you 38:15 that alpha is equal to-- if alpha is equal to 5%, 38:20 then q alpha over 2, q 2.5% is equal to 1.96, for example. 38:27 So those are just fixed numbers that 38:28 are functions of the Gaussian. 38:31 So everybody agrees? 38:32 We've done that before for our confidence intervals. 38:37 38:40 And so now we know that if I actually 38:42 plug in this guy to be q alpha over 2, then 38:48 this limit is actually equal to alpha. 38:51 And so now I've actually constrained this. 38:53 39:01 So q alpha over 2 here for alpha equals 5%, as I said, is 1.96. 39:07 So in the example 1, the number that we found was 3.54, 39:13 I think, or something like that, 3.55 for t. 39:18 So if we scroll back very quickly, 3.45-- 39:29 that was example 1. 39:30 Example two-- negative 0.77. 39:33 So if I look at tn in example 1, tn 39:40 was just the absolute value of 3.45, which-- 39:46 don't pull out your calculators-- is equal to 3.45. 39:50 Example 2, absolute value of negative 0.77 39:54 was equal to 0.77. 39:57 And so all I need to check is, is this number 39:59 larger or smaller than 1.96? 40:01 That's what my test ends up being. 40:06 So in example 1, 3.45 being larger 40:12 than 1.96, that means that I reject. 40:18 Fairness of my coins, in example 2, 40:22 0.77 being smaller than 1.96-- 40:27 what do I do? 40:29 I fail to reject. 40:30 40:44 So here is a question. 40:45 40:47 In example 1, for what level alpha would psi alpha-- 40:54 40:57 OK, so here, what's going to happen 41:00 if I start decreasing my level? 41:04 When I decrease my level, I'm actually 41:07 making this area smaller and smaller, 41:09 which means that I push this c to the right. 41:13 So now I'm asking, what is the smallest c 41:17 I should pick so that now, I actually do not reject h0? 41:22 What is the smallest c I should be taking here? 41:29 What is the smallest c? 41:30 41:37 So c here, in the example I gave you for 5%, was 1.96. 41:43 What is the smallest c I should be taking so that now, 41:49 this inequality is reversed? 41:50 41:54 3.45. 41:55 I ask only trivial questions, don't be worried. 41:58 So 3.45 is the smallest c that I'm actually 42:02 willing to tolerate. 42:04 So let's say this was my 5%. 42:07 If this was 2.5-- 42:09 if here, let's say, in this picture, 42:11 alpha is 5%, that means maybe I need to push here. 42:16 And this number should be what? 42:18 So this is going to be 1.96. 42:20 And this number here is going to be 3.45, clearly to scale. 42:26 And so now, what I want to ask you is, 42:30 well, there's two ways I can understand this number 3.45. 42:33 It is the number 3.45, but I can also 42:36 try to understand what is the area to the right of this guy. 42:40 And if I understand what the area to the right of this guy 42:42 is, this is actually some alpha prime over 2. 42:47 And that means that if I actually 42:49 fix this level alpha prime, that would 42:53 be exactly the tipping point at which I would 42:57 go from accepting to rejecting. 43:01 So I knew, in terms of absolute thresholds, 43:04 3.45 is the trivial answer to the question. 43:07 That's the tipping point, because I'm 43:09 comparing a number to 3.45. 43:11 But now, if I try to map this back 43:13 and understand what level would have been giving me 43:16 this particular tipping point, that's 43:18 a number between 0 and 1. 43:21 The smaller the number, the larger this number here, 43:25 which means that the more evidence I have in my data 43:28 against h0. 43:30 And so this number is actually something called the p-value. 43:36 And so saying, for example 2, there's 43:38 the tipping point alpha at which I 43:40 go from failing to reject to rejecting. 43:44 And that's exactly the number, the area under the curve, 43:47 such that here, I see 0.77. 43:53 And this is this alpha prime prime over 2. 43:56 43:59 Alpha prime prime is clearly larger than 5%. 44:04 So what's the advantage of thinking and mapping back 44:06 these numbers? 44:08 Well, now, I'm actually going to spit out some number which 44:11 is between 0 and 1. 44:12 And that should be the only scale you should have in mind. 44:18 Remember, we discussed that last time. 44:20 I was like, well, if I actually spit out 44:22 a number which is 3.45, maybe you can try to think, 44:26 is 3.45 a large number for a Gaussian? 44:29 That's a number. 44:29 But if I had another random variable that was not Gaussian, 44:32 maybe it was a double exponential, 44:33 you would have to have another scale in your mind. 44:36 Is 3.45 so large that it's unlikely for it 44:42 to come from a double exponential. 44:44 If I had a gamma distribution-- 44:46 I can think of any distribution, and then that means, 44:48 for each distribution, you would have to have scale in mind. 44:51 So of course, you can have the Gaussian scale in mind. 44:53 I mean, I have the Gaussian scale in mind. 44:55 But then, if I map it back into this number between 0 and 1, 44:59 all the distributions play the same role. 45:02 So whether I'm talking about if my limiting distribution is 45:05 normal or exponential or gamma, or whatever you want, 45:09 for all these guys, I'm just going 45:11 to map it into one number between 0 and 1. 45:13 Small number means lots of evidence against h1. 45:16 Large number means lots of evidence against h0. 45:21 Small number means very few evidence against h9. 45:25 And this is the only number you need to keep in mind. 45:27 And the question is, am I willing 45:29 to tolerate this number between 5%, 6%, or maybe 10%, 12%? 45:34 And this is the only scale you have to have in mind. 45:37 And this scale is the scale of p-values. 45:41 So the p-value is the tipping point in terms of alpha. 45:48 In words, I can make it formal, because tipping point, 45:52 as far as I know, is not a mathematical term. 45:54 So a p-value of a test is the smallest, 45:58 potentially asymptotic level if I talk about an asymptotic 46:01 p-value-- 46:02 and that's what we do when we talk about central theorem-- 46:05 at which the test rejects h0. 46:09 If I were to go any smaller-- 46:10 46:14 sorry, it's the smallest level-- 46:17 yeah, if I were to go any smaller, 46:19 I would fail to reject. 46:21 The smaller the level, the less likely it is for me to reject. 46:25 And if I were to go any smaller, I 46:26 would start failing to reject. 46:31 And so it is a random number. 46:33 It depends on what I actually observe. 46:35 So here, of course, I instantiated those two numbers, 46:39 3.45 and 0.77, as realizations of random variables. 46:44 But if you think of those as being the random numbers 46:46 before I see my data, this was a random number, 46:50 and therefore, the area under the curve to the right of it 46:53 is also a random area. 46:55 If this thing fluctuates, then the area under the curve 46:58 fluctuates. 47:00 And that's what the p-value is. 47:02 That's what-- what is his name? 47:05 I forget. 47:06 John Oliver talks about when he talks about p-hacking. 47:10 And so we talked about this in the first lecture. 47:14 So p-hacking is, how do I do-- oh, if I'm a scientist, 47:18 do I want to see a small p-value or a large p-value? 47:20 AUDIENCE: Small. 47:21 PHILIPPE RIGOLLET: Small, right? 47:22 Scientists want to see small p-values because small p-values 47:24 equals rejecting, which equals discovery, 47:28 which equals publications, which equals promotion. 47:31 So that's what people want to see. 47:34 So people are tempted to see small p-values. 47:37 And what's called p-hacking is, well, find a way to cheat. 47:41 Maybe look at your data, formulate your hypothesis 47:44 in such a way that you will actually have a smaller 47:49 p-value than you should have. 47:51 So here, for example, there's one thing 47:53 I did not insist on because, again, this is not 47:54 a particular course on statistical thinking, 47:57 but one thing that we implicitly did 47:59 was set those theta 0 and theta 1 ahead of time. 48:04 I fixed them, and I'm trying to test this. 48:08 This is to be contrasted with the following approach. 48:11 I draw my data. 48:13 So I draw-- 48:15 I run this experiment, which is probably 48:16 going to get me a publication in nature. 48:18 I'm trying to test if a coin is fair. 48:23 And I draw my data, and I see that there's 48:24 13 out of 30 of my observations that are heads. 48:31 That means that, from this data, it 48:32 looks like p is less than 1/2. 48:36 So if I look at this data and then 48:38 decide that my alternative is not p not equal to 1/2, 48:42 but rather p less than 1/2, that's p-hacking. 48:47 I'm actually making my p-value strictly smaller 48:50 by first looking at the data, and then deciding what 48:53 my alternative is going to be. 48:54 And that's cheating, because all the things we did, 48:58 we're assuming that this 0.5, or the alternative, 49:02 was actually a fixed-- everything was deterministic. 49:05 The only randomness came from the data. 49:07 But if I start looking at the data 49:08 and designing my experiment or my alternatives 49:11 and null hypothesis based on the data, 49:13 it's as if I started putting randomness all over the place. 49:15 And then I cannot control it because I don't know how it 49:18 just intermingles with each other. 49:22 So that was for the John Oliver moment. 49:26 49:29 So the p-value is nice. 49:32 So maybe I mentioned that, before, my wife 49:35 works in market research. 49:36 And maybe every two years, she seems 49:40 to run into a statistician in the hallway, 49:42 and she comes home and says, what is a p-value again? 49:45 And for her, a p-value is just the number 49:48 in an Excel spreadsheet. 49:50 And actually, small equals good and large equals bad. 49:55 And that's all she needs to know at this point. 49:57 Actually, they do the job for her-- small is green, 50:01 large is red. 50:02 And so for her, a p-value is just green or red. 50:06 But so what she's really implicitly doing 50:08 with this color code is just applying the golden rule. 50:12 What the statisticians do for her in the Excel spreadsheet 50:16 is that they take the numbers for the p-values that 50:18 are less than some fixed level. 50:20 So depending on the field in which she works-- 50:22 so she works for pharmaceutical companies-- 50:24 so the p-values are typically compared-- 50:26 the tests are usually performed at level 1%, rather than 5%. 50:31 So 5% is maybe your gold standard 50:33 if you're doing sociology or trying to-- 50:36 I don't know-- release a new blueberry flavor 50:39 for your toothpaste. 50:40 Something that's not going to change the life of people, 50:43 maybe you're going to run at 5%. 50:45 It's OK to make a mistake. 50:46 See, people are just going to feel gross, 50:47 but that's about it, whereas here, 50:50 if you have this p-value which is less than 1%, 50:53 it might be more important for some drug discovery, 50:55 for example. 50:56 And so let's say you run at 1%. 50:59 And so what they do in this Excel spreadsheet is 51:02 that all the numbers that are below 1% show up in green 51:05 and all the numbers that are above 1% show up in red. 51:09 And that's it. 51:09 That's just applying the golden rule. 51:11 If the number is green, reject. 51:13 If the number is red, fail to reject. 51:18 Yeah? 51:18 AUDIENCE: So going back to example 2 51:20 where the prior example where you 51:23 want to cheat by looking after beta 51:26 and then formulating, say, theta 1 to be p less than 1/2. 51:32 PHILIPPE RIGOLLET: Yeah. 51:33 AUDIENCE: So how would you achieve your goal 51:38 by changing the theta-- 51:40 PHILIPPE RIGOLLET: By achieving my goal, 51:42 you mean letting ethics aside, right? 51:45 AUDIENCE: Yeah, yeah. 51:46 PHILIPPE RIGOLLET: Ah, you want to be published. 51:47 AUDIENCE: Yeah. 51:48 PHILIPPE RIGOLLET: [LAUGHS] So let me teach you how, then. 51:54 So well, here, what do you do? 51:58 You want to-- at the end of the day, 52:03 a test is only telling you whether you found evidence 52:06 in your data that h1 was more likely than h0, basically. 52:11 How do you make h1 more likely? 52:12 Well, you just basically target h1 to be what it is-- 52:18 what the data is going to make it more likely to be. 52:21 So if, for example, I say h1 can be on both sides, 52:26 then my data is going to have to take into account fluctuations 52:29 on both sides, and I'm going to lose a factor or two somewhere 52:31 because things are not symmetric. 52:33 Here is the ultimate way of making this work. 52:38 I'm going back to my example of flipping coins. 52:42 And now, so here, what I did is, I said, 52:45 oh, this number 0.43 is actually smaller than 0.5, 52:54 so I'm just going to test whether I'm 0.5 52:56 or I'm less than 0.5. 52:58 But here is something that I can promise you 53:01 I did not make the computation will reject. 53:04 So here, this one actually-- 53:06 yeah, this one fails to reject. 53:08 So here is one that will certainly reject. 53:11 h0 is 0.5, p is 0.5, h1p is 0.43. 53:24 Now, you can try, but I can promise you 53:27 that your data will tell you that h1 is the right one. 53:32 I mean, you can check very quickly that this is really 53:36 extremely likely to happen. 53:37 53:40 Actually, what am I-- 53:41 53:45 no, actually, that's not true, because here, 53:52 the test that I derive that's based on this kind of stuff, 53:56 here at some point, somewhere under some layers, 53:59 I assume that all our tests are going to have this form. 54:04 But here, this is only when you're 54:06 trying to test one region versus another region next to it, 54:09 or one point versus a region around it, 54:11 or something like this, whereas for this guy, 54:13 there's another test that could come up with, 54:15 which is, what is the probability that I get 0.43, 54:18 and what is the probability that I get 0.5? 54:21 Now, what I'm going to do is, I'm 54:23 going to just conclude it's whichever 54:25 has the largest probability. 54:27 Then maybe I'm going to have to make some adjustments so 54:29 that the level is actually 5%. 54:32 But I can make this happen. 54:33 I can make the level be 5% and always conclude this guy, 54:36 but I would have to use a different test. 54:38 Now, the test that I described, again, 54:40 those tn larger than c are built in 54:42 to be tests that are resilient to these kind of manipulations 54:46 because they're oblivious towards what 54:48 the alternative looks like. 54:50 I mean, they're just saying it's either to the left 54:51 or to the right, but whether it's 54:53 a point or an entire half-line doesn't matter. 54:55 54:59 So if you try to look at your data 55:01 and just put the data itself into your hypothesis testing 55:05 problem, then you're failing the statistical principle. 55:10 And that's what people are doing. 55:12 I mean, how can I check? 55:13 I mean, of course, here, it's going 55:15 to be pretty blatant if you publish 55:16 a paper that looks like this. 55:17 But there's ways to do it differently. 55:19 For example, one way to do it is to just do mult-- 55:21 so typically, what people do is they 55:23 do multiple hypothesis testing. 55:24 They're doing 100 tests at a time. 55:27 Then you have random fluctuations every time. 55:30 And so they just pick the one that 55:32 has the random fluctuations that go their way. 55:34 I mean, sometimes it's going in your way, 55:36 and sometimes it's going the opposite way, 55:37 so you just pick the one that works for you. 55:39 We'll talk about multiple hypothesis testing soon 55:41 if you want to increase your publication count. 55:44 55:49 There's actually papers-- 55:50 I think it was a big news that some papers, 55:53 I think, in psychology or psychometrics 55:54 papers that actually refused to publish p-values now. 55:57 56:03 Where were we? 56:05 Here's the golden rule. 56:07 So one thing that I like to show is this thing, 56:11 just so you know how you apply the golden rule 56:14 and how you apply the standard tests. 56:16 So the standard paradigm is the following. 56:25 You have a black box, which is your test. 56:29 For my wife, this is the 4th floor of the building. 56:32 That's where the statisticians sit. 56:33 What she sends there is data-- 56:35 56:38 let's say x1 xn. 56:41 And she says, well, this one is about toothpaste, 56:43 so here's a level-- 56:45 let's say 5%. 56:47 What the 4th floor brings back is that answer-- yes, 56:50 no, green, red, just an answer. 56:53 56:58 So that's the standard testing. 56:59 You just feed it the data and the level at which you 57:02 want to perform the test, maybe asymptotic, 57:04 and it spits out a yes, no answer. 57:06 What p-value does, you just feed it the data itself. 57:15 57:18 And what it spits out is the p-value. 57:22 And now it's just up to you. 57:23 I mean, hopefully your brain has the computational power 57:27 of deciding whether a number is larger or smaller than 5% 57:31 without having to call a statistician for this. 57:33 And that's what it does. 57:35 So now we're on 1 scale. 57:37 Now, I see some of you nodding when I talk about p-hacking, 57:41 so that means you've seen p-values. 57:43 If you've seen more than 100 p-values in your life, 57:45 you have an entire scale. 57:47 A good p-value is less than 10 to the minus 4. 57:50 That's the ultimate sweet spot. 57:53 Actually, statistical software spits out 57:56 an output which says less than 10 to the minus 4. 58:01 But then maybe you want a p-val-- 58:02 58:05 if you tell me my p-value was 4.65, then I will say, 58:08 you've been doing some p-hacking until you found 58:10 a number that was below 5%. 58:12 That's typically what people will do. 58:14 But if you tell me-- 58:16 if you're doing the test, if you're saying, 58:18 I published my result, my test at 5% 58:21 said yes, that means that maybe you're p-value was 4.99, 58:27 or you're p-value was 10 to the minus 4, I will never know. 58:29 I will never know how much evidence 58:31 you had against the null. 58:34 But if you tell me what the p-value is, 58:36 I can make my own decision. 58:37 I don't have to tell me whether it's a yes, no. 58:39 You tell me it's 4.99, I'm going to say, well, maybe yes, 58:42 but I'm going to take it with a grain of salt. 58:45 And so that's why p-values are good numbers to have in mind. 58:48 Now, I should, as if it was like an old trick 58:51 that you start mastering when you're 45 years old. 58:54 No, it's just, how small is the number between 0 and 1? 58:57 That's really what you need to know. 59:00 Maybe on the log scale-- if it's 10 to the minus 1, 59:03 10 to the minus 2, 10 to the minus 3, et cetera-- 59:07 that's probably the extent of the mastery here. 59:09 59:12 So this traditional standard paradigm that I showed 59:16 is actually commonly referred to as the Neyman-Pearson paradigm. 59:21 So here, it says name Neyman-Pearson's theory, 59:23 so there's an entire theory that comes with it. 59:25 But it's really a paradigm. 59:27 It's a way of thinking about hypothesis testing that 59:29 says, well, if I'm not going to be able to optimize both 59:32 my type I and type II error, I'm actually 59:34 going to lock in my type I error below some level 59:37 and just minimize the type II error under this constraint. 59:42 That's what the Neyman-Pearson paradigm is. 59:45 And it sort of makes sense for hypothesis testing problems. 59:48 Now, if you were doing some other applications 59:50 with multi-objective optimization, 59:52 you would maybe come up with something different. 59:54 For example, machine learning is not performing typically 59:58 under Neyman-Pearson paradigm. 60:01 So if you do spam filtering, you could say, well, 60:05 I want to constrain the probability as much as I can 60:08 of taking somebody's important emails 60:10 and throwing them out as spam, and under this constraint, 60:14 not send too much spam to that person. 60:17 That sort of makes sense for spams. 60:19 Now, if you're labeling cats versus dogs, that's probably 60:23 not like you want to make sure that no more than 5% 60:27 of the dogs are labeled cat because, I mean, 60:30 it doesn't matter. 60:31 So what you typically do is, you just 60:33 sum up the two types of errors you can make, 60:34 and you minimize the sum without putting any more 60:36 weight on one or the other. 60:38 So here's an example where doing a binary decision, one or two 60:42 of the errors you can make, you don't 60:45 have to actually be like that. 60:47 So this example here, I did not. 60:50 The trivial test psi is equal to 0, what was it 60:55 in the US trial court example? 61:00 What is psi equals 0? 61:03 That was concluding always to the null. 61:05 What was the null? 61:08 AUDIENCE: Innocent. 61:08 PHILIPPE RIGOLLET: Innocent, right? 61:10 That's the status quo. 61:11 So that means that this guy never rejects h0. 61:14 Everybody's going away free. 61:16 So you're sure you're not actually 61:18 going against the constitution because alpha is 0%, which 61:25 is certainly less than 5%. 61:26 But the power, the fact that a lot of criminals 61:30 go back outside in the free world 61:34 is actually formulated in terms of low power, which, 61:37 in this case, is actually 0. 61:39 Again, the power is the number between 0 and 1. 61:41 Close to 0, good. 61:43 Close to 1, bad. 61:45 Now, what is the definition of the p-value? 61:51 That's going to be something-- it's a mouthful. 61:54 The definition of the p-value is a mouthful. 61:58 It's the tipping point. 62:00 It is the smallest level at which blah, blah, blah, blah, 62:02 blah. 62:03 It's complicated to remember it. 62:05 Now, I think that my 6th explanation, my wife, 62:09 after saying, oh, so it's the probability of making an error, 62:12 I said, yeah, that's the probability of making 62:14 an error because, of course, she can 62:16 think probability of making an error small, good, large, bad. 62:22 So that's actually a good way to remember. 62:24 I'm pretty sure that at least 50% 62:26 of people who are using p-values out there 62:28 think that the p-value is the probability of making an error. 62:31 Now, for all matters of purposes, 62:33 if your goal is to just threshold the p-value, 62:35 this is OK to have this in y. 62:37 But when comes, at least until December 22, 62:42 I would recommend trying to actually memorize 62:44 the right definition for the p-value. 62:46 62:53 So the idea, again, is fix the level 62:55 and try to optimize the power. 62:57 63:01 So we're going to try to compute some p-values from now on. 63:05 How do you compute the p-value? 63:06 Well, you can actually see it from this picture over there. 63:10 63:14 One thing I didn't show on this picture-- so here, 63:16 it was my q alpha over 2 that had alpha here, 63:19 alpha over 2 here. 63:21 That was my q alpha over 2. 63:22 And I said, if tn is to the right of this guy, 63:26 I'm going to reject. 63:27 If tn is to the left of this guy, 63:29 I'm going to fail to reject. 63:31 Pictorially, you can actually represent the p-value. 63:34 It's when I replace this guy by tn itself. 63:36 63:41 Sorry, that's p-value over 2. 63:44 No, actually, that's p-value. 63:47 So let me just keep it like that and put the absolute value 63:51 here. 63:51 63:54 So if you replace the role of q alpha over 2, by your test 63:58 statistic, the area under the curve 64:01 is actually the p-value itself up 64:03 to a scale because of the symmetric thing. 64:06 So there's a good way to see, pictorially, 64:09 what the p-value is. 64:10 It's just the probability that some Gaussians-- 64:13 it's just the probability that some absolute value of n01 64:17 exceeds tn. 64:18 64:22 That's what the p-value is. 64:24 Now, this guy has nothing to do with this guy, 64:26 so this is really just 1 minus the Gaussian cdf of tn, 64:32 and that's it. 64:34 So that's how I would compute p-values. 64:36 Now, as I said, the p-value is a beauty 64:40 because you don't have to understand 64:43 the fact that your limiting distribution is a Gaussian. 64:47 It's already factored in this construction. 64:49 The fact that I'm actually looking 64:50 at this cumulative distribution function of a standard Gaussian 64:54 makes my p-value automatically adjust to what 64:57 the limiting distribution is. 64:58 And if this was the cumulative distribution 65:00 function of a exponential, I would just 65:03 have a different function here denoted by f, for example, 65:06 and I would just compute a different value. 65:07 But in the end, regardless of what the limiting value is, 65:10 my p-value would still be a number between 0 and 1. 65:13 And so to illustrate that, let's look 65:16 at other weird distributions that we could get in place 65:20 of the standard Gaussian. 65:22 And we're not going to see many, but we'll see one. 65:24 And it's not called the chi squared distribution. 65:27 It's actually called the Student's distribution, 65:29 but it involves the chi squared distribution 65:31 as a building block. 65:34 So I don't know if my phonetics are not really right there, 65:38 so I try to say, well, it's chi squared. 65:43 Maybe it's "kee" squared above, in Canada, who knows. 65:47 So for a positive integer, so there's only 1 parameter. 65:50 So for the Gaussian, you have 2 parameters, 65:52 which are mu and sigma squared. 65:54 Those are real numbers. 65:55 Sigma squared's positive. 65:57 Here, I have 1 integer parameter. 65:59 66:03 Then the chi squared distribution 66:05 with d degrees of freedom-- 66:07 so the parameter is called a degree of freedom, 66:09 just like mu is called the expected value and sigma 66:11 squared is called the variance. 66:12 Here, we call it degrees of freedom. 66:14 You don't have to really understand why. 66:17 So that's the law that you would get-- 66:19 that's the random variable you would 66:21 get if you were to sum d squares of independent standard 66:26 Gaussians. 66:26 66:29 So I take the square of an independent random Gaussian. 66:33 I take another one. 66:34 I sum them, and that's a chi squared 66:36 with 2 degrees of freedom. 66:39 That's how you get it. 66:40 Now, I could define it using its probability density function. 66:46 I mean, after all, this is the sum 66:49 of positive random variables, so it 66:51 is a positive random variable. 66:53 It has a density on the positive real line. 66:56 And the pdf of chi squared with d degrees of freedom is what? 67:03 Well, it's fd of x is-- 67:07 what is it?-- x to the d/2 minus 1 e to the minus x/2. 67:13 And then here, I have a gamma of d/2. 67:16 And the other one is, I think, 2 to the d/2 minus 1. 67:20 67:23 No, 2 to the d/2. 67:26 That's what it is. 67:28 That's the density. 67:30 If you are very good at probability, 67:32 you can make the change of variable 67:33 and write your Jacobian and do all this stuff 67:35 and actually check that this is true. 67:37 I do not recommend doing that. 67:40 So this is the density, but it's better understood like that. 67:44 I think it was just something that you 67:46 built from standard Gaussian. 67:48 So for example, an example of a chi 67:50 squared with 2 degrees of freedom 67:52 is actually the following thing. 67:54 Let's assume I have a target like this. 67:56 68:00 And I don't aim very well. 68:02 And I'm trying to hit the center. 68:05 And I'm not going to have, maybe, 68:07 a deviation, which is standard Gaussian left, right 68:10 and standard Gaussian north, south. 68:16 So I'm throwing, and then I'm here, 68:18 and I'm claiming that this number here, by Pythagoras 68:22 theorem, the square distance here 68:24 is the sum of this square distance 68:25 here, which is the square of a Gaussian by assumption. 68:30 This is plus the square of this distance, 68:31 which is the square of another independent Gaussian. 68:34 I assume those are independent. 68:35 And so the square distance from this point to this point 68:37 is the chi squared with 2 degrees of freedom. 68:40 So this guy here is n01 squared. 68:45 This is n01 squared. 68:48 And so this guy here, this distance here, 68:50 is chi squared with 2 degrees of freedom. 68:53 I mean the square distance. 68:54 I'm talking about square distances here. 68:58 So now you can see that, actually, Pythagoras 69:02 is basically why chi squared [? arrives. ?] 69:05 That's why it has its own name. 69:07 I mean, I could define this random variable. 69:10 I mean, it's actually a gamma distribution. 69:13 It's a special case of something called the gamma distribution. 69:15 The fact that the special case has its own name 69:17 is because there's many times what 69:19 we're going to take sum of squares 69:20 of independent Gaussians because Gaussians, the sum of squares 69:23 is really the norm, the Euclidean norm squared, 69:25 just by Pythagoras theorem. 69:26 If I'm in higher dimension, I can 69:28 start to sum more squared coordinates, 69:30 and I'm going to measure the norm squared. 69:32 69:34 So if you want to draw this picture, it looks like this. 69:37 Again, it's the sum of positive numbers, 69:39 so it's going to be on 0 plus infinity. 69:43 That's fd. 69:44 And so f1 looks like this, f2 looks like this. 69:52 So the tails become heavier and heavier as d increases. 69:57 And then at [INAUDIBLE] to 3, it starts 70:00 to have a different shape. 70:01 It starts from 0 and it looks like this. 70:04 And then, as d increases, it's basically 70:06 as if you were to push this thing to the right. 70:09 It's just like, psh, so it's just falling like a big blob. 70:14 Everybody sees what's going on? 70:16 So there's just this fat thing that's just going there. 70:19 What is the expected value of a chi squared? 70:21 70:28 So it's the expected value of the sum 70:30 of Gaussian random variables, squared. 70:37 I know I said that. 70:40 AUDIENCE: So it's the sum of their second moments, right? 70:42 PHILIPPE RIGOLLET: Which is? 70:43 70:46 Those are n01. 70:47 AUDIENCE: It's like-- oh, I see, 1. 70:50 PHILIPPE RIGOLLET: Yeah. 70:51 AUDIENCE: So n times 1 or d times 1. 70:53 PHILIPPE RIGOLLET: Yeah, which is d. 70:55 So one thing you can check quickly 70:56 is that the expected value of a chi squared is d. 71:00 And so you see, that's why the mass is shifting to the right 71:04 as d increases. 71:05 It's just going there. 71:06 Actually, the variance is also increasing. 71:08 The variance is 2d. 71:10 71:14 So this is one thing. 71:16 And so why do we care about this? 71:19 In basic statistics, it's not like we actually 71:22 have statistics much about throwing darts 71:25 at high-dimensional boards. 71:28 So what's happening is that if I look at the sample variance, 71:31 the average of the sum of squared centered by their mean, 71:36 then I can actually expend this as the sum 71:38 of the squares minus the average squared 71:42 It's just the same trick that we have 71:44 for the variance-- second moment minus first moment square. 71:49 And then I claim that Cochran's theorem-- 71:53 and I will tell you in a second what Cochran's theorem tells me 71:56 is that this sample variance is actually-- 71:58 so if I had only this-- 72:01 look at those guys. 72:04 Those guys are Gaussian with mean mu and variance 72:07 sigma squared. 72:08 Think for 1 second mu being 0 and sigma squared being 1. 72:13 Now, this part would be a chi squared with n degrees 72:16 of freedom divided by n. 72:19 Now I get another thing here, which 72:21 is the square of something that looks like a Gaussian as well. 72:24 So it looks like I have something else here, which 72:27 looks also like a chi squared. 72:29 Now, Cochran's theorem is essentially telling you 72:31 that those things are independent, 72:35 and so that in a way, you can think of those guys as being, 72:39 here, n degrees of freedom minus 1 degree of freedom. 72:43 Now, here, as I said, this does not mean 0 and variance 1. 72:47 The fact that it's not mean 0 is not a problem 72:50 because I can remove the mean here and remove the mean here. 72:54 And so this thing has the same distribution, 72:57 regardless of what the actual mean is. 72:59 So without loss of generality, I can 73:00 assume that mu is equal to 0. 73:02 Now, the variance, I'm going to have to pay, 73:03 because if I multiply all these numbers by 10, 73:06 then this sn is going to multiplied by 100. 73:09 So this thing is going to scale with the variance. 73:11 And not surprisingly, it's scaling like the square 73:13 of the variance. 73:15 So if I look at sn, it's distributed 73:18 as sigma squared times the chi squared 73:21 with n minus 1 degrees of freedom divided by n. 73:25 And we don't really write that, because a chi squared 73:28 times sigma squared divided by n is not a distribution, 73:30 so we put everything to the left, 73:32 and we say that this is actually a chi squared with n 73:34 minus 1 degrees of freedom. 73:36 So here, I'm actually dropping a fact on you, 73:40 but you can see the building block. 73:43 What is the thing that's fuzzy at this point, 73:46 but the rest should be crystal clear to you? 73:48 The thing that's fuzzy is that removing this squared guy 73:52 here is actually removing 1 degree of freedom. 73:55 That should be weird, but that's what Cochran's theorem tells. 73:59 It's essentially stating something 74:00 about orthogonality of subspaces with the span 74:04 of the constant vector, something like that. 74:07 So you don't have to think about it too much, 74:09 but that's what it's telling me. 74:11 But the rest, if you plug in-- so the scaling in sigma squared 74:15 and in n, so that should be completely clear to you. 74:18 So in particular, if I remove that part, 74:20 it should be clear to you that this thing, if mean is 0, 74:24 this thing is actually distributed. 74:27 Well, if mu is 0, what is the distribution of this guy? 74:30 74:35 So I remove that part, just this part. 74:37 74:46 So I have xi, which are n0 sigma squared. 74:50 And I'm asking, what is the distribution of 1/n sum from i 74:53 equal 1 to n of xi squared? 74:57 So it is the sum of their IID. 75:00 So it's the sum of independent Gaussians, but not standard. 75:03 So the first thing to make them standard 75:05 is that I divide all of them by sigma squared. 75:07 75:10 Now, this guy is of the form zi squared where zi is n01. 75:17 75:20 So now, this thing here has what distribution? 75:25 AUDIENCE: Chi squared n. 75:27 PHILIPPE RIGOLLET: Chi squared n. 75:30 And now, sigma squared over n times chi squared n-- 75:33 so if I have sigma squared divided by n times chi 75:35 squared-- 75:37 sorry, so n times n divided by sigma squared. 75:41 So if I take this thing and I multiply it 75:45 by n divided by sigma squared, it means I remove this term, 75:48 and now I am left with a chi squared 75:49 with n degrees of freedom. 75:51 Now, the effect of centering with the sample mean here 75:55 is only to lose 1 degree of freedom. 75:57 That's it. 75:58 76:01 So if I want to do a test about variance, since this 76:05 is supposedly a good estimator of variance, 76:08 this could be my pivotal distribution. 76:10 This could play the role of a Gaussian. 76:12 If I want to know if my variance is equal to 1 or larger than 1, 76:16 I could actually build a test based on this only statement 76:21 and test if the variance is larger than 1 or not. 76:23 Now, this is not asymptotic because I 76:25 started with the very assumption that my data was 76:28 Gaussian itself. 76:29 76:32 Now, just a side remark-- you can 76:33 check that this chi squared 2, 2 is an exponential with 1/2 76:37 degrees of freedom, which is certainly not 76:38 clear from the fact that z1 squared plus z2 squared 76:42 is a chi squared with 2 degrees of freedom. 76:44 if I give you the sum of the square 76:46 of 2 independent Gaussian, this is actually an exponential. 76:50 That's not super clear, right? 76:53 But if you look at what was here-- 77:00 I don't know if you took notes, but let me rewrite it for you. 77:03 So it was x to the d/2 minus 1 e to the minus x/2 divided 77:08 by 2 to the d/2 gamma of d/2. 77:14 So if I plug in d is equal to 2, gamma of 2/2 77:18 is gamma of 1, which is 1. 77:21 It's factorial of 0. 77:23 So it's 1, so this guy goes away. 77:26 2 to the d/2 is 2 to 1, so that's just 1. 77:33 No, that's just 2. 77:36 Then x the d/2 minus 1 is x the 0, goes away. 77:40 And so I have x minus x/2 1/2, which is really, indeed, 77:47 of the form lambda e to the minus lambda 77:50 x for lambda is equal to 1/2, which was 77:53 our exponential distribution. 77:54 77:59 Well, next week is, well, Columbus Day? 78:05 So not next Monday-- 78:08 so next week, we'll talk about Student's distribution. 78:12 And so that was discovered by a guy 78:15 who pretended his name was Student, but was not Student. 78:19 And I challenge you to find why in the meantime. 78:23 So I'll see you next week. 78:24 Your homework is going to be outside 78:28 so we can release the room. 78:31