https://www.youtube.com/watch?v=a66tfLdr6oY&list=PLUl4u3cNGP60uVBMaoNERc6knT_MgPKS0&index=10 字幕記錄 00:00 00:00 The following content is provided under a Creative 00:02 Commons license. 00:03 Your support will help MIT OpenCourseWare 00:06 continue to offer high quality educational resources for free. 00:10 To make a donation or to view additional materials 00:12 from hundreds of MIT courses, visit MIT OpenCourseWare 00:16 at ocw.mit.edu. 00:17 00:20 PROFESSOR: So we've been talking about this chi square test. 00:22 And the name chi square comes from the fact 00:26 that we build a test statistic that 00:28 has asymptotic distribution given 00:31 by the chi square distribution. 00:36 Let's just give it another shot. 00:37 00:44 OK. 00:47 This test. 00:48 Who has actually ever encountered the chi square test 00:50 outside of a stats classroom? 00:54 All right. 00:54 So some people have. 00:55 It's a fairly common test that you might encounter. 00:59 And it was essentially to test, if given 01:01 some data with a fixed probability mass function, so 01:06 a discrete distribution, you wanted 01:08 to test if the PMF was equal to a set value, p0, 01:12 or if it was different from p0. 01:15 And the way the chi square arose here 01:18 was by looking at Wald's test. 01:22 And essentially if you write-- so Wald's is the one that 01:25 has the chi square as the limiting distribution, 01:27 and if you invert the covariance matrix, 01:31 the asymptotic covariance matrix, so you compute 01:33 the Fisher information, which in this particular case 01:36 does not exist for the multinomial distribution, 01:39 but we found the trick on how to do this. 01:41 We remove the part that forbid it to be invertible, 01:44 then we found this chi square distribution. 01:46 In a way we have this test statistic, 01:47 which you might have learned as a black box, laundry list, 01:50 but going through the math which might have been slightly 01:53 unpleasant, I acknowledge, but really told you 01:56 why you should do this particular normalization. 01:59 So since some of you requested a little more practical examples 02:04 of how those things work, let me show you a couple. 02:07 The first one is you want to answer the question, well, 02:12 you know, when should I be born to be successful. 02:16 Some people believe in zodiac, and so Fortune magazine 02:20 actually collected the signs of 256 heads of the Fortune 500. 02:24 Those were taken randomly. 02:26 And they were collected there, and you 02:27 can see the count of number of CEOs that 02:31 have a particular zodiac sign. 02:33 And if this was completely uniformly distributed, 02:35 you should actually get a number that's 02:37 around 256 divided by 12, which in this case is 21.33. 02:42 And you can see that there is numbers 02:45 that are probably in the vicinity, but look at this guy. 02:49 Pisces, that's 29. 02:51 So who's Pisces here? 02:53 All right. 02:55 All right, so give me your information 02:57 and we'll meet again in 10 years. 02:59 And so basically you might want to test 03:02 if actually the fact that it's uniformly distributed 03:04 is a valid assumption. 03:06 Now this is clearly a random variable. 03:09 I pick a random CEO and I measure 03:13 what his zodiac sign is. 03:16 And I want to know, so it's a probability over, I don't know, 03:19 12 zodiac signs. 03:20 And I want to know if it's uniform or not. 03:23 Uniform sounds like it should be the status 03:25 quo, if you're reasonable. 03:27 And maybe there's actually something that moves away. 03:31 So we could do this, in view of these data is there evidence 03:34 that one is different. 03:36 Here is another example where you might want 03:38 to apply the chi square test. 03:40 So as I said, the benchmark distribution 03:44 was the uniform distribution for the zodiac sign, 03:46 and that's usually the one I give you. 03:47 1 over k, 1 over k, because well that's 03:49 sort of the zero, the central point for all distributions. 03:53 That's the point, the center of what we call the simplex. 03:57 But you can have another benchmark 03:58 that sort of makes sense. 03:59 So for example this is an actual dataset where 275 jurors were 04:04 identified, racial group were collected, 04:09 and you actually might want to know 04:10 if you know juries in this country 04:12 are actually representative of the actual population. 04:17 And so here of course, the population 04:19 is not uniformly distributed according to racial group. 04:23 And the way you actually do it is you 04:24 actually go on Wikipedia, for example, 04:26 and you look at the demographics of the United States, 04:28 and you find that the proportion of white is 72%, black is 7%, 04:33 Hispanic is 12, and other is about 9%. 04:41 So that's a total of 1. 04:43 And this is what we actually measured for some jurors. 04:46 So for this guy, you can actually 04:48 run the chi square test. 04:49 You have the estimated proportion, which 04:51 comes from this first line. 04:53 You have the tested proportion, p0, 04:55 that comes from the second line, and you 04:56 might want to check if those things actually 04:58 correspond to each other. 04:59 OK, so I'm not going to do it for you, 05:01 but I sort of invite you to do it 05:03 and test, and see how this compares 05:05 to the quantiles of the appropriate chi 05:07 square distribution and see what you can conclude from those two 05:10 things. 05:12 All right. 05:12 So this was the multinomial case. 05:15 So this is essentially what we did. 05:17 We computed the MLE under the right constraint, 05:19 and that was our test statistic that converges 05:20 to the chi square distribution. 05:22 So if you've seen it before, that's 05:23 all that was given to you. 05:24 Now we know why the normalization here 05:27 is p0 j and not p0 j squared or square root of p0 j, or even 1. 05:33 I mean it's not clear that this should 05:35 be the right normalization, but we 05:36 know that's what comes from taking 05:38 the right normalization, which comes from the Fisher 05:41 information. 05:42 All right? 05:43 OK. 05:43 05:47 The thing I wanted to move onto, so we've basically covered 05:50 chi square test. 05:51 Are there any questions about chi square test? 05:53 And for those of you who were not here on Thursday, 05:56 I'm really just-- 05:57 do not pretend I just did it. 05:59 That's something we did last Thursday. 06:01 But are there any questions that arose 06:03 when you were reading your notes, things 06:04 that you didn't understand? 06:06 Yes. 06:06 AUDIENCE: Is there like a formal name? 06:09 Before we had talked about how what we call the Fisher 06:12 information [INAUDIBLE],, still has the same [INAUDIBLE] 06:17 because it's the same number. 06:21 PROFESSOR: So it's not the Fisher. 06:23 The Fisher information does not exist in this case. 06:25 And so there's no appropriate name for this. 06:27 It's the pseudoinverse of the asymptotic covariance matrix, 06:30 and that's what it is. 06:32 I don't know if I mentioned it last time, 06:34 but there's this entire field that uses-- 06:36 you know, for people who really aspire to differential geometry 06:39 but are stuck in the stats department, 06:41 and there's this thing called information geometry, which 06:43 is essentially studying the manifolds associated 06:47 to the Fisher information metric, the metric that's 06:50 associated to Fisher information. 06:52 And so those of course can be lower dimensional manifolds, 06:55 not only distorts the geometry but forces everything 06:58 to live on a lower dimension, which 06:59 is what happens when your Fisher information does not exist. 07:02 And so there's a bunch of things that you 07:04 can study, what this manifold looks like, et cetera. 07:06 But no, there's no particular terminology here 07:09 about going here. 07:12 To be fair, within the scope of this class, 07:14 this is the only case where you-- 07:18 multinomial case is the only case 07:19 where you typically see a lack of a Fisher information matrix. 07:26 And that's just because we have these extra constraints 07:28 that the sum of the parameters should be 1. 07:30 And if you have an extra constraint that 07:31 seems like it's actually remove one degree of freedom, 07:34 this will happen inevitably. 07:36 And so maybe what you can do is reparameterize. 07:40 So if I actually reparameterize everything function of p1 07:44 to p k minus 1, and then 1 minus the sum, 07:46 this would not have happened. 07:48 Because I have only a k-dimensional space. 07:51 So there's tricks around this to make it 07:53 exist if you want it to exist. 07:56 Any other question? 07:58 All right. 07:59 So let's move on to Student's t-test. 08:02 We mentioned it last time. 08:03 So essentially you've probably done it 08:06 more even in the homework than you've done it in lectures, 08:09 but just quickly this is essentially the test. 08:12 That's the test when we have an actual data that comes 08:15 from a normal distribution. 08:16 There is no Central Limit Theorem that exists. 08:18 This is really to account for the fact 08:21 that for smaller sample sizes, it 08:24 might be the case that it's not exactly true that when 08:27 I look at xn bar minus mu divided by-- so if I look 08:33 at xn bar minus mu divided by sigma times square root of n, 08:37 then this thing should have N 0, 1 distribution approximately. 08:41 Right? 08:42 By the Central Limit Theorem. 08:45 So that's for n large. 08:47 But if n is small, then it's still true 08:53 when the data is N mu, sigma squared, 09:00 then it's true that square root of n-- 09:02 09:09 so here it's approximately. 09:12 And this is always true. 09:14 But I don't know sigma in practice, right? 09:16 Maybe mu, it comes from my, maybe mu comes from my mu 09:20 0, maybe something from the test statistic 09:23 where mu actually is here. 09:25 But for this guy I'm going to have inevitably 09:27 to find an estimator. 09:29 And now in this case, for small n, this is no longer true. 09:32 And what the t statistic is doing 09:34 is essentially telling you what the distribution of this guy 09:36 is. 09:37 So what you should say is that now this guy 09:41 has a t distribution with n minus 1 degrees of freedom. 09:44 That's basically the laundry list stats 09:47 that you would learn. 09:48 It says just look at a different table, that's what it is. 09:50 But we actually defined what a t distribution was. 09:55 And a t distribution is basically 09:58 something that has the same distribution as some N 0, 1, 10:03 divided by the square root of a chi square 10:06 with d degrees of freedom divided by d. 10:08 And that's a t distribution with d degrees of freedom. 10:12 And those two have to be independent. 10:14 10:20 And so what I need to check is that this guy over there 10:24 is of this form. 10:25 10:37 OK? 10:39 So let's look at the numerator. 10:41 Well, square root of n, xn bar minus mu. 10:45 What is the distribution of this thing? 10:47 Is it an N 0, 1? 10:50 AUDIENCE: N 0, sigma squared? 10:52 PROFESSOR: N 0, sigma squared, right. 10:54 10:58 So I'm not going to put it here. 11:00 So if I want this guy to be N 0, 1, 11:01 I need to divide by sigma, that's what we have over there. 11:04 11:06 So that's my N 0, 1 that's going to play the role of this guy 11:09 here. 11:11 So if I want to go a little further, 11:13 I need to just say, OK, now I need to have square root of n, 11:21 and I need to find something here 11:23 that looks like my square root of chi square divided 11:27 by-- yeah? 11:28 AUDIENCE: Really quick question. 11:29 The equals sign with the d on top, that's just defined as? 11:32 PROFESSOR: No, that's just the distribution. 11:35 So, I don't know. 11:37 AUDIENCE: Then never mind. 11:38 PROFESSOR: Let's just write it like that, if you want. 11:41 I mean, that's not really appropriate to have. 11:44 Usually you write only one distribution 11:46 on the right-hand inside of this little thing. 11:48 So not just this complicated function of distributions. 11:51 This is more like to explain. 11:53 OK, and so usually the thing you should 11:54 say that t is equal to this X divided by square root of Z 11:58 divided by d where X has normal distribution, 12:01 Z has chi square distribution with d degrees of freedom. 12:06 So what do we need here? 12:07 Well I need to have something which looks like my sigma hat, 12:10 right? 12:10 So somehow inevitably I'm going to need to have sigma hat. 12:13 12:16 Now of course I need to divide this by my sigma 12:18 so that my sigma goes away. 12:19 12:22 And so now this thing here-- 12:25 sorry, I should move on to the right, OK. 12:27 And so this thing here, so sigma hat is square root of Sn. 12:33 And now I'm almost there. 12:35 So this thing is actually equal to square root of n. 12:38 12:47 But this thing here is actually not a-- 12:51 12:55 so this thing here follows a distribution 12:57 which is actually a chi square, square root 13:00 of a chi square distribution divided by n. 13:11 13:15 Yeah, that's the square root chi square distribution 13:18 with n minus 1 degrees of freedom divided 13:20 by n, because sigma hat is equal to 1 over n sum 13:25 from i equal 1 to n, xi minus x bar squared. 13:30 And we just said that this part here 13:32 was a chi square distribution. 13:34 We didn't just say it, we said it a few lectures years back, 13:36 that this thing was a chi square distribution, and the fact 13:39 that the presence of this x bar here 13:42 was actually removing one degree of freedom from this sum. 13:46 OK, so this guy here has the same distribution 13:48 as a chi square n minus 1 divided by n. 13:52 So I need to actually still arrange this thing a little bit 13:56 to have a t distribution. 13:58 I should not see n here, but I should n minus 1. 14:01 The d is the same as this d here. 14:06 And so let me make the correction 14:07 so that this actually happens. 14:09 Well, if I actually write this to be equal to-- 14:14 so if I write square root of n minus 1, as on the slide, 14:19 times xn bar minus mu divided by-- 14:25 well let me write it as square root of Sn, 14:27 which is my sigma hat. 14:29 Then what this thing is actually equal to, 14:33 it follows a N 0, 1, divided by the square root 14:39 of my chi square distribution with n 14:40 minus 1 degrees of freedom. 14:42 And here the fact that I multiply 14:43 by square root of n minus 1, and I 14:45 have the square root of n here, is essentially the same 14:47 as dividing here by n minus 1. 14:51 And that's my tn distribution. 14:54 My t distribution with n minus 1 degrees of freedom. 14:58 Just by definition of what this thing is. 15:00 OK? 15:00 15:22 All right. 15:22 Yes? 15:23 AUDIENCE: Where'd you get the square root from? 15:26 PROFESSOR: This guy? 15:27 Oh sorry, that's sigma squared. 15:28 Thank you. 15:30 That's the estimator of the variance, not the estimator 15:32 of the standard deviation. 15:33 And when I want to divide it I divide by standard deviation. 15:35 Thank you. 15:38 Any other question or remark? 15:40 AUDIENCE: Shouldn't you divide by sigma squared? 15:42 The actual. 15:45 The estimator for the variance is 15:47 equal to sigma squared times chi square, right? 15:52 PROFESSOR: The estimator for the variance. 15:55 Oh yes, you're right. 15:56 So there's a sigma squared here. 15:59 Is that what you're asking? 16:00 AUDIENCE: Yeah. 16:00 PROFESSOR: Yes, absolutely. 16:01 And that's where, it get cancels here. 16:03 It gets canceled here. 16:04 16:10 OK? 16:10 16:13 So this is really a sigma squared times chi square. 16:15 16:20 OK. 16:21 So the fact that it's sigma squared 16:22 is just because I can pull out sigma 16:24 squared and just think those guys N 0, 1. 16:26 16:32 All right. 16:33 So that's my t distribution. 16:34 Now that I actually have a pivotal distribution, what I do 16:37 is that I form the statistic. 16:40 Here I called it Tn tilde. 16:42 16:52 OK. 16:53 And what is this thing? 16:54 I know that this has a pivotal distribution. 16:56 So for example, I know that the probability 16:59 that Tn tilde in absolute value exceeds some number that I'm 17:05 going to call q alpha over 2 for the t n minus 1, 17:11 is equal to alpha. 17:13 So that's basically, remember the t distribution 17:16 has the same shape as the Gaussian distribution. 17:19 What I'm finding is, for this t distribution, 17:21 some number q alpha over 2 of t n minus 1 17:26 and minus q alpha over 2 of t minus 1. 17:29 So those are different from the Gaussian one. 17:31 Such that the area under the curve 17:33 here is alpha over 2 on each side 17:36 so that the probability that my absolute value exceeds 17:39 this number is equal to alpha. 17:43 And that's what I'm going to use to reject the test. 17:46 So now my test becomes, for H0, say mu is equal to some mu 0, 17:59 versus H1, mu is not equal to mu 0. 18:05 18:08 The rejection region is going to be equal to the set on which 18:13 square root of n minus 1 times xn bar minus mu 0 this time, 18:19 divided by square root of Sn exceeds, in absolute value, 18:25 exceeds q-- sorry that's already here-- 18:28 exceeds q alpha over 2 of t n minus 1. 18:34 So I reject when this thing increases. 18:36 The same as the Gaussian case, except that rather than reading 18:39 my quantiles from the Gaussian table 18:41 I read them from the Student table. 18:44 It's just the same thing. 18:45 So they're just going to be a little bit farther. 18:48 So this guy here is just going to be a little bigger 18:52 than the one for the Gaussian one, 18:54 because it's going to require me a little more evidence 18:57 in my data to be able to reject because I 18:59 have to account for the fluctuations of sigma hat. 19:01 19:09 So of course Student's test is used everywhere. 19:12 People use only t tests, right? 19:15 If you look at any data point, any output, 19:19 even if you had 500 observations, 19:21 if you look at the statistical software output 19:23 it's going to say t test. 19:25 And the reason why you see t test 19:26 is because somehow it's felt like it's not asymptotic. 19:29 You don't need to actually do, you 19:31 know, to be particularly careful. 19:33 And anyway, if n is equal to 500, 19:35 since the two curves are above each other 19:37 it's basically the same thing. 19:39 So it doesn't really change anything. 19:40 So why not use the t test? 19:43 So it's not asymptotic. 19:44 It doesn't require Central Limit Theorem to kick in. 19:47 And so in particular it be run if you have 15 observations. 19:50 Of course, the drawback of the Student test 19:52 is that it relies on the assumption 19:54 that the sample is Gaussian, and that's something 19:56 we really need to keep in mind. 19:57 If you have a small sample size, there is no magic going on. 20:01 It's not like Student t test allows you to get rid 20:04 of this asymptotic normality. 20:06 It sort of assumes that it's built in. 20:08 It assumes that your data has a Gaussian distribution. 20:14 So if you have 15 observations, what are you going to do? 20:18 You want to test if the mean is equal to 0 or not equal to 0, 20:21 but you have only 15 observations. 20:24 You have to somehow assume that your data is Gaussian. 20:27 But if the data is given to you, this is not math, 20:30 you actually have to check that it's Gaussian. 20:32 And so we're going to have to find 20:33 a test that, given some data, tells us whether it's Gaussian 20:38 or not. 20:39 If I have 15 observations, 8 of them 20:42 are equal to plus 1 and 7 of them are equal to minus 1, 20:46 then it's pretty unlikely that you're 20:47 going to be able to conclude that your data has a Gaussian 20:50 distribution. 20:51 However, if you see some sort of spread around some value, 20:54 you form a histogram maybe and it sort of 20:56 looks like it's a Gaussian, you might 20:57 want to say it's Gaussian. 20:59 And so how do we make this more quantitative? 21:01 Well, the sad answer to this question 21:05 is that there will be some tests that make it quantitative, 21:08 but here, if you think about it for one second, what is going 21:11 to be your null hypothesis? 21:13 Your null hypothesis, since it's one point, 21:15 it's going to be that it's Gaussian, 21:17 and then the alternative is going 21:19 to be that it's not Gaussian. 21:21 So what it means is that, for the first time 21:23 in your statistician life, you're 21:26 going to want to conclude that H0 is the true one. 21:30 You're definitely not going to want 21:31 to say that it's not Gaussian, because then everything you 21:34 know is sort of falling apart. 21:36 And so it's kind of a weird thing where 21:39 you're sort of going to be seeking tests 21:41 that have no power basically. 21:43 You're going to want to test that, and that's the nature. 21:46 The amount of alternatives, the number 21:49 of ways you can be not Gaussian, is so huge 21:52 that all tests are sort of bound to have very low power. 21:56 And so that's why people are pretty happy with the idea 21:58 that things are Gaussian, because it's 22:00 very hard to find a test that's going 22:01 to reject this hypothesis. 22:04 And so we're even going to find some tests that are visual, 22:08 where you're going to be able to say, 22:10 well, sort of looks Gaussian to me. 22:12 It allows you to deal with the borderline cases 22:16 pretty efficiently. 22:17 We'll see actually a particular example. 22:19 All right, so this theory of testing 22:22 whether data comes from a particular distribution 22:24 is called goodness of fit. 22:26 Is this distribution a good fit for my data? 22:31 That's the goodness of fit test. 22:33 We have just seen a goodness of fit test. 22:36 What was it? 22:36 22:41 Yeah. 22:44 The chi square test, right? 22:46 The case square test, we were given a candidate PMF 22:49 and we were testing if this was a good fit for our data. 22:52 That was a goodness of fit test. 22:54 So of course multinomial is one example, 22:57 but really what we have in the back of our mind is 22:59 I want to test if my data is Gaussian. 23:01 That's basically the usual thing. 23:03 And just like you always see t test as the standard output 23:06 from statistical software whether you ask for it or not, 23:09 there will be a test for normality 23:11 whether you ask it or not from any statistical software app. 23:16 All right. 23:17 So a goodness of fit test looks as follows. 23:19 There's a random variable X and you're 23:21 given i.i.d. copies of X, X1 to Xn, 23:23 they come from the same distribution. 23:25 And you're going to ask the following question: does X have 23:28 a standard normal distribution? 23:31 So for t distribution that's definitely 23:33 the kind of questions you may want to ask. 23:35 Does X have a uniform distribution on 0, 1? 23:39 That's different from the distribution 1 23:41 over k, 1 over k, it's the continuous notion 23:44 of uniformity. 23:47 And for example, you might want to test that-- 23:49 so there's actually a nice exercise, which 23:51 is if you look at the p-values. 23:53 So we've defined what the p-values were. 23:55 And the p-value's a number between 0 and 1, right? 23:59 And you could actually ask yourself, 24:01 what is the distribution of the p-value under the null? 24:04 So the p-value is a random number. 24:08 It's the probability-- so the p-value-- let's look 24:10 at the following test. 24:13 24:17 H0, mu is equal to 0, versus H1, mu is not equal to 0. 24:25 And I know that the p-value is-- 24:28 so I'm going to form what? 24:29 I'm going to look at Xn bar minus mu 24:34 times square root of n divided by-- let's say that we 24:37 know sigma for one second. 24:40 Then the p-value is the probability 24:43 that this is larger then square root of n little xn 24:48 bar minus mu, minus 0 actually in this case, 24:54 divided by sigma, where this guy is the observed. 24:59 25:04 OK. 25:05 So now you could say, well, how is that a random variable? 25:09 It's just a number. 25:11 It's just a probability of something. 25:13 But then I can view this as a function of this guy 25:17 here when I plug it back to be a random variable. 25:23 So what I mean by this is that if I look at this value 25:26 here, if I say that phi is the CDF of N 0, 1, 25:34 so the p-value is the probability 25:36 that it exceeds this. 25:37 So that's the probability that I'm either here or here. 25:41 25:44 AUDIENCE: [INAUDIBLE] 25:47 PROFESSOR: No, it's not, right? 25:49 AUDIENCE: [INAUDIBLE] 25:52 PROFESSOR: This is a big X and this is a small x. 25:55 This is just where you plug in your data. 25:57 The p-value is the probability that you 25:59 have more evidence against your null 26:03 than what you already have. 26:05 OK, so now I can write it in terms 26:06 of cumulative distribution functions. 26:09 So this is what? 26:09 This is phi of this guy, which is minus this thing here. 26:14 26:17 Well it's basically 2 times this guy, 26:19 phi of minus square root of n, Xn bar divided by sigma. 26:27 26:30 That's my p-value. 26:31 If you give me data, I'm going to compute the average 26:33 and plug it in there, and it can spit out the p-value. 26:36 Everybody agrees? 26:37 26:39 So now I can view this, if I start now looking back I say, 26:42 well, where does this data come from? 26:45 Well, it could be a random variable. 26:48 It came from the realization of this thing. 26:51 So I can try to, I can think of this value, 26:54 where now this is a random variable because I just plugged 26:57 in a random variable in here. 26:59 So now I view my p-value as a random variable. 27:04 So I keep switching from small x to large X. Everybody 27:06 agrees what I'm doing here? 27:08 So I just wrote it as a deterministic function 27:11 of some deterministic number, and now the function 27:14 stays deterministic but the number becomes random. 27:17 And so I can think of this as some statistic of my data. 27:21 And I could say, well, what is the distribution 27:23 of this random variable? 27:26 Now if my data is actually normally distributed, 27:29 so I'm actually under the null, so 27:31 under the null, that means that Xn bar times square root of n 27:37 divided by sigma has what distribution? 27:40 27:48 Normal? 27:48 27:56 Well it was sigma, I assume I knew it. 27:59 So it's N 0, 1, right? 28:00 I divided by sigma here. 28:02 OK? 28:03 So now I have this random variable. 28:04 28:15 And so my random variable is now 2 phi of minus absolute value 28:24 of a Gaussian. 28:24 28:34 And I'm actually interested in the distribution of this thing. 28:40 I could ask that. 28:41 Anybody has an idea of how you would 28:43 want to tackle this thing? 28:45 If I ask you, what is the distribution 28:46 of a random variable, how do you tackle this question? 28:48 28:53 There's basically two ways. 28:54 One is to try to find something that 28:55 looks like the expectation of h of x for all h. 29:02 And you try to write this using change of variables 29:04 and something that looks like integral of h of x p of x dx. 29:09 And then you say, well, that's the density. 29:12 If you can read this for any h, then that's 29:15 the way you would do it. 29:16 But there's a simpler way that does not 29:19 involve changing variables, et cetera, 29:21 you just try to compute the cumulative distribution 29:23 function. 29:25 So let's try to compute the probability 29:26 that 2 phi minus N 0, 1, is less than t. 29:34 And maybe we can find something we know. 29:38 OK. 29:38 Well that's equal to what? 29:39 That's the probability that a minus N 0, 29:43 well let's say that an N 0, 1-- 29:45 sorry, N 0, 1 absolute value is greater than minus phi inverse 29:57 of t over 2. 29:58 30:04 And that's what? 30:05 Well, it's just the same thing that we had before. 30:07 It's equal to-- so if I look again, 30:12 this is the probability that I'm actually on this side 30:15 or that side of this number. 30:17 And this number is what? 30:18 It's minus phi of t over 2. 30:25 Why do I have a minus here? 30:27 30:32 That's fine, OK. 30:33 So it's actually not this, it's actually the probability 30:36 that my absolute value-- 30:39 oh, because phi inverse. 30:41 OK. 30:42 Because phi inverse is-- 30:44 so I'm going to look at t between 0 30:48 and-- so this number is ranging between 0 and 1. 30:52 So it means that this number is ranging between 0-- 30:55 well, the probability that something is less than t 30:58 should be ranging between the numbers that this guy takes, 31:03 so that's between 0 and 2. 31:04 31:11 Because this thing takes values between 0 and 2. 31:14 I want to see 0 and 1, though. 31:16 31:21 AUDIENCE: Negative absolute value is always less 31:23 than [INAUDIBLE]. 31:24 31:29 PROFESSOR: Yeah. 31:29 You're right, thank you. 31:30 So this is always some number which is less than 0, 31:34 so the probability that the Gaussian is less 31:36 than this number is always less than the probability 31:38 it's less than 0, which is 1/2, so t only 31:40 has to be between 0 and 1. 31:41 Thank you. 31:43 And so now for t between 0 and 1, then 31:47 this guy is actually becoming something which is positive, 31:50 for the same reason as before. 31:52 And so that's what? 31:53 That's just basically 2 times phi of phi inverse of t over 2. 32:04 32:07 That's just playing with the symmetry a little bit. 32:09 You can look at the areas under the curve. 32:11 And so what it means is that those two guys cancel. 32:13 This is the identity. 32:15 And so this is equal to t. 32:18 So which distribution has a density-- 32:23 sorry, which distribution has a cumulative distribution 32:27 function which is equal to t for t between 0 and 1? 32:32 That's the uniform distribution, right? 32:34 So it means that this guy follows a uniform distribution 32:37 on the interval 0, 1. 32:39 32:44 And you could actually check that. 32:45 For any test you're going to come up with, 32:47 this is going to be the case. 32:48 Your p-value under the null will have a distribution 32:52 which is uniform. 32:54 So now if somebody shows up and says, here's my test, 32:58 it's awesome, it just works great. 33:00 I'm not going to explain to you how I built it, 33:02 it's a complicated statistics that 33:03 involve moments of order 27. 33:06 And I'm like, OK, you know, how am I 33:08 going to test that your test statistic actually makes sense? 33:11 Well one thing I can do is to run a bunch of data, 33:16 draw a bunch of samples, compute your test statistic, 33:18 compute the p-value, and check if my p-value has 33:22 a uniform distribution on the interval 0, 1. 33:27 But for that I need to have a test that, 33:29 given a bunch of observations, can tell me 33:31 whether they're actually distributed uniformly 33:33 on the interval 0, 1. 33:34 And again one thing I could do is build a histogram 33:36 and see if it looks like that of a uniform, 33:40 but I could also try to be slightly more quantitative 33:42 about this. 33:43 AUDIENCE: Why does the [INAUDIBLE] have 33:44 to be for a [INAUDIBLE]? 33:47 PROFESSOR: For two tests? 33:48 AUDIENCE: For each test. 33:51 Why does the p-value have to be normal? 33:54 I mean, uniform. 33:55 PROFESSOR: It's uniform under the null. 33:57 So because my test statistic was built under the null, 34:00 and so I have to be able to plug in the right value in there, 34:03 otherwise it's going to shift everything 34:04 for this particular test. 34:06 AUDIENCE: At the beginning while your probabilities 34:08 were of big Xn, that thing. 34:09 That thing is the p-value. 34:11 PROFESSOR: That's the p-value, right? 34:13 That's the definition of the p-value. 34:15 AUDIENCE: OK. 34:15 34:17 PROFESSOR: So it's the probability 34:19 that my test statistic exceeds what I've actually observed. 34:23 AUDIENCE: So how you run the test is basically 34:26 you have your observations and plug them 34:29 into the cumulative distribution function for a normal, 34:33 and then see if it falls under the given-- 34:35 PROFESSOR: Yeah. 34:36 So my p-value is just this number 34:39 when I just plug in the values that I observe here. 34:42 That's one number. 34:43 For every dataset you're going to give me, 34:45 it's going to be one number. 34:46 Now what I can do is generate a bunch of datasets of size n, 34:51 like 200 of them. 34:53 And then I'm going to have a new sample 34:55 of say 200, which is just the sample of 200 p-values. 34:59 And I want to test if those p-values have 35:00 a uniform distribution. 35:02 OK? 35:02 Because that's the distribution they should be having. 35:05 All right? 35:06 35:11 OK. 35:12 This one we've already seen. 35:13 Does x have a PMF with 30%, 50%, and 20%? 35:18 That's something I could try to test. 35:21 That looks like your grade point distribution for this class. 35:27 Well not exactly, but that looks like it. 35:30 So all these things are known as goodness of fit tests. 35:33 The goodness of fit test is something 35:34 that you want to know if the data that you have at hand 35:38 follows the hypothesized distribution. 35:41 So it's not a parametric test. 35:43 It's not a test that says, is my mean equal to 25 or not. 35:46 Is my proportion of heads larger than 1/2 or not? 35:51 It's something that says, my distribution 35:53 this particular thing. 35:54 35:57 So I'm going to write them as goodness of fit, G-O-F here. 36:00 You don't need to have parametric modeling to do that. 36:02 36:05 So how do I work? 36:06 So if I don't have any parametric modeling, 36:09 I need to have something which is somewhat non-parametric, 36:12 something that goes beyond computing the mean 36:14 and the standard deviation, something 36:16 that computes some intrinsic non-parametric aspect 36:19 of my data. 36:21 And just like here we made this computation, what we did 36:24 is we said well, if I actually check 36:28 that the CDF of my data, that my p-value is uniform, 36:34 then I know it's uniform. 36:35 So it means that the cumulative distribution function 36:37 has an intrinsic value about it that captures 36:39 the entire distribution. 36:41 Everything I need to know about my distribution 36:44 is captured by the cumulative distribution function. 36:47 Now I have an empirical way of computing, 36:49 I have a data-driven way of computing 36:52 an estimate for the cumulative distribution function, which 36:54 is using the old statistical trick which 36:57 consists of replacing expectations by averages. 37:00 So as I said, the cumulative distribution function 37:04 for any distribution, for any random variable, is-- 37:08 37:12 so F of t is the probability that X 37:17 is less than or equal to t, which 37:19 is equal to the expectation of the indicator 37:22 that X is less than or equal to t. 37:26 That's the definition of a probability. 37:28 And so here I'm just going to replace expectation 37:31 by the average. 37:34 That's my usual statistical trick. 37:37 And so my estimator Fn for-- 37:42 the distribution is going to be 1 over n sum from i 37:45 equal 1 to n of these indicators. 37:48 37:53 And this is called the empirical CDF. 37:58 It's just the data version of the CDF. 38:01 38:04 So I just replaced this expectation here by an average. 38:08 38:13 Now when I sum indicators, I'm actually 38:17 counting the number of them that satisfy something. 38:20 So if you look at what this guy is, 38:24 this is the number of X i's that is less than t, right? 38:32 And so if I divide by n, it's the proportion of observations 38:35 I have that are less than t. 38:36 38:41 That's what the empirical distribution is. 38:43 38:46 That's what's written here, the number of data points 38:50 that are less than t. 38:52 And so this is going to be something 38:53 that's sort of trying to estimate one or the other. 38:57 And the law of large number actually 38:59 tells me that for any given t, if n is large enough, Fn of t 39:03 should be close to F of t. 39:05 Because it's an average. 39:07 And this entire thing, this entire statistical trick, 39:10 which consists of replacing expectations by averages, 39:13 is justified by the law of large number. 39:16 Every time we used it, that was because the law of large number 39:19 sort of guaranteed to us that the average was 39:21 close to the expectation. 39:23 39:26 OK. 39:27 So law of large numbers tell me that Fn of t converges, 39:30 so that's the strong law, says that almost surely actually 39:34 Fn of t goes to F of t. 39:35 39:40 And that's just for any given t. 39:43 Is there any question about this? 39:46 That averages converge to expectation, 39:48 that's the law of large number. 39:49 39:52 And almost surely we could say in probability 39:54 it's the same, that would be the weak law of large number. 39:57 40:00 Now this is fine. 40:01 For any given t, the average converges to the true. 40:05 It just happens that this random variable is indexed by t, 40:09 and I could do it for t equals 1 or 2 or 25, 40:12 and just check it again. 40:14 But I might want to check it for all t's at once. 40:18 And that's actually a different result. 40:19 That's called a uniform result. I 40:21 want this to hold for all t at the same time. 40:25 And it may be the case that it works for each t individually 40:28 but not for all t's at the same time. 40:31 What could happen is that for t equals 1 40:33 it converges at a certain rate, and for t equals 2 40:36 it converges at a bit of a slower rate, 40:37 and for t equals 3 at a slower rate and slower rate. 40:41 And so as t goes to infinity, the rate is going to vanish 40:43 and nothing is going to converge. 40:45 That could happen. 40:46 I could make this happen at a finite point. 40:48 There's many ways where it could make this happen. 40:50 Let's see how that could work. 40:52 I could say, well, actually no. 40:54 I still need to have this at infinity for some reason. 40:59 It turns out that this is still true uniformly, 41:01 and this is actually a much more complicated result 41:03 than the law of large number. 41:05 It's called Glivenko-Cantelli Theorem. 41:07 And the Glivenko-Cantelli Theorem 41:09 tells me that, for all t's at once, Fn converges to F. 41:14 So let me just show you quickly why 41:18 this is just a little bit stronger than the one 41:22 that we had. 41:25 If sup is confusing you, think of max. 41:29 It's just the max over an infinite set. 41:31 And so what we know is that Fn of t goes to F of t 41:40 as n goes to infinity. 41:43 And that's almost surely. 41:45 And that's the law of large numbers. 41:48 Which is equivalent to saying that Fn of t minus F of t as n 41:54 goes to infinity converges almost surely to 0, right? 41:59 This is the same thing. 42:01 Now I want this to happen for all t's at once. 42:07 So what I'm going to do-- oh, and this is actually 42:09 equivalent to this. 42:11 And so what I'm going to do is I'm going 42:12 to make it a little stronger. 42:14 So here the arrow only goes one way. 42:16 And this is where the sup for t in R of Fn of t. 42:20 42:26 And you could actually show that this happens also 42:28 almost surely. 42:29 42:35 Now maybe almost surely is a bit more 42:37 difficult to get a grasp on. 42:39 42:43 Does anybody want to see, like why this statement for this sup 42:48 is strictly stronger than the one that holds individually 42:51 for all t's? 42:52 You want to see that? 42:54 OK, so let's do that. 42:54 So forget about it almost surely for one second. 42:57 Let's just do it in probability. 42:59 The fact that Fn of t converges to F of t for all t, 43:09 in probability means that this goes to 0 as n goes 43:12 to infinity for any epsilon. 43:13 43:17 For any epsilon in t we know we have this. 43:19 That's the convergence in probability. 43:22 Now what I want is to put a sup here. 43:24 43:28 The probability that the sup is lower than epsilon, 43:32 might be actually always larger than, never go to 0 43:38 in some cases. 43:39 It could be the case that for each given t, 43:42 I can make n large enough so that this probability becomes 43:46 small. 43:47 But then maybe it's an n of t. 43:49 So this here means that for any-- 43:53 maybe I shouldn't put, let me put a delta here. 43:56 So for any epsilon, for any t and for any epsilon, 44:02 there exists n, which could depend on both epsilon 44:09 and t, such that the probability that Fn t 44:15 minus F of t exceeding delta is less than epsilon t. 44:25 There exists an n and a delta. 44:29 No, that's for all delta, sorry. 44:30 44:34 So this is true. 44:36 That's what this limit statement actually means. 44:40 But it could be the case that now when I take the sup over t, 44:43 maybe that n of t is something that looks like t. 44:47 44:50 Or maybe, well, integer part of t. 44:54 It could be, right? 44:56 I don't say anything. 44:57 It's just an n that depends on t. 44:59 So if this n is just t, maybe t over epsilon, 45:04 because I want epsilon. 45:05 Something like this. 45:07 Well that means that if I want this 45:09 to hold for all t's at once, I'm going 45:11 to have to go for the n that works for all t's at once. 45:15 But there's no such n that works for all t's at once. 45:19 The only n that works is infinity. 45:21 And so I cannot make this happen for all of them. 45:24 What Glivenko-Cantelli tells you, 45:26 it's actually this is not something that holds like this. 45:29 That the n that depends on t, there's actually one largest n 45:33 that works for all the t's at once, and that's it. 45:37 45:39 OK. 45:39 So just so you know why this is actually a stronger statement, 45:44 and that's basically how it works. 45:48 Any other question? 45:50 Yeah. 45:51 AUDIENCE: So what's the position for this 45:53 to have, because the random variable have 45:54 a finite mean, finite variance? 45:57 PROFESSOR: No. 45:58 Well the random variable does have finite mean 46:00 and finite variance, because the random variable 46:02 is an indicator. 46:03 So it has everything you want. 46:04 This is one of the nicest random variables, 46:06 this is a Bernoulli random variable. 46:08 So here when I say law of large number, that this holds. 46:11 Where did I write this? 46:12 I think I erased it. 46:14 Yeah, the one over there. 46:15 This is actually the law of large numbers 46:16 for Bernoulli random variables. 46:17 They have everything you want. 46:18 They're bounded. 46:21 Yes. 46:21 AUDIENCE: So I'm having trouble understanding 46:23 the first statement. 46:25 So it says, for all epsilon and all t, 46:27 the probability of that-- 46:29 PROFESSOR: So you mean this one? 46:31 AUDIENCE: Yeah. 46:31 PROFESSOR: For all epsilon and all t. 46:34 So you fix them now. 46:36 Then the probability that, sorry, that was delta. 46:39 I changed this epsilon to delta at some point. 46:41 AUDIENCE: And then what's the second line? 46:44 PROFESSOR: Oh, so then the second line says that, 46:49 so I'm just rewriting in terms of epsilon delta 46:53 what this n goes to infinity means. 46:56 So it means that for any a t and delta, 47:01 so that's the same as this guy here, 47:04 then here I'm just going back to rewriting this. 47:06 It says that for any epsilon there exists an n large 47:08 enough such that, well, n larger than this thing 47:11 basically, such that this thing is less than epsilon. 47:14 47:18 So Glivenko-Cantelli tells us that not only is this thing 47:21 a good idea pointwise, but it's also a good idea uniformly. 47:25 And all it's saying is if you actually 47:27 were happy with just this result, you should 47:30 be even happier with that result. 47:32 And both of those results only tell you one thing. 47:34 They're just telling you that the empirical CDF 47:36 is a good estimator of the CDF. 47:38 47:41 Now since those indicators are Bernoulli distributions, 47:47 I can actually do even more. 47:50 So let me get this guy here. 47:52 48:00 OK so, those guys, Fn of t, this guy 48:14 is a Bernoulli distribution. 48:16 What is the parameter of this Bernoulli distribution? 48:20 What is the probability that it takes value 1? 48:22 48:26 AUDIENCE: F of t. 48:26 PROFESSOR: F of t, right? 48:28 It's just the probability that this thing happens, 48:30 which is F of t. 48:31 48:34 So in particular the variance of this guy 48:40 is the variance of this Bernoulli. 48:42 So it's F of t 1 minus F of t. 48:46 And I can use that in my Central Limit Theorem. 48:50 And Central Limit Theorem is just 48:51 going to tell me that if I look at the average 48:53 of random variables, I remove their mean, 48:56 so I look at square root of n Fn of t, 49:01 which I could really write as xn bar, right? 49:04 That's really just an xn bar. 49:06 Minus the expectation, which is F 49:08 of t, that comes from this guy. 49:11 Now if I divide by square root of the variance, that's 49:16 my square root p1 minus p. 49:18 Then this guy, by the Central Limit Theorem, 49:22 goes to some N 0, 1. 49:23 49:27 Which is the same thing as you see there, 49:28 except that the variance was put on the other side. 49:30 49:34 OK. 49:36 Do I have the same thing uniformly in t? 49:42 49:46 Can I write something that holds uniformly in t? 49:48 Well, if you think about it for one second 49:50 it's unlikely it's going to go too well. 49:53 In the sense that it's unlikely that the supremum 49:55 of those random variables over t is going to also be a Gaussian. 49:58 50:02 And the reason is that, well actually the reason 50:08 is that this thing is actually a stochastic process indexed 50:10 by t. 50:11 A stochastic process is just a sequence in random variables 50:14 that's indexed by, let's say time. 50:17 The one that's the most famous is Brownian motion, 50:20 and it's basically a bunch of Gaussian increments. 50:24 So when you go from t to just t a little after that, 50:27 you have add some Gaussian into the thing. 50:30 And here it's basically the same thing that's happening. 50:33 And you would sort of expect, since each of this guy 50:35 is Gaussian, you would expect to see 50:37 something that looks like a Brownian motion at the end. 50:40 But it's not exactly a Brownian motion, 50:41 it's something that's called the Brownian bridge. 50:43 So if you've seen the Brownian motion, if I make 50:45 it start at 0 for example, so this is the value 50:49 of my Brownian motion. 50:50 Let's write it. 50:52 So this is one path, one realization of Brownian motion. 50:56 Let's call it w of t as t increases. 50:59 So let's say it starts at 0 and looks like something like this. 51:04 So that's what Brownian motion looks like. 51:06 It's just something that's pretty nasty. 51:11 I mean it looks pretty nasty, it's not continuous et cetera, 51:13 but it's actually very benign in some average way. 51:19 So Brownian motion is just something, 51:21 you should view this as if I sum some random variable that 51:25 are Gaussian, and then I look at this from farther and farther, 51:29 it's going to look like this. 51:31 And so here I cannot have a Brownian motion in the n, 51:34 because what is the variance of Fn of t minus F of t at t is 51:40 equal to 1? 51:40 51:43 Sorry, at t is equal to infinity. 51:47 AUDIENCE: 0. 51:48 PROFESSOR: It's 0, right? 51:49 The variance goes from 0 at t is negative infinity, 51:52 because at negative infinity F of t is going to 0. 51:56 And as t goes to plus infinity, F of t 51:59 is going to 1, which means that the variance of this guy as t 52:03 goes from negative infinity to plus infinity 52:06 is pinned to be 0 on each side. 52:09 And so my Brownian motion cannot, 52:12 when I describe a Brownian motion I'm just adding more 52:14 and more entropy to the thing and it's going all over 52:16 the place, but here what I want is that as I go back it should 52:20 go back to essentially 0. 52:21 It should be pinned down to a specific value at the n. 52:25 And that's actually called the Brownian bridge. 52:27 It's a Brownian motion that's conditioned 52:29 to come back to where it started essentially. 52:32 Now you don't need to understand Brownian bridges to understand 52:35 what I'm going to be telling you. 52:36 The only thing I want to communicate to you 52:39 is that this guy here, when I say a Brownian bridge, 52:42 I can go to any probabilist and they can tell you 52:45 all the probability properties of this stochastic process. 52:51 It can tell me the probability that it 52:52 takes any value at any point. 52:55 In particular, it can tell me-- 52:57 the supremum between 0 and 1 of this guy, 53:01 it could tell me what the cumulative distribution 53:03 function of this thing is, can tell me 53:04 what the density of this thing is, can tell me everything. 53:07 So it means that if I want to compute probabilities 53:09 on this object here, which is the maximum value that this guy 53:14 can take over a certain period of time, which is basically 53:17 this random variable. 53:18 So if I look at the value here, it's 53:20 a random variable that fluctuates. 53:22 It can tell me where it is with hyperability, can tell me 53:25 the quantiles of this thing, which is useful 53:28 because I can build a table and use it to compute my quantiles 53:31 and form tests from it. 53:34 So that's what actually is quite nice. 53:36 It says that if I look at the square root of n Fn 53:38 hat minus sup over t, I get something 53:40 that looks like the sup of these Gaussians, 53:42 but it's not really sup of Gaussian, 53:44 it's sup of a Brownian motion. 53:46 Now there's something you should be very careful here. 53:48 I cheated a little bit. 53:49 I mean, I didn't cheat, I can do whatever I want. 53:51 But my notation might be a little confusing. 53:55 Everybody sees that this t here is not the same as this t here? 54:01 Can somebody see that? 54:03 Just because, first of all, this guy's between 0 and 1. 54:05 And this guy is in all of R. 54:09 What is this t here? 54:12 As a function of this t here? 54:14 54:21 This guy is F of this guy. 54:23 So really, if I want it to be completely transparent 54:27 and not save the keys of my keyboard, 54:32 I would read this as sup over t of Fn t minus F of t 54:42 goes to N distribution as n goes to infinity. 54:46 The supremum over t, again in R, so this guy is 54:50 for t in the entire real line, this guy 54:52 is for t in the entire real line. 54:54 But now I should write b of what? 54:58 F of t, exactly. 55:00 So really the t here is F of the original one. 55:04 And so that's a Brownian bridge, where 55:06 when t goes to infinity the Brownian bridge 55:09 goes from 0 to 1 and it looks like this. 55:11 A Brownian bridge at 0 is 0, at 1 it's 0. 55:16 And it does this. 55:18 But it doesn't stray too far because I condition 55:20 it to come back to this point. 55:22 That's what a Brownian bridge is. 55:26 OK. 55:28 So in particular, I can find a distribution for this guy. 55:33 And I can use this to build a test which is called 55:35 the Kolmogorov-Smirnov test. 55:37 55:39 The idea is the following. 55:40 It says, if I want to test some distribution 55:44 F0, some distribution that has a particular CDF F0, 55:49 and I plug it in under the null, then 55:52 this guy should have pretty much the same distribution 55:55 as the supremum of Brownian bridge. 55:58 And so if I see this to be much larger than it should 56:00 be when it's the supremum of a Brownian bridge, 56:02 I'm actually going to reject my hypothesis. 56:05 56:08 So here's the test. 56:09 I want to test whether H0, F is equal to F0, 56:17 and you will see that most of the goodness of fit tests 56:22 are formulated mathematically in terms 56:24 of the cumulative distribution function. 56:26 I could formulate them in terms of personality density 56:29 function, or just write x follows N 0, 1, 56:33 but that's the way we write it. 56:34 We formulate them in terms of cumulative distribution 56:37 function because that's what we have 56:39 a handle on through the empirical cumulative 56:42 distribution function. 56:44 And then it's versus H1, F is not equal to F0. 56:50 So now I have my empirical CDF. 56:52 And I hope that for all t's, Fn of t 56:54 should be close to F0 of t. 56:57 Let me write it like this. 57:00 I put it on the exponent because otherwise that 57:03 would be the empirical distribution function based 57:06 on zero observations. 57:07 57:11 Now I form the following test statistic. 57:14 57:21 So my test statistic is tn, which 57:24 is the supremum over t in the real line of square root 57:28 of n Fn of t minus F of t, sorry, F0 of t. 57:34 So I can compute everything. 57:35 I know this from the data, and this 57:37 is the one that comes from my null hypothesis. 57:39 As I can compute this thing. 57:41 And I know that if this is true, this 57:43 should actually be the supremum of a Brownian bridge. 57:46 Pretty much. 57:48 And so the Kolmogorov-Smirnov test is simply, 58:01 reject if this guy, tn, in absolute value, 58:09 no actually not in absolute value. 58:10 This is just already absolute valued. 58:13 Then this guy should be what? 58:14 It should be larger than the q alpha over 2 distribution 58:20 that I have. 58:21 But now rather than putting N 0, 1, or Tn, 58:24 this is here whatever notation I have for supremum 58:30 of Brownian bridge. 58:31 58:40 Just like I did for any pivotal distribution. 58:43 That was the same recipe every single time. 58:45 I formed the test statistic such that 58:47 the asymptotic distribution did not depend on anything I know, 58:51 and then I would just reject when this pivotal distribution 58:54 was larger than something. 58:56 Yes? 58:56 AUDIENCE: I'm not really sure why Brownian bridge appears. 58:59 59:02 PROFESSOR: Do you know what a Brownian bridge is, or? 59:05 AUDIENCE: Only vaguely. 59:06 PROFESSOR: OK. 59:07 So this thing here, think of it as being a Gaussian. 59:14 So for all t you have a Gaussian distribution. 59:18 Now a Brownian motion, so if I had a Brownian motion 59:27 I need to tell you what the-- 59:28 so it's basically a Brownian motion 59:30 is something that looks like this. 59:31 It's some random variable that's indexed by t. 59:34 I want, say, the expectation of Xt could be equal to 0 59:38 for all t. 59:40 And what I want is that the increments 59:42 have a certain distribution. 59:44 So what I want is that the expectation of Xt minus Xs 59:53 follows some distribution which is N 0, t minus s. 59:58 So the increments are bigger as I go farther, 60:00 in terms of variability. 60:02 And I also want some covariance structure between the two. 60:05 So what I want is that the covariance between Xs and Xt 60:10 is actually equal to the minimum of s and t. 60:12 60:18 Yeah, maybe. 60:21 Yeah, that should be there. 60:23 So this is, you open a probability book, that's 60:26 what it's going to look like. 60:27 So in particular, you can see, if I put 0 here 60:31 and X0 is equal to 0, it has 0 variance. 60:34 So in particular, it means that Xt, 60:38 if I look only at the t-th one, it 60:39 has some normal distribution with variance t. 60:43 So this is something that just blows up. 60:46 So this guy here looks like it's going 60:49 to be a Brownian motion because when 60:50 I look at the left-hand side it has a normal distribution. 60:53 Now there's a bunch of other things you need to check. 60:55 It's the fact that you have this covariance, for example, 60:58 which I did not tell you. 61:00 But it sure look somewhat like that. 61:03 And in particular, when I look at the normal with mean 0 61:07 and variance here, then it's clear 61:10 that this guy does not have a variance that's 61:12 going to go to infinity just like the variance of this guy. 61:16 We know that the variance is forced to be back to 0. 61:21 And so in particular we have something 61:23 that has mean 0 always, whose variance has to be 0 at 0, 61:28 and variance-- sorry, at t equals negative infinity, 61:31 and variance 1 at t equals plus infinity. 61:34 So a variance 0 at t equals plus infinity, 61:36 and so I have to basically force it to be equal to 0 at each n. 61:40 So the Brownian motion here tends 61:42 to just go to infinity somewhere, 61:44 whereas this guy forces it to come back. 61:47 Now everything I described to you 61:48 is on the scale negative infinity to plus infinity, 61:52 but since everything depends on F of t, 61:56 I can actually just put that back 61:58 into a scale, which is 0 and 1 by a simple change of variable. 62:02 It's called change of time for the Brownian motion. 62:06 OK? 62:07 Yeah. 62:08 AUDIENCE: So does a Brownian bridge 62:09 have a variance at each point that's proportional? 62:13 Like it starts at 0 variance and then 62:15 goes to 1/4 variance in the middle 62:17 and then goes back to 0 variance? 62:21 Like in the same parabolic shape? 62:23 PROFESSOR: Yeah. 62:24 I mean, definitely. 62:26 I mean by symmetry you can probably infer all the things. 62:29 AUDIENCE: Well I can imagine Brownian bridge 62:31 with a variance that starts at 0 and stays, like, 62:34 the shape of the variance as you move along. 62:38 PROFESSOR: Yeah, so I don't know if-- there 62:40 is an explicit formula for this, and it's simple. 62:43 That's what I can tell you, but I don't know what the explicit, 62:45 off the top of my head what the explicit formula is. 62:47 AUDIENCE: But would it have to match this F 62:49 of t 1 minus F of t structure? 62:53 Or not? 62:53 PROFESSOR: Yeah. 62:54 62:56 AUDIENCE: Or does the fact that we're taking the supremum-- 62:58 PROFESSOR: No. 62:59 Well the Brownian bridge, this is the supremum-- you're right. 63:03 So this will be this form for the variance for sure, 63:06 because this is only marginal distributions that 63:08 don't take-- right, the process is not just 63:10 what is the distribution at each instant t. 63:13 It's also how do those distributions interact 63:15 with each other in terms of covariance. 63:17 For the marginal distributions at each instance t, 63:19 you're right, the variance is F of t 1 minus F of t. 63:22 We're not going to escape that. 63:25 But then the covariance structure between those guys 63:27 is a little more complicated. 63:29 But yes, you're right. 63:30 For marginal that's enough. 63:32 Yeah? 63:32 AUDIENCE: So the supremum of the Brownian bridge 63:34 is a number between 0 and 10, let's just say. 63:38 PROFESSOR: Yeah, it could be infinity. 63:40 AUDIENCE: So it's not symmetrical with respect to 0, 63:43 so why are we doing all over 2? 63:45 63:56 PROFESSOR: OK. 63:57 Did say raise it? 63:58 Yeah. 63:59 Because here I didn't say the supremum of the absolute value 64:01 of a Brownian bridge, I just said the supremum 64:03 of a Brownian bridge. 64:04 But you're right, let's just do this like that. 64:08 And then it's probably cleaner. 64:11 64:14 So yeah, actually well it should be q alpha. 64:17 So this is basically, you're right. 64:19 So think of it as being one-sided. 64:22 And there's actually no symmetry for the supremum. 64:25 I mean the supremum is not symmetric around 0, 64:29 so you're right. 64:29 I should not use alpha over 2, thank you. 64:33 Any other question? 64:35 This should be alpha. 64:36 Yeah. 64:37 I mean those slides were written with 1 minus alpha 64:39 and I have not replaced all instances of 1 minus alpha 64:42 by alpha. 64:43 I mean, except this guy, tilde. 64:45 Well, depends on how you want to call it. 64:47 But this is still, the probability that Z exceeds 64:50 this guy should be alpha. 64:53 OK? 64:54 And this can be found in tables. 64:55 And we can compute the p-value just like we did before. 65:00 But we have to simulate it because it's not 65:02 going to depend on the cumulative distribution 65:04 function of a Gaussian, like it did for the usual Gaussian 65:06 test. 65:07 That's something that's more complicated, 65:09 and typically you don't even try. 65:11 You get the statistical software to do it for you. 65:14 So just let me skip a few lines. 65:17 This is what the table looks like for the Kolmogorov-Smirnov 65:20 test. 65:21 So it just tells you, what is your number of observations, n. 65:25 Then you want alpha to be equal to 5%, say. 65:28 Let's say you have nine observations. 65:30 So if square root of n absolute value of Fn of t minus F of t 65:34 exceeds this thing, you reject. 65:36 65:46 Well it's pretty clear from this test 65:47 is that it looks very nice, and I tell 65:49 you this is how you build it. 65:50 But if you think about it for one second, 65:52 it's actually really an annoying thing 65:54 to build because you have to take the supremum over t. 65:57 This depends on computing a supremum, which in practice 66:01 might be super cumbersome. 66:03 I don't want to have to compute this for all values t 66:05 and then to take the maximum of those guys. 66:07 It turns out that that's actually quite nice that we 66:09 don't have to actually do this. 66:11 What does the empirical distribution function 66:14 look like? 66:15 Well, this thing, remember Fn of t by definition was-- 66:23 so let me go to the slide that's relevant. 66:25 So Fn of t looks like this. 66:27 66:38 So what it means is that when t is between two observations, 66:41 then this guy is actually keeping the same value. 66:44 So if I put my observations on the real line here. 66:48 So let's say I have one observation here, 66:49 one observation here, one observation here, 66:51 one observation here, and one observation here, 66:53 for simplicity. 66:55 Then this guy is basically, up to this normalization, 66:57 counting how many observations they have that are less than t. 67:01 So since I normalize by n, I know that the smallest number 67:05 here is going to be 0, and the largest number here 67:10 is going to be 1. 67:13 So let's say this looks like this. 67:14 This is the value 1. 67:18 At the value, since I take it less than or equal to, 67:21 when I'm at Xi, I'm actually counting it. 67:24 So the jump happens at Xi. 67:26 So that's the first observation, and then I jump. 67:29 By how much do I jump? 67:30 67:33 Yeah? 67:35 One over n, right? 67:38 And then this value belongs to the right. 67:41 And then I do it again. 67:42 67:50 I know it's not going to work out for me, but we'll see. 67:54 Oh no actually, I did pretty well. 67:55 68:00 This is what my cumulative distribution looks like. 68:04 Now if you look on this slide, there 68:05 is this weird notation where I start putting now 68:07 my indices in parentheses. 68:10 X parenthesis 1, X parenthesis 2, et cetera. 68:13 Those are called the ordered statistic. 68:15 It's just because it might be, when my data is given 68:18 to me I just call the first observation, 68:20 the one that's on top of the table, 68:21 but it doesn't have to be the smallest value. 68:24 So it might be that this is X1 and that this is X2, 68:28 and then this is X3, X4, and X5. 68:31 These might be my observations. 68:33 So what I do is that I call them in such a way 68:35 that this is actually, I recall this guy X1, 68:38 which is just really X3. 68:40 This is X2, X3, X4, and X5. 68:46 These are my reordered observations 68:48 in such a way that the smallest one is indexed by one 68:52 and the largest one is indexed by n. 68:54 68:58 So now this is actually quite nice, 69:01 because what I'm trying to do is to find the largest 69:04 deviation from this guy to the true cumulative distribution 69:07 function. 69:07 The true cumulative distribution function, 69:09 let's say it's Gaussian, looks like this. 69:11 69:15 It's something continuous, for a symmetric distribution 69:19 it crosses this axis at 1/2, and that's what it looks like. 69:22 And the Kolmogorov-Smirnov test is just 69:25 telling me how far do those two curves get 69:31 in the worst possible case? 69:35 So in particular here, where are they the farthest? 69:37 Clearly that's this point. 69:40 And so up to rescaling, this is the value 69:42 I'm going to be interested in. 69:44 That's how they get as far as possible from each other. 69:49 Here, something just happened, right? 69:52 The farthest distance that I got was exactly 69:54 at one of those dots. 69:55 69:58 It turns out this is enough to look at those dots. 70:01 And the reason is, well because after this dot 70:04 and until the next jump, this guy does not change, 70:08 but this guy increases. 70:11 And so the only point where they can be the farthest apart 70:15 is either to the left of a jump or to the right of a jump. 70:19 That's the only place where they can be far from each other. 70:22 And that means that only one observation. 70:24 Everybody sees that? 70:26 The farthest points, the points at which those two curves are 70:29 the farthest from each other, has 70:31 to be at one of the observations. 70:34 And so rather than looking at a sup over all possible t's, 70:37 really all I need to do is to look at a maximum 70:40 only at my observations. 70:43 70:46 I just need to check at each of those points 70:48 whether they're far. 70:51 Now here, notice that you did not, 70:53 this is not written Fn of Xi. 70:57 The reason is because I actually know what Fn of Xi is. 71:01 Fn of the i-th order observation is just 71:05 the number of jumps I've had until this observation. 71:08 So here, I know that the value of Fn is 1 over n, 71:11 here it's 2 over n, 3 over n, 4 over n, 5 over n. 71:15 So I knew that the values of Fn at my observations, 71:19 and those are actually the only values that Fn can take, 71:22 are an integer divided by n. 71:25 And that's why you see i minus 1 over n, or i over n. 71:29 This is the difference just before the jump, 71:32 and this is the difference at the jump. 71:34 71:38 So here the key message is that this is no longer 71:42 a supremum over all t's, but it's just 71:44 the maximum from 1 to n. 71:46 So I really have only two n values to compute. 71:49 This value and this value for each observation, that's 2n 71:51 total. 71:52 I look at the maximum and that's actually the value. 71:55 And it's actually equal to tn. 71:58 It's not an approximation. 71:59 Those things are equal. 72:00 That's just the only places where 72:02 those guys can be maximum. 72:03 72:09 Yes? 72:10 AUDIENCE: It seems like since the null hypothesis [INAUDIBLE] 72:15 the entire distribution of theta, 72:17 this is like strictly more powerful than just 72:19 doing it [INAUDIBLE]. 72:23 PROFESSOR: It's strictly less powerful. 72:24 AUDIENCE: Strictly less powerful. 72:27 But is there, is that like a big trade-off 72:30 that we're making when we do that? 72:32 Obviously we're not certain in the first place 72:33 that we want to assume normality. 72:35 Does it make sense to [INAUDIBLE],, 72:37 the Gaussian [INAUDIBLE]. 72:39 72:48 PROFESSOR: So can you, I'm not sure what 72:50 question you're asking. 72:51 AUDIENCE: So when we're doing a normal test, 72:53 we're just asking questions about the mus, 72:55 the means of our distribution. 72:57 [INAUDIBLE] This one, it seems like it 73:00 would be both at the same time. 73:02 [INAUDIBLE] Is this decreasing power [INAUDIBLE]?? 73:11 PROFESSOR: So remember, here in this test 73:13 we want to conclude to H0, in the other test we typically 73:16 want to conclude to H1. 73:17 So here we actually don't want power, in a way. 73:21 And you have to also assume that doing a test on the mean 73:24 is probably not the only thing you're 73:26 going to end up doing on your data 73:27 after you actually establish that it's normally distributed. 73:31 Then you have the dataset, you've sort of 73:33 established it's normally distributed, 73:34 and then you can just run the arsenal of statistical studies. 73:38 And we're going to see regression 73:39 and all sorts of predictive things, which are not just 73:42 tests if the mean is equal to something. 73:44 Maybe you want to build a confidence interval 73:45 for the mean. 73:46 Then this is not, confidence interval is not a test. 73:50 So you're going to have to first test if it's normal, 73:52 and then see if you can actually use 73:53 the quantiles of a Gaussian distribution or a t 73:55 distribution to build this confidence interval. 73:59 So in a way you should do this as like, the flat fee 74:03 to enter the Gaussian world, and then you 74:05 can do whatever you want to do in the Gaussian world. 74:09 We'll see actually that your question goes back 74:11 to something that's a little important, is here 74:14 I said F0 is fully specified. 74:17 It's like an N 1, 5. 74:21 But I didn't say, is it normally distributed, 74:24 which is the question that everybody asks. 74:26 You're not asking, is it this particular normal distribution 74:29 with this particular mean and this particular variance. 74:31 So how would you do it in practice? 74:32 Well you would say, I'm just going 74:34 to replace the mean by the empirical mean and the variance 74:36 by the empirical variance. 74:38 But by doing that you're making a huge mistake because you 74:41 are sort of depriving your test of the possibility 74:45 to reject the Gaussian hypothesis just 74:46 based on the fact that the mean is wrong or the variance 74:49 is wrong. 74:49 You've already stuck to your data pretty well. 74:52 And so you're sort of like already 74:55 tilting the game in favor of H0 big time. 74:59 So there's actually a way to arrange for this. 75:01 75:03 OK, so this is about pivotal statistic. 75:05 We've used this word many times. 75:06 75:09 And So that's how. 75:12 I'm not going to go into this test. 75:13 It's really, this is a recipe on how you would actually 75:16 build the table that I showed you, this table. 75:20 This is basically the recipe on how to build it. 75:23 There's another recipe to build it, which is just 75:25 open a book at this page. 75:27 That's a little faster. 75:29 Or use software. 75:32 I just wanted to show you. 75:34 So let's just keep in mind, anybody has a good memory? 75:36 Let's just keep in mind this number. 75:38 This is the threshold for the Kolmogorov-Smirnov statistic. 75:44 If I have 10 observations and I want to do it at 5%, 75:47 it's about 41%. 75:50 So that's the number that it should be larger from. 75:52 So it turns out that if you want to test if it's normal, and not 75:56 just the specific normal, this number 75:59 is going to be different. 76:00 Do you think the number I'm going 76:01 to read in a table that's appropriate for this is 76:03 going to be larger or smaller? 76:05 Who says larger? 76:07 AUDIENCE: Sorry, what was the question? 76:09 PROFESSOR: So the question is, this 76:10 is the number I should see if my test was, is X, say, N 0, 5. 76:20 Right? 76:20 That's a specific distribution with a specific F0. 76:25 So that's the number, I would build 76:27 the Kolmogorov-Smirnov statistic from this. 76:29 I would perform a test and check if my Kolmogorov-Smirnov 76:32 statistic tn is larger than this number or not. 76:34 If it's larger I'm going to reject. 76:36 Now I say, actually, I don't want to test if H0 is N 0, 5, 76:40 but it's just a mu sigma squared for some mu and sigma squared. 76:47 And in particular I'm just going to plugin mu hat and sigma 76:50 hat into my F0, run the same statistic, 76:52 but compare it to a different number. 76:56 So the larger the number, the more or less 77:00 likely am I to reject? 77:03 The less likely I am to reject, right? 77:05 So if I just use that number, let's say 77:09 this is a large number, I would be more 77:12 tempted to say it's Gaussian. 77:14 And if you look at the table you would 77:15 get that if you make the appropriate correction 77:18 at the same number of observations, 10, 77:21 and the same level, you get 25% as opposed to 41%. 77:26 That means that you're actually much more likely if you 77:28 use the appropriate test to reject the fact that it's 77:32 normal, which is bad news, because that means 77:34 you don't have access to the Gaussian arsenal, 77:36 and nobody wants to do this. 77:38 So actually this is a mistake that people do a lot. 77:40 They use the Kolmogorov-Smirnov test 77:42 to test for normality without adjusting for the fact 77:45 that they've plugged in the estimated mean 77:48 and the estimated variance. 77:50 This leads to rejecting less often, right? 77:53 I mean this is almost half of the number that we had. 77:58 And then they can be happy and walk home 78:00 and say, well, I did the test and it was normal. 78:03 So this is actually a mistake that I 78:04 believe that genuinely at least a quarter of the people 78:07 do make in purpose. 78:09 They just say, well I want it to be Gaussian so I'm just 78:11 going to make my life easier. 78:13 So this is the so-called Kolmogorov Lilliefors test. 78:17 We'll talk about it, well not today for sure. 78:20 There's other statistics that you can test, that you can use. 78:24 And the idea is to say, well, we want 78:26 to know if the empirical distribution 78:28 function, the empirical CDF, is close to the true CDF. 78:31 The way we did it is by forming the difference 78:33 in looking at the worst possible distance they can be. 78:36 That's called a sup norm, or L infinity norm, 78:39 in functional analysis. 78:42 So here, this is what it looked like. 78:44 The distance between Fn and F that we measured was just 78:46 the supremum distance over all t's. 78:48 That's one way to measure distance between two functions. 78:51 But there's an infinite many ways 78:53 to measure distance between functions. 78:54 One is something we're much more familiar with, 78:56 which is the squared L2-norm. 78:59 This is nice because this has like an inner product, 79:02 it has some nice properties. 79:04 And you could actually just, rather than taking the sup, 79:06 you could just integrate the squared distance. 79:10 And this is what leads to Cramier-Von Mises test. 79:14 And then there's another one that 79:15 says, well, maybe I don't want to integrate without weights. 79:18 Maybe I want to put weights that account for the variance. 79:22 And this guy is called Anderson-Darling. 79:24 For each of these tests you can check 79:26 that the asymptotic distribution is going to be pivotal, 79:29 which means that there will be a table at the back of some book 79:32 that tells you what the statistic, the quantiles 79:37 of square root of n times this guy 79:38 are asymptotically, basically. 79:40 Yeah? 79:41 AUDIENCE: For the Kolmogorov-Smirnov test, 79:44 for the table that shows the value it has, 79:48 it has the value for different n. 79:51 But I thought we [INAUDIBLE]-- 79:53 PROFESSOR: Yeah. 79:54 So that's just to show you that asymptotically it's pivotal, 79:56 and I can point you to one specific thing. 79:59 But it turns out that this thing is actually pivotal for each n. 80:02 And that's why you have this recipe to construct the entire 80:05 thing, because it's actually not true for all possible n's. 80:08 Also there's the n that shows up here. 80:10 So no actually, this is something 80:13 you should have in mind. 80:14 So basically, let me strike what I just said. 80:18 This thing you can actually, this distribution 80:20 will not depend on F0 for any particular n. 80:24 It's just not going to be a Brownian bridge 80:25 but a finite sample approximation of a Brownian 80:28 bridge, and you can simulate that just drawing samples 80:31 from it, building a histogram, and constructing 80:33 the quantiles for this guy. 80:35 AUDIENCE: No one has actually developed 80:36 a table for Brownian-- 80:38 PROFESSOR: Oh, there is one. 80:39 That's the table, maybe. 80:42 Let's see if we see it at the bottom of the other table. 80:46 Yeah. 80:47 See? 80:47 Over 40, over 30. 80:48 So this is not the Kolmogorov-Smirnov, 80:50 but that's the Kolmogorov Lilliefors. 80:52 Those numbers that you see here, they 80:54 are the numbers for the asymptotic thing which is 80:57 some sort of Brownian bridge. 80:59 Yeah? 81:00 AUDIENCE: Two questions. 81:01 If I want to build the Kolmogorov-Smirnov test, 81:03 it says that F0 is required to be continuous. 81:08 PROFESSOR: Yeah. 81:10 AUDIENCE: [INAUDIBLE] If we have, like, probability 81:13 mass of a particular value. 81:15 Like some sort of data. 81:18 PROFESSOR: So then you won't have this nice picture, right? 81:20 This can happen at any point because you're 81:22 going to have discontinuities in F 81:24 and those things can happen everywhere. 81:26 And then-- 81:27 AUDIENCE: Would the supremum still work? 81:29 PROFESSOR: You mean the Brownian bridge? 81:30 AUDIENCE: Yeah. 81:32 The Kolmogorov test doesn't say that you 81:35 have to be able to easily calculate the supremum. 81:37 PROFESSOR: No, no, no, but you still need it. 81:39 You still need it for-- 81:40 so there's some finite sample versions of it 81:42 that you can use that are slightly more conservative, 81:45 which is in a way good news because you're 81:47 going to conclude more to H0. 81:50 And there's are some, I forget the name, 81:52 it's Kiefer-Wolfowitz, the Kiefer-Dvoretzky-Wolfowitz, 81:57 an equality which is basically like Hoeffding's inequality. 81:59 So it's basically up to bad constants 82:01 telling you the same result as the Brownian bridge result, 82:04 and those are true all the time. 82:06 But for the exact asymptotic distribution, 82:08 you need continuity. 82:11 Yes. 82:12 AUDIENCE: So just a clarification. 82:13 So when we are testing the Kolmogorov, 82:15 we shouldn't test a particular mu and sigma squared? 82:19 PROFESSOR: Well if you know what they are you can use 82:22 Kolmogorov-Smirnov, but if you don't know what they are 82:25 you're going to plug in-- 82:26 as soon as you're going to estimate 82:27 the mean and the variance from the data, 82:29 you should use the one we'll see next time, which is 82:31 called Kolmogorov Lilliefors. 82:33 You don't have to think about it too much. 82:34 We'll talk about it on Thursday. 82:38 Any other question? 82:39 So we're out of time. 82:40 So I think we should stop here, and we'll resume on Thursday. 82:45