字幕記錄 https://www.youtube.com/watch?v=C_W1adH-NVE&list=PLUl4u3cNGP60uVBMaoNERc6knT_MgPKS0&index=2 00:00 00:00 The following content is provided under a Creative 00:02 Commons license. 00:03 Your support will help MIT OpenCourseWare 00:06 continue to offer high quality educational resources for free. 00:10 To make a donation or to view additional materials 00:12 from hundreds of MIT courses, visit MIT OpenCourseWare 00:16 at ocw.mit.edu. 00:17 00:20 PHILIPPE RIGOLLET: --of our limiting distribution, 00:22 which happen to be Gaussian. 00:24 But if the central limit theorem told 00:25 us that the limiting distribution of some average 00:28 was something that looked like a Poisson 00:30 or an [? exponential, ?] then we would just 00:32 have in the same way taken the quintiles 00:34 of the exponential distribution. 00:36 So let's go back to what we had. 00:39 So generically if you have a set of observations X1 to Xn. 00:46 So remember for the kiss example they were denoted by R1 to Rn, 00:52 because they were turning the head to the right, 00:55 but let's just go back. 00:56 We say X1 to Xn, and in this case 00:59 I'm going to assume they're IID, and I'm 01:02 going to make them Bernoulli with [INAUDIBLE] p, 01:05 and p is unknown, right? 01:06 01:10 So what did we do from here? 01:11 Well, we said p is the expectation of Xi, 01:15 and actually we didn't even think about it too much. 01:17 We said, well, if I need to estimate 01:19 the proportion of people who turn their head to the right 01:21 when they kiss, I just basically I'm 01:22 going to compute the average. 01:24 So our p hat was just Xn bar, which was just 1 01:28 over n sum from i over 1 2n of the Xi. 01:32 The average of the observations was their estimate. 01:34 And then we wanted to build some confidence intervals 01:37 around this. 01:38 So what we wanted to understand is, how much that this p hat 01:41 fluctuates. 01:42 This is a random variable. 01:44 It's an average of random variables. 01:45 It's a random variable, so we want 01:46 to know what the distribution is. 01:47 And if we know what the distribution is, 01:49 then we actually know, well, where it fluctuates. 01:51 What the expectation is. 01:52 Around which value it tends to fluctuate et cetera. 01:55 And so what the central limit theorem 01:57 told us was if I take square root of n times Xn bar minus p, 02:03 which is its average. 02:04 And then I divide it by the standard deviation. 02:07 02:10 Then this thing here converges as n goes to infinity, 02:15 and we will say a little bit more 02:17 about what it means in distribution 02:19 to some standard normal random variable. 02:23 So that was the central limit theorem. 02:24 02:27 So what it means is that when I think 02:28 of this as a random variable, when n is large enough 02:35 it's going to look like this. 02:37 And so I understand perfectly its fluctuations. 02:40 I know that this thing here has-- 02:43 I know the probability of being in this zone. 02:45 I know that this number here is 0. 02:47 I know a bunch of things. 02:49 And then, in particular, what I was 02:51 interested in was that the probability, that's 02:55 the absolute value of a Gaussian random variable, 02:59 exceeds q alpha over 2, q alpha over 2. 03:05 We said that this was equal to what? 03:06 03:13 Anybody? 03:15 What was that? 03:16 AUDIENCE: [INAUDIBLE] 03:18 PHILIPPE RIGOLLET: Alpha, right? 03:19 So that's the probability. 03:21 That's my random variable. 03:23 So this is by definition q alpha over 2 is the number. 03:27 So that to the right of it is alpha over 2. 03:29 And this is a negative q alpha over 2 by symmetry. 03:34 And so the probability that i exceeds-- well, 03:36 it's not very symmetric, but the probability 03:38 that i exceeds this value, q alpha over 2, 03:41 is just the sum of the two gray areas. 03:46 All right? 03:47 So now I said that this thing was approximately equal, 03:50 due to the central limit theorem, 03:51 to the probability, that square root of n. 03:55 Xn bar minus p divided by square root p 1 minus p. 03:59 04:04 Well, absolute value was larger than q alpha over 2. 04:10 Well, then this thing by default is actually approximately equal 04:12 to alpha, just because of virtue of the central limit theorem. 04:16 And then we just said, well, I'll solve for p. 04:23 Has anyone attempted to solve the degree two equation for p 04:28 in the homework? 04:29 Everybody has tried it? 04:30 04:35 So essentially, this is going to be an equation in p. 04:37 Sometimes we don't want to solve it. 04:39 Some of the p's we will replace by their worst possible value. 04:41 For example, we said one of the tricks we had was 04:44 that this value here, square root of p 1 minus p, 04:48 was always less than one half. 04:51 Until we could actually get the confidence interval that 04:53 was larger than all possible confidence 04:55 intervals for all possible values of p, 04:57 but we could solve for p. 04:59 Do we all agree on the principle of what we did? 05:01 So that's how you build confidence intervals. 05:03 Now let's step back for a second, 05:05 and see what was important in the building of this confidence 05:08 interval. 05:09 The really key thing is that I didn't tell you 05:11 why I formed this thing, right? 05:15 We started from x bar, and then I 05:17 took some weird function of x bar that depended on p and n. 05:21 And the reason is, because when I take this function, 05:23 the central limit theorem tells me 05:25 that it converges to something that I know. 05:28 But this very important thing about the something that I know 05:30 is that it does not depend on anything that I don't know. 05:35 For example, if I forgot to divide 05:36 by square root of p 1 minus p, then this thing would have 05:40 had a variance, which is the p 1 minus p. 05:43 If I didn't remove this p here, the mean here 05:47 would have been affected by p. 05:49 And there's no table for normal p 1. 05:53 Yes? 05:53 AUDIENCE: [INAUDIBLE] 05:55 PHILIPPE RIGOLLET: Oh, so the square root of n terms 05:58 come from. 05:58 So really you should view this. 06:00 So there's a rule and sort of a quiet rule in math 06:04 that you don't write a divided by b over c, right? 06:08 You write c times a divided by b, because it looks nicer. 06:12 But the way you want to think about this 06:14 is that this is x bar minus p divided by the square root of p 06:20 1 minus p divided by n. 06:23 And the reason is, because this is actually 06:25 the standard deviation of this-- 06:27 oh sorry, x bar n. 06:28 This is actually the standard deviation of this guy, 06:31 and the square root of n comes from the [INAUDIBLE] average. 06:36 So the key thing was that this thing, 06:39 this limiting distribution did not depend on anything 06:42 I don't know. 06:43 And this is actually called a pivotal distribution. 06:45 It's pivotal. 06:47 I don't need anything. 06:49 I don't need to know anything, and I can read it in a table. 06:51 Sometimes there's going to be complicated things, 06:54 but now we have computers. 06:55 The beauty about Gaussian is that people have studied them 06:57 to death, and you can open any stats textbook, 07:00 and you will see a table again that will tell you 07:02 for each value of alpha you're interested in, 07:04 it will tell you what q alpha over 2 is. 07:07 But there might be some crazy distributions, 07:10 but as long as they don't depend on anything, 07:12 we might actually be able to simulate 07:13 from them, and in particular compute what q alpha over 2 07:16 is for any possible value [INAUDIBLE].. 07:19 And so that's what we're going to be trying to do. 07:21 Finding pivotal distributions. 07:22 How do we take this Xn bar, which is a good estimate, 07:26 and turn it into something which may be exactly 07:28 or asymptotically does not depend 07:31 on any unknown parameter. 07:33 So here is one way we can actually-- 07:35 so that's what we did for the kiss example, right? 07:38 And here I mentioned, for example, 07:39 in the extreme case, when n was equal to 3 07:41 we would get a different thing, but here the CLT 07:44 would not be valid. 07:45 And what that means is that my pivotal distribution 07:49 is actually not the normal distribution, 07:52 but it might be something else. 07:54 And I said we can make take exact computations. 07:56 Well, let's see what it is, right? 07:58 If I have three observations, so I'm going to have X1, X2, X3. 08:06 So now I take the average of those guys. 08:08 08:13 OK, so that's my estimate. 08:15 How many values can this guy take? 08:16 08:23 It's a little bit of counting. 08:25 08:27 Four values. 08:28 How did you get to that number? 08:29 08:37 OK, so each of these guys can take value 0, 1, right? 08:41 So the number of values that it can take, 08:43 I mean, it's a little annoying, because then I 08:45 have to sum them, right? 08:47 So basically, I have to count the number of 1's. 08:51 So how many 1's can I get, right? 08:54 Sorry I have to-- yeah, so this is the number of 1's that I-- 08:57 OK, so let's look at that. 08:58 So we get 0, 0, 0. 09:00 0, 0, 1. 09:01 And then I get basically three of them 09:03 that have just the one in there, right? 09:04 09:07 So there's three of them. 09:08 How many of them have exactly two 1's? 09:12 2. 09:13 Sorry, 3, right? 09:15 So it's just this guy where I replaced the 0's and the 1. 09:18 OK, so now I get-- 09:21 so here I get three that take the value 1, 09:23 and one that gets the value 0. 09:25 And then I get three that take the value 2, 09:28 and then one that takes the value 1. 09:30 The value [? 0 ?] 1's, right? 09:33 OK, so everybody knows what I'm missing here is just the ones 09:35 here where I replaced the 0's by 1's. 09:38 So the number of values that this thing can take 09:40 is 1, 2, 3, 4. 09:43 So someone is counting much faster than me. 09:45 And so those numbers, you've probably seen them before, 09:48 right? 09:48 1, 3, 3, 1, remember? 09:50 And so essentially those guys, it 09:52 takes only three values, which are either 1/3, 1. 09:58 Sorry, 1/3. 10:02 Oh OK, so it's 0, sorry. 10:06 1/3, 2/3, and 1. 10:10 Those are the four possible values you can take. 10:12 And so now-- which is probably much easier 10:14 to count like that-- and so now all 10:16 I have to tell you if I want to describe 10:18 the distribution of this probability 10:20 of this random variable, is just the probability 10:23 that it takes each of these values. 10:24 So X bar 3 takes the value 0 probability 10:30 that X bar 3 takes the value 1/3, et cetera. 10:34 If I give you each of these possible values, 10:36 then you will be able to know exactly what the distribution 10:38 is, and hopefully maybe to turn it into something 10:41 you can compute. 10:42 Now the thing is that those values will actually 10:44 depend on the unknown p. 10:47 What is the unknown p here? 10:48 What is the probability that X bar 10:49 3 is equal to 0 for example? 10:52 I'm sorry? 10:53 AUDIENCE: [INAUDIBLE] 10:54 PHILIPPE RIGOLLET: Yeah, OK. 10:55 So let's write it without making the computation So 1/8 is 10:59 probably not the right answer, right? 11:03 For example, if p is equal to 0, what is this probability? 11:09 1. 11:10 If p is 1, what is this probability? 11:13 0. 11:14 So it will depend on p. 11:16 So the probability that this thing is equal to 0, 11:18 is just the probability that all three of those guys 11:20 are equal to 0. 11:21 The probability that X1 is equal to 0, and X2 is equal to 0, 11:24 and X3 is equal to 0. 11:25 Now my things are independent, so I 11:26 do what I actually want to do, which 11:28 say the probability of the intersection 11:29 is the product of the probabilities, right? 11:32 So it's just the probability that each of them is equal to 0 11:34 to the power of 3. 11:36 And the probability that each of them, or say one of them 11:38 is equal to 0, is just 1 minus p. 11:41 11:45 And then for this guy I just get the probability-- well, 11:48 it's more complicated, because I have to decide which one it is. 11:51 But those things are just the probability 11:53 of some binomial random variables, right? 11:56 This is just a binomial, X bar 3. 12:00 So if I look at X bar 3, and then I multiply it by 3, 12:03 it's just this sum of independent Bernoulli's 12:05 with parameter p. 12:07 So this is actually a binomial with parameter 3 and p. 12:11 And there's tables for binomials, 12:13 and they tell you all this. 12:16 Now the thing is I want to invert this guy, right? 12:18 Somehow. 12:19 This thing depends on p. 12:21 I don't like it, so I'm going to have 12:22 to find ways to get this things depending on p, 12:25 and I could make all these nasty computations, 12:27 and spend hours doing this. 12:29 But there's tricks to go around this. 12:31 There's upper bounds. 12:32 Just like we just said, well, maybe I 12:34 don't want to solve the second degree equation in p, 12:36 because it's just going to capture maybe smaller order 12:40 terms, right? 12:41 Things that maybe won't make a huge difference numerically. 12:43 You can check that in your problem set one. 12:46 Does it make a huge difference numerically 12:48 to solve the second degree equation, 12:50 or to just use the [INAUDIBLE] p 1 12:52 minus p or even to plug in p hat instead of p. 12:56 Those are going to be the-- problem 12:57 set one is to make sure that you see what magnitude of changes 13:01 you get by changing from one method to the other. 13:05 So what I wanted to go to is something 13:13 where we can use something, which is just 13:16 a little more brute force. 13:17 So the probability that-- so here 13:19 is this Hoeffding's inequality. 13:20 We saw that. 13:21 That's what we've finished on last time. 13:23 So Hoeffding's inequality is actually 13:25 one of the most useful inequalities. 13:27 If any one of you is doing anything really to algorithms, 13:30 you've seen that inequality before. 13:32 It's extremely convenient that it tells you 13:33 something about bounded random variables, 13:35 and if you do algorithms typically with things bounded. 13:37 And that's the case of Bernoulli's random variables, 13:40 right? 13:40 They're bounded between 0 and 1. 13:42 And so when I do this thing, when 13:44 I do Hoeffding's inequality, what this thing is telling 13:46 me is for any given epsilon here, for any given epsilon, 13:53 what is the probability that Xn bar goes away 13:55 from its expectation? 13:58 All right, then we saw that it decreases somewhat similarly 14:02 to the way a Gaussian would look like. 14:04 So essentially what Hoeffding's inequality is telling me, is 14:08 that I have this picture, when I have a Gaussian with mean u, 14:18 I know it looks like this, right? 14:20 What Hoeffding's inequality is telling 14:22 me is that if I actually take the average 14:24 of some bounded random variables, 14:27 then their probability distribution function or maybe 14:30 math function-- this thing might not even have [INAUDIBLE] 14:32 the density, but let's think of it as being a density just 14:35 for simplicity-- it's going to be something 14:38 that's going to look like this. 14:40 It's going to be somewhat-- well, 14:42 sometimes it's going to have to escape just 14:44 for the sake of having integral 1. 14:46 But it's essentially telling me that those guys 14:49 stay below those guys. 14:52 The probability that Xn bar exceeds mu 14:56 is bounded by something that decays 14:58 like to tail of Gaussian. 15:00 So really that's the picture you should have in mind. 15:03 When I average bounded random variables, 15:05 I actually have something that might be really rugged. 15:08 It might not be smooth like a Gaussian, 15:10 but I know that it's always bounded by a Gaussian. 15:12 And what's nice about it is that when I actually 15:14 start computing probability that exceeds some number, 15:17 say alpha over 2, then I know that this I can actually 15:24 get a number, which is just-- 15:29 sorry, the probability that it exceeds, yeah. 15:31 So this number that I get here is actually 15:33 going to be somewhat smaller, right? 15:35 So that's going to be the q alpha over 2 for the Gaussian, 15:37 and that's going to be the-- 15:39 I don't know, r alpha over 2 for this [? Bernoulli ?] 15:41 random variable. 15:43 Like q prime or different q. 15:46 So I can actually do this without actually 15:50 taking any limits, right? 15:51 This is valid for any n. 15:53 I don't need to actually go to infinity. 15:54 Now this seems a bit magical, right? 15:57 I mean, I just said we need n to be, 15:59 we discussed that we wanted n to be larger 16:01 than 30 last time for the central limit theorem 16:03 to kick in, and this one seems to tell me 16:05 I can do it for any n. 16:07 Now there will be a price to pay is that I pick up this 2 over b 16:12 minus alpha squared. 16:13 So that's the variance of the Gaussian that I have, right? 16:20 Sort of. 16:20 That's telling me what the variance should be, 16:23 and this is actually not as nice. 16:24 I pick factor 4 compared to the Gaussian 16:27 that I would get for that. 16:29 So let's try to solve it for our case. 16:32 So I just told you try it. 16:33 Did anybody try to do it? 16:35 16:37 So we started from this last time, right? 16:39 16:41 And the reason was that we could say 16:43 that the probability that this thing exceeds q alpha over 2 16:46 is alpha. 16:47 So that was using CLT, so let's just keep it here, and see 16:52 what we would do differently. 16:53 16:56 What Hoeffding tells me is that the probability that Xn 16:58 bar minus-- 17:00 well, what is mu in this case? 17:04 It's p, right? 17:06 It's just notation here. 17:07 Mu was the average, but we call it 17:09 p in the case of Bernoulli's, exceeds-- 17:12 let's just call it epsilon for a second. 17:17 So we said that this was bounded by what? 17:19 So Hoeffding tells me that this is bounded 17:21 by 2 times exponential minus 2. 17:26 Now the nice thing is that I pick up a factor n here, 17:29 epsilon squared. 17:30 And what is b minus a squared for the Bernoulli's? 17:33 1. 17:33 So I don't have a denominator here. 17:36 And I'm going to do exactly what I did here. 17:38 I'm going to set this guy to be equal to alpha. 17:40 17:43 So that if I get alpha here, then that 17:46 means that just solving for epsilon, 17:50 I'm going to have some number, which will play the role of q 17:52 alpha over 2, and then I'm going to be 17:54 able to just say that p is between X bar and minus 17:58 epsilon, and X bar n plus epsilon. 18:00 OK, so let's do it. 18:02 18:05 So we have to solve the equation. 18:06 18:14 2 exponential minus 2n epsilon squared equals alpha, 18:20 which means that-- 18:22 so here I'm going to get, there's a 2 right here. 18:26 So that means that I get alpha over 2 here. 18:29 Then I take the logs on both sides, 18:30 and now let me just write it. 18:32 18:36 And then I want to solve for epsilon. 18:39 So that means that epsilon is equal to square root log 18:43 q over alpha divided by 2n. 18:45 18:50 Yes? 18:51 AUDIENCE: [INAUDIBLE] 18:53 PHILIPPE RIGOLLET: Why is b minus a 1? 18:55 Well, let's just look, right? 18:57 X lives in the interval b minus a. 19:00 So I can take b to be 25, and a to be my negative 42. 19:06 But I'm going to try to be as sharp as I can. 19:09 All right, so what is the smallest value 19:10 you can think of such that a Bernoulli random variable 19:13 is larger than or equal to this value? 19:15 19:19 What values does a Bernoulli random variable take? 19:23 0 and 1. 19:24 So it takes values between 0 and 1. 19:29 It just maxes the value. 19:31 Actually, this is the worst possible case 19:33 for the Hoeffding inequality. 19:38 So now I just get this one, and so now you 19:40 tell me that I have this thing. 19:41 So when I solve this guy over there. 19:43 So combining this thing and this thing 19:46 implies that the probability that p lives between Xn 19:53 bar minus square root log 2 over alpha divided by 2n and X 20:01 bar plus the square root log 2 over alpha divided by 2n 20:10 is equal to? 20:12 20:15 I mean, is at least. 20:16 What is it at least equal to? 20:18 20:22 Here this controls the probability of them outside 20:25 of this interval, right? 20:27 It tells me the probability that Xn bar is far from p 20:31 by more than epsilon. 20:32 So there's a probability that they're actually 20:34 outside of the interval that I just wrote. 20:36 So it's 1 minus the probability of being in the interval. 20:39 So this is at least 1 minus alpha. 20:43 So I just use the fact that a probability of the complement 20:46 is 1 minus the probability of the set. 20:50 And since I have an upper bound on the probability of the set, 20:53 I have a lower bound on the probability of the complement. 20:59 So now it's a bit different. 21:03 Before, we actually wrote something that was-- 21:06 so let me get it back. 21:08 So if we go back to the example where we took the [INAUDIBLE] 21:11 over p, we got this guy. 21:16 q alpha over square root of-- 21:19 over 2 square root n. 21:21 So we had Xn bar plus minus q alpha over 2 square root n. 21:24 Actually, that was q alpha over 2n, I'm sorry about that. 21:27 21:30 And so now we have something that replaces this q alpha, 21:34 and it's essentially square root of 2 log 2 over alpha. 21:40 Because if I replace q alpha by square root 21:43 of 2 log 2 over alpha, I actually 21:47 get exactly this thing here. 21:49 21:52 And so the question is, what would you guess? 21:55 Is this number, this margin, square root of log 2 over alpha 22:01 divided by 2n, is it smaller or larger than this guy? 22:05 q alpha all over 2/3n. 22:08 Yes? 22:09 Larger. 22:10 Everybody agrees with this? 22:12 Just qualitatively? 22:14 Right, because we just made a very conservative statement. 22:17 We do not use anything. 22:18 This is true always. 22:20 So it can only be better. 22:22 The reason in statistics where you use those assumptions 22:24 that n is large enough, that you have this independence that you 22:27 like so much, and so you can actually have the central limit 22:30 theorem kick in, all these things 22:32 are for you to have enough assumptions 22:35 so that you can actually make sharper and sharper decisions. 22:38 More and more confident statement. 22:40 And that's why there's all this junk science out there, 22:42 because people make too much assumptions for their own good. 22:45 They're saying, well, let's assume 22:46 that everything is the way I love it, so that I can for sure 22:50 any minor change, I will be able to say 22:53 that's because I made an important scientific discovery 22:55 rather than, well, that was just [INAUDIBLE] OK? 23:02 So now here's the fun moment. 23:04 And actually let me tell you why we look at this thing. 23:09 So there's actually-- who has seen 23:11 different types of convergence in the probability statistic 23:14 class? 23:14 23:17 [INAUDIBLE] students. 23:20 And so there's different types of-- 23:22 in the real numbers there's very simple. 23:25 There's one convergence, Xn turns 23:27 to X. To start thinking about functions, 23:29 well, maybe you have uniform convergence, 23:32 you have pointwise convergence. 23:33 So if you've done some real analysis, 23:34 you know there's different types of convergence 23:36 you can think of. 23:37 And in the convergence of random variables, 23:40 there's also different types, but for different reasons. 23:42 It's just because the question is, what do you 23:44 do with the randomness? 23:45 When you see that something converges to something, 23:47 it probably means that you're willing to tolerate 23:50 low probability things happening or where it doesn't happen, 23:54 and on how you handle those, creates 23:56 different types of convergence. 23:58 So to be fair, in statistics the only convergence we care about 24:03 is the convergence in distribution. 24:05 That's this one. 24:07 The one that comes from the central limit theorem. 24:09 That's actually the weakest possible you could make. 24:12 Which is good, because that means it's 24:14 going to happen more often. 24:16 And so why do we need this thing? 24:17 Because the only thing we really need 24:19 to do is to say that when I start computing 24:21 probabilities on this random variable, 24:23 they're going to look like probabilities 24:25 on that random variable. 24:27 All right, so for example, think of the following 24:30 two random variables, x and minus x. 24:41 So this is the same random variable, 24:42 and this one is negative. 24:44 When I look at those two random variables, 24:48 think of them as a sequence, a constant sequence. 24:51 These two constant sequences do not go to the same number, 24:53 right? 24:54 One is plus-- one is x, the other one is minus x. 24:57 So unless x is the random variable always equal to 0, 25:01 those two things are different. 25:03 However, when I compute probabilities on this guy, 25:05 and when I compute probabilities on that guy, they're the same. 25:09 Because x and minus x have the same distribution 25:12 just by symmetry of the gaps in random variables. 25:15 And so you can see this is very weak. 25:17 I'm not saying anything about the two random variables being 25:19 close to each other every time I'm 25:20 going to flip my coin, right? 25:22 Maybe I'm going to press my computer and say, what is x? 25:25 Well, it's 1.2. 25:26 Negative x is going to be negative 1.2. 25:29 Those things are far apart, and it 25:30 doesn't matter, because in average those things 25:32 are going to have the same probabilities that's happening. 25:34 And that's all we care about in statistics. 25:36 You need to realize that this is what's important, 25:37 and why you need to know. 25:39 Because you have it really good. 25:40 If your problem is you really care more about convergence 25:43 almost surely, which is probably the strongest you can think of. 25:45 So what we're going to do is talk about different types 25:48 of convergence not to just reflect on the fact 25:51 on how our life is good. 25:53 It's just that the problem is that when the convergence 25:56 in distribution is so weak that it cannot do anything I want 26:00 with it. 26:00 In particular, I cannot say that if X converges, 26:04 Xn converges in distribution, and Yn converges 26:07 in distribution, then Xn plus Yn converge in distribution 26:10 to the sum of their limits. 26:12 I cannot do that. 26:12 It's just too weak. 26:14 Think of this example and it's preventing you 26:16 to do quite a lot of things. 26:17 26:20 So this is converge in distribution to sum n 0, 1. 26:26 This is converge in distribution to sum n 0, 1. 26:28 But their sum is 0, and it's certainly not-- 26:31 it doesn't look like the sum of two 26:33 independent Gaussian random variables, right? 26:36 And so what we need is to have stronger conditions here 26:40 and there, so that we can actually put things together. 26:42 And we're going to have more complicated formulas. 26:45 One of the formulas, for example, 26:46 is if I replace p by p hat in this denominator. 26:50 We mentioned doing this at some point. 26:53 So I would need that p hat goes to p, 26:57 but I need stronger than n distributions 26:59 so that this happens. 27:00 I actually need this to happen in a stronger sense. 27:04 So here are the first two strongest sense in which 27:07 random variables can converge. 27:09 The first one is almost surely. 27:13 And who has already seen this notation little omega 27:16 when they're talking about random variables? 27:19 All right, so very few. 27:20 So this little omega is-- so what is a random variable? 27:24 A random variable is something that you measure 27:25 on something that's random. 27:27 So the example I like to think of 27:29 is, if you take a ball of snow, and put it 27:34 in the sun for some time. 27:37 You come back. 27:38 It's going to have a random shape, right? 27:39 It's going to be a random blurb of something. 27:42 But there's still a bunch of things you can measure on it. 27:45 You can measure its volume. 27:46 You can measure its inner temperature. 27:48 You can measure its surface area. 27:50 All these things are random variables, 27:52 but the ball itself is omega. 27:54 That's the thing on which you make your measurement. 27:56 And so a random variable is just a function of those omegas. 28:00 Now why do we make all these things fancy? 28:03 Because you cannot take any function. 28:04 This function has to be what's called measurable, 28:06 and there's entire courses on measure theory, 28:09 and not everything is measurable. 28:11 And so that's why you have to be a little careful 28:13 why not everything is measurable, 28:14 because you need some sort of nice property. 28:17 So that the measure of something, 28:19 the union of two things, is less than the sum of the measures, 28:22 things like that. 28:23 And so almost surely is telling you that for most of the balls, 28:30 for most of the omegas, that's the right-hand side. 28:34 The probability of omega is such that those things converge 28:37 to each other is actually equal to 1. 28:41 So it tells me that for almost all omegas, all the omegas, 28:45 if I put them together, I get something 28:47 that has probability of 1. 28:48 It might be that there are other ones that have probability 0, 28:50 but what it's telling is that this thing 28:52 happens for all possible realization of the underlying 28:55 thing. 28:56 That's very strong. 28:57 It essentially says randomness does not matter, 29:00 because it's happening always. 29:01 29:04 Now convergence in probability allows 29:06 you to squeeze a little bit of probability under the rock. 29:09 It tells you I want the convergence to hold, 29:12 but I'm willing to let go of some little epsilon. 29:17 So I'm willing to allow Tn to be less than epsilon. 29:23 Tn minus T to be-- sorry, to be larger than epsilon. 29:27 But the problem is they want this to go to 0 29:29 as well as n goes to infinity, but for each 29:31 n this thing does not have to be 0, which 29:34 is different from here, right? 29:36 So this probability here is fine. 29:40 So it's a little weaker, but it's a slightly different one. 29:44 I'm not going to ask you to learn and show that one 29:46 is weaker than the other one. 29:48 But just know that these are two different types. 29:51 This one is actually much easier to check than this one. 29:53 30:02 Then there's something called convergence in Lp. 30:06 So this one is the fact that it embodies the following fact. 30:09 If I give you a random variable with mean 0, 30:11 and I tell you that its variance is going to 0, right? 30:14 You have a sequence of random variables, their mean is 0, 30:16 their expectation is 0, but their variance is going to 0. 30:20 So think of Gaussian random variables with mean 0, 30:23 and a variance that shrinks to 0. 30:26 And this random variable converges to a spike at 0, 30:29 so it converges to 0, right? 30:31 And so what I mean by that is that to have this convergence, 30:35 all I had to tell you was that the variance was going to 0. 30:38 And so in L2 this is really what it's telling you. 30:41 It's telling you, well, if the variance is going to 0-- 30:44 well, it's for any random variable T, 30:46 so here what I describe was for a deterministic. 30:50 So Tn goes to a random variable T. If you look at the square-- 30:55 the expectation of the square distance, and it goes to 0. 30:58 But you don't have to limit yourself to the square. 31:00 You can take power of three. 31:01 You can take power 67.6, power of 9 pi. 31:06 You take whatever power you want, it can be fractional. 31:09 It has to be lower than 1, and that's the convergence in Lp. 31:13 But we mostly care about integer p. 31:17 And then here's our star, the convergence in distribution, 31:20 and that's just the one that tells you 31:21 that when I start computing probabilities on the Tn, 31:27 they're going to look very close to the probabilities on the T. 31:31 So that was our Tn with this guy, for example, 31:34 and T was this standard Gaussian distribution. 31:37 Now here, this is not any probability. 31:38 This is just the probability then less than or equal to x. 31:42 But if you remember your probability class, 31:44 if you can compute those probabilities, 31:45 you can compute any probabilities 31:47 just by subtracting and just building things together. 31:50 31:55 Well, I need this for all x's, so I want this for each x, 31:58 So you fix x, and then you make the limit go to infinity. 32:01 You make n go to infinity, and I want 32:03 this for the point x's at which the cumulative distribution 32:06 function of T is continuous. 32:08 There might be jumps, and that I don't actually care for those. 32:15 All right, so here I mentioned it for random variables. 32:17 If you're interested, there's also random vectors. 32:19 A random vector is just a table of random variables. 32:23 You can talk about random matrices. 32:25 And you can talk about random whatever you want. 32:27 Every time you have an object that's 32:28 just collecting real numbers, you can just 32:31 plug random variables in there. 32:33 And so there's all these definitions that [? extend. ?] 32:37 So where I see you see an absolute value, 32:39 we'll see a norm. 32:40 Things like this. 32:43 So I'm sure this might look scary a little bit, 32:46 but really what we are going to use is only the last one, which 32:49 as you can see is just telling you 32:50 that the probabilities converge to the probabilities. 32:52 But I'm going to need the other ones every once in a while. 32:55 And the reason is, well, OK, so here I'm 32:59 actually going to the important characterizations 33:02 of the convergence in distribution, 33:05 which is R convergence style. 33:08 So i converge in distribution if and only 33:10 if for any function that's continuous and bounded, 33:14 when I look at the expectation of f of Tn, 33:16 this converges to the expectation of f of T. OK, 33:19 so this is just those two things are actually equivalent. 33:25 Sometimes it's easier to check one, easier to check the other, 33:27 but in this class you won't have to prove that something 33:30 converges in distribution other than just combining 33:33 our existing convergence results. 33:37 And then the last one which is equivalent to the above two 33:40 is, anybody knows what the name of this quantity is? 33:42 This expectation here? 33:45 What is it called? 33:47 The characteristic function, right? 33:49 And so this i is the complex i, and is the complex number. 33:52 And so it's essentially telling me 33:54 that, well, rather than actually looking 33:56 at all bounded and continuous but real functions, 33:58 I can actually look at one specific family 34:03 of complex functions, which are the functions that maps 34:08 T to E to the ixT for x and R. That's 34:12 a much smaller family of functions. 34:14 All possible continuous embedded functions 34:17 has many more elements than just the real element. 34:21 And so now I can show that if I limit myself to do it, 34:24 it's actually sufficient. 34:25 34:28 So those three things are used all over the literature just 34:32 to show things. 34:33 In particular, if you're interested in deep digging 34:37 a little more mathematically, the central limit theorem 34:39 is going to be so important. 34:40 Maybe you want to read about how to prove it. 34:42 We're not going to prove it in this class. 34:43 There's probably at least five different ways of proving it, 34:49 but the most canonical one, the one that you find in textbooks, 34:52 is the one that actually uses the third element. 34:55 So you just look at the characteristic function 34:59 of the square root of n Xn bar minus say mu, 35:04 and you just expand the thing, and this is what you get. 35:07 And you will see that in the end, 35:09 you will get the characteristic function of a Gaussian. 35:13 Why a Gaussian? 35:14 Why does it kick in? 35:15 Well, because what is the characteristic function 35:17 of a Gaussian? 35:17 Does anybody remember the characteristic function 35:19 of a standard Gaussian? 35:20 AUDIENCE: [INAUDIBLE] 35:21 PHILIPPE RIGOLLET: Yeah, well, I mean 35:23 there's two pi's and stuff that goes away, right? 35:27 A Gaussian is a random variable. 35:29 A characteristic function is a function, 35:31 and so it's not really itself. 35:33 It looks like itself. 35:34 Anybody knows what the actual formula is? 35:37 Yeah. 35:37 AUDIENCE: [INAUDIBLE] 35:39 PHILIPPE RIGOLLET: E to the minus? 35:41 AUDIENCE: E to the minus x squared over 2. 35:42 PHILIPPE RIGOLLET: Exactly. 35:43 E to the minus x squared over 2. 35:44 But this x squared over 2 is actually 35:46 just the second order expansion in the Taylor expansion. 35:49 And that's why the Gaussian is so important. 35:51 It's just the second order Taylor expansion. 35:54 And so you can check it out. 35:56 I think Terry Tao has some stuff on his blog, 35:58 and there's a bunch of different proofs. 36:00 But if you want to prove convergence in distribution, 36:02 you very likely are going to use one this three right here. 36:07 So let's move on. 36:09 36:13 This is when I said that this convergence is 36:15 weaker than that convergence. 36:17 This is what I meant. 36:18 If you have convergence in one style, 36:20 it implies convergence in the other stuff. 36:23 So the first [INAUDIBLE] is that if Tn converges almost surely, 36:26 this a dot s dot means almost surely, 36:28 then it also converges in probability 36:31 and actually the two limits, which 36:32 are this random variable T, are equal almost surely. 36:37 Basically what it means is that whatever you measure one 36:39 is going to be the same that you measure on the other one. 36:42 So that's very strong. 36:44 So that means that convergence almost surely 36:47 is stronger than convergence in probability. 36:50 If you're converge in Lp then you also converge 36:53 in Lq for sum q less than p. 36:56 So if you converge in L2, you'll also converge in L1. 36:59 If you converge in L67, you converge in L2. 37:03 If you're converge in L infinity, 37:04 you converge in Lp for anything. 37:09 And so, again, limits are equal. 37:12 And then when you converge in distribution, 37:14 when you converge in probability, 37:15 you also converge in distribution. 37:18 OK, so almost surely implies probability. 37:22 Lp implies probability. 37:24 Probability implies distribution. 37:26 And here note that I did not write, 37:28 and the limits are equal almost surely. 37:30 Why? 37:31 37:35 Because the convergence in distribution 37:37 is actually not telling you that your random variable 37:38 is converging to another random variable. 37:40 It's telling you that the distribution 37:42 of your random variable is converging to a distribution. 37:45 And think of this, guys. 37:47 x and minus x. 37:49 The central limit theorem tells me 37:50 that I'm converging to some standard Gaussian distribution, 37:53 but am I converging to x or am I converging to minus x? 37:57 It's not well identified. 37:58 It's any random variable that has this distribution. 38:01 So there's no way the limits are equal. 38:04 Their distributions are going to be the same, 38:06 but they're not the same limit. 38:07 Is that clear for everyone? 38:09 So in a way, convergence in distribution 38:12 is really not a convergence of a random variable 38:15 towards another random variable. 38:16 It's just telling you the limiting distribution 38:18 of your random variable [INAUDIBLE] 38:20 which is enough for us. 38:22 And one thing that's actually really nice 38:24 is this continuous mapping theorem, which 38:28 essentially tells you that-- 38:30 so this is one of the theorems that we like, 38:32 because they tell us you can do what 38:33 you feel like you want to do. 38:35 So if I have Tn that goes to T, f of Tn goes to f of T, 38:39 and this is true for any of those convergence 38:42 except for Lp. 38:45 38:48 But they have to have f, which is continuous, otherwise 38:51 weird stuff can happen. 38:54 So this is going to be convenient, because here I 38:58 don't have X to n minus p. 39:00 I have a continuous function. 39:01 It's between a linear function of Xn minus p, 39:03 but I could think of like even crazier stuff to do, 39:05 and it would still be true. 39:07 If I took the square, it would converge to something that 39:10 looks like its distribution. 39:11 It's the same as the distribution 39:12 of a square Gaussian. 39:16 So this is a mouthful, these two slides-- 39:18 actually this particular slide is a mouthful. 39:20 What I have in my head since I was pretty much where you're 39:24 sitting, is this diagram. 39:27 So what it tells me-- so it's actually voluntarily cropped, 39:32 so you can start from any Lq you want large. 39:35 And then as you decrease the index, 39:38 you are actually implying, implying, 39:39 implying until you imply convergence in probability. 39:42 Convergence almost surely implies convergence 39:44 in probability, and everything goes to the [? sync, ?] 39:49 that is convergence in distribution. 39:52 So everything implies convergence in distribution. 39:55 So that's basically rather than remembering those formulas, 39:57 this is really the diagram you want to remember. 39:59 40:02 All right, so why do we bother learning about those things. 40:06 That's because of this limits and operations. 40:09 Operations and limits. 40:10 If I have a sequence of real numbers, 40:13 and I know that Xn converges to X and Yn converges to Y, 40:17 then I can start doing all my manipulations and things 40:20 are happy. 40:20 I can add stuff. 40:21 I can multiply stuff. 40:23 But it's not true always for convergence in distribution. 40:28 But it is, what's nice, it's actually 40:29 true for convergence almost surely. 40:32 Convergence almost surely everything is true. 40:35 It's just impossible to make it fail. 40:38 But convergence in probability is not always everything, 40:41 but at least you can actually add stuff and multiply stuff. 40:43 And it will still give you the sum of the n, 40:46 and the product of the n. 40:49 You can even take the ratio if V is not 0 of course. 40:55 If the limit is not 0, then actually 40:57 you need Vn to be not 0 as well. 40:58 41:01 You can actually prove this last statement, right? 41:05 Because it's a combination of the first statement 41:08 of the second one, and the continuous mapping theorem. 41:11 Because the function that maps x to 1 41:14 over x on everything but 0, is continuous. 41:19 And so 1 over Vn converges to 1 over V, 41:24 and then I can multiply those two things. 41:26 So you actually knew that one. 41:28 But really this is not what matters, 41:30 because this is something that you will do whatever happens. 41:35 If I don't tell you you cannot do it, well, you will do it. 41:37 But in general those things don't 41:39 apply to convergence in distribution 41:40 unless the pair itself is known to converge in distribution. 41:44 Remember when I said that these things apply to vectors, 41:48 then you need to actually say that the vector converges 41:51 in distributions to the limiting factor. 41:53 Now this tells you in particular, 41:55 since the cumulative distribution function is not 41:57 defined for vectors, I would have 41:59 to actually use one of the other distributions, one 42:02 of the other criteria, which is convergence 42:04 of characteristic functions or convergence 42:07 of a function of bounded continuous function 42:11 of the random variable. 42:12 0.2 or 0.3, but 0.1 is not going get you anywhere. 42:17 But this is something that's going 42:18 to be too hard for us to deal with, so we're actually 42:20 going to rely on the fact that we have 42:23 something that's even better. 42:24 There's something that is waiting for us 42:26 at the end of his lecture, which is called Slutsky's that says 42:29 that if V, in this case, converges in probability 42:33 but U converge in distribution, I can actually still do that. 42:36 I actually don't need both of them 42:37 to converge in probability. 42:38 I actually need only one of them to converge in probability 42:41 to make this statement. 42:42 But two sum. 42:45 So let's go to another example. 42:47 So I just want to make sure that we keep on doing statistics. 42:49 And every time we're going to just do a little bit 42:51 too much probability, I'm going to reset the pressure, 42:54 and start doing statistics again. 42:56 All right, so assume you observe the times 42:59 the inter-arrival time of the T at Kendall. 43:04 So this is not the arrival time. 43:06 It's not like 7:56, 8:15. 43:09 No, it's really the inter-arrival time, right? 43:12 So say the next T is arriving in six minutes. 43:17 So let's say [INAUDIBLE] bound. 43:20 And so you have this inter-arrival time. 43:23 So those are numbers say, 3, 4, 5, 4, 3, et cetera. 43:27 So I have this sequence of numbers. 43:29 So I'm going to observe this, and I'm 43:31 going to try to infer what is the rate of T's going out 43:36 of the station from this. 43:38 So I'm going to assume that these things are 43:40 mutually independent. 43:43 That's probably not completely true. 43:44 Again, it just means that what it would mean 43:46 is that two consecutive inter-arrival times are 43:49 independent. 43:50 I mean, you can make it independent if you want, 43:52 but again, this independent assumption 43:53 is for us to be happy and safe. 43:56 Unless someone comes with overwhelming proof 43:58 that it's not independent and far from being independent, 44:01 then yes, you have a problem. 44:03 But it might be the fact that it's actually-- if you 44:06 have a T that's one hour late. 44:09 If an inter-arrival time is one hour, then the other T, 44:12 either they fixed it, and it's going 44:14 to be just 30 seconds behind, or they haven't fixed it, 44:17 then it's going to be another hour behind. 44:18 So they're not exactly independent, 44:20 but they are when things work well and approximate. 44:24 And so now I need to model a random variable that's 44:27 positive, maybe not upper bounded. 44:29 I mean, people complain enough that this thing 44:31 can be really large. 44:32 And so one thing that people like for inter-arrival times 44:34 is exponential distribution. 44:36 So that's a positive random variable. 44:38 Looks like an exponential on the right-hand slide, 44:40 on the positive line. 44:41 And so it decays very fast towards 0. 44:43 The probability that you have very large 44:45 values exponentially small, and there's a [INAUDIBLE] lambda 44:49 that controls how exponential is defined. 44:50 It's exponential minus lambda times something. 44:53 And so we're going to assume that they 44:56 have the same distribution, the same random variable. 44:58 So they're IID, because they are independent, 45:00 and they're identically distributed. 45:01 They all have this exponential with parameter lambda, 45:04 and I'm going to try to learn something about lambda. 45:06 What is the estimated value of lambda, 45:08 and can I build a confidence interval for lambda. 45:12 So we observe n arrival times. 45:16 So as I said, the mutual independence 45:20 is plausible, but not completely justified. 45:24 The fact that they're exponential 45:25 is actually something that people like in all this what's 45:27 called queuing theory. 45:29 So exponentials arise a lot when you 45:31 talk about inter-arrival times. 45:32 It's not about the bus, but where 45:34 it's very important is call centers, service, servers where 45:41 tasks come, and people want to know how long it's 45:45 going to take to serve a task. 45:47 So when I call at a center, nobody 45:50 knows how long I'm going to stay on the phone with this person. 45:52 But it turns out that empirically exponential 45:54 distributions have been very good at modeling this. 45:56 And what it means is that they're actually-- 45:58 you have this memoryless property. 46:01 It's kind of crazy if you think about it. 46:03 What does that thing say? 46:04 Let's parse it. 46:06 That's the probability. 46:08 So this is condition on the fact that T1 is larger than T. 46:12 So T1 is just say the first arrival time. 46:14 That means that conditionally on the fact 46:16 that I've been waiting for the first T, well, 46:19 the first [INAUDIBLE]. 46:23 Well, I should probably-- the first subway for more than T 46:27 conditionally-- so I've been there T minutes already. 46:30 Then the probability that I wait for s more minutes. 46:33 So that's the probability that T1 is learned, 46:35 and the time that we've already waited plus x. 46:38 Given that I've been waiting for T minutes, 46:40 really I wait for s more minutes, 46:42 is actually the probability that I wait for s minutes total. 46:46 It's completely memoryless. 46:47 It doesn't remember how long have you been waiting. 46:49 The probability does not change. 46:51 You can have waited for two hours, the probability 46:53 that it takes another 10 minutes is 46:55 going to be the same as if you had 46:56 been waiting for zero minutes. 46:59 And that's something that's actually 47:00 part of your problem set. 47:02 Very easy to compute. 47:03 This is just an analytical property. 47:05 And you just manipulate functions, 47:07 and you see that this thing just happen to be true, 47:09 and that's something that people like. 47:11 Because that's also something that benefit. 47:15 And also what we like is that this thing is positive 47:17 almost surely, which is good when you model arrival times. 47:21 To be fair, we're not going to be that careful. 47:23 Because sometimes we are just going 47:24 to assume that something follows a normal distribution. 47:29 And in particular, I mean, I don't 47:30 know if we're going to go into that details, 47:32 but a good thing that you can model with a Gaussian 47:34 distribution are heights of students. 47:38 But technically with positive probability, 47:40 you can have a negative Gaussian random variable, right? 47:44 And the probability being it's probably 10 to the minus 25, 47:48 but it's positive. 47:49 But it's good enough for us for our modeling. 47:51 So this thing is nice, but this is not going to be required. 47:54 When you're modeling positive random variables, 47:56 you don't always have to use positive distributions that are 47:59 supported on positive numbers. 48:01 You can use distributions like Gaussian. 48:03 48:06 So now this exponential distribution of T1, Tn 48:09 they have the same parameter, and that 48:11 means that in average they have the same inter-arrival time. 48:14 So this lambda is actually the expectation. 48:16 And what I'm just saying is that they're identically 48:19 distributed means that I mean some sort 48:21 of a stationary regime, and it's not always true. 48:24 I have to look at a shorter period of time, 48:26 because at rush hour and 11:00 PM 48:28 clearly those average inter-arrival times 48:31 are going to be different So it means that I am really 48:33 focusing maybe on rush hour. 48:35 48:38 Sorry, I said it's lambda. 48:39 It's actually 1 over lambda. 48:40 I always mix the two. 48:42 All right, so you have the density of T1. 48:44 So f of T is this. 48:46 So it's on the positive real line. 48:49 The fact that I have strictly positive or larger [INAUDIBLE] 48:52 to 0 doesn't make any difference. 48:54 So this is the density. 48:55 So it's lambda E to the minus lambda T. The lambda in front 48:58 just ensures that when I integrate 48:59 this function between 0 and infinity, I get 1. 49:03 And you can see, it decays like exponential minus lambda T. 49:06 So if I were to draw it, it would just look like this. 49:09 49:13 So at 0, what value does it take? 49:17 Lambda. 49:19 And then I decay like exponential minus lambda T. 49:23 So this is 0, and this is f of T. 49:30 So very small probability of being very large. 49:33 Of course, it depends on lambda. 49:35 Now the expectation, you can compute the expectation 49:37 of this thing, right? 49:38 So you integrate T times f of T. This 49:41 is part of the little sheet that I gave you last time. 49:44 This is one of the things you should 49:45 be able to do blindfolded. 49:47 And then you get the expectation of T1 is 1 over lambda. 49:51 That's what comes out. 49:53 So as I actually tell many of my students, 99% of statistics 49:57 is replacing expectations by averages. 50:00 And so what you're tempted to do is say, well, if in average I'm 50:02 supposed to see 1 over lambda, I have 15 observations. 50:05 I'm just going to average those observations, 50:07 and I'm going to see something that should be close to 1 50:10 over lambda. 50:11 So statistics is about replacing averages, 50:14 expectations with averages, and that's we do. 50:17 So Tn bar here, which is the average of the Ti's, is 50:21 a pretty good estimator for 1 over lambda. 50:25 So if I want an estimate for lambda, 50:27 then I need to take 1 over Tn bar. 50:30 So here is one estimator. 50:32 I did it without much principle except that I just 50:36 want to replace expectations by averages, 50:38 and then I fixed the problem that I was actually estimating 50:41 1 over lambda by lambda. 50:43 But you could come up with other estimators, right? 50:45 But let's say this is my way of getting to that estimator. 50:49 Just like I didn't give you any principled way of getting p 50:52 hat, which is Xn bar in the kiss example. 50:54 But that's the natural way to do it. 50:57 Everybody is completely shocked by this approach? 51:01 All right, so let's do this. 51:03 So what can I say about the properties of this estimator 51:06 lambda hat? 51:08 Well, I know that Tn bar is going to 1 over lambda 51:12 by the law of large number. 51:14 It's an average. 51:14 It converges to the expectation both almost surely, 51:18 and in probability. 51:19 So the first one is the strong law of large number, 51:21 the second one is the weak law of large number. 51:23 I can apply the strong one. 51:24 I have enough conditions. 51:26 And hence, what do I apply so that 1 over Tn bar 51:31 actually goes to lambda? 51:35 So I said hence. 51:36 What is hence? 51:37 What is it based on? 51:37 AUDIENCE: [INAUDIBLE] 51:43 PHILIPPE RIGOLLET Yeah, continuous mapping theorem, 51:45 right? 51:45 So I have this function 1 over x. 51:47 I just apply this function. 51:49 So if it was 1 over lambda squared, 51:51 I would have the same thing that would 51:52 happen just because the function 1 over x 51:54 is continuous away from 0. 51:58 And now the central limit theorem 52:00 is also telling me something about lambda. 52:02 About Tn bar, right? 52:03 It's telling me that if I look at my average, 52:05 I remove the expectation here. 52:08 So if I do Tn bar minus my expectation, 52:11 rescale by this guy here, then this thing is going 52:15 to converge to some Gaussian random variable, 52:18 but here I have this lambda to the negative 1-- 52:21 to the negative 2 here, and that's 52:23 because they did not tell you that if you 52:25 compute the variance-- 52:28 so from this, you can probably extract. 52:30 52:34 So if I have X that follows some exponential distribution 52:39 with parameter lambda. 52:40 Well, let's call it T. 52:42 So we know that T in expectation, the expectation 52:46 of T is 1 over lambda. 52:48 What is the variance of T? 52:49 52:56 You should be able to read it from the thing here. 53:00 53:09 1 over lambda squared. 53:10 That's what you actually read in the variance, 53:12 because the central limit theorem is really telling you 53:16 the distribution goes through this n. 53:19 But this numbers and this number you can read, right? 53:23 If you look at the expectation of this guy it's-- of this guy 53:26 comes out. 53:26 This is 1 over lambda minus 1 over lambda. 53:28 That's why you read the 0. 53:30 And if you look at the variance of the dot, 53:32 you get n times the variance of this average. 53:36 Variance of the average is picking up a factor 1 over n. 53:39 So the n cancels. 53:40 And then I'm left with only one of the variances, which 53:42 is 1 over lambda squared. 53:45 OK, so we're not going to do that in details, 53:48 because, again, this is just a pure calculus exercise. 53:50 But this is if you compute integral of lambda e 53:54 to the minus t lambda times t squared. 53:58 Actually t minus 1 over lambda squared 54:01 dt between 0 and infinity. 54:05 You will see that this thing is 1 over lambda squared. 54:07 54:14 How would I do this? 54:15 54:20 Configuration by [INAUDIBLE] or you know it. 54:24 All right. 54:26 So this is what the central limit theorem tells me. 54:29 So this gives me if I solve this, 54:31 and I plug in so I can multiply by lambda and solve, 54:35 it would give me somewhat a confidence interval for 1 54:40 over lambda. 54:42 If we just think of 1 over lambda 54:44 as being the p that I had before, 54:46 this would give me a central limit theorem for-- 54:48 54:51 sorry, a confidence interval for 1 over lambda. 54:54 So I'm hiding a little bit under the rug 54:56 the fact that I have to still define it. 54:58 Let's just actually go through this. 55:00 I see some of you are uncomfortable with this, 55:02 so let's just do it. 55:04 So what we've just proved by the central limit 55:06 theorem is that the probability, that's 55:09 square root of n Tn minus 1 over lambda exceeds q alpha over 2 55:21 is approximately equal to alpha, right? 55:24 That's just the statement of the central limit theorem, 55:27 and by approximately equal I mean as n goes to infinity. 55:30 55:34 Sorry I did not write it correctly. 55:36 I still have to divide by square root of 1 55:39 over lambda squared, which is the standard deviation, right? 55:43 And we said that this is a bit ugly. 55:44 So let's just do it the way it should be. 55:46 So multiply all these things by lambda. 55:50 So that means now that the absolute value, so 55:56 with probability 1 minus alpha asymptotically, 55:59 I have that square root of n times lambda Tn minus 1 56:07 is less than or equal to q alpha over 2. 56:11 56:14 So what it means is that, oh, I have negative q alpha over 2 56:20 less than square root of n. 56:22 Let me divide by square root of n here. 56:25 lambda Tn minus 1 q alpha over 2. 56:34 And so now what I have is that I get that lambda is between-- 56:41 that's Tn bar-- is between 1 plus q alpha over 2 56:50 divided by root n. 56:53 And the whole thing is divided by Tn bar, 56:57 and same thing on the other side except I have 1 minus q alpha 57:04 over 2 divided by root n divided by Tn bar. 57:08 57:12 So it's kind of a weird shape, but it's still 57:16 of the form 1 over Tn bar plus or minus something. 57:20 But this something depends on Tn bar itself. 57:23 And that's actually normal, because Tn bar is not only 57:26 giving me information about the mean, 57:29 but it's also giving me information about the variance. 57:31 So it should definitely come in the size of my error bars. 57:37 And that's the way it comes in this fairly natural way. 57:41 Everybody agrees? 57:43 So now I have actually built a confidence interval. 57:46 But what I want to show you with this example is, 57:50 can I translate this in a central limit 57:52 theorem for something that converges to lambda, right? 57:57 I know that Tn bar converges to 1 over lambda, 58:00 but I also know that 1 over Tn bar converges to lambda. 58:05 So do I have a central limit theorem for 1 over Tn bar? 58:09 Technically no, right? 58:11 Central limit theorems are about averages, and 1 over an average 58:14 is not an average. 58:16 But there's something that statisticians like a lot, 58:20 and it's called the Delta method. 58:23 The Delta method is really something 58:24 that's telling you that you can actually 58:27 take a function of an average, and let 58:30 it go to the function of the limit, 58:32 and you still have a central limit theorem. 58:34 And the factor or the price to pay for this 58:37 is something which depends on the derivative of the function. 58:44 And so let's just go through this, 58:46 and it's, again, just like the proof of the central limit 58:48 theorem. 58:49 And actually in many of those asymptotic statistics results, 58:53 this is actually just a Taylor expansion, 58:55 and here it's not even the second order, 58:57 it's actually the first order, all right? 58:59 So I'm just going to do linear approximation of this function. 59:02 59:04 So let's do it. 59:05 So I have that g of Tn bar-- 59:12 actually let's use the notation of this slide, 59:15 which is Zn and theta. 59:17 So what I know is that Zn minus theta square root of n 59:24 goes to some Gaussian, this standard Gaussian. 59:29 No, not standard. 59:32 OK, so that's the assumptions. 59:34 And what I want to show is some convergence of g of Zn 59:40 to g of theta. 59:43 So I'm not going to multiply by root n just yet. 59:46 So I'm going to do a first order Taylor expansion. 59:49 So what it is telling me is that this is equal to Zn minus theta 59:57 times g prime of, let's call it theta bar 60:01 where theta bar is somewhere between say 60:06 Zn and theta, for sum. 60:11 60:13 OK, so if theta is less than Zn you just permute those two. 60:17 So that's what the Taylor first order Taylor 60:21 expansion tells me. 60:21 There exists a theta bar that's between the two 60:23 values at which I'm expanding so that those two things are 60:26 equal. 60:29 Is everybody shocked? 60:31 No? 60:31 So that's standard Taylor expansion. 60:36 Now I'm going to multiply by root n. 60:38 60:44 And so that's going to be what? 60:45 That's going to be root n Zn minus theta. 60:50 Ah-ha, that's something I like. 60:51 Times g prime of theta bar. 60:57 60:59 Now the central limit theorem tells me 61:01 that this goes to what? 61:02 61:06 Well, this goes to sum n 0 sigma squared, right? 61:12 That was the first line over there. 61:15 This guy here, well, it's not clear, right? 61:20 Actually it is. 61:21 Let's start with this guy. 61:24 What does theta bar go to? 61:28 Well, I know that Zn is going to theta. 61:30 61:33 Just because, well, that's my law of large numbers. 61:37 Zn is going to theta, which means 61:41 that theta bar is sandwiched between two values that 61:44 converge to theta. 61:46 So that means that theta bar converges to theta itself 61:49 as n goes to infinity. 61:51 That's just the law of large numbers. 61:54 Everybody agrees? 61:57 Just because it's sandwiched, right? 61:58 So I have Zn. 62:01 I have theta, and theta bar is somewhere here. 62:05 The picture might be reversed. 62:06 It might be that Zn end is larger than theta. 62:08 But the law of large number tells me 62:10 that this guy is not moving, but this guy is moving that way. 62:14 So you know when n is [INAUDIBLE],, 62:16 there's very little wiggle room for theta bar, 62:18 and it can only get to theta. 62:19 62:23 And I call it the sandwich theorem, 62:25 or just find your favorite food in there. 62:29 So this guy goes to theta, and now I 62:31 need to make an extra assumption, which 62:33 is that g prime is continuous. 62:38 And if g prime is continuous, then g prime of theta bar 62:42 goes to g prime of theta. 62:44 So this thing goes to g prime of theta. 62:49 62:52 But I have an issue here. 62:54 Is that now I have something that 62:56 converges in distribution and something 62:57 that converges in say-- 63:01 I mean, this converges almost surely or saying probability 63:04 just to be safe. 63:06 And this one converges in distribution. 63:09 And I want to combine them. 63:11 But I don't have a slide that tells me 63:12 I'm allowed to take the product of something that converges 63:15 in distribution, and something that converges in probability. 63:18 This does not exist. 63:19 Actually, if anything it told me, 63:21 do not do anything with things that converge in distribution. 63:25 And so that gets us to our-- 63:32 OK, so I'll come back to this in a second. 63:36 And that gets us to something called Slutsky's theorem. 63:39 And Slutsky's theorem tells us that in very specific cases, 63:42 you can do just that. 63:44 So you have two sequences of random variables, Xn bar, 63:49 that's Xn that converges to X. And Yn that converges to Y, 63:53 but Y is not anything. 63:55 Y is not any random variable. 63:57 So X converges in this distribution. 63:59 Sorry, I forgot to mention, this is very important. 64:01 Xn converges in distribution, Y converges in probability. 64:04 And we know that in generality we cannot combine those two 64:07 things, but Slutsky tells us that if the limit of Y is 64:11 a constant, meaning it's not a random variable, 64:13 but it's a deterministic number 2, 64:16 just a fixed number that's not a random variable, 64:18 then you can combine them. 64:21 Then you can sum them, and then you can multiply them. 64:24 64:28 I mean, actually you can do whatever combination you want, 64:31 because it actually implies that X, the vector Xn, Yn 64:34 converges to the vector Xc. 64:39 OK, so here I just took two combinations. 64:41 They are very convenient for us, the sum and the product 64:44 so I could do other stuff like the ratio 64:45 if c is not 0, things like that. 64:47 64:51 So that's what Slutsky does for us. 64:53 So what you're going to have to write a lot in your homework, 64:56 in your mid-terms, by Slutsky. 64:58 I know some people are very generous with their by Slutsky. 65:03 They just do numerical applications, 65:05 mu is equal to 6, and therefore by Slutsky 65:08 mu square is equal to 36. 65:10 All right, so don't do that. 65:11 Just use, write Slutsky when you're actually using Slutsky. 65:15 But this is something that's very important for us, 65:17 and it turns out that you're going 65:18 to feel like you can write by Slutsky all the time, 65:20 because that's going to work for us all the time. 65:23 Everything we're going to see is actually 65:25 going to be where we're going to have to combine stuff. 65:27 Since we only rely on convergence from distribution 65:30 arising from the central limit theorem, 65:32 we're actually going to have to rely on something that 65:34 allows us to combine them, and the only thing we know 65:36 is Slutsky. 65:37 So we better hope that this thing works. 65:40 So why Slutsky works for us. 65:41 Can somebody tell me why Slutsky works 65:43 to combine those two guys? 65:46 So this one is converging in distribution. 65:48 This one is converging in probability, 65:51 but to a deterministic number. 65:54 g prime of theta is a deterministic number. 65:57 I don't know what theta is, but it's certainly deterministic. 66:02 All right, so I can combine them, multiply them. 66:04 So that's just the second line of that in particular. 66:08 All right, everybody is with me? 66:12 So now I'm allowed to do this. 66:13 You can actually-- you will see something 66:15 like counterexample questions in your problem 66:16 set just so that you can convince yourself. 66:18 It's always a good thing. 66:19 I don't like to give them, because I 66:21 think it's much better for you to actually come 66:23 to the counterexample yourself. 66:24 Like what can go wrong if Y is not a random-- 66:35 sorry, if Y is not a-- 66:38 sorry, if c is not the constant, but it's a random variable. 66:42 You can figure that out. 66:45 All right, so let's go back. 66:46 So we have now this Delta method that tells us 66:49 that now I have a central limit theorem 66:51 for functions of averages, and not just for averages. 66:55 So the only price to pay is this derivative there. 66:57 67:00 So, for example, if g is just a linear function, 67:05 then I'm going to have a constant multiplication. 67:07 If g is a quadratic function, then I'm 67:10 going to have theta squared that shows up there. 67:13 Things like that. 67:14 So just think of what kind of applications 67:16 you could have for this. 67:17 Here are the functions that we're interested in, 67:19 is x maps to 1 over x. 67:21 What is the derivative of this guy? 67:23 67:25 What is the derivative of 1 over x? 67:29 Negative 1 over x squared, right? 67:31 That's the thing we're going to have to put in there. 67:33 And so this is what we get. 67:37 So now when I'm actually going to write this, 67:44 so if I want to show square root of n lambda hat minus lambda. 67:51 That's my application, right? 67:52 This is actually 1 over Tn, and this is 1 over 1 over lambda. 67:59 So the function g of x is 1 over x in this case. 68:05 So now I have this thing. 68:06 So I know that by the Delta method-- 68:08 oh, and I knew that Tn, remember, 68:11 square root of Tn minus 1 over lambda 68:16 was going to sum normal with mean 0 68:19 and variance 1 over lambda squared, right? 68:21 So the sigma square over there is 1 over lambda squared. 68:26 So now this thing goes to what? 68:27 Sum normal. 68:28 What is going to be the mean? 68:32 0. 68:32 68:35 And what is the variance? 68:37 So the variance is going-- 68:38 I'm going to pick up this guy, 1 over lambda 68:40 squared, and then I'm going to have to take g prime of what? 68:46 Of 1 over lambda, right? 68:48 That's my theta. 68:49 68:52 So I have g of theta, which is 1 over theta. 68:55 So I'm going to have g prime of 1 over lambda. 68:58 And what is g prime of 1 over lambda? 69:00 69:05 So we said that g prime is 1 over negative 1 over x squared. 69:09 So it's negative 1 over 1 over lambda squared-- 69:13 69:17 sorry, squared. 69:18 69:21 Which is nice, because g can be decreasing. 69:24 So that would be annoying to have a negative variance. 69:26 And so g prime is negative 1 over, and so 69:29 what I get eventually is lambda squared up here, 69:33 but then I square it again. 69:36 So this whole thing here becomes what? 69:39 Can somebody tell me what the final result is? 69:41 69:44 Lambda squared right? 69:45 So it's lambda 4 divided by lambda 2. 69:47 69:55 So that's what's written there. 69:59 And now I can just do my good old computation for a-- 70:04 70:10 I can do a good computation for a confidence interval. 70:14 All right, so let's just go from the second line. 70:17 So we know that lambda hat minus lambda 70:21 is less than, we've done that several times already. 70:23 So it's q alpha over 2-- 70:25 sorry, I should put alpha over 2 over this thing, right? 70:28 So that's really the quintile of what our alpha over 2 times 70:31 lambda divided by square root of n. 70:34 All right, and so that means that my confidence interval 70:39 should be this, lambda hat. 70:42 Lambda belongs to lambda plus or minus q alpha 70:47 over 2 lambda divided by root n, right? 70:51 So that's my confidence interval. 70:53 But again, it's not very suitable, because-- 70:56 sorry, that's lambda hat. 70:59 Because they don't know how to compute it. 71:02 So now I'm going to request from the audience 71:04 some remedies for this. 71:06 What do you suggest we do? 71:07 71:12 What is the laziest thing I can do? 71:14 71:18 Anybody? 71:19 Yeah. 71:19 AUDIENCE: [INAUDIBLE] 71:21 PHILIPPE RIGOLLET Replace lambda by lambda hat. 71:23 What justifies for me to do this? 71:25 AUDIENCE: [INAUDIBLE] 71:27 PHILIPPE RIGOLLET Yeah, and Slutsky 71:29 tells me I can actually do it, because Slutsky tells me, 71:32 where does this lambda come from, right? 71:35 This lambda comes from here. 71:37 That's the one that's here. 71:39 So actually I could rewrite this entire thing 71:41 as square root of n lambda hat minus lambda divided by lambda 71:47 converges to sum n 0, 1. 71:51 Now if I replace this by lambda hat, what I have is 71:55 that this is actually really the original one times 72:01 lambda divided by lambda hat. 72:04 And this converges to n 0, 1, right? 72:07 And now what you're telling me is, well, this guy 72:10 I know it converges to n 0, 1, and this guy is converging to 1 72:15 by the law of large number. 72:16 But this one is converging to 1, which happens to be a constant. 72:19 It converges in probability, so by Slutsky I can actually 72:22 take the product and still maintain my conversion 72:25 to distribution to a standard Gaussian. 72:29 So you can always do this. 72:30 Every time you replace some p by p hat, 72:34 as long as their ratio goes to 1, 72:35 which is going to be guaranteed by the law of large number, 72:38 you're actually going to be fine. 72:40 And that's where we're going to use Slutsky a lot. 72:42 When we do plug in, Slutsky is going to be our friend. 72:46 OK, so we can do this. 72:47 72:51 And that's one way. 72:52 And then other ways to just solve 72:53 for lambda like we did before. 72:56 So the first one we got is actually-- 72:58 I don't know if I still have it somewhere. 73:00 Yeah, that was the one, right? 73:03 So we had 1 over Tn q, and that's exactly the same 73:08 that we have here. 73:09 So your solution is actually giving us exactly this guy when 73:12 we actually solve for lambda. 73:14 73:17 So this is what we get. 73:20 Lambda hat. 73:21 We replace lambda by lambda hat, and we 73:24 have our asymptotic convergence theorem. 73:27 And that's exactly what we did in Slutsky's theorem. 73:30 Now we're getting to it at this point is just telling us 73:32 that we can actually do this. 73:36 Are there any questions about what we did here? 73:39 So this derivation right here is exactly what I 73:42 did on the board I showed you. 73:44 So let me just show you with a little more space 73:46 just so that we all understand, right? 73:49 So we know that square root of n lambda hat minus lambda divided 73:58 by lambda, the true lambda defined 74:00 converges to sum n 0, 1. 74:04 So that was CLT plus Delta method. 74:07 74:11 Applying those two, we got to here. 74:13 And we know that lambda hat converges 74:17 to lambda in probability and almost surely, and that's what? 74:21 That was law of large number plus continued mapping theorem, 74:24 right? 74:25 Because we only knew that one of our lambda hat 74:27 converges to 1 over lambda. 74:29 So we had to flip those things around. 74:31 And now what I said is that I apply Slutsky, 74:33 so I write square root of n lambda hat minus lambda divided 74:38 by lambda hat, which is the suggestion that was made to me. 74:42 They said, I want this, but I would 74:44 want to show that it converges to sum n 0, 74:45 1 so I can legitimately use q alpha over 2 in this one 74:49 though. 74:50 And the way we said is like, well, this thing is actually 74:53 really q divided by lambda times lambda divided by lambda hat. 75:00 So this thing that was proposed to me, 75:02 I can decompose it in the product 75:03 of those two random variables. 75:05 The first one here converges through the Gaussian 75:09 from the central limit theorem. 75:10 And the second one converges to 1 from this guy, 75:14 but in probability this time. 75:17 75:20 That was the ratio of two things in probability, 75:23 we can actually get it. 75:25 And so now I apply Slutsky. 75:26 75:31 And Slutsky tells me that I can actually do that. 75:34 But when I take the product of this thing that converges 75:36 to some standard Gaussian, and this thing that converges 75:40 in probability to 1, then their product actually 75:43 converges to still this standard Gaussian [INAUDIBLE] 75:48 75:55 Well, that's exactly what's done here, 75:58 and I think I'm getting there. 76:02 So in our case, OK, so just a remark for Slutsky's theorem. 76:07 So that's the last line. 76:09 So in the first example we used the problem dependent trick, 76:11 which was to say, well, turns out 76:13 that we knew that p is between 0 and 1. 76:16 So we have this p 1 minus p that was annoying to us. 76:18 We just said, let's just bound it by 1/4, 76:21 because that's going to be true for any value of p. 76:23 But here, lambda takes any value between 0 and infinity, 76:26 so we didn't have such a trick. 76:27 It's something like we could see that lambda was less 76:29 than something. 76:30 Maybe we know it, in which case we could use that. 76:34 But then in this case, we could actually also 76:36 have used Slutsky's theorem by doing plug in, right? 76:39 So here this is my p 1 minus p that's replaced by p hat 1 76:41 minus p hat. 76:43 And Slutsky justify, so we did that 76:45 without really thinking last time. 76:46 But Slutsky actually justifies the fact 76:48 that this is valid, and still allows me to use 76:51 this q alpha over 2 here. 76:52 76:56 All right, so that's the end of this lecture. 76:58 Tonight I will post the next set of slides, chapter two. 77:01 And, well, hopefully the video. 77:04 I'm not sure when it's going to come out. 77:06