字幕記錄 00:00 All right, so let's get started. 00:02 So today, we're gonna talk about what are probably 00:06 the two most famous theorems in the entire history of probably. 00:11 They're called the law of large numbers and the central limit theorem. 00:15 They're closely related, so makes sense to do them together, kind of compare and 00:20 contrast them. 00:22 I don't, I can't think of a more famous probability theorem than these two. 00:27 So the setup for today is that we have i.i.d random variables. 00:34 Let's just call them X1, X2 i.i.d. 00:39 Since they're i.i.d they have the same mean and variance. 00:44 If the mean and variance exists but we'll assume they do. 00:49 So the mean, we'll just call it Mu. 00:50 And the variants, sigma squared. 00:56 So we're assuming that these are finite for now. 00:59 The mean and variants exist. 01:00 And both of these theorems tell us 01:05 what happens to the sample mean as n gets large. 01:09 So, the sample mean is just defined as Xn bar. 01:14 Standard notation in statistics is put a bar to mean averages and 01:17 that's just the average of the first n. 01:21 So to take the first n random variables, and average them, so 01:24 that's just called the sample mean. 01:32 So the question is, what can we say about Xn bar as n gets large? 01:37 So the way we would interpret this or use this is we get to observe. 01:44 These Xs, they're random variables but after we observe them they become data. 01:47 We're never going to have an infinite amount of data so 01:51 at some point we stop it at n. 01:52 We can think of that as the sample size and hopefully we get a large sample size. 01:57 Of course, it depends on the problem. 01:58 Some problems, you may not be able to get large n. 02:01 Well, we assume n is large, and 02:03 just take the average, question is just, what can we say? 02:07 All right, so first, here's what the law of large numbers says. 02:16 It's a very simple statement. 02:19 And hopefully pretty intuitive, too. 02:22 Law of Large Numbers says that Xn bar 02:27 converges to mu, as n goes to infinity. 02:36 With probability 1. 02:39 That's the fine print, probability 1. 02:44 With probability 0, so something really crazy could happen. 02:47 But we don't worry too much about it, because it has probability 0. 02:50 With probability 1, this is the sample mean, and 02:54 it says that the sample mean converges to the true mean. 03:03 So, that is a pretty nice, intuitive, easy to remember result. 03:11 That is, by true I mean the theoretical mean. 03:14 That is the expected value of Xj for any j is the true expected value. 03:21 Whereas this, is a random variable. 03:24 Right? We're taking an average of 03:25 random variables. 03:25 That's a random variable. 03:26 So this is just a constant but this is a random variable. 03:30 But it's gonna converge and I should say a little bit more, 03:35 what is this convergence statement actually mean. 03:39 You've all seen limit of sequences, but when we are talking about limits of random 03:43 variables we have to be a little more careful. 03:45 How do we actually define this. 03:47 The definition of this statement is just pointwise which means, 03:54 remember Xn bar is a random variable. 03:56 Random variable mathematically speaking is a function. 03:58 So it's say for each possible, if you evaluate this at some 04:02 specific outcome of the experiment, then you'll get a sequence of numbers. 04:06 That is if you actually observed the values and this kind of crystallizes into 04:11 numbers when you evaluate it at the outcome of the experiment. 04:14 And so those numbers converge to mu. 04:20 In other words, this is an event. 04:23 Either these random variables converge or they don't. 04:27 And we say that event has probability 1. 04:31 That' what the statement of the theorem is. 04:34 So to just give a simple example. 04:41 Let's think about what happens if we have Bernoulli p. 04:45 So if Xj is Bernoulli p, then intuitively we're 04:50 just imagining a infinite sequence of coin tosses. 04:56 Where the probability of heads is p, and 05:00 then this says that if we add up all of these Bernoullis up to n, 05:06 that it's just in the first coin flips, how many times did the coin land heads, 05:13 divided by the number of flips should convert to p with probability 1. 05:25 So for example, so this is a very intuitive statement. 05:28 If it's a fair coin and you flip the coin a million times, well, 05:33 you're not really expecting that it will be 500,000 heads and 500,00 tails. 05:39 But you do think that, in the long run, it should be the case 05:44 that it's going to be essentially half heads, half tails. 05:48 Not exactly, but essentially. 05:50 And the proportion should get closer and closer to the true value. 05:56 This qualification would probably 1 is needed because mathematically speaking 05:59 even if you have a fair coin, there's nothing in the math that says 06:04 it's impossible that the coin would land heads, heads, heads, heads, heads forever. 06:09 You know that that's never actually gonna happen in reality. 06:13 It's just not gonna happen. 06:15 It's a fair coin. 06:16 It might land heads, heads, heads for a time if you're very lucky or 06:20 unlucky or whatever. 06:21 But it's not gonna be heads, heads, heads forever. 06:26 But there's nothing in the math that says that's an invalid sequence. 06:31 So there's some weird pathological cases like that. 06:35 But with probability one, we get what we expect. 06:39 If we didn't have this result, how we would ever even estimate p? 06:45 You might imagine if you didn't know what p was, 06:48 kind of the obvious thing to do is flip the coin a lot of times and 06:51 take the proportion of heads and use that as your approximation for p. 06:54 But what justification could you have for 06:57 doing that approximation if you didn't have this. 07:00 So this is a very, very necessary result. 07:06 But I guess to comment a little bit more about what does it actually say for 07:09 the coin, because this is kind of related to gambler's fallacy, and 07:13 things like that. 07:15 The gambler's fallacy is the idea that like let's say your gambling and 07:19 you lose like ten times in a row and then it's the feeling that your due to win. 07:27 You lost all these times then and you might try to justify that using a lot 07:32 of large numbers and say well you know the coin might landed let's say, 07:36 heads you win money, tails you lose money, you just lost money ten times in a row. 07:41 But the law of large numbers says, in the long run, 07:44 it's gonna go back to one-half if it's fair. 07:47 So somehow you need to start winning a lot to compensate. 07:51 That's not the way it works. 07:54 The coin is memoryless. 07:56 The coin does not care how many failures or how many losses you had before. 08:00 So the way it works is not through If you're unlucky at the beginning that 08:04 somehow it gets offset later by an increase in heads. 08:09 The way it works is through what we might call swamping. 08:14 And let's say the coin landed tails a 100 times in a row. 08:19 It doesn't mean that the probability has changed for 101st flip. 08:24 What it means though, is that we're letting n go to infinity here, okay? 08:29 So no matter how unlucky you were in the first 100 or 08:32 the first million trials, that's nothing compared to infinity, right? 08:37 So those first million just get swamped out by the entire infinite future, 08:42 so that what's going on here. 08:50 Yeah, so to tell you one little story about the law 08:55 of large numbers, a colleague of mine told me this story. 09:01 He had a student once who said he hated statistics. 09:06 And of course, my colleague was very shocked, 09:08 like how can anyone hate statistics? 09:11 And so he asked, why? 09:12 How is it possible that you hate statistics? 09:15 And then the student who was an athlete, and he was training everyday and 09:19 he had just learned the law of large numbers. 09:22 And he was very, very depressed by this because he said, the law of large numbers 09:26 says in the long run, I'm gonna only be average and I can't improve. 09:30 So well, of course the fallacy there, we assumed iid right now. 09:38 Now there are generalizations of this theorem beyond iid, but 09:41 we can't just get rid of iid. 09:43 So the iid is saying that the distribution is not changing with time. 09:48 That doesn't mean that you can't actually improve your own distribution then it 09:53 would not be iid. 09:54 So don't be depressed by this, and in fact this theorem 09:59 I think is crucial in order for science to actually be possible. 10:05 Because if you kind of imagine kind of hypothetical 10:08 counter factual world where this theorem was actually false. 10:13 That would be really depressing to try to ever learn about the world, right? 10:18 Cuz this is saying, you're collecting more and more data. 10:21 You're letting your sample size go to infinity. 10:24 And this says, you converged to the truth, right? 10:28 And it would be some weird setting, where you get more and more data, and more and 10:31 more data, and yet you're not able to converge to the truth, right? 10:35 So that would be really bad. 10:36 So this is very intuitive, very important. 10:40 Okay, so let's prove this at least a similar version. 10:47 So this is actually sometimes called the strong law of large numbers. 10:53 And we're actually gonna prove what's sometimes called 10:56 the weak law of large numbers. 10:57 I don't really like the terminology strong and weak here, but 11:02 that's kind of a standard. 11:04 Strong law of large numbers is what I just said, 11:07 where it's converging point-wise with probability 1. 11:13 That is just these random variables converged to this constant, 11:19 except on some bad event that has probability 0. 11:23 The weak law of large numbers says that for 11:26 any, C greater than 0, 11:32 the probability that Xn bar minus 11:37 the mean is greater than c goes to 0. 11:42 So it's a very similar looking statement. 11:47 It's not exactly equivalent. 11:49 It's possible to show, you have to go through some real analysis for 11:53 this that is not necessary for our purposes. 11:55 But it turns out that, this statement, 11:57 once you've proven this thing it implies this form of convergence. 12:01 This is called convergence in probability, but 12:06 the intuition is very similar. 12:09 So just to interpret this statement in words it says, so we can chose, 12:14 we should interpret c as being some small number. 12:17 So let's say we chose c to be 0.001, okay? 12:21 And then it says that this thing goes to 0, so in other words, this, 12:26 as n goes to infinity again. 12:29 So this says that if n is large enough, then 12:34 it's extremely unlikely that these are more than 0.001 apart. 12:38 In other words, if n is large, 12:41 it's extremely likely that this is extremely close to this, right? 12:46 So it's a very similar statement, n is large, 12:49 it's extremely likely that the sample mean is very close to the true mean. 12:54 Okay, so that's what it says. 12:55 So we'll prove this one, 12:58 because to prove this one takes a lot of work and a lot of time. 13:03 This one, it looks like it's a nice-looking theorem. 13:06 And it is a nice theorem, but 13:07 we can prove it very easily using Chebyshev's inequality. 13:15 Okay, so let's prove the weak law of large numbers. 13:23 So all we need to do is show that this goes to 0, right? 13:26 That's what the statement is. 13:28 So let's just bound it using, this looks pretty similar to what we were doing last 13:32 time, where we did Markov's inequality, Chebyshev's inequality. 13:36 This looks similar to that kind of stuff from last time, 13:39 which is why I did that, well, one reason for doing that last time. 13:42 We need the inequalities anyway, but it's especially useful here. 13:46 So we just need to show this thing goes to 0. 13:48 Xn bar minus mu greater than c, goes to 0, 13:55 By Chebyshev's inequality, this is less than or 13:59 equal to the variance of Xn bar divided by c squared, 14:03 that's just exactly Chebyshev from last time. 14:09 Now we just need the variance of Xn bar, variance of Xn bar, 14:15 well, just stare at the definition of Xn bar for a second. 14:18 There's a 1 over n in front, that comes out as 1 over n squared. 14:25 And then since I'm assuming they're iid an then dependent, 14:28 the variance of the sum is just n times the variance of one term. 14:32 So that's n sigma squared divided by c squared, 14:36 which is sigma squared over nc squared. 14:41 Sigma is a constant, c is a constant, n goes to infinity, so this goes to 0. 14:48 So that proved the weak law of large numbers, just only a one line thing. 14:59 Okay, so that tells us what happens point-wise when we average a bunch 15:06 of iid random variables, and it converges to the mean. 15:11 So let me just rewrite that statement. 15:14 Then we'll write the central limit theorem and kind of compare them. 15:17 So another way to write what we just showed 15:22 is that Xn bar minus mu goes to 0 as n goes to 15:27 infinity, which is a good thing to know. 15:33 However, it doesn't tell us what the distribution of Xn bar looks like. 15:40 So this is true with probability one, but what is the distribution? 15:52 What is the distribution of Xn bar look like? 16:00 So this says it's getting closer, Xn bar is getting closer and 16:05 closer to this constant mu. 16:07 Okay, but that's not really telling us the shape, and 16:10 it's not really telling us the rate. 16:12 This goes to 0, but at what rate? 16:16 So one way to think about problems like that, when you have something going to 0, 16:22 and you wanna study something about, how fast does it go to 0? 16:27 Then one might, not just in here, but 16:30 just as a general approach to that kind of problem. 16:33 We know this goes to 0, but we don't know how fast. 16:37 One way to study that would be multiply it by something that goes to infinity, right. 16:42 Now, if we multiply it by something that goes to infinity, 16:47 such that this times this goes to infinity. 16:50 Then we know that this part that blows up is dominating over this part. 16:55 And if we multiply by something that goes to infinity, but 16:58 this whole thing still goes to 0, then that's more informative, right? 17:02 So what's gonna happen is that we can imagine multiplying here by 17:08 n to some power and we're gonna show that there's a power here, 17:12 and to some power, fill in the blank. 17:15 What we're gonna show is that, 17:18 if the power here is above some threshold and to the big powers, 17:24 its gonna go to infinity fast, this thing will just blow up. 17:29 And if we put a smaller power than the threshold here, then this is still going 17:34 to infinity as long as this is a positive power of n, this is still going to 17:39 infinity, this parts going to 0, but this part's dominating, right? 17:43 So this term is competing with this term. 17:46 This one goes to infinity, this one goes to 0, okay? 17:49 So then the question is what's that magic threshold value? 17:53 And the answer is one-half. 17:57 So that's what we're gonna study right now. 17:58 So we're gonna take the square root of n times xn bar minus mu. 18:04 This is kind of the happy medium, 18:06 where we're gonna get a non-degenerate distribution, that this is gonna converge 18:11 in distribution to an actual distribution, it's not gonna just get killed to 0 or 18:16 blow up to infinity, it's actually gonna give us a nice distribution. 18:22 Okay, and I'm also gonna divide by the sigma here, makes it a little bit cleaner. 18:28 So this is the central limit theorem now. 18:31 I'm stating it, then we'll prove it. 18:37 Central limit theorem says, if you take this and 18:40 look at what happens as n goes to infinity. 18:47 Converges to standard normal in distribution. 18:55 [SOUND] By convergence and distribution, what we mean is that 19:00 the distribution of this converges to the standard normal distribution. 19:06 In other words, you could take the CDF. 19:09 I mean these may be discrete or continuous or a mixture of discreet and continuous. 19:14 So it doesn't necessarily have a PDF, but every random variable has a CDF. 19:20 So it says if you take the CDF of this, 19:22 it's gonna converge to capital 5, the standard normal. 19:27 So I think this is kind of an amazing result that this holds in such generality, 19:33 right, because I mean the normal is just this one, standard normal is just this 19:38 one particular, it's a nice looking bell curve, but that's just one distribution. 19:44 And those x's they could be discrete, they could be continuous, 19:48 they could be extremely nasty looking distributions, right? 19:52 It could look like anything, 19:54 the only thing we assumed was that there was a finite variance. 19:59 Other than that, they could have an incredibly complicated, 20:03 messy distribution. 20:06 But it's always gonna go to standard normal. 20:09 So this is one of the reasons why the standard normal distribution is so 20:14 important on the one hand and so, widely used, because this is a theorem 20:19 as n goes to infinity is what it says, but the way it's used in practice is then 20:24 people use normal approximations all the time and a lot of the justification for 20:30 normal approximations is coming from this, because this says that if n is large, 20:36 then the sample mean will approximately have a normal distribution. 20:44 Even if the original data did not look like they came from a normal distribution, 20:50 when you average lots and lots of them, it looks normal, okay. 20:55 So this is in a sense is a better theorem than the law of large numbers, 20:59 but because it's kind of more informative to know the distribution, 21:03 know something about the rate, and you know it's interesting that it's, 21:07 square root of n is kind of the power of n that's just right, right? 21:11 A larger power it's gonna blow up, a smaller power it's gonna go to 0. 21:15 N to the one-half is the compromise, then you always get a normal distribution. 21:20 It's more informative in some sense, but 21:22 you should also keep in mind, it is a different sense of convergence. 21:27 Up here, we're talking about the random variables actually converging, 21:32 literally the random variables converge the sample mean converges 21:36 literally to point-wise with probability 1, to the true mean. 21:41 Here, we're talking about convergence in distribution. 21:43 So we're not talking about convergence of random variables. 21:47 We're just saying the distribution of this converges to the normal 0, 1 distribution. 21:52 So that's a different sense of convergence, but anyway, 21:57 both of them are telling us what's gonna happen to Xn bar when n is large, okay? 22:04 So well, let's prove this theorem. 22:07 Here's another way to write this, by the way, 22:11 it's good to be familiar with both ways. 22:15 It's just algebra to go from one to the other, but 22:18 they're both useful enough to be worth mentioning. 22:21 Let's just write the central limit theorem in terms of the sum of X's 22:26 rather than in terms of the sample mean. 22:29 So I'm just gonna take the sum of Xj, j equals 1 to n. 22:34 And so, we can either think of the central limit theorem as, 22:38 either think of it as telling us what happens to the sample mean or we 22:41 can think of it as telling us what happens to the sum, or the convolution, okay? 22:46 It's equivalent because they're just a factor of, 22:50 we just have to be careful not to mess up the factor of n, 22:53 b ut we can go from one to the other cuz it's just a factor of n. 22:57 So the claim is that this is approximately normal when n is large, 23:02 but if we just have this thing, this could easily just blow up. 23:08 You're just adding more and more terms. 23:10 But somehow we wanna standardize this first. 23:15 So if we take this thing, because this thing has mean and 23:20 mu, right, so let's subtract n mu. 23:26 Because then it has zero mean, because I just want to match. 23:30 I wanna make the mean 0 and the variance 1, so 23:32 that it kind of matches up with that, rather than just letting it blow up. 23:37 So this is called centering, we just subtracted by linearity, 23:41 the mean is n mu, so just subtract it n mu. 23:43 And then let's divide by the standard deviation, 23:47 this is just how we did standard deviation before. 23:50 So over there we showed that the variants of Xn bar is sigma-squared over n. 23:57 And the variance of this sum is just n sigma squared. 24:02 So let's just divide by the standard deviation, right, 24:07 which is square root of n Times sigma, okay? 24:12 Cuz the variance is n sigma squared. 24:15 So that's just the standardized version. 24:17 And the statement is again that this converges to the standard normal 24:22 in distribution. 24:23 So if we take this sum and standardize it, then it's gonna go standard normal. 24:33 Okay, so, all right, so now we're ready to prove this theorem. 24:41 And, sort of just a calculation, but it's kind of a nice 24:46 calculation in some ways, we're gonna prove it, well. 24:53 This theorem is always true as long as the variance exist. 24:57 We don't need to assume that, the third moment or the fourth moment exist. 25:01 But the proof is much more complicated to do it in that generality. 25:05 So we're gonna assume that the MGF exists, then we can actually work with the MGFs. 25:11 Because when you see this thing, sum of independent random variables, 25:15 then we know the MGF is gonna be something useful if it exists. 25:20 And there's ways to extend this proof to cases where the MGF doesn't exist. 25:23 But for our purposes, we may as well just assume MGF exists. 25:30 So assuming MGF, let's call it M(t). 25:38 Of Xj, they're iid, so if one of them has an MGF, they all have the same MGF. 25:44 We'll just assume that that exists. 25:54 Once we have MGFs, then our strategy is to show that the MGFs converge. 26:01 So that's a theorem about MGFs, that if the MGFs converge to some other MGF, 26:07 then the random variables converge in distribution, right? 26:12 We had a homework problem related to that, where you found that the MGFs converged 26:18 to some MGF, and that implies convergence of the distributions, right? 26:22 Okay, so that's the whole strategy. 26:24 So that means all we need to do is find the MGF of this and 26:28 then take the limit, okay? 26:30 So basically at this point, it's just like, write down the MGF, 26:35 take the limit, and use a few facts about MGFs, okay? 26:40 So first of all, we can assume. 26:50 That, let's just assume mu = 0 and 26:54 sigma = 1, just to simplify the notation. 27:00 This is without loss of generality, 27:04 because we could write this as, all we have to do is consider. 27:10 I wrote the standardized thing this way, but 27:14 I could've just written it as standardizing each X separately. 27:19 I could've written Xj- mu over sigma. 27:24 So this would be standardizing each of them separately, j = 1 to n, and 27:29 then we have a 1 over root n. 27:34 That will be the same thing that we're looking at. 27:36 This just says standardize them separately first. 27:39 But then you could just, I mean if you want, just call this thing Yj. 27:43 And once you have the central limit term for Yj, then you know that that's true. 27:47 So you might as well just assume that they've already been standardized. 27:51 And so just to have some notation, let's just let Sn equal the sum, 27:57 S for sum, of the first n terms. 28:00 And what we wanna show is that the MGF 28:04 of Sn over root n, that's what we're looking at, right? 28:08 That let mu equal zero, sigma equals one, so we're looking at Sn over root n. 28:12 And we wanna show that that goes to the standard normal MGF. 28:22 Right, so we just need to find this MGF, take a limit. 28:27 Okay, so let's just find the MGF. 28:30 So by definition, that's the expected value of e to the t times Sn over root n. 28:42 And Sn is just the sum. 28:44 So, and we're assuming independence, which means that these, you can 28:50 write this as e to the t x1 over root n, e to the t x2 over root n, blah, blah, blah. 28:56 All of those factors are independent, therefore, they're uncorrelated. 29:02 So we can just split it up as a product, X1/ over root n. 29:09 Blah, blah, blah, same thing, 29:13 just e to the Xj over root n is the general term, right? 29:18 I'm just using the fact that those are uncorrelated, so 29:23 we can write e of the product of the expectations. 29:28 But since these X's are iid, 29:30 these are really just the same thing written, n times. 29:33 So really, this is just this thing to the nth power. 29:40 And this thing, that should remind you of an MGF, right? 29:44 That's just the MGF of X1, 29:46 except that instead of evaluated at t, it's evaluated at t over root n. 29:51 So really, that's just the MGF, 29:54 evaluated at t over root n raised to the nth power. 30:00 So that's what we have. 30:04 Now we need to take the limit as n goes to infinity. 30:06 So let's just look at what's gonna happen here, n is going to infinity. 30:11 This thing on the inside becomes M of 0. 30:16 M of 0 is 1 for any MGF, right? 30:19 Cuz e to the 0 is 1. 30:21 So this is of the form 1 to the infinity which is in indeterminate form, right? 30:28 It could evaluate to anything. 30:31 So going back to calculus, how do you deal with 1 to the infinity, 30:35 or 0 over 0, or whatever. 30:37 Usually we try to reduce it to something where we can use L'Hopital's Rule for 30:41 those problems, right? 30:42 Or we can use a Taylor series type of thing. 30:45 So, how do we get into that form? 30:51 Take the log, because this looks like 1 to infinity. 30:56 If we take the log, it'll look like infinity times log of 1. 31:00 So it'll look like infinity times 0, take logs. 31:04 Then we just have to remember to exponentiate at the end to undo the log. 31:10 Okay, so let's write down then what we have. 31:18 After taking the log, and we're trying to do a limit, so 31:22 we're doing the limit as n goes to infinity, and we take the log. 31:26 It's n log M(t 31:31 over root n). 31:36 So that's of the form infinity times 0. 31:41 If we want 0 over 0 or infinity over infinity, 31:44 we can just write it as 1 over n in the denominator. 31:54 Okay, and now it's of the form 0 over 0. 31:57 So we can almost use L'Hopital's Rule, but not quite. 32:00 We have to be a little bit careful. 32:02 Because first of all, I'm assuming n is an integer, 32:05 and you can't do calculus on integers. 32:10 Secondly, it's just kind of, even if we pretended that n is a real number and 32:14 then the derivative of n would be- 1 over n squared and 32:18 that's kind of annoying to deal with. 32:20 And it's kind of annoying to deal with this square root here. 32:23 So let's first make a change of variables. 32:26 Let's just let y = 1 over root n and also let y be real, not necessarily, 32:37 Not necessarily of the form 1 over square root of an integer, okay? 32:42 So it's the same limit, just written in terms of y instead of in terms of n. 32:48 So as n goes to infinity y goes to 0 and 1 over n is y squared, 32:54 so it's denominator is just y squared. 32:58 The reason I do it this way is that 1 over root n is just y 33:02 by definition but then the numerator is just log m of yt. 33:07 That's a lot easier to deal with because we got rid of the square roots. 33:13 So it's still of the form 0 over 0. 33:16 So we're gonna use L'Hospital's Rule. 33:21 So limit, y goes to 0. 33:23 Take the derivative of the numerator and the denominator separately. 33:28 The derivative of the denominator is 2y. 33:31 The derivative of the numerator, 33:32 well we're just going to have to use the chain rule. 33:35 Derivative of log something is 1 over that thing. 33:39 So that's M of yt hence the derivative of that thing which again 33:44 by the chain rule is M prime of yt times the derivative of yt. 33:50 We're treating t as constant, we're differentiating with respect to y. 33:55 So t comes out. 34:00 And now let's see what we have. 34:02 Let's just summarize a couple facts about MGFs. 34:08 So M of t is the expected value of E to the tX1. 34:14 So M of 0 = 1 Okay. 34:22 And when we first started doing MGF we said that we take derivatives of the MGF 34:26 and evaluate it at 0. 34:28 We get the moments, that is why it's called the moment generating function. 34:31 So the first derivative at 0 is the mean, but we assume that mu is 0. 34:36 So this is 0, here. 34:38 And the second derivative, while we're doing this. 34:42 Secondary derivative is the second moment, but since we assumed that the variance is 34:46 1 and the mean is 0, the second moment is 1, okay? 34:51 So over here, as we let y go to 0, denominator's still going to 0. 34:56 Numerator's also going to 0, because M prime of 0 is 0, 35:01 so its still on the form 0 over 0, so let's just do what we were told again. 35:08 So first I can simplify it a little bit, this t can come out, 35:13 because that's acting as a constant, and the 2 can come out. 35:17 And limit y goes to 0 and this M of yt part, 35:24 that's just going to 1. 35:28 So we can write that as part of a separate limit, but 35:31 that other limit is just going to 1. 35:34 You can think of it as just the limit of this part times 35:36 the limit of the rest of it. 35:37 But that part's just going to 1, so we can get rid of that. 35:41 So really is just, what's left is just 35:47 the limit of M prime yt divided by y. 35:53 Everything else is gone, so it's actually pretty nicely simplified. 35:59 Now, using L'Hospital's Rule a second time, 36:02 now the derivative of the denominator is just 1, okay? 36:07 And for the numerator, chain rule, M double prime of yt. 36:16 That was a t not a t squared, but now it's a t squared, 36:19 because by the chain rule, derivative of yt is t, so we have a t squared over 2. 36:25 Now when we let y go to 0, now it's just M double prime 0 is 1, so 36:29 now this limit is just 1. 36:31 So we get t squared over 2, 36:35 that's what we wanted, 36:39 because t squared over 2 is the log. 36:45 Of e to the t squared over 2, but 36:48 e to the t square over 2 is exactly the normal 0,1 MGF. 37:02 Okay so, 37:03 to prove that theorem that's the end of the proof of the central limit theorem. 37:08 All we had to do was just basic facts out MGF, use, L'Hospital's Rule twice. 37:14 And there we have one of the most famous important theorems in statistics. 37:20 Now so there are more general versions of this, 37:23 like you can extend this in various ways where it's not an IID, 37:28 but it still has to satisfy some assumptions, right. 37:32 But anyway, this is the basic central limit theorem. 37:36 Okay, so that's pretty good. 37:40 Let's do an example, like how do we actually use this, 37:45 for the sake of approximations, things like that. 37:50 Last time I was talking about the difference between inequalities and 37:53 approximations, right? 37:54 And we talked about Poisson approximation before. 37:57 We haven't really talked about normal approximation. 38:01 This result is giving us the ability to use normal approximations 38:06 when we're studying sample mean and is large, okay? 38:11 So historically, though, the first version of 38:16 the central limit theorem that was ever proven, 38:21 I think was for binomials, okay? 38:24 So what we're saying is that 38:28 binomial np under some conditions will be approximately normal. 38:33 And well in the old days that was incredibly important fact because 38:38 they didn't have computers to binomials how to deal with 38:42 like n choose k, and n is large, and k, you have all these factorials. 38:47 You can't do these things by hand. 38:49 Now we have fast computers, so it's a little bit better. 38:53 But it's still a lot easier working with normal distributions than 38:57 binomial distributions most of the time, right? 39:00 And even now factorials still grow so fast that even with 39:05 a fast computer with large memory and everything, you may quickly 39:09 exceed its ability when you're doing some big complicated binomial problem. 39:13 And normals have a lot of nice properties, as we've seen, okay? 39:18 The question is, when can we approximate a binomial 39:24 using a normal, and how do we do that, okay? 39:29 So this is just the binomial approximation 39:34 to the normal, other way around. 39:38 Normal approximation, I'll say binomial approximated by normal, 39:41 the normal approximation to the binomial. 39:48 When is that valid? 39:52 To contrast it with the Poisson approximation, 39:55 that we've seen before, okay? 39:58 So, if x is, let's x be binomial np 40:05 And as we've done many times before 40:10 we can represent x as a sum of iid Bernoulli. 40:18 Right? Well these are just 1, if success on the J 40:23 trials 0 otherwise, so the XJ are iid Bernoulli P. 40:33 So this does fit into the framework of the central limit 40:37 theorem that is we are adding up iid random variables. 40:40 So the central limit theorem says that, if the N is large this will be 40:45 approximately normal, at least after we have standardized it, okay? 40:50 So suppose we wanted to approximate, suppose we're 40:55 interested in the probability that x is between A and B. 41:04 And I want to approximate that, 41:07 first we'll do equality then we're approximating it. 41:10 So, I mean if you had to do this on a computer what you would do or by hand, 41:15 which you wouldn't want to, would be to take the PMF and 41:19 sum up all the values of the PMF from A to B, right. 41:23 So okay, you would not want to do that by hand most of the time. 41:28 But suppose we just want an approximation for this, not the exact thing. 41:32 So first, the strategy is just gonna be to take x and standardize it first. 41:38 So we're gonna subtract the mean, so we know that the mean is NP, 41:44 and we're gonna divide by the standard deviation, 41:48 which we know as the square root of NPQ or Q is 1 minus P. 41:53 So, I'm just standardizing it right now. 41:55 So this is still equal, we haven't done any approximations yet. 42:02 And then, now that we've standardized it, 42:05 we can apply the central limit theorem, if N is large enough, right? 42:10 If N is, if central limit theorem said N goes to infinity, 42:12 that doesn't answer the question of how large does N have to be. 42:16 And for that, there's various theorems and various rules of thumb. 42:20 A lot of books will say, how large does N have to be? 42:23 And some books at least will say 30, and that's just a rule of thumb. 42:30 That's not always gonna work for all, there's separate rules of thumb for 42:36 the binomial, like you want N times P to be reasonably large and 42:41 N times 1 minus P to be large, there are different rules of thumb. 42:46 But anyway, if N is large enough, 42:49 then what we've just proven is that this is gonna look like it has 42:53 a normal distribution because that's a sum of IID things. 42:57 And we standardized it correctly, because we already knew the mean and the variance, 43:02 so we just standardized it. 43:03 Okay, so this is approximately. 43:08 Now we're going to use the normal approximation, 43:10 we're going to say this is approximately normal. 43:13 And if I want the probability that the normal is between something and 43:17 something, that's just the CDF here minus the CDF here, right? 43:22 Because for the normal, I mean this is discrete but we're approximating 43:27 using something continuous and we just say, integrate the PDF from here to here. 43:34 But fundamental theorem calculus, that just says take the CDF and go, okay. 43:37 So we're just gonna do Phi of B minus 43:43 NP over square root of NPQ minus Phi 43:48 of A minus NP over square root of NPQ. 43:53 So that would be the basic normal approximation, 43:57 I'll talk a little bit about how to improve this approximation. 44:02 But to contrast it with the Poisson approximation. 44:11 We talked before about the fact that, and we proved the fact 44:16 that if N goes to infinity, and P goes to 0, and N times P is fixed. 44:22 Then the binomial distribution converts to the Poisson distribution, 44:26 we proved that before. 44:28 So in the Poisson approximation, so for 44:31 the Poisson approximation what we had was N is large but P was very small, right? 44:38 And we let lambda equal NP and x as moderate. 44:46 And most important thing is that P is small here, P is close to 0. 44:51 We proved it in the case where this goes to infinity and this goes to 0, okay? 44:55 So Poisson is relevant when we're dealing with a large number of very 45:00 rare unlikely things. 45:02 That's really in contrast to this, 45:06 in this case for the normal approximation. 45:10 Then, while we still want N to be large, but 45:14 if you kind of think intuitively about when is this gonna work well, 45:19 we actually want P to be close to one half. 45:25 Because think about the symmetry, if you have a binomial of P equals one half, 45:30 that's a symmetric distribution. 45:33 The normal is symmetric, no matter, every normal distribution is symmetric. 45:39 If P is far from one half, then the binomial is very, very skewed, and in that 45:44 case it's kind of doesn't make that much sense to approximate using a normal. 45:52 So this is gonna work as an approximation, that's normal approximation, 45:58 as an approximation if P is very small, this makes a lot more sense than this. 46:04 However, think about the statement of the central limit theorem. 46:08 In that theorem I never said P was close to one half, 46:12 in fact that was just a general theorem, we didn't even have P in the statement 46:17 of the central limit theorem, but somehow this still has to eventually work. 46:22 But as a practical matter as an approximation, 46:26 if P is close to one half this is going to work quite well, 46:30 if N is like 30 or 50 or 100, it will work fine. 46:34 But if P is .001, the central limit theorem is still true, 46:38 that as N goes to infinity it's gonna work, okay. 46:41 But if N is kind of not that enormous of a number, 46:46 then it's gonna be a pretty bad approximation. 46:50 And let's just try to reconcile these statements though, is there a case? 46:58 If we let N go to infinity and P be very small, 47:02 I still said, if N is going to infinity, 47:05 it's still gonna converge to normal just much slower, right? 47:11 So, how could the binomial look both normal and Poisson? 47:18 Well, the answer is that the Poisson also looks normal. 47:21 So if you've Poisson lambda where lambda's very large, 47:24 that's also gonna look normal, so there is a case where those come together. 47:30 Okay, one last thing about this is that there is something kind 47:35 of weird about this in the sense that we're approximating 47:40 a discrete distribution using something continuous. 47:44 And if we wanted to get, 47:47 what if we wanted to just approximate same problem? 47:53 I just wanna add something to this. 47:54 Well, let's just look at that just to see what more of like what could go wrong 47:59 with this. 47:59 What if we look at the case A equals B? 48:02 So then we're just saying the probability that x equals A, 48:06 that is approximate the Binomial PMF. 48:10 And one kind of weird thing about this is, this thing would change if 48:14 we changed these to strict inequality but this part would not. 48:18 As soon as we say that this is approximately normal than we don't care 48:22 about that anymore. 48:24 So there's something called the continuity correction which I just wanted to 48:27 briefly mention. 48:28 Which is an improvement to deal with the fact that you're using something 48:31 continuous to approximate something discrete. 48:34 And it's often not explained very well but if you understand what 48:39 it does in this simple case, then it's not hard to see the idea. 48:44 The idea is that if you just said this is approximately normal then you would just 48:49 say zero, right? 48:50 Because it would be zero for continuous, that's not very useful, right? 48:53 We want something more useful than zero. 48:56 So the idea is just simply to write this as, 49:00 here let's assume A is an integer x is discreet well, 49:05 x equals A is the same thing as saying that x is between 49:10 A plus one-half and A minus one-half. 49:17 Right? 49:19 So just use this first. 49:24 So for each value in this range, 49:26 replace it by an interval of length 1 centered there, 49:30 that's exactly the same thing because x is an integer anyway, so that's true. 49:35 But here at least we're giving it an interval to work with instead of 49:40 just saying zero, so that improves this approximation. 49:44 Anyway, it's just central limit theorem. 49:45 All right, so see you next time.