字幕記錄 00:00 00:00 The following content is provided under a Creative 00:02 Commons license. 00:03 Your support will help MIT OpenCourseWare 00:06 continue to offer high quality educational resources for free. 00:10 To make a donation or to view additional materials 00:12 from hundreds of MIT courses, visit MIT OpenCourseWare 00:16 at ocw.mit.edu. 00:17 00:20 PHILIPPE RIGOLLET: --124. 00:22 If I were to repeat this 1,000 times, 00:24 so every one of those 1,000 times 00:26 they collect 124 data points and then 00:29 I'd do it again and do it again and again, 00:31 then in average, the number I should get 00:34 should be close to the true parameter that I'm looking for. 00:37 The fluctuations that are due to the fact 00:38 that I get different samples every time 00:40 should somewhat vanish. 00:42 And so what I want is to have a small bias, hopefully a 0 bias. 00:46 If this thing is 0, then we see that the estimator is unbiased. 00:50 01:06 So this is definitely a property that we 01:08 are going to be looking for in an estimator, 01:10 trying to find them to be unbiased. 01:11 But we'll see that it's actually maybe not enough. 01:14 So unbiasedness should not be something 01:16 you lose your sleep over. 01:18 Something that's slightly better is the risk, really 01:21 the quadratics risk, which is expectation of-- 01:33 so if I have an estimator, theta hat, 01:35 I'm going to look at the expectation of theta hat n 01:38 minus theta squared. 01:41 And what we showed last time is that we can actually-- 01:44 by inserting in there, adding and removing 01:46 the expectation of theta hat, we actually 01:49 get something where this thing can 01:50 be decomposed as the square of the bias plus the variance, 01:59 which is just the expectation of theta hat minus its expectation 02:04 squared. 02:06 That came from the fact that when 02:08 I added and removed the expectation of theta hat 02:10 in there, the cross-terms cancel. 02:13 All right. 02:14 So that was the bias squared, and this is the variance. 02:19 02:25 And so for example, if the quadratic risk goes to 0, 02:29 then that means that theta hat converges 02:31 to theta in the L2 sense. 02:34 And here we know that if we want this to go to 0, 02:38 since it's the sum of two positive terms, 02:40 we need to have both the bias that goes to 0 02:42 and the variance that goes to 0, so we 02:44 need to control both of those things. 02:46 And so there is usually an inherent trade-off 02:49 between getting a small bias and getting a small variance. 02:53 If you reduce one too much, then the variance of the other one 02:56 is going to-- 02:57 then the other one is going to increase, or the opposite. 02:59 That happens a lot, but not so much, actually, in this class. 03:03 So let's just look at a couple of examples. 03:07 So am I planning-- 03:10 yeah. 03:11 So examples. 03:19 So if I do, for example, X1, Xn, there are iid Bernoulli. 03:26 And I'm going to write it theta so 03:27 that we keep the same notation. 03:29 Then theta hat, what is the theta hat 03:32 that we proposed many times? 03:33 It's just X bar, Xn bar, the average of Xi's. 03:38 So what is the bias of this guy? 03:40 Well, to know the bias, I just have to remove theta 03:44 from the expectation. 03:46 What is the expectation of Xn bar? 03:49 Well, by linearity of the expectation, 03:51 it's just the average of the expectations. 03:53 03:57 But since all my Xi's are Bernouilli with the same theta, 04:00 then each of this guy is actually equal to theta. 04:03 So this thing is actually theta, which means 04:06 that this isn't biased, right? 04:07 04:14 Now, what is the variance of this guy? 04:16 04:22 So if you forgot the properties of the variance 04:27 for sum of independent random variables, 04:29 now it's time to wake up. 04:30 So we have the variance of something 04:34 that looks like 1 over n, the sum from i equal 1 to n of Xi. 04:38 04:41 So it's of the form variance of a constant times 04:45 a random variable. 04:46 So the first thing I'm going to do is pull out the constant. 04:49 But we know that the variance leaves on the square scale, 04:52 so when I pull out a constant outside of the variance, 04:54 it comes out with a square. 04:56 The variance of a times X is a-squared 04:59 times the variance of X, so this is equal to 1 05:02 over n squared times the variance of the sum. 05:06 05:10 So now we want to always do what we want to do. 05:13 So we have the variance of the sum. 05:16 We would like somehow to say that this 05:17 is the sum of the variances. 05:19 And in general, we are not allowed to say that, 05:22 but we are because my Xi's are actually independent. 05:26 So this is actually equal to 1 over n squared sum from i equal 05:30 1 to n of the variance of each of the Xi's. 05:36 And that's by independence, so this is basic probability. 05:42 And now, what is the variance of Xi's where again they're 05:45 all the same distribution, so the variance of Xi 05:47 is the same as the variance of X1. 05:49 And so each of those guys has variance what? 05:51 What is the variance of a Bernoulli? 05:53 We've said it once. 05:54 It's theta times 1 minus theta. 05:55 06:00 And so now I'm going to have the sum of n times a constant, 06:03 so I get n times the constant divided by n squared, 06:05 so one of the n's is going to cancel. 06:07 And so the whole thing here is actually 06:10 equal to theta, 1 minus theta divided by n. 06:15 06:18 So if I'm interested in the quadratic risk-- 06:20 06:27 and again, I should just say risk, 06:28 because this is the only risk we're going 06:30 to be actually looking at. 06:32 Yeah. 06:32 This parenthesis should really stop here. 06:34 06:38 I really wanted to put quadratic in parenthesis. 06:41 So the risk of this guy is what? 06:43 Well, it's the expectation of x bar n minus theta squared. 06:50 And we know it's the square of the variance, 06:54 so it's the square of the bias, which 06:56 we know is 0, so it's 0 squared plus the variance, which 07:00 is theta, 1 plus theta-- 07:03 1 minus theta divided by n. 07:07 So it's just theta, 1 minus theta divided by n. 07:14 So this is just summarizing the performance of an estimator, 07:17 which is the random variable. 07:18 I mean, it's complicated. 07:19 If I really wanted to describe it, 07:22 I would just tell you the entire distribution 07:25 of this random variable. 07:27 But now what I'm doing is I'm saying, well, 07:29 let's just take this random variable, remove theta from it, 07:32 and see how small the fluctuations around theta-- 07:36 the squared fluctuations around theta are in expectation. 07:41 So that's what the quadratic risk is doing. 07:43 And in a way, this decomposition, 07:44 as the sum of the bias square and the variance, 07:46 is really telling you that-- 07:47 it is really accounting for the bias, which is, well, 07:50 even if I had an infinite amount of observations, 07:52 is this thing doing the right thing? 07:54 And the other thing is actually the variance, 07:56 so for finite number of observations, 07:57 what are the fluctuations? 07:59 All right. 08:00 Then you can see that those things, bias and variance, 08:02 are actually very different. 08:05 So I don't have any colors here, so you're 08:08 going to have to really follow the speed-- 08:12 the order in which I draw those curves. 08:14 All right. 08:14 So let's find-- 08:15 I'm going to give you three candidate estimators, so-- 08:19 08:29 estimators for theta. 08:31 08:35 So the first one is definitely Xn bar. 08:38 That will be a good candidate estimator. 08:40 The second one is going to be 0.5, because after all, 08:45 why should I bother if it's actually going to be-- 08:47 right? 08:47 So for example, if I ask you to predict 08:51 the score of some candidate in some election, 08:54 then since you know it's going to be very close to 0.5, 08:57 you might as well just throw 0.5 and you're not going 08:59 to be very far from reality. 09:00 And it's actually going to cost you 0 time and $0 09:02 to come up with that. 09:03 So sometimes maybe just a good old guess 09:06 is actually doing the job for you. 09:08 Of course, for presidential elections 09:10 or something like this, it's not very helpful 09:12 if your prediction is telling you this. 09:14 But if it was something different, 09:17 that would be a good way to generate some close to 1/2. 09:21 For a coin, for example, if I give you a coin, 09:23 you never know. 09:24 Maybe it's slightly biased. 09:25 But the good guess, just looking at it, inspecting it, 09:27 maybe there's something crazy happening 09:29 with the structure of it, you're going 09:31 to guess that it's 0.5 without trying to collect information. 09:34 And let's find another one, which is, well, you know, 09:36 I have a lot of observations. 09:38 But I'm recording couples kissing, but I'm on a budget. 09:43 I don't have time to travel all around the world 09:46 and collect some people. 09:47 So really, I'm just going to look at the first couple 09:49 and go home. 09:49 So my other estimator is just going to be X1. 09:53 I just take the first observation, 0, 1, 09:55 and that's it. 09:57 So now I'm going-- 09:57 I want to actually understand what the behavior of those guys 10:01 is. 10:01 All right. 10:02 So we know-- and so we know that for this guy, the bias is 0 10:09 and the variance is equal to theta, 10:14 1 minus theta divided by n. 10:19 What is the bias of this guy, 0.5? 10:22 10:28 AUDIENCE: 0.5. 10:29 AUDIENCE: 0.5 minus theta? 10:31 PHILIPPE RIGOLLET: 0.5 minus theta, right. 10:32 10:35 So the bias, 0.5 minus theta. 10:39 What is the variance of this guy? 10:40 10:44 What is the variance of 0.5? 10:46 AUDIENCE: It's 0. 10:47 PHILIPPE RIGOLLET: 0. 10:48 Right. 10:49 It's just a deterministic number, 10:50 so there's no fluctuations for this guy. 10:53 What is the bias? 10:54 Well, X1 is actually-- 10:56 just for simplicity, I can think of it 10:58 as being X1 bar, the average of itself, 11:00 so that wherever I saw an n for this guy, I can replace it by 1 11:03 and that will give me my formula. 11:05 So the bias is still going to be 0. 11:07 And the variance is going to be equal to theta, 1 minus theta. 11:10 11:13 So now I have those three estimators. 11:16 Well, if I compare X1 and Xn bar, then 11:19 clearly I have 0 bias in both cases. 11:22 That's good. 11:23 And I have the variance that's actually n times smaller when I 11:27 use my n observations than when I don't. 11:29 So those two guys, on these two fronts, 11:31 you can actually look at the two numbers 11:32 and say, well, the first number is the same. 11:35 The second number is better for the other guy, 11:37 so I will definitely go for this guy compared to this guy. 11:40 So this guy is gone. 11:42 But not this guy. 11:43 Well, if I look at the bias, the variance is 0. 11:47 It's always beating the variance of this guy. 11:49 And if I look at the bias, it's actually really not that bad. 11:52 It's 0.5 minus theta. 11:53 In particular, if theta is 0.5, then this guy 11:55 is strictly better. 11:57 And so you can actually now look at what 12:00 the quadratic risk looks like. 12:05 So here, what I'm going to do is I'm 12:06 going to take my true theta-- so it's 12:08 going to range between 0 and 1. 12:09 And we know that those two things are functions of theta, 12:12 so I can only understand them if I plot them 12:13 as functions of theta. 12:16 And so now I'm going to actually plot-- 12:18 the y-axis is going to be the risk. 12:20 12:23 So what is the risk of the estimator of 0.5? 12:26 This one is easy. 12:27 Well, it's 0 plus the square of 0.5 minus theta. 12:33 So we know that at theta, it's actually going to be 0. 12:37 And then it's going to be a square. 12:39 So at 0, it's going to be 0.25. 12:44 And at 1, it's going to be 0.25 as well. 12:49 So it looks like this. 12:49 Well, actually, sorry. 12:50 Let me put the 0.5 where it should be. 12:52 12:56 OK. 12:57 So this here is the risk of 0.5. 13:03 And we'll write it like this. 13:06 So when theta is very close to 0.5, I'm very happy. 13:09 When theta gets farther, it's a little bit annoying. 13:13 And then here, I want to plot the risk of this guy. 13:16 So now the thing with the risk of this guy 13:18 is that it will depend on n. 13:20 So I will just pick some n that I'm happy with just 13:24 so that I can actually draw a curve. 13:26 Otherwise, I'm going to have to plot one curve per value of n. 13:29 So let's just say, for example, that n is equal to 10. 13:31 And so now I need to plot the function theta, 1 minus 13:35 theta divided by 10. 13:37 We know that theta, 1 minus theta 13:39 is a curve that goes like this. 13:40 It takes value at 1/2. 13:42 It thinks value 1/4. 13:43 That's the maximum. 13:44 And then it's 0 at the end. 13:46 So really, if n is equal to 1, this 13:52 is what the variance looks like. 13:53 The bias doesn't count in the risk. 13:56 Yeah. 13:57 AUDIENCE: [INAUDIBLE] 14:00 PHILIPPE RIGOLLET: Sure. 14:01 Can you move? 14:03 All right. 14:04 Are you guys good? 14:05 14:08 All right. 14:08 So now I have this picture. 14:10 And I know I'm going up to 25. 14:12 And there's a place where those curves cross. 14:15 So if you're sure-- 14:16 let's say you're talking about presidential election, 14:18 you know that those things are going to be really close. 14:20 Maybe you're actually better by predicting 0.5 14:23 if you know it's not going to go too far. 14:25 But that's for one observation, so that's the risk of X1. 14:32 But if I look at the risk of Xn, all I'm doing 14:34 is just crushing this curve down to 0. 14:38 So as n increases, it's going to look more and more like this. 14:42 It's the same curve divided by n. 14:44 14:48 And so now I can just start to understand 14:50 that for different values of thetas, 14:52 now I'm going to have to be very close to theta is equal to 1/2 14:56 if I want to start saying that Xn bar is worse 14:58 than the naive estimator 0.5. 15:03 Yeah. 15:04 AUDIENCE: Sorry. 15:04 I know you explained a little bit before, but can you just-- 15:08 what is an intuitive definition of risk? 15:11 What is it actually describing? 15:13 PHILIPPE RIGOLLET: So either you can-- 15:16 well, when you have an unbiased estimator, it's simple. 15:18 It's just telling you it's the variance, 15:20 because the theta that you have over there is really-- so 15:23 in the definition of the risk, the theta 15:26 that you have here if you're unbiased 15:28 is really the expectation of theta hat. 15:31 So that's really just the variance. 15:33 So the risk is really telling you 15:35 how much fluctuations I have around my expectation 15:39 if unbiased. 15:39 But actually here, it's telling you how much fluctuations 15:42 I have in average around theta. 15:44 So if you understand the notion of variance as being-- 15:47 AUDIENCE: [INAUDIBLE] 15:47 PHILIPPE RIGOLLET: What? 15:48 AUDIENCE: Like variance on average. 15:49 PHILIPPE RIGOLLET: No. 15:49 AUDIENCE: No. 15:50 PHILIPPE RIGOLLET: It's just like variance. 15:51 AUDIENCE: Oh, OK. 15:52 PHILIPPE RIGOLLET: So when you-- 15:53 I mean, if you claim you understand what variance is, 15:56 it's telling you what is the expected 15:58 squared fluctuation around the expectation 16:00 of my random variable. 16:01 It's just telling you on average how far I'm going to be. 16:04 And you take the square because you want to cancel the signs. 16:06 Otherwise, you're going to get 0. 16:07 AUDIENCE: Oh, OK. 16:07 PHILIPPE RIGOLLET: And here it's saying, well, 16:08 I really don't care what the expectation of theta hat is. 16:11 What I want to get to is theta, so I'm 16:13 looking at the expectation of the squared fluctuations 16:15 around theta itself. 16:16 If I'm unbiased, it coincides with the variance. 16:19 But if I'm biased, then I have to account for the fact 16:21 that I'm really not computing the-- 16:23 AUDIENCE: OK. 16:23 OK. 16:24 Thanks. 16:25 PHILIPPE RIGOLLET: OK? 16:27 All right. 16:28 Are there any questions? 16:29 So here, what I really want to illustrate 16:31 is that the risk itself is a function 16:33 of theta most of the times. 16:34 And so for different thetas, some estimators 16:35 are going to be better than others. 16:37 But there's also the entire range 16:38 of estimators, those that are really biased, 16:41 but the bias can completely vanish. 16:44 And so here, you see you have no bias, 16:47 but the variance can be large. 16:48 Or you have 0 bias-- 16:50 you have a bias, but the variance is 0. 16:52 So you can actually have this trade-off 16:54 and you can find things that are in the entire range in general. 16:58 17:01 So those things are actually-- those trade-offs 17:05 between bias and variance are usually much better illustrated 17:10 if we're talking about multivariate parameters. 17:12 If I actually look at a parameter which 17:14 is the mean of some multivariate Gaussian, so an entire vector, 17:19 then the bias is going to-- 17:20 I can make the bias bigger by, for example, 17:23 forcing all the coordinates of my estimator to be the same. 17:26 So here, I'm going to get some bias, 17:27 but the variance is actually going 17:29 to be much better, because I get to average all 17:31 the coordinates for this guy. 17:32 And so really, the bias/variance trade-off 17:35 is when you have multiple parameters to estimate, 17:38 so you have a vector of parameters, 17:40 a multivariate parameter, the bias 17:42 increases when you're trying to pull more information 17:45 across the different components to actually have 17:49 a lower variance. 17:50 So the more you average, the lower the variance. 17:53 That's exactly what we've illustrated. 17:54 As n increases, the variance decreases, 17:56 like 1 over n or theta, 1 minus theta over n. 17:59 And so this is how it happens in general. 18:01 In this class, it's mostly one-dimensional parameter 18:03 estimation, so it's going to be a little harder to illustrate 18:06 that. 18:06 But if you do, for example, non-parametric estimation, 18:09 that's all you do. 18:10 There's just bias/variance trade-offs all the time. 18:14 And in between, when you have high-dimensional parametric 18:16 estimation, that happens a lot as well. 18:20 OK. 18:21 So I'm just going to go quickly through those two remaining 18:25 slides, because we've actually seen them. 18:26 But I just wanted you to have somewhere a formal definition 18:29 of what a confidence interval is. 18:32 And so we fixed a statistical model for n observations, X1 18:37 to Xn. 18:38 The parameter theta here is one-dimensional. 18:42 Theta is a subset of the real line, 18:44 and that's why I talk about intervals. 18:47 An interval is a subset of the line. 18:48 If I had a subset of R2, for example, 18:51 that would no longer be called an interval, but a region, 18:54 just because-- well, that's just we can say a set, 18:57 a confidence set. 18:59 But people like to say confidence region. 19:01 So an interval is just a one-dimensional conference 19:04 region. 19:04 And it has to be an interval as well. 19:07 So a confidence interval of level 1 minus alpha-- 19:11 so we refer to the quality of a confidence interval 19:16 is actually called it's level. 19:18 It takes value 1 minus alpha for some positive alpha. 19:21 And so the confidence level-- 19:23 the level of the confidence interval is between 0 and 1. 19:26 The closer to 1 it is, the better the confidence interval. 19:29 The closer to 0, the worse it is. 19:32 And so for any random interval-- so 19:34 a confidence interval is a random interval. 19:37 The bounds of this interval depends on random data. 19:41 Just like we had X bar plus/minus 19:44 1 over square root of n, for example, or 2 19:46 over square root of n, this X bar 19:49 was the random thing that would make fluctuate those guys. 19:53 And so now I have an interval. 19:54 And now I have its boundaries, but now the boundaries 19:56 are not allowed to depend on my unknown parameter. 19:58 Otherwise, it's not a confidence interval, 20:00 just like an estimator that depends 20:02 on the unknown parameter is not an estimator. 20:04 The confidence interval has to be something 20:06 that I can compute once I collect data. 20:10 And so what I want is that-- so there's this weird notation. 20:14 The fact that I write theta-- 20:17 that's the probability that I contains theta. 20:19 You're used to seeing theta belongs to I. 20:23 But here, I really want to emphasize 20:24 that the randomness is in I. And so the way 20:26 you actually say it when you read 20:28 this formula is the probability that I contains theta 20:32 is at least 1 minus alpha. 20:36 So it better be close to 1. 20:39 You want 1 minus alpha to be very close to 1, 20:41 because it's really telling you that whatever 20:43 random variable I'm giving you, my error bars are actually 20:46 covering the right theta. 20:49 And I want this to be true. 20:50 But I want this-- since I don't know 20:52 what my confidence-- my parameter of theta 20:54 is, I want this to hold true for all possible values 20:58 of the parameters that nature may have come up with from. 21:02 So I want this-- so there's theta that changes here, 21:05 so the distribution of the interval 21:06 is actually changing with theta hopefully. 21:08 And theta is changing with this guy. 21:11 So regardless of the value of theta that I'm getting, 21:13 I want that the probability that it contains the theta 21:17 is actually larger than 1 minus alpha. 21:20 So I'll come back to it in a second. 21:22 I just want to say that here, we can 21:23 talk about asymptotic level. 21:25 And that's typically when you use central limit 21:27 theorem to compute this guy. 21:29 Then you're not guaranteed that the value is 21:32 at least 1 minus alpha for every n, 21:35 but it's actually in the limit larger than 1 minus alpha. 21:40 So maybe for each fixed n it's going to be not true. 21:43 But for as no goes to infinity, it's 21:44 actually going to become true. 21:46 If you want this to hold for every n, 21:49 you actually need to use things such as Hoeffding's inequality 21:51 that we described at some point, that hold for every n. 21:55 So as a rule of thumb, if you use the central limit theorem, 22:00 you're dealing with a confidence interval 22:01 with asymptotic level 1 minus alpha. 22:04 And the reason is because you actually 22:05 want to get the quintiles of the normal-- the Gaussian 22:10 distribution that comes from the central limit theorem. 22:13 And if you want to use Hoeffding's, for example, 22:15 you might actually get away with a confidence interval that's 22:18 actually true even non-asymptotically. 22:20 It's just the regular confidence interval. 22:22 22:24 So this is the formal definition. 22:26 It's a bit of a mouthful. 22:28 But we actually-- the best way to understand them 22:30 is to build them. 22:31 Now, at some point I said-- 22:33 and I think it was part of the homework-- 22:35 22:38 so here, I really say the probability 22:39 the true parameter belongs to the confidence interval 22:42 is actually 1 minus alpha. 22:44 And so that's because here, this confidence interval 22:47 is still a random variable. 22:48 Now, if I start plugging in numbers instead 22:50 of the random variables X1 to Xn, 22:52 I start putting 1, 0, 0, 1, 0, 0, 1, 22:55 like I did for the kiss example, then in this case, 22:58 the random interval is actually going to be 0.42, 0.65. 23:03 And this guy, the probability that theta belongs to it 23:05 is not 1 minus alpha. 23:07 It's either 0 if it's not in there 23:10 or it's 1 if it's in there. 23:11 23:16 So here is the example that we had. 23:19 So just let's look at back into our favorite example, which 23:24 is the average of Bernoulli random variables, 23:26 so we studied that maybe that's the third time already. 23:30 So the sample average, Xn bar, is a strongly consistent 23:34 estimator of p. 23:35 That was one of the properties that we wanted. 23:37 Strongly consistent means that as n goes to infinity, 23:40 it converges almost surely to the true parameter. 23:42 That's the strong law of large number. 23:44 It is consistent also, because it's strongly consistent, 23:47 so it also converges in probability, 23:49 which makes it consistent. 23:52 It's unbiased. 23:53 We've seen that. 23:53 We've actually computed its quadratic risk. 23:57 And now what I have is that if I look at-- 24:00 thanks to the central limit theorem, we actually did this. 24:02 We built a confidence interval at level 1 minus alpha-- 24:08 asymptotic level, sorry, asymptotic level 1 minus alpha. 24:12 And so here, this is how we did it. 24:15 Let me just go through it again. 24:17 So we know from the central limit theorem-- 24:19 24:28 so the central limit theorem tells us 24:31 that Xn bar minus p divided by square root of p1 minus p, 24:38 square root of n converges in distribution as n 24:41 goes to infinity to some standard normal distribution. 24:47 So what it means is that if I look at the probability 24:49 under the true p, that's square root of n, Xn bar 24:53 minus p divided by square root of p1 minus p, 25:03 it's less than Q alpha over 2, where this is 25:06 the definition of the quintile. 25:07 Then this guy-- and I'm actually going to use the same notation, 25:11 limit as n goes to infinity, this is the same thing. 25:17 So this is actually going to be equal to 1 minus alpha. 25:22 That's exactly what I did last time. 25:25 This is by definition of the quintile of a standard Gaussian 25:28 and of a limit in distribution. 25:32 So the probabilities computed on this guy in the limit converges 25:36 to the probability computed on this guy. 25:38 And we know that this is just the probability 25:40 that the absolute value of sum n 0, 1 25:42 is less than Q alpha over 2. 25:44 25:47 And so in particular, if it's equal, 25:50 then I can put some larger than or equal to, 25:54 which guarantees my asymptotic confidence level. 25:57 And I just solve for p. 25:59 So this is equivalent to the limit 26:03 as n goes to infinity of the probability 26:07 that theta is between Xn bar minus Q 26:15 alpha over 2 divided by-- 26:21 times square root of p1 minus p divided by square root of n, Xn 26:26 bar plus q alpha over 2, square root of p1 minus p 26:33 divided by square root of n is larger than or equal 26:37 to 1 minus alpha. 26:39 And so there you go. 26:39 I have my confidence interval. 26:43 Except that's not, right? 26:45 We just said that the bounds of a confidence interval 26:48 may not depend on the unknown parameter. 26:50 And here, they do. 26:52 And so we actually came up with two ways 26:54 of getting rid of this. 26:55 Since we only need this thing-- so this thing, as we said, 26:58 is really equal. 26:59 Every time I'm going to make this guy smaller 27:01 and this guy larger, I'm only going 27:03 to increase the probability. 27:05 And so what we do is we actually just take 27:06 the largest possible value for p1 minus 27:08 p, which makes the interval as large as possible. 27:13 And so now I have this. 27:15 I just do one of the two tricks. 27:17 I replace p1 minus p by their upper bound, which is 1/4. 27:22 27:25 As we said, p1 minus p, the function looks like this. 27:28 So I just take the value here at 1/2. 27:31 Or, I can use Slutsky and say that if I replace p by Xn bar, 27:37 that's the same as just replacing p by Xn bar here. 27:40 27:45 And by Slutsky, we know that this is actually converging 27:48 also to some standard Gaussian. 27:50 27:59 We've seen that when we saw Slutsky as an example. 28:04 And so those two things-- actually, 28:05 just because I'm taking the limit 28:07 and I'm only caring about the asymptotic confidence level, 28:10 I can actually just plug in consistent quantities in there, 28:13 such as Xn bar where I don't have a p. 28:15 And that gives me another confidence interval. 28:18 All right. 28:19 So this by now, hopefully after doing it three times, 28:24 you should really, really be comfortable with just creating 28:28 this confidence interval. 28:29 We did it three times in class. 28:31 I think you probably did it another couple times 28:33 in your homework. 28:34 So just make sure you're comfortable with this. 28:36 All right. 28:37 That's one of the basic things you would want to know. 28:39 Are there any questions? 28:41 Yes. 28:42 AUDIENCE: So Slutsky holds for any single response set p. 28:46 But Xn converges [INAUDIBLE]. 28:48 28:52 PHILIPPE RIGOLLET: So that's not Slutsky, right? 28:55 AUDIENCE: That's [INAUDIBLE]. 28:58 PHILIPPE RIGOLLET: So Slutsky tells you that if you-- 29:04 Slutsky's about combining two types of convergence. 29:06 So Slutsky tells you that if you actually 29:08 have one Xn that converges to X in distribution and Yn 29:13 that converges to Y in probability, then 29:16 you can actually multiply Xn and Yn 29:18 and get that the limit in distribution 29:20 is the product of X and Y, where X is now a constant. 29:28 And here we have the constant, which is 1. 29:32 But I did that already, right? 29:35 Using Slutsky to replace it for the-- 29:37 to replace P by Xn bar, we've done 29:40 that last time, maybe a couple of times ago, actually. 29:44 Yeah. 29:45 AUDIENCE: So I guess these statements are [INAUDIBLE].. 29:49 PHILIPPE RIGOLLET: That's correct. 29:51 AUDIENCE: So could we like figure out [INAUDIBLE] 29:53 can we set a finite [INAUDIBLE]. 29:58 PHILIPPE RIGOLLET: So of course, the short answer is no. 30:00 30:04 So here's how you would go about thinking 30:06 about which method is better. 30:08 So there's always the more conservative method. 30:10 The first one, the only thing you're losing 30:13 is the rate of convergence of the central limit theorem. 30:16 So if n is large enough so that the central limit theorem 30:19 approximation is very good, then that's all you're 30:22 going to be losing. 30:24 Of course, the price you pay is that your confidence interval 30:27 is wider than it would be if you were 30:28 to use Slutsky for this particular problem, 30:31 typically wider. 30:32 Actually, it is always wider, because Xn bar-- 30:37 1 minus Xn bar is always less than 1/4 as well. 30:41 And so that's the first thing you-- 30:45 so Slutsky basically adds your relying on the central limit-- 30:51 your relying on the asymptotics again. 30:53 Now of course, you don't want to be conservative, 30:56 because you actually want to squeeze as much from your data 30:59 as you can. 30:59 So it depends on how comfortable and how critical it is for you 31:04 to put valid error bars. 31:06 If they're valid in the asymptotics, 31:07 then maybe you're actually going to go with Slutsky 31:09 so it actually gives you slightly narrower confidence 31:11 intervals and so you feel like you're a little more-- 31:16 you have a more precise answer. 31:17 Now, if you really need to be super-conservative, 31:19 then you're actually going to go with the P1 minus P. 31:23 Actually, if you need to be even more conservative, 31:25 you are going to go with Hoeffding's so you don't even 31:28 have to rely on the asymptotics level at all. 31:31 But then you're confidence interval 31:32 becomes twice as wide and twice as wide 31:35 and it becomes wider and wider as you go. 31:37 So depends on-- 31:39 I mean, there's a lot of data in statistics 31:41 which is gauging how critical it is for you to output 31:46 valid error bounds or if they're really just here 31:48 to be indicative of the precision of the estimator you 31:51 gave from a more qualitative perspective. 31:55 AUDIENCE: So the error there is [INAUDIBLE]?? 31:57 PHILIPPE RIGOLLET: Yeah. 31:58 So here, there's basically a bunch of errors. 32:01 There's one that's-- so there's a theorem called Berry-Esseen 32:04 that quantifies how far this probability is from 1 minus 32:09 alpha, but the constants are terrible. 32:12 So it's not very helpful, but it tells you 32:14 as n grows how smaller this thing grows-- 32:17 becomes smaller. 32:18 And then for Slutsky, again you're 32:20 multiplying something that converges by something that 32:22 fluctuates around 1, so you need to understand 32:24 how this thing fluctuates. 32:25 Now, there's something that shows up. 32:28 Basically, what is the slope of the function 1 32:31 over square root of X1 minus X around the value 32:36 you're interested in? 32:37 And so if this function is super-sharp, 32:39 then small fluctuations of Xn bar around this expectation 32:43 are going to lead to really high fluctuations 32:45 of the function itself. 32:47 So if you're looking at-- 32:49 if you have f of Xn bar and f around say the true P, 32:55 if f is really sharp like that, then 32:58 if you move a little bit here, then you're 33:00 going to move really a lot on the y-axis. 33:03 So that's what the function here-- the function 33:05 you're interested in is 1 over square root of X1 minus X. 33:09 So what does this function look like around the point where you 33:11 think P is the true parameter? 33:14 33:17 Its derivative really is what matters. 33:19 OK? 33:21 Any other question. 33:22 33:24 OK. 33:25 So it's important, because now we're 33:26 going to switch to the real let's do some hardcore 33:29 computation type of things. 33:31 All right. 33:32 33:36 So in this chapter, we're going to talk about maximum 33:39 likelihood estimation. 33:40 33:44 Who has already seen maximum likelihood estimation? 33:49 OK. 33:50 And who knows what a convex function is? 33:55 OK. 33:56 So we'll do a little bit of reminders on those things. 34:00 So those things are when we do maximum likelihood estimation, 34:04 likelihood is the function, so we need to maximize a function. 34:07 That's basically what we need to do. 34:09 And if I give you a function, you 34:10 need to know how to maximize this function. 34:12 Sometimes, you have closed-form solutions. 34:14 You can take the derivative and set it equal to 0 and solve it. 34:18 But sometimes, you actually need to resort to algorithms 34:21 to do that. 34:21 And there's an entire industry doing that. 34:25 And we'll briefly touch upon it, but this is definitely 34:27 not the focus of this class. 34:30 OK. 34:31 So before diving directly into the definition 34:34 of the likelihood and what is the definition 34:36 of the maximum likelihood estimator, what 34:38 I'm going to try to do is to give you 34:41 an insight for what we're actually doing when we do 34:45 maximum likelihood estimation. 34:48 So remember, we have a model on a sample space E 34:53 and some candidate distributions P theta. 34:57 And really, your goal is to estimate a true theta 35:00 star, the one that generated some data, X1 to Xn, 35:04 in an iid fashion. 35:06 But this theta star is really a proxy for us 35:08 to know that we actually understand 35:10 the distribution itself. 35:12 The goal of knowing theta star is so that you can actually 35:15 know what P theta star. 35:17 Otherwise, it has-- well, sometimes we 35:19 said it has some meaning itself, but really you 35:21 want to know what the distribution is. 35:23 And so your goal is to actually come up with the distribution-- 35:27 hopefully that comes from the family P theta-- 35:30 that's close to P theta star. 35:33 So in a way, what does it mean to have two distributions that 35:38 are close? 35:39 It means that when you compute probabilities 35:41 on one distribution, you should have 35:43 the same probability on the other distribution pretty much. 35:46 So what we can do is say, well, now I 35:49 have two candidate distributions. 35:51 35:59 So if theta hat leads to a candidate distribution P theta 36:03 hat, and this is the true theta star, 36:06 it leads to the true distribution P theta star 36:08 according to which my data was drawn. 36:11 That's my candidate. 36:12 36:16 As a statistician, I'm supposed to come up 36:18 with a good candidate, and this is the truth. 36:20 36:23 And what I want is that if you actually give me 36:26 the distribution, then I want when 36:30 I'm computing probabilities for this guy, 36:31 I know what the probabilities for the other guys are. 36:34 And so really what I want is that if I compute a probability 36:40 under theta hat of some interval a, b, 36:44 it should be pretty close to the probability 36:46 under theta star of a, b. 36:51 And more generally, if I want to take 36:53 the union of two intervals, I want this to be true. 36:55 If I take just 1/2 lines, I want this to be true from 0 36:58 to infinity, for example, things like this. 37:00 I want this to be true for all of them at once. 37:03 And so what I do is that I write A for a probability event. 37:07 And I want that P hat of A is close to P star of A 37:11 for any event A in the sample space. 37:15 Does that sound like a reasonable goal 37:17 for a statistician? 37:18 So in particular, if I want those to be close, 37:20 I want the absolute value of their difference 37:22 to be close to 0. 37:23 37:26 And this turns out to be-- 37:28 if I want this to hold for all possible A's, I 37:31 have all possible events, so I'm going to actually maximize over 37:35 these events. 37:36 And I'm going to look at the worst 37:37 possible event on which theta hat can depart from theta star. 37:41 And so rather than defining it specifically 37:43 for theta hat and theta star, I'm 37:44 just going to say, well, if you give me two probability 37:47 measures, P theta and P theta prime, 37:51 I want to know how close they are. 37:53 Well, if I want to measure how close they 37:55 are by how they can differ when I measure 37:58 the probability of some event, I'm 38:01 just looking at the absolute value of the difference 38:04 of the probabilities and I'm just 38:06 maximizing over the worst possible event that might 38:09 actually make them differ. 38:11 Agreed? 38:13 That's a pretty strong notion. 38:14 So if the total variation between theta and theta prime 38:17 is small, it means that for all possible A's that you give me, 38:22 then P theta of A is going to be close to P 38:25 theta prime of A, because if-- 38:30 let's say I just found the bound on the total variation 38:33 distance, which is 0.01. 38:41 All right. 38:42 So that means that this is going to be larger 38:46 than the max over A of P theta minus P theta prime of A, 39:00 which means that for any A-- 39:04 actually, let me write P theta hat and P theta star, 39:06 like we said, theta hat and theta star. 39:10 And so if I have a bound, say, on the total variation, 39:12 which is 0.01, that means that P theta hat-- 39:19 every time I compute a probability on P theta hat, 39:23 it's basically in the interval P theta star of A, 39:29 the one that I really wanted to compute, plus or minus 0.01. 39:34 This has nothing to do with confidence interval. 39:36 This is just telling me how far I 39:38 am from the value of actually trying to compute. 39:41 And that's true for all A. And that's key. 39:44 That's where this max comes into play. 39:47 It just says, I want this bound to hold 39:49 for all possible A's at once. 39:50 39:55 So this is actually a very well-known distance 39:58 between probability measures. 39:59 It's the total variation distance. 40:00 It's extremely central to probabilistic analysis. 40:04 And it essentially tells you that every time-- 40:07 if two probability distributions are close, 40:09 then it means that every time I compute a probability 40:11 under P theta but I really actually 40:15 have data from P theta prime, then 40:17 the error is no larger than the total variation. 40:21 OK. 40:23 So this is maybe not the most convenient way 40:29 of finding a distance. 40:30 I mean, how are you going-- 40:32 in reality, how are you to compute this maximum 40:34 over all possible events? 40:35 I mean, it's just crazy, right? 40:36 There's an infinite number of them. 40:38 It's much larger than the number of intervals, for example, 40:41 so it's a bit annoying. 40:43 And so there's actually a way to compress it 40:46 by just looking at the basically function distance or vector 40:50 distance between probability mass functions or probability 40:53 density functions. 40:55 So I'm going to start with the discrete version 40:58 of the total variation. 40:59 So throughout this chapter, I will 41:03 make the difference between discrete random variables 41:05 and continuous random variables. 41:07 It really doesn't matter. 41:08 All it means is that when I talk about discrete, 41:10 I will talk about probability mass functions. 41:12 And when I talk about continuous, 41:13 I will talk about probability density functions. 41:16 When I talk about probability mass functions, 41:20 I talk about sums. 41:21 When I talk about probability density functions, 41:24 I talk about integrals. 41:26 But they're all the same thing, really. 41:30 So let's start with the probability mass function. 41:32 Everybody remembers what the probability mass 41:34 function of a discrete random variable is. 41:37 This is the function that tells me for each possible value 41:42 that it can take, the probability 41:43 that it takes this value. 41:46 So the Probability Mass Function, PMF, 41:53 is just the function for all x in the sample space 41:57 tells me the probability that my random variable is 42:01 equal to this little value. 42:03 And I will denote it by P sub theta of X. 42:09 So what I want is, of course, that the sum 42:10 of the probabilities is 1. 42:12 42:17 And I want them to be non-negative. 42:20 Actually, typically we will assume that they are positive. 42:23 Otherwise, we can just remove this x from the sample space. 42:27 And so then I have the total variation distance, I mean, 42:31 it's supposed to be the maximum overall sets of-- 42:35 of subsets of E, such that the probability 42:39 of A minus probability of theta prime of A-- 42:43 it's complicated, but really there's 42:44 this beautiful formula that tells me 42:46 that if I look at the total variation between P theta 42:50 and P theta prime, it's actually equal to just 1/2 42:54 of the sum for all X in E of the absolute difference between P 43:04 theta X and P theta prime of X. 43:12 So that's something you can compute. 43:13 If I give you two probability mass functions, 43:16 you can compute this immediately. 43:19 But if I give you just the densities 43:24 and the original distribution, the original definition 43:26 where you have to max over all possible events, 43:28 it's not clear you're going to be 43:29 able to do that very quickly. 43:31 So this is really the one you can work with. 43:35 But the other one is really telling you 43:36 what it is doing for you. 43:37 It's controlling the difference of probabilities 43:39 you can compute on any event. 43:41 But here, it's just telling you, well, 43:42 if you do it for each simple event, it's little x. 43:46 It's actually simple. 43:49 Now, if we have continuous random variables-- so 43:53 by the way, I didn't mention, but discrete means Bernoulli. 43:56 Binomial, but not only those that have finite support, 43:59 like Bernoulli has support of size 2, 44:02 binomial NP has support of size n-- 44:05 there's n possible values it can take-- but also Poisson. 44:08 Poisson distribution can take an infinite number 44:10 of values, all the positive integers, 44:13 non-negative integers. 44:16 And so now we have also the continuous ones, 44:18 such as Gaussian, exponential. 44:19 And what characterizes those guys is that they 44:21 have a probability density. 44:24 So the density, remember the way I 44:26 use my density is when I want to compute 44:28 the probability of belonging to some event A. 44:31 The probability of X falling to some subset of the real line A 44:37 is simply the integral of the density on this set. 44:40 That's the famous area under the curve thing. 44:43 So since for each possible value, the probability at X-- 44:49 so I hope you remember that stuff. 44:51 That's just probably something that you 44:57 must remember from probability. 44:59 But essentially, we know that the probability that X is equal 45:02 to little x is 0 for a continuous random variable, 45:04 for all possible X's. 45:06 There's just none of them that actually gets weight. 45:09 So what we have to do is to describe the fact that it's 45:11 in some little region. 45:12 So the probability that it's in some interval, say, a, b, this 45:18 is the integral between A and B of f theta of X, dx. 45:25 So I have this density, such as the Gaussian one. 45:28 And the probability that I belong to the interval a, 45:30 b is just the area under the curve between A and B. 45:36 If you don't remember that, please take immediate remedy. 45:43 So this function f, just like P, is non-negative. 45:48 And rather than summing to 1, it integrates to 1 45:51 when I integrate it over the entire sample space E. 45:55 And now the total variation, well, it 45:56 takes basically the same form. 45:58 I said that you essentially replace sums 46:00 by integrals when you're dealing with densities. 46:03 And here, it's just saying, rather than having 46:05 1/2 of the sum of the absolute values, 46:07 you have 1/2 of the integral of the absolute value 46:09 of the difference. 46:11 Again, if I give you two densities 46:15 and if you're not too bad at calculus, which you will often 46:18 be, because there's lots of them you can actually not compute. 46:21 But if I gave you, for example, two Gaussian densities, 46:24 exponential minus x squared, blah, blah, blah, and I say, 46:27 just compute the total variation distance, 46:29 you could actually write it as an integral. 46:30 Now, whether you can actually reduce this integral 46:33 to some particular number is another story. 46:35 But you could technically do it. 46:38 So now, you have actually a handle on this thing 46:41 and you could technically ask Mathematica, 46:43 whereas asking Mathematica to take 46:45 the max over all possible events is going to be difficult. 46:48 All right. 46:48 So the total variation has some properties. 46:55 So let's keep on the board the definition that 46:59 involves, say, the densities. 47:05 So think Gaussian in your mind. 47:06 And you have two Gaussians, one with mean theta 47:09 and one with mean theta prime. 47:10 And I'm looking at the total variation between those two 47:13 guys. 47:14 So if I look at P theta minus-- 47:20 sorry. 47:20 TV between P theta and P theta prime, this 47:25 is equal to 1/2 of the integral between f theta, f theta prime. 47:31 And when I don't write it-- 47:32 so I don't write the X, dx but it's there. 47:34 And then I integrate over E. 47:38 So what is this thing doing for me? 47:39 It's just saying, well, if I have-- so 47:41 think of two Gaussians. 47:42 For example, I have one that's here and one that's here. 47:44 47:47 So this is let's say f theta, f theta prime. 47:51 This guy is doing what? 47:52 It's computing the absolute value of the difference 47:55 between f and f theta prime. 47:57 You can check for yourself that graphically, this I 48:01 can represent as an area not under the curve, 48:05 but between the curves. 48:10 So this is this guy. 48:11 48:16 Now, this guy is really the integral of the absolute value. 48:20 So this thing here, this area, this 48:22 is 2 times the total variation. 48:25 48:28 The scaling 1/2 really doesn't matter. 48:29 It's just if I want to have an actual correspondence 48:32 between the maximum and the other guy, I have to do this. 48:36 48:39 So this is what it looks like. 48:41 So we have this definition. 48:42 And so we have a couple of properties that come into this. 48:48 The first one is that it's symmetric. 48:49 TV of P theta and P theta prime is 48:51 the same as the TV between P theta prime and P theta. 48:55 Well, that's pretty obvious from this definition. 48:59 I just flip those two, I get the same number. 49:02 It's actually also true if I take the maximum. 49:05 Those things are completely symmetric in theta and theta 49:07 prime. 49:08 You can just flip them. 49:10 It's non-negative. 49:11 Is that clear to everyone that this thing is non-negative? 49:15 I integrate an absolute value, so this thing 49:20 is going to give me some non-negative number. 49:22 And so if I integrate this non-negative number, 49:24 it's going to be a non-negative number. 49:26 The fact also that it's an area tells me 49:29 that it's going to be non-negative. 49:32 The nice thing is that if TV is equal to zero, then 49:36 the two distributions, the two probabilities are the same. 49:42 That means that for every A, P theta of A 49:46 is equal to P theta prime of A. Now, 49:49 there's two ways to see that. 49:50 The first one is to say that if this integral is 49:53 equal to 0, that means that for almost all X, 49:56 f theta is equal to f theta prime. 49:58 The only way I can integrate a non-negative and get 0 50:01 is that it's 0 pretty much everywhere. 50:05 And so what it means is that the two densities 50:07 have to be the same pretty much everywhere, 50:09 which means that the distributions are the same. 50:11 But this is not really the way you want to do this, 50:13 because you have to understand what 50:15 pretty much everywhere means-- 50:16 which I should really say almost everywhere. 50:18 That's the formal way of saying it. 50:20 But let's go to this definition-- 50:22 50:24 which is gone. 50:26 Yeah. 50:26 That's the one here. 50:28 The max of those two guys, if this maximum is equal to 0-- 50:35 I have a maximum of non-negative numbers, their absolute values. 50:39 Their maximum is equal to 0, well, 50:42 they better be all equal to 0, because if one is not 50:44 equal to 0, then the maximum is not equal to 0. 50:47 So those two guys, for those two things 50:50 to be-- for the maximum to be equal to 0, 50:52 then each of the individual absolute values 50:54 have to be equal to 0, which means that the probability here 50:57 is equal to this probability here for every event A. 51:03 So those two things-- 51:04 this is nice, right? 51:06 That's called definiteness. 51:08 The total variation equal to 0 implies that P theta 51:10 is equal to P theta prime. 51:12 So that's really some notion of distance, right? 51:14 That's what we want. 51:16 If this thing being small implied 51:17 that P theta could be all over the place compared 51:20 to P theta prime, that would not help very much. 51:24 Now, there's also the triangle inequality 51:26 that follows immediately from the triangle 51:28 inequality inside this guy. 51:32 If I squeeze in some f theta prime prime in there, 51:35 I'm going to use the triangle inequality 51:37 and get the triangle inequality for the whole thing. 51:39 51:42 Yeah? 51:42 AUDIENCE: The fact that you need two definitions 51:45 of the [INAUDIBLE],, is it something 51:48 obvious or is it complete? 51:50 PHILIPPE RIGOLLET: I'll do it for you now. 51:52 So let's just prove that those two things are actually 51:56 giving me the same definition. 51:58 52:00 So what I'm going to do is I'm actually going 52:02 to start with the second one. 52:04 And I'm going to write-- 52:05 I'm going to start with the density version. 52:07 But as an exercise, you can do it for the PMF version 52:10 if you prefer. 52:11 So I'm going to start with the fact that f-- 52:13 52:20 so I'm going to write f of g so I don't have to write f and g. 52:23 So think of this as being f sub theta, and think of this guy 52:27 as being f sub theta prime. 52:29 I just don't want to have to write indices all the time. 52:32 So I'm going to start with this thing, the integral of f 52:34 of X minus g of X dx. 52:38 The first thing I'm going to do is this is an absolute value, 52:41 so either the number in the absolute value is positive 52:45 and I actually kept it like that, or it's negative 52:47 and I flipped its sign. 52:48 So let's just split between those two cases. 52:51 So this thing is equal to 1/2 the integral of-- 52:55 so let me actually write the set A star as 53:00 being the set of X's such that f of X is larger than g of X. 53:09 So that's the set on which the difference is 53:11 going to be positive or the difference is 53:13 going to be negative. 53:14 So this, again, is equivalent to f 53:17 of X minus g of X is positive. 53:23 OK. 53:23 Everybody agrees? 53:24 So this is the set I'm interested in. 53:26 53:29 So now I'm going to split my integral into two parts, 53:31 in A, A star, so on A star, f is larger than g, 53:38 so the absolute value is just the difference itself. 53:40 53:45 So here I put parenthesis rather than absolute value. 53:48 And then I have plus 1/2 of the integral on the complement. 53:54 What are you guys used to to write the complement, to the C 53:57 or the bar? 54:01 To the C? 54:01 54:05 And so here on the complement, then f is less than g, 54:08 so this is actually really g of X minus f of X, dx. 54:17 Everybody's with me here? 54:19 So I just said-- 54:20 I mean, those are just rewriting what the definition 54:23 of the absolute value is. 54:24 54:33 OK. 54:33 So now there's nice things that I know about f and g. 54:38 And the two nice things is that the integral of f is equal to 1 54:40 and the integral of g is equal to 1. 54:42 54:46 This implies that the integral of f minus g is equal to what? 54:53 AUDIENCE: 0. 54:54 PHILIPPE RIGOLLET: 0. 54:56 And so now that means that if I want 54:59 to just go from the integral here on A complement 55:04 to the integral on A-- 55:05 or on A star, complement to the integral of A star, 55:08 I just have to flip the sign. 55:11 So that implies that an integral on A star 55:14 complement of g of X minus f of X, 55:21 dx, this is simply equal to the integral on A star 55:25 of f of X minus g of X, dx. 55:30 55:40 All right. 55:41 So now this guy becomes this guy over there. 55:46 So I have 1/2 of this plus 1/2 of the same guy, 55:50 so that means that 1/2 half of the integral between of f 55:55 minus g absolute value-- 55:57 so that was my original definition, 55:59 this thing is actually equal to the integral on A star 56:03 of f of X minus g of X, dx. 56:10 56:14 And this is simply equal to P of A star-- 56:21 so say Pf of A start minus Pg of A star. 56:26 56:34 Which one is larger than the other one? 56:36 56:41 AUDIENCE: [INAUDIBLE] 56:43 PHILIPPE RIGOLLET: It is. 56:44 Just look at this board. 56:45 AUDIENCE: [INAUDIBLE] 56:47 PHILIPPE RIGOLLET: What? 56:48 AUDIENCE: [INAUDIBLE] 56:49 PHILIPPE RIGOLLET: The first one has 56:50 to be larger, because this thing is actually 56:51 equal to a non-negative number. 56:53 56:59 So now I have this absolute value of two things, 57:01 and so I'm closer to the actual definition. 57:04 But I still need to show you that this thing is 57:06 the maximum value. 57:09 So this is definitely at most the maximum over A of Pf 57:17 of A minus Pg of A. 57:21 That's certainly true. 57:24 Right? 57:24 We agree with this? 57:27 Because this is just for one specific A, 57:30 and I'm bounding it by the maximum over all possible A. 57:34 So that's clearly true. 57:36 So now I have to go the other way around. 57:38 I have to show you that the max is actually this guy, A star. 57:44 So why would that be true? 57:45 Well, let's just inspect this thing over there. 57:49 So we want to show that if I take 57:50 any other A in this integral than this guy A star, 57:53 it's actually got to decrease its value. 57:56 So we have this function. 57:57 I'm going to call this function delta. 57:59 58:02 And what we have is-- so let's say 58:03 this function looks like this. 58:04 Now it's the difference between two densities. 58:06 It doesn't have to integrate-- it doesn't 58:09 have to be non-negative. 58:10 But it certainly has to integrate to 0. 58:12 58:15 And so now I take this thing. 58:18 And the A star, what is the set A star here? 58:22 The set A star is the set over which the function 58:25 delta is non-negative. 58:27 58:36 So that's just the definition. 58:37 A star was the set over which f minus g was positive, 58:41 and f minus g was just called delta. 58:44 So what it means is that what I'm really integrating 58:47 is delta on this set. 58:50 So it's this area under the curve, 58:53 just on the positive things. 58:55 Agreed? 58:57 So now let's just make some tiny variations around this guy. 59:03 If I take A to be larger than A star-- 59:08 so let me add, for example, this part here. 59:10 59:12 That means that when I compute my integral, 59:15 I'm removing this area under the curve. 59:18 It's negative. 59:18 The integral here is negative. 59:20 So if I start adding something to A, the value goes lower. 59:25 If I start removing something from A, like say this guy, 59:29 I'm actually removing this value from the integral. 59:32 So there's no way. 59:33 I'm actually stuck. 59:34 This A star is the one that actually maximizes 59:37 the integral of this function. 59:39 So we used the fact that for any function, 59:49 say delta, the integral over A of delta 59:59 is less than the integral over the set of X's 60:02 such that delta of X is non-negative of delta of X, dx. 60:07 60:10 And that's an obvious fact, just by picture, say. 60:13 60:18 And that's true for all A. Yeah? 60:24 AUDIENCE: [INAUDIBLE] could you use 60:28 like a portion under the axis as like less than 60:33 or equal to the portion above the axis? 60:34 PHILIPPE RIGOLLET: It's actually equal. 60:36 We know that the integral of f minus g-- 60:39 the integral of delta is 0. 60:41 So there's actually exactly the same area above and below. 60:47 But yeah, you're right. 60:49 You could go to the extreme cases. 60:51 You're right. 60:51 60:57 No. 60:57 It's actually still be true, even if there was-- 61:00 if this was a constant, that would still be true. 61:02 Here, I never use the fact that the integral is equal to 0. 61:05 61:11 I could shift this function by 1 so that the integral of delta 61:15 is equal to 1, and it would still 61:18 be true that it's maximized when I take A to be 61:21 the set where it's positive. 61:24 Just need to make sure that there is someplace where it is, 61:27 but that's about it. 61:28 61:33 Of course, we used this before, when we made this thing. 61:36 But just the last argument, this last fact 61:38 does not require that. 61:39 61:43 All right. 61:44 So now we have this notion of-- 61:47 I need the-- 61:48 61:52 OK. 61:53 So we have this notion of distance 61:57 between probability measures. 61:58 I mean, these things are exactly what-- 62:00 if I were to be in a formal math class and I said, 62:03 here are the axioms that a distance should satisfy, 62:06 those are exactly those things. 62:08 If it's not satisfying this thing, 62:10 it's called pseudo-distance or quasi-distance or just metric 62:13 or nothing at all, honestly. 62:15 So it's a distance. 62:16 It's symmetric, non-negative, equal to 0, 62:18 if and only if the two arguments are equal, then 62:21 it satisfies the triangle inequality. 62:25 And so that means that we have this actual total variation 62:28 distance between probability distributions. 62:31 And here is now a statistical strategy to implement our goal. 62:36 Remember, our goal was to spit out 62:38 a theta hat, which was close such that P theta 62:41 hat was close to P theta star. 62:45 So hopefully, we were trying to minimize the total variation 62:48 distance between P theta hat and P theta star. 62:51 Now, we cannot do that, because just by this fact, this slide, 62:55 if we wanted to do that directly, we would just take-- 62:57 well, let's take theta hat equals theta star and that will 62:59 give me the value 0. 63:00 And that's the minimum possible value we can take. 63:03 The problem is that we don't know 63:04 what the total variation is to something that we don't know. 63:07 We know how to compute total variations if I give you 63:09 the two arguments. 63:10 But here, one of the arguments is not known. 63:12 P theta star is not known to us, so we need to estimate it. 63:16 And so here is the strategy. 63:18 Just build an estimator of the total variation 63:21 distance between P theta and P theta star 63:24 for all candidate theta, all possible theta 63:27 in capital theta. 63:30 Now, if this is a good estimate, then when I minimize it, 63:33 I should get something that's close to P theta star. 63:37 So here's the strategy. 63:38 This is my function that maps theta 63:40 to the total variation between P theta and P theta star. 63:44 I know it's minimized at theta star. 63:47 That's definitely TV of P-- and the value here, the y-axis 63:51 should say 0. 63:53 And so I don't know this guy, so I'm 63:54 going to estimate it by some estimator that 63:56 comes from my data. 63:57 Hopefully, the more data I have, the better this estimator is. 64:00 And I'm going to try to minimize this estimator now. 64:03 And if the two things are close, then the minima 64:05 should be close. 64:07 That's a pretty good estimation strategy. 64:09 The problem is that it's very unclear 64:11 how you would build this estimator of TV, 64:13 of the Total Variation. 64:18 So building estimators, as I said, 64:21 typically consists in replacing expectations by averages. 64:25 But there's no simple way of expressing the total variation 64:29 distance as the expectations with respect 64:31 to theta star of anything. 64:33 So what we're going to do is we're 64:36 going to move from total variation distance 64:38 to another notion of distance that sort of has 64:41 the same properties and the same feeling 64:43 and the same motivations as the total variation distance. 64:47 But for this guy, we will be able to build 64:49 an estimate for it, because it's actually 64:51 going to be of the form expectation of something. 64:53 And we're going to be able to replace 64:55 the expectation by an average and then minimize this average. 65:00 So this surrogate for total variation distance 65:04 is actually called the Kullback-Leibler divergence. 65:07 And why we call it divergence is because it's actually 65:09 not a distance. 65:11 It's not going to be symmetric to start with. 65:14 So this Kullback-Leibler or even KL divergence-- 65:17 I will just refer to it as KL-- 65:20 is actually just more convenient. 65:22 But it has some roots coming from information theory, which 65:27 I will not delve into. 65:29 But if any of you is actually a Core 6 student, 65:31 I'm sure you've seen that in some-- 65:32 I don't know-- course that has any content on information 65:37 theory. 65:39 All right. 65:39 So the KL divergence between two probability measures, P theta 65:42 and P theta prime-- 65:43 and here, as I said, it's not going to be the symmetric, 65:47 so it's very important for you to specify 65:49 which order you say it is, between P theta and P theta 65:51 prime. 65:52 It's different from saying between P theta prime and P 65:55 theta. 65:56 And so we denote it by KL. 65:58 And so remember, before we had either the sum or the integral 66:04 of 1/2 of the distance-- absolute value of the distance 66:07 between the PMFs and 1/2 of the absolute values 66:10 of the distances between the probability density functions. 66:17 And then we replace this absolute value 66:19 of the distance divided by 2 by this weird function. 66:24 This function is P theta, log P theta, 66:28 divided by P theta prime. 66:30 That's the function. 66:31 That's a weird function. 66:34 OK. 66:35 So this was what we had. 66:38 66:40 That's the TV. 66:41 66:44 And the KL, if I use the same notation, f and g, 66:48 is integral of f of X, log of f of X over g of X, dx. 66:57 67:01 It's a bit different. 67:04 And I go from discrete to continuous using an integral. 67:09 Everybody can read this. 67:10 Everybody's fine with this. 67:11 Is there any uncertainty about the actual definition here? 67:15 So here I go straight to the definition, 67:17 which is just plugging the functions 67:19 into some integral and compute. 67:22 So I don't bother with maxima or anything. 67:24 I mean, there is something like that, 67:26 but it's certainly not as natural as the total variation. 67:29 Yes? 67:30 AUDIENCE: The total variation, [INAUDIBLE].. 67:33 67:38 PHILIPPE RIGOLLET: Yes, just because it's 67:40 hard to build anything from total variation, 67:42 because I don't know it. 67:43 So it's very difficult. But if you can actually-- 67:45 and even computing it between two Gaussians, 67:47 just try it for yourself. 67:49 And please stop doing it after at most six minutes, 67:52 because you won't be able to do it. 67:54 And so it's just very hard to manipulate, 67:56 like this integral of absolute values of differences 67:59 between probability density function, at least 68:01 for the probability density functions 68:02 we're used to manipulate is actually a nightmare. 68:04 And so people prefer KL, because for the Gaussian, 68:08 this is going to be theta minus theta prime squared. 68:10 And then we're going to be happy. 68:12 And so those things are much easier to manipulate. 68:15 But it's really-- the total variation 68:18 is telling you how far in the worst case 68:20 the two probabilities can be. 68:21 This is really the intrinsic notion 68:23 of closeness between probabilities. 68:25 So that's really the one-- if we could, 68:27 that's the one we would go after. 68:30 Sometimes people will compute them numerically, 68:32 so that they can say, oh, here's the total variation distance I 68:34 have between those two things. 68:36 And then you actually know that that 68:38 means they are close, because the absolute value-- if I tell 68:41 you total variation is 0.01, like we did here, 68:44 it has a very specific meaning. 68:46 If I tell you the KL divergence is 0.01, 68:49 it's not clear what it means. 68:50 68:55 OK. 68:55 So what are the properties? 68:58 The KL divergence between P theta and P theta prime 69:00 is different from the KL divergence between P theta 69:03 prime and P theta in general. 69:05 Of course, in general, because if theta 69:07 is equal to theta prime, then this certainly is true. 69:11 So there's cases when it's not true. 69:14 The KL divergence is non-negative. 69:17 Who knows the Jensen's inequality here? 69:19 That should be a subset of the people who 69:21 raised their hand when I asked what a convex function is. 69:25 All right. 69:26 So you know what Jensen's inequality is. 69:27 This is Jensen's-- the proof is just one step 69:30 Jensen's inequality, which we will not go into details. 69:33 But that's basically an inequality 69:35 involving expectation of a convex function 69:38 of a random variable compared to the convex function 69:40 of the expectation of a random variable. 69:42 69:45 If you know Jensen, have fun and prove it. 69:48 What's really nice is that if the KL is equal to 0, 69:51 then the two distributions are the same. 69:55 And that's something we're looking for. 69:57 Everything else we're happy to throw out. 69:59 And actually, if you pay attention, 70:00 we're actually really throwing out everything else. 70:03 So they're not symmetric. 70:05 It does satisfy the triangle inequality in general. 70:08 But it's non-negative and it's 0 if and only if the two 70:12 distributions are the same. 70:13 And that's all we care about. 70:15 And that's what we call a divergence rather than 70:17 a distance, and divergence will be enough for our purposes. 70:21 And actually, this asymmetry, the fact 70:24 that it's not flipping-- the first time I saw it, 70:26 I was just annoyed. 70:27 I was like, can we just like, I don't 70:29 know, take the average of the KL between P theta 70:31 and P theta prime and P theta prime and P theta, 70:34 you would think maybe you could do this. 70:36 You just symmatrize it by just taking the average of the two 70:39 possible values it can take. 70:41 The problem is that this will still not satisfy the triangle 70:44 inequality. 70:45 And there's no way basically to turn it into something 70:48 that is a distance. 70:49 But the divergence is doing a pretty good thing for us. 70:52 And this is what will allow us to estimate it and basically 70:55 overcome what we could not do with the total variation. 71:03 So the first thing that you want to notice 71:06 is the total variation distance-- 71:08 the KL divergence, sorry, is actually 71:10 an expectation of something. 71:12 Look at what it is here. 71:15 It's the integral of some function against a density. 71:20 That's exactly the definition of an expectation, right? 71:25 So this is the expectation of this particular function 71:29 with respect to this density f. 71:31 So in particular, if I call this is density f-- if I say, 71:35 I want the true distribution to be the first argument, 71:38 this is an expectation with respect 71:39 to the true distribution from which my data is actually 71:42 drawn of the log of this ratio. 71:45 So ha ha. 71:46 I'm a statistician. 71:47 Now I have an expectation. 71:49 I can replace it by an average, because I have data 71:51 from this distribution. 71:52 And I could actually replace the expectation by an average 71:54 and try to minimize here. 71:56 The problem is that-- 71:57 actually the star here should be in front of the theta, 72:00 not of the P, right? 72:01 That's P theta star, not P star theta. 72:04 But here, I still cannot compute it, 72:05 because I have this P theta star that shows up. 72:08 I don't know what it is. 72:10 And that's now where the log plays a role. 72:13 If you actually pay attention, I said 72:15 you can use Jensen to prove all this stuff. 72:16 You could actually replace the log by any concave function. 72:21 That would be f divergent. 72:22 That's called an f divergence. 72:24 But the log itself is a very, very specific property, 72:26 which allows us to say that the log of the ratio 72:29 is the ratio of the log. 72:33 Now, this thing here does not depend on theta. 72:38 If I think of this KL divergence as a function of theta, 72:43 then the first part is actually a constant. 72:45 If I change theta, this thing is never going to change. 72:47 It depends only on theta star. 72:49 So if I look at this function KL-- 72:51 73:03 so if I look at the function, theta maps 73:05 to KL P theta star, P theta, it's 73:11 of the form expectation with respect to theta star, 73:15 log of P theta star of X. And then I 73:23 have minus expectation with respect to theta star of log 73:29 of P theta of x. 73:33 Now as I said, this thing here, this second expectation 73:38 is a function of theta. 73:39 When theta changes, this thing is going to change. 73:42 And that's a good thing. 73:43 We want something that reflects how close theta and theta 73:45 star are. 73:46 But this thing is not going to change. 73:48 This is a fixed value. 73:49 Actually, it's the negative entropy of P theta star. 73:53 And if you've heard of KL, you've 73:54 probably heard of entropy. 73:55 And that's what-- it's basically minus the entropy. 73:58 And that's a quantity that just depends on theta star. 74:01 But it's just the number. 74:03 I could compute this number if I told 74:05 you this is n theta star 1. 74:07 You could compute this. 74:09 So now I'm going to try to minimize 74:11 the estimate of this function. 74:14 And minimizing a function or a function plus a constant 74:16 is the same thing. 74:18 I'm just shifting the function here or here, 74:20 but it's the same minimizer. 74:23 OK. 74:24 So the function that maps theta to KL of P theta star 74:28 to P theta is of the form constant minus this expectation 74:32 of a log of P theta. 74:35 Everybody agrees? 74:38 Are there any questions about this? 74:40 Are there any remarks, including I 74:42 have no idea what's happening right now? 74:46 OK. 74:46 We're good? 74:47 Yeah. 74:48 AUDIENCE: So when you're actually employing this method, 74:50 how do you know which theta to use as theta star and which 74:52 isn't? 74:53 PHILIPPE RIGOLLET: So this is not a method just yet, right? 74:55 I'm just describing to you what the KL divergence 74:57 between two distributions is. 74:58 If you really wanted to compute it, 75:00 you would need to know what P theta star is 75:01 and what P theta is. 75:02 AUDIENCE: Right. 75:03 PHILIPPE RIGOLLET: And so here, I'm just saying at some point, 75:06 we still-- so here, you see-- 75:07 so now let's move onto one step. 75:09 I don't know expectation of theta star. 75:12 But I have data that comes from distribution P theta star. 75:15 So the expectation by the law of large numbers 75:17 should be close to the average. 75:19 And so what I'm doing is I'm replacing any-- 75:23 I can actually-- this is a very standard estimation method. 75:27 You write something as an expectation with respect 75:30 to the data-generating process of some function. 75:34 And then you replace this by the average of this function. 75:37 And the law of large numbers tells me 75:38 that those two quantities should actually be close. 75:41 Now, it doesn't mean that's going to be the end of the day, 75:43 right. 75:44 When we did Xn bar, that was the end of the day. 75:46 We had an expectation. 75:47 We replaced it by an average. 75:49 And then we were gone. 75:51 But here, we still have to do something, 75:53 because this is not telling me what theta is. 75:55 Now I still have to minimize this average. 75:58 So this is now my candidate estimator for KL, KL hat. 76:04 And that's the one where I said, well, it's 76:06 going to be of the form of constant. 76:07 And this constant, I don't know. 76:09 You're right. 76:09 I have no idea what this constant is. 76:11 It depends on P theta star. 76:13 But then I have minus something that I can completely compute. 76:16 If you give me data and theta, I can compute this entire thing. 76:20 And now what I claim is that the minimizer of f or f plus-- 76:25 f of X or f of X plus 4 are the same thing, 76:28 or say 4 plus f of X. I'm just shifting 76:32 the plot of my function up and down, 76:34 but the minimizer stays exactly where it is. 76:36 76:39 If I have a function-- 76:41 76:43 so now I have a function of theta. 76:45 76:51 This is KL hat of P theta star, P theta. 76:56 And it's of the form-- it's a function like this. 76:58 I don't know where this function is. 77:00 It might very well be this function or this function. 77:06 Every time it's a translation on the y-axis of all these guys. 77:10 And the value that I translated by depends on theta star. 77:14 I don't know what it is. 77:15 But what I claim is that the minimizer is always this guy, 77:19 regardless of what the value is. 77:22 OK? 77:25 So when I say constant, it's a constant with respect to theta. 77:28 It's an unknown constant. 77:29 But it's with respect to theta, so without loss of generality, 77:32 I can assume that this constant is 0 for my purposes, 77:36 or 25 if you prefer. 77:38 77:41 All right. 77:41 So we'll just keep going on this property next time. 77:46 And we'll see how from here we can move on to-- 77:49 the likelihood is actually going to come out of this formula. 77:51 Thanks. 77:53