字幕記錄 00:00 00:00 The following content is provided under a Creative 00:02 Commons license. 00:03 Your support will help MIT OpenCourseWare 00:06 continue to offer high quality educational resources for free. 00:10 To make a donation or to view additional materials 00:12 from hundreds of MIT courses, visit MIT OpenCourseWare 00:16 at ocw.mit.edu. 00:17 00:19 PHILIPPE RIGOLLET: Doesn't to run Flash Player. 00:21 So I had to run them on Chrome. 00:26 All right, so let's move on to our second chapter. 00:29 And hopefully, in this chapter, you 00:31 will feel a little better if you felt 00:33 like it was going a bit fast in the first chapter. 00:36 And the main reason we actually went fast, especially 00:38 in terms of confidence intervals. 00:40 Some of you came and asked me what 00:41 you mean by this is a confidence interval? 00:43 What does it mean that it's not happening 00:45 in there with probability 95%, et cetera? 00:47 I just went really fast so that you 00:49 could see why I didn't want to give you a first week doing 00:53 probability only without understanding 00:57 what the statistical context for it was. 00:59 So hopefully, all these things that we've 01:02 done in terms of probability, you actually 01:03 know why we've been doing them. 01:05 And so we're basically going to go back to what we're doing, 01:08 maybe start with some statistical setup. 01:11 But the goal of this lecture is really 01:13 going to go back again to what we've seen from a purely 01:17 statistical perspective. 01:18 All right? 01:19 So the first thing we're going to do 01:22 is explain why we're doing statistical modeling, right? 01:26 So in practice, if you have data, 01:28 if you observe a bunch of points-- 01:30 in here, I gave you some numbers, for example. 01:34 So here's the partial data sets with the number of siblings, 01:37 including self, that were collected from college 01:40 students a few years back. 01:41 So I was teaching a class like yours 01:43 and actually asked students to go 01:44 and fill out some Google form and tell me a bunch of things. 01:47 And one of the questions was, including yourself, 01:49 how many siblings do you have? 01:51 And so they gave me this list of numbers, right? 01:54 And there's many ways I can think of this list of numbers, 01:57 right? 01:57 I could think of it as being just a discrete distribution 02:01 on the set of numbers between 1-- 02:05 I know there's not going to be an answer which is less than 1, 02:08 unless, well, someone doesn't understand the question. 02:10 But all the answers I should get are positive integers-- 02:14 1, 2, 3, et cetera. 02:16 And there probably is an upper bound, 02:18 but I don't know it on the top of my head. 02:20 So maybe I should say 100. 02:22 Maybe I should say 15. 02:24 It depends, right? 02:25 And so I think the largest number I got for this was 6. 02:28 All right? 02:29 So here you can see you have pretty standard families, 02:33 you know, lots of 1s, 2s, and 3s. 02:37 What statistical modeling is doing 02:39 is to try to compress this information that I 02:42 could actually describe in a very naive way. 02:44 So let's start with the basic usual statistical set 02:49 up, right? 02:49 So I will start with many of the boards that look 02:52 like x1, xn, random variables. 02:58 And what I'm going to assume, as we said typically 03:01 is that those guys are IID. 03:04 And they have some distribution, all right? 03:06 So they all share the same distribution. 03:08 And the fact that their IID is so that I can actually 03:10 do statistics. 03:11 Statistics means looking at the global averaging thing 03:15 so that I can actually get a sense of what 03:17 the global behavior is for the population, right? 03:20 If I start assuming that those things are not identically 03:22 distributed-- they all live on their own-- 03:24 that my sequence of number is your number of siblings-- 03:28 the shoe size of this person-- 03:30 the depth of the Charles River, and I 03:31 start measuring a bunch of stuff. 03:33 There's nothing I can actually get together. 03:34 I need to have something that's cohesive. 03:37 And so here, I collected some data that was cohesive. 03:41 And so the goal here-- 03:42 the first thing is to say what is the distribution that I 03:44 actually have here, right? 03:46 03:50 So I could actually be very general. 03:52 I could just say at some distribution p. 03:56 And let's so those are random variables, not random vectors, 03:59 right? 03:59 I could collect entire vectors about students, 04:02 but let's say those are just random variables. 04:05 And so now I can start making assumptions 04:07 on this distribution p, right? 04:09 What can I say about a distribution? 04:11 Well, maybe if those numbers are continues, 04:14 for example, I could assume they have a density-- 04:17 a probability density function. 04:19 That's already an assumption. 04:21 Maybe I could start to assume that they're probability 04:23 density function is smooth. 04:25 That's another assumption. 04:26 Maybe I could actually assume that it's piecewise constant. 04:29 That's even better, right? 04:30 And those things make my life simpler and simpler, 04:33 because what I do by making the successive assumptions is 04:37 reducing the degrees of freedom of the space in which I 04:40 am actually searching the distribution. 04:42 And so what we actually want is to have something 04:46 which is small enough so we can actually 04:48 have some averaging going on. 04:50 But we also want something which is big enough 04:53 that it can actually express. 04:55 It has chances of actually containing a distribution that 04:58 makes sense for us. 04:59 So let's start with the simplest possible example, 05:01 which is when the xi's belong to 0, 1. 05:07 And as I said, here, we don't have a choice. 05:09 The distribution of those guys has to be Bernoulli. 05:14 And since they are IID, they all share the same p. 05:18 So that's definitely the simplest possible thing 05:20 I could think of. 05:21 They are just Bernoulli p. 05:24 And so all I would have to figure out in this case is p. 05:28 05:31 And this is the simplest case. 05:34 And unsurprisingly, it has the simplest answer, right? 05:37 We will come back to this example 05:39 when we study maximum likelihood estimators or method 05:42 of moments estimators by method of moments. 05:45 But at the end of the day, the things 05:49 that we did-- the things that we will do 05:50 are always the naive estimator you 05:52 would come up with is what is the proportion of 1. 05:55 And this will be, in pretty much all respects, 05:58 the best estimator you can think of. 06:01 All right? 06:01 So then we're going to try to assess this performance. 06:03 And we saw how to do that in the first chapter as well. 06:06 So this problem here somehow is completely understood. 06:10 We'll come back to it. 06:11 But there's nothing fancy that is going to happen. 06:14 But now, I could have some more complicated things. 06:17 For example, in the example of the students now, 06:20 my xi's belong to the sequence of integers 1, 2, 3, 06:26 et cetera, OK, which is also denoted by n, maybe without 0 06:31 if you want to put 0 in that, right? 06:32 So the positive integers. 06:36 Or I could actually just maybe put 06:38 some prior knowledge about how humans 06:41 have time to have families. 06:42 But maybe some people thought of their college mates 06:47 as being their brothers and sisters. 06:49 And one student would actually put 465 siblings, 06:53 because we're all good friends. 06:56 Or maybe they actually think that all their Facebook 06:59 contacts are actually their siblings. 07:00 And so you never know what's going to happen. 07:03 So maybe you want to account for this, 07:04 but maybe you know that people are reasonable, 07:06 and they will actually give you something like this. 07:08 07:11 Now intuitively, maybe you would say, well, 07:13 why would you bother doing this if you're not 07:15 really sure about the 20? 07:17 But I think that probably all of you 07:18 intuitively guess that this is probably a good idea 07:21 to start putting this kind of assumption 07:24 rather than allowing for any number in the first place, 07:26 because this eventually will be injected 07:30 into the precision of our estimators. 07:33 If I allow anything, it's going to be more complicated for me 07:36 to get an accurate estimator. 07:37 If I know that the numbers are either 1 or 2, then 07:40 I'm actually going to be slightly more accurate as well. 07:42 All right? 07:43 Because I know that, for example, somebody put the 5, 07:45 I can remove it. 07:46 Then it's not going to actually corrupt with my estimator. 07:48 All right, so now, let's say we actually 07:55 agree that we have numbers. 07:57 And here I put seven numbers, OK? 08:01 So I just said, well, let's assume 08:03 that the numbers I'm going to get 08:05 are going to be 1 all the way to say this number that I denote 08:10 by larger than or equal to 7, which 08:12 is a placeholder for any numbers larger than 7, OK? 08:15 Because I know maybe I don't want 08:17 to distinguish between people that have 9 or 25 siblings. 08:20 OK, and so now, this is a distribution 08:24 on seven possible values-- the discrete distributions. 08:28 And you know from your probability class 08:30 that the way you describe this distribution 08:32 is using the probability mass function. 08:34 08:44 OK, or PMF-- 08:45 08:48 So that's how we describe a discrete distribution. 08:51 And the PMF is just a list of numbers, right? 08:56 So as I wrote here, you have a list of numbers. 08:58 And here, you wrote the possible value 09:00 that your random variable can take. 09:03 And here you rate the probability 09:04 that your random variable takes this value. 09:07 So the possible values being 1, 2, 3 all the way to larger than 09:11 or equal to 7. 09:13 And then I'm trying to estimate those numbers. 09:16 Right? 09:16 If I give you those numbers, at least up to this 09:20 you know compression of all numbers that are equal to 7, 09:23 you have the full description of your distribution. 09:25 And that is the ultimate goal of statistics, right? 09:29 The ultimate goal of statistics is 09:30 to say what distribution your data came from, 09:33 because that's basically the best you're 09:34 going to be able to do. 09:36 Now admittedly, if I started looking at the fraction of 1s, 09:40 and the fraction of 2s, and the fraction of 3s, et cetera, 09:44 I would actually eventually get those numbers-- 09:48 just like looking at the fraction of 1s 09:49 gave me a good estimate for p in the Bernoulli case, 09:53 it would do the same in this case, right? 09:55 It's a pretty intuitive idea. 09:56 It's just the law of large numbers. 09:59 Everybody agrees with that? 10:00 If I look at the proportion of 1s, the proportion of 2s, 10:02 the proportion of 3s, that should actually 10:04 give me something that gets closer and closer, as my sample 10:06 size increases to what I want. 10:10 But the problem is if my sample size is not huge, 10:14 here I have seven numbers to estimate. 10:17 And if I have 20 observations, the ratio 10:20 is not really in my favor-- 10:23 20 observations to estimate seven parameters-- some of them 10:26 are going to be pretty off, typically 10:27 the ones with the large values. 10:29 If you have only 20 students, look at the list of numbers. 10:32 I don't know how many numbers I have, but it probably 10:34 is close to 20-- 10:35 maybe 15 or something. 10:37 And so if you look at this list, nobody's 10:39 actually-- nobody has four or more siblings, right? 10:45 There's no such person. 10:46 So that means that eventually from this data set, 10:49 my estimates-- 10:50 so those numbers I denote by say p1, p2, p3, et cetera-- 10:56 those estimates p4 hat would be equal to what from this data? 11:01 11:06 0, right? 11:07 And p5 hat equal to 0 and p6 hat would be equal to 0. 11:12 And p larger than or equal to 7 hat would be equal to 0. 11:16 That would be my estimate from this data set. 11:19 So maybe this is not-- 11:21 maybe I want to actually pull some information 11:25 from the people who have less siblings 11:28 to try to make a guess, which is probably 11:31 slightly better for the larger values, right? 11:33 It's pretty clear that in average, there is more than 0-- 11:39 the proportion of the population of households 11:42 that have four children or more is definitely more than 0, 11:46 all right? 11:46 So it means that my data set is not 11:49 representative of what I'm going to try 11:51 to do is to find a model that tries to use the data they have 11:53 for the smaller values that I can observe and just push it up 11:56 to the other ones. 11:57 And so what we can do is to just reduce those parameters 12:00 into something that's understood. 12:03 And this is part of the modeling that I talked about 12:05 in the first place. 12:07 Now, how do you succinctly describe a number of something? 12:12 Well, one thing that you do is the Poisson distribution, 12:14 right? 12:15 Why do Poisson? 12:17 There's many reasons. 12:18 Again, that's part of statical modeling. 12:20 But once you know that you have number of something that 12:23 can be modeled by a Poisson, why not try a Poisson, right? 12:26 You could just fit a Poisson. 12:27 And the Poisson is something that looks like this. 12:30 And I guess you've all seen it. 12:32 But if x follows a Poisson distribution 12:36 with parameter lambda, than the probability 12:38 that x is equal to little x is equal to lambda 12:42 to the x over factorial x e to the minus lambda. 12:47 OK? 12:47 12:51 And if you did the sheet that I gave you on the first day, 12:54 you can check those numbers. 12:55 So this is, of course, for x equals 0, 1, et cetera, right? 13:00 So x is in natural integers. 13:04 And if you sum from x equals 0 to infinity, this thing you get 13:08 is e to the lambda. 13:09 And so they cancel, and you have some 13:11 which is equal to 1, which is indeed a PMF. 13:13 But what's key about this PMF is that it never takes value 0. 13:17 Like this thing is always strictly positive. 13:20 So whatever value of lambda I find from this data 13:24 will give me something that's certainly 13:25 more interesting than just putting the value 0. 13:29 But more importantly, rather than having 13:31 to estimate seven parameters and, as a consequence, 13:35 to actually have to estimate 1, 2, 13:37 3, 4 of them being equal to 0, I have only one parameter 13:42 to estimate which is lambda. 13:44 The problem with doing this is that now lambda may not 13:49 be just something as simple as computing the average number. 13:53 Right? 13:54 In this case, it will. 13:55 But in many instances, it's actually not clear 13:58 that this parametrization with lambda that I chose-- 14:01 I'm going to be able to estimate lambda just by computing 14:03 the average number that I get. 14:06 It will be the case. 14:07 But if it's not, remember this example of the exponential 14:10 we did in the last lecture-- 14:12 we could use the delta method and things 14:13 like that to estimate them. 14:16 All right, so here's modeling 101. 14:20 So the purpose of modeling is to restrict 14:22 the base of possible distributions 14:26 to a subspace that's actually plausible, but much simpler 14:29 for me to estimate. 14:31 So we went from all distributions 14:34 on seven parameters, which is a large space-- 14:37 that's a lot of things-- 14:38 to something which is just one number. 14:41 This number is positive. 14:42 14:46 Any question about the purpose of doing this? 14:50 OK, so we're going to have to do a little bit of formalism now. 14:55 And so if we want to talk-- 14:58 this is a statistics classroom. 14:59 I'm not going to want to talk about the Poisson model 15:01 specifically every single time. 15:03 I'm going to want to talk about generic models. 15:05 And then you're going to build to plug in your favorite word-- 15:08 Poisson, binomial, exponential, uniform-- 15:11 all these words that you've seen, 15:12 you're going to be able to plug in there. 15:13 But we're just going to have some generic notations 15:16 and some generic terminology for a statistical model. 15:19 All right? 15:19 So here is the formal definition. 15:22 So I'm going to go through it with you. 15:24 15:29 OK, so the definition is that of a statistical model. 15:35 15:47 OK? 15:47 15:50 Sorry, that's a statistical experiment, I should say. 15:53 16:00 So a statistical experiment is actually just a pair-- 16:04 E. And that's a set-- 16:08 16:11 and a family of distributions P theta, where theta ranges 16:19 in some set capital theta. 16:22 OK? 16:22 So I hope you're up to date with your Greek letters. 16:26 So the small theta is the capital theta. 16:28 And enough of us-- 16:29 I don't have the handwriting. 16:31 So if you don't see something, just ask me. 16:34 And so this thing now-- so each of this guy 16:36 is a probability distribution. 16:40 All right? 16:41 So for example, this could be a Poisson with parameter theta 16:47 or a Bernoulli with parameter theta-- 16:51 OK, or an exponential with parameter-- 16:54 I don't know-- 1 over theta squared if you want. 16:56 16:58 OK, but they're just indexed by theta. 17:00 But for each theta, this completely 17:02 describes the distribution. 17:05 It could be more complicated. 17:06 This theta should be a pair-- could be a pair-- a mu sigma 17:09 square. 17:10 And that could actually give you some n mu sigma squared. 17:16 OK so anything where you can actually-- 17:20 rather than actually giving you a full distribution, 17:24 I can compress into a parameter. 17:26 But it could be worse. 17:27 It could be this guy here. 17:28 Right? 17:28 Theta could be p1-- 17:32 p larger than or equal to 7. 17:34 And my distribution could just be something that has PMF-- 17:37 17:40 p1-- p larger than 7. 17:44 That's another parameter. 17:45 This one is seven dimensional. 17:48 This one is two dimensional. 17:49 And all these guys are just one dimensional. 17:52 All these guys are parameters. 17:55 Is that clear? 17:56 What's important here is that once they give you theta, 17:59 you know exactly all the probabilities associated 18:00 with this random variable. 18:01 You know its distribution perfectly. 18:03 18:09 So this is the definition. 18:10 Is that clear? 18:11 Is there a question about this distribution-- 18:13 about this definition, sorry? 18:14 18:17 All right. 18:18 So really, the key thing is the statistical model associated 18:22 to a statistical experiments. 18:24 OK? 18:24 18:27 So let's just see some examples. 18:29 It's probably just better because, again, the formalism 18:31 is never really clear. 18:33 Actually, that's the next slide. 18:35 OK, so there's two things we need to assume. 18:40 OK, so the purpose of a statistical model 18:44 is once I estimate the parameter, 18:46 I actually know exactly what distribution it has, OK? 18:51 So it means that I could potentially 18:54 have several parameters that give me 18:56 the same distribution that would still be fine, because I could 18:59 estimate one guy. 18:59 Or I could estimate the other guy. 19:01 And I would still recover the underlying distribution 19:03 of my data. 19:03 19:04 The problem is that this creates really annoying 19:07 theoretical problems, like things 19:09 don't work, the algorithms won't work, 19:11 the guarantees won't work. 19:12 And so what we typically assume is that the model 19:14 is so-called well-specified. 19:16 19:18 Sorry, that's not well specified. 19:20 I'm jumping ahead of myself. 19:23 OK, well-specified means that your data-- 19:28 the distribution of your data is actually one of those guys. 19:32 OK? 19:33 So some vocabulary-- so well-specified 19:46 means that for my observations x, 19:51 there exists a theta in capital theta 19:55 such that x follows p sub theta. 20:00 I should put a double bar. 20:03 OK, so that's what well-specified means. 20:06 So that means that the distribution 20:08 of your actual data is just one of those guys. 20:12 This is a bit strong of an assumption. 20:18 It's strong in the sense that-- 20:20 I don't know if you've heard of this sense, which I don't know, 20:26 I can tell you who it's attributed to, 20:28 but that probably means that this person did not 20:30 come up with it. 20:31 But I said that all models are wrong, but some of them 20:35 are useful. 20:37 All right, so all models are wrong 20:40 means that maybe it's not true that this Poisson distribution 20:44 that I assume for the number of siblings for college students-- 20:47 maybe that's not perfectly correct. 20:50 Maybe there's a spike at three, right? 20:53 Maybe there's a spike at one, because you know, 20:55 maybe those are slightly more educated families. 20:58 They have less children. 20:59 Maybe this is actually not exactly perfect. 21:02 But it's probably good enough for our purposes. 21:04 And when we make this assumption, 21:05 we're actually assuming that the data really 21:07 comes from a Poisson model. 21:09 There is a lot of research that goes on 21:11 about misspecified models and that tells you 21:14 how well you're doing in the model that's 21:16 the closest to the actual distribution. 21:18 So that's pretty much it. 21:19 Yeah? 21:21 AUDIENCE: [INAUDIBLE]. 21:22 21:24 PHILIPPE RIGOLLET: So my data-- 21:25 so it's always the way I denote one 21:29 of the generic observations, right? 21:31 So my observations are x1, xn. 21:36 And they're IID with distribution p-- 21:39 always. 21:40 So x is just one of those guys. 21:42 I don't want to write x5 or x4. 21:46 They're IID. 21:47 So they all have the same distribution. 21:49 So OK-- no, no, no. 21:54 They're all IID. 21:55 So they all have the same p data. 21:57 They'll have the same p, which means 21:59 they'll have the same p data. 22:00 So I can pick any one of them. 22:02 So I'd just remove the index just so we're clear. 22:05 OK? 22:06 So when I write x, I just mean think of x1. 22:09 Right they're an idea. 22:10 I can pick whichever I want. 22:12 I'm not going to write x1. 22:13 It's going to be weird. 22:14 22:17 OK? 22:18 Is that clear? 22:19 OK. 22:20 So this particular theta is called the true parameter. 22:26 22:34 Sometimes since we're going to want some variable theta, 22:37 we might denote it by theta star as opposed 22:41 to theta hat, which is always our estimator. 22:43 But I'll keep it to be theta for now. 22:47 And so the aim of this physical experiment 22:50 is to estimate theta so that once I actually 22:52 plug in theta in the form of my distribution, for example, 22:56 I could plug in theta here. 22:58 So theta here was actually lambda. 23:01 So once I estimate this guy, I would plug it in, 23:03 and I would know the probability that my random variable takes 23:06 any value, by just putting the lambda hat and the lambda hat 23:09 here. 23:10 OK? 23:11 So my goal is going to be to estimate 23:12 this guy so that I can actually compute those distributions. 23:16 But actually, we'll see, for example, 23:18 when we talk about regression that this parameter actually 23:21 has a meaning in many instances. 23:23 And so just knowing the parameter itself 23:26 intuitively or say more-- 23:30 let's say more so than just computing probabilities, 23:33 will actually tell us something about the process. 23:36 For example, we're going to run linear regression. 23:38 And when we do linear regression, 23:40 there's going to be some coefficients 23:41 in the linear regression. 23:42 And the value of this coefficient 23:44 is actually telling me what is the sensitivity of the response 23:47 that I'm looking at to this particular input. 23:50 All right? 23:51 So just knowing if this number is larger 23:52 or if this number is small is actually 23:55 going to be useful for us to just look at this guy. 23:58 All right? 23:58 So there's going to be some instances where 24:00 it's going to be important. 24:01 Sometimes we're going to want to know if this parameter is 24:04 larger or smaller than something or if it's equal to something 24:07 or not equal to something. 24:08 And those things are also important-- for example, 24:10 if theta actually measures the true-- 24:13 right? 24:13 So theta is the true unknown parameter-- true efficacy 24:16 of a drug. 24:18 OK? 24:18 Let's say I want to know what the true efficacy of a drug is. 24:21 And what I'm going to want to know is maybe it's a score. 24:25 Maybe I'm going to want to know if theta is larger than 2. 24:27 Maybe I want to know if theta is the average number of siblings. 24:30 Is this true number larger than 2 or not? 24:32 Right? 24:32 Maybe I am interested in knowing if college students come from-- 24:37 so maybe from a sociological perspective, 24:40 I'm interested in knowing if college students come 24:42 from households with more than two children. 24:45 All right, so those can be the questions 24:47 that I may ask myself. 24:48 I'm going to want to know maybe theta is going 24:50 to be equal to 1/2 or not. 24:51 So maybe for a drug efficacy, is it completely 24:54 standard-- maybe for elections. 24:57 Is the proportion of the population 24:59 that is going to vote for this particular candidate 25:02 equal to 0.5? 25:03 Or is it different from 0.5? 25:05 OK, and I can think of different things. 25:07 When I'm talking about the regression, 25:09 I'm going to want to test if this coefficient is actually 25:11 0 or not, because if it's 0, it means 25:13 that the variable that's in front of it actually goes out. 25:17 And so those are things we're testing. 25:18 Actually having this very specific yes/no answer 25:22 is going to give me a huge intuition or huge understanding 25:26 of what's going on in the phenomenon that I observe. 25:29 But actually, since the questions are so precise, 25:32 it's going to be much more-- 25:34 I'm going to be much better at answering them rather 25:36 than giving you an estimate for theta 25:38 with some confidence around it. 25:41 All right, it's sort of the same principle as trying to reduce. 25:44 What you're trying to do as a statistician 25:46 is to inject as much knowledge about the question and about 25:49 the problem that you can so that the data has 25:52 to do a minimal job. 25:54 And henceforth, you actually need less data. 25:58 So from now on, we will always assume-- 26:00 and this is because this is an intro stats class-- 26:03 you will always assume that theta-- 26:05 the subset of parameters is a subset of r to the d. 26:09 That means that theta is a vector 26:11 with at most a finite number of coordinates. 26:16 Why do I say this? 26:17 Well, this is called a parametric model. 26:20 So it's called a parametric model or sometimes 26:31 parametric statistics. 26:35 Actually, we don't really talk about parametric statistics. 26:37 But we talk a lot about nonparametric statistics 26:40 or a non-parametric model. 26:42 Can somebody think of a model which is non-parametric? 26:45 26:53 For example, in the siblings example, 26:56 if I did not cap the number of siblings to 7, 27:01 but I let this list go to infinity, 27:06 I would have an infinite number of parameters to estimate. 27:09 Very likely, the last ones would be 0. 27:12 But still, I would have an infinite number of parameters 27:14 to estimate. 27:15 So this would not be a parametric model 27:17 if I just let this list of things 27:19 to be estimated to be infinite. 27:21 But there's other classes that are actually infinite 27:24 and cannot represented by vectors. 27:26 For example, function-- right? 27:29 If I tell you my model, pf, is just 27:38 the distribution of x's, the probability distributions, 27:43 that have density f, right? 27:48 So what I know is that the density is non-negative 27:50 and that it integrates to one, right? 27:52 That's all I know about densities. 27:54 Well f is not something you're going 27:57 to be able to describe with a finite number of values, right? 28:01 All possible functions is the huge set. 28:03 It's certainly not representable by 10 numbers. 28:08 And so non-parametric estimation is typically 28:12 when you actually want to parametrize this 28:14 by a large class of functions. 28:17 And so for example, histograms is the prime tool 28:20 of non-parametric estimation, because when 28:22 you fit a histogram to data, you're 28:24 trying to estimate the density of your data, 28:26 but you're not trying to represent it 28:28 as a finite number of points. 28:31 That's really-- I mean, effectively, 28:35 you have to represent it, right? 28:36 So you actually truncate somewhere and just 28:38 say those things are not going to matter. 28:40 All right? 28:41 But really the key thing is that this is non-parametric 28:44 where you have a potentially infinite number of parameters. 28:47 Whereas we're going to only talk about finites. 28:49 And actually finite in the overwhelming majority of cases 28:53 is going to be 1. 28:55 So theta is going to be a subset of r1. 28:58 OK, we're going to be interested in estimating 29:00 one parameter just like the parameter of a Poisson 29:03 or the parameter of an exponential-- 29:05 the parameter of Bernoulli. 29:07 But for example, really, we're going 29:09 to be interested in estimating mu and sigma 29:11 square for the normal. 29:12 29:17 So here are some statistical models. 29:19 All right? 29:20 So I'm going to go through them with you. 29:23 29:31 So if I tell you I observe-- 29:35 I'm interested in understanding-- 29:38 I'm still [INAUDIBLE] I'm interested in understanding 29:42 the proportion of people who kiss by bending 29:44 their head to the right. 29:46 And for that, I collected n observations. 29:50 And I'm interested in making some inference 29:53 in the statistical model. 29:54 My question to you is, what is the statistical model? 29:58 Well, if you want to read the statistical model, 30:00 you're going to have to write this E-- 30:02 oh, sorry, I never told you what E was. 30:03 OK, well actually just go to the examples, 30:06 and then you'll know what E is. 30:09 So you're going to have to write to me an E and a p theta, OK? 30:14 So let's start with the Bernoulli trials. 30:16 30:25 So this e here is called the sample space. 30:29 30:33 And in the normal people's words, 30:37 it just means the space or the set in which x and-- 30:44 back to your question, x is just a generic observation lips. 30:48 30:51 OK, and hopefully, this is the smallest you can think of. 30:56 OK, so for example, for Bernoulli trials, 30:58 I'm going to observe a sequence of 0's and 1's. 31:01 So my experiment is going to be-- as written on the board, 31:04 is going to be 1, 0, 1. 31:06 And then the probability distributions 31:08 are going to be, well, it's just going 31:10 to be the Bernoulli distributions indexed 31:13 by p, right? 31:14 So rather than writing p sub p, I'm 31:17 going to write it as Bernoulli p, 31:20 because it's clear what I mean when I write that. 31:24 Is everybody happy? 31:25 Actually, I need to tell you something more. 31:27 This is a family of distributions. 31:28 So I need p. 31:29 And maybe I don't want to have to p 31:31 that's a value 0, 1, right? 31:33 It doesn't make sense. 31:34 I would probably not look at this problem 31:37 if I anticipated that everybody would kiss to the right. 31:40 And everybody would kiss to the left. 31:43 So I am going to assume that p is in 0, 1, 31:45 but does not have 0 and 1. 31:47 OK? 31:48 So that's the statistical model for a Bernoulli trial. 31:52 32:00 OK, now the next one, what do we have? 32:03 Exponential. 32:03 OK? 32:04 32:09 OK, so when I have exponential distributions, 32:12 what is the support of the exponential distribution? 32:14 What value is it going to take? 32:17 32:20 0 to infinity, right? 32:23 So what I have is that my first space 32:26 is the value that my random variables can take. 32:28 So its-- well, actually I can remove the 0 again-- 32:34 0 to plus infinity. 32:37 And then the family of distributions 32:39 that I have are exponential with parameter lambda. 32:43 And again, maybe you've seen me switching 32:45 from p, to lambda, to theta, to mu, to sigma square. 32:49 Honestly you can do whatever you want. 32:50 But its just that it's customary to have this particular group 32:53 of letters. 32:54 OK? 32:55 And so the parameters of an exponential 32:58 are just positive numbers. 33:02 OK? 33:02 And that's my exponential model. 33:08 What is the third one? 33:09 Can somebody tell me? 33:11 Poisson, OK? 33:12 33:16 OK, so Poisson-- is a Poisson random verbal 33:20 discrete or continuous? 33:21 33:27 Go back to your probability. 33:29 All right, so the answer being the opposite of continuous-- 33:34 good job. 33:36 All right, so it's going to be-- 33:38 what value can a Poisson take? 33:39 33:43 All the natural integers, right? 33:44 So 0, 1, 2, 3, all the way to infinity. 33:47 We don't have any control of this. 33:48 So I'm going to write this as n without 0. 33:53 I think in the slides, it's n-star maybe. 33:55 Actually, no, you can take value 0. 33:57 I'm sorry. 33:57 This actually takes value 0 quite a lot. 33:59 That's typically, in many instances, actually the mode. 34:03 So it's n, and then I'm going to write it 34:05 as Poisson with parameter-- well, 34:08 here it's again lambda as a parameter. 34:11 And lambda can take any positive value. 34:13 OK? 34:13 34:17 And that's where you can actually see that the model 34:21 that we had for the siblings-- right? 34:23 So let me actually just squeeze in the siblings model here. 34:27 34:31 So that was the bad model that I had in the first place 34:35 when I actually kept this. 34:37 Let's say we just kept it at 7. 34:39 Forget about larger than or equal to 7. 34:40 We just assumed it was 7. 34:42 What was our sample space? 34:43 34:54 We said 7. 34:56 So it's 1, 2, to 7, right? 35:01 Those were the possible values that this thing would take. 35:04 And then what was my-- 35:06 what's my parameter space? 35:07 35:10 So it's going to be a nightmare write. 35:12 But I'm going to write it. 35:14 OK, so I'm going to write it as something like the probability 35:18 that x is equal to k is equal to p sub k. 35:22 35:26 OK? 35:27 And that's going to be for p. 35:33 OK, so that's for all k's, right? 35:36 Or for k equal 1 to 7. 35:38 And here the index is the set of parameters p1 to pk. 35:44 And I know a little more about those guys, right? 35:47 I know there are going to be non-negative-- 35:49 PJ non-negative. 35:50 And I know that they sum to 1. 35:52 35:57 OK, so maybe writing this, you start 36:01 seeing why we like those Poisson, exponential, 36:05 and short notation, because I actually don't have 36:08 to write the PMF of a Poisson. 36:09 The Poisson is really just this. 36:10 But I call it Poisson so I don't have 36:12 to rewrite this all the time. 36:14 And so here, I did not use a particular form. 36:17 So I just have this thing, and that's what it is. 36:19 The set of parameters is the set of positive numbers of-- 36:24 p1 to p7, pk-- 36:28 and sum to 7, right? 36:31 And so this as just a list of numbers 36:34 that are non-negative and sum up to 1. 36:37 So that's my parameter space. 36:39 OK? 36:40 So here that's my theta. 36:42 This whole thing here-- 36:45 this is my capital theta. 36:47 OK? 36:48 36:51 So that's just the set of parameters 36:53 that theta-- the set of parameters 36:55 that theta is allowed to take. 36:58 OK, and finally, we're going to end with the star of all, 37:01 and that's the normal distribution. 37:03 And in the normal distribution, you still 37:06 have also some flexibility in terms of choices, 37:10 because then naturally, the normal distribution 37:13 is parametrized by-- 37:16 the normal distribution is parametrized by two parameters, 37:19 right? 37:19 Mean and variance. 37:20 37:26 So what values can a Gaussian random variable take? 37:30 An entire real line, right? 37:33 And the set of parameters that it 37:35 can take it-- so this is going to be n, mu, sigma square. 37:42 And mu is going to be positive. 37:46 And stigma square is going-- 37:49 sorry, m is going to be an r. 37:51 And sigma square is going to be positive. 37:55 OK, so again here, that's the way 37:57 you're supposed to write it. 37:58 If you really want to identify what theta is, 38:03 well, theta formally is the set of mu sigma square such that-- 38:08 well, in r times 0 infinity, right? 38:15 38:19 That's just to be formal, but this does the job just fine. 38:22 OK? 38:22 You don't have to be super formal. 38:25 OK, that's not three. 38:28 That's like five. 38:30 Actually, I just want to write another one. 38:32 Let's call it 5-bit. 38:35 And 5-bit is just Gaussian with known variants. 38:41 38:46 And this arises a lot in labs when 38:50 you have measurement error-- 38:51 when you actually receive your measurement device. 38:55 This thing has been tested by the manufacturer 38:57 so much that it actually comes in on the side of the box. 39:00 It says that the standard deviation of your measurements 39:04 is going to be 0.23. 39:07 OK, and actually why you do this is because we 39:09 can brag about accuracy, right? 39:11 That's how they sell you this particular device. 39:13 And so you actually know exactly what sigma square is. 39:16 So once you actually get your data in the lab, 39:20 you actually only have to estimate mu, 39:22 because stigma comes on the label. 39:25 So now, what is your statistical model? 39:28 Well, the numbers are collecting still in r. 39:33 But now, the models that I have is n, mu, sigma squared. 39:42 But the parameter space is not mu, and r, and sigma positive. 39:46 It's just mu and r. 39:46 39:54 And to be a little more emphatic about this, 39:58 this is enough to describe it, right? 39:59 Because if sigma is the sigma that 40:02 was specified by the manufacturer, 40:04 then this is the sigma you want. 40:07 But you can actually write sigma is equal to-- 40:10 sigma square is equal to sigma square manufacturer. 40:15 Right? 40:15 You can just fix it to be this particular value. 40:18 Or maybe you don't want to write that index that's 40:21 the manufacturer. 40:22 And so you just say, well, the sigma-- 40:23 when I write n squared what I mean 40:24 is the sigma square from the manufacturer. 40:26 Yeah? 40:27 AUDIENCE: [INAUDIBLE] 40:29 40:35 PHILIPPE RIGOLLET: Yeah. 40:37 For a particular measuring device? 40:39 You know, you're in a lab, and you have some measuring device. 40:42 I don't know-- something that measures 40:45 tensile strength of something. 40:48 And it's just going to measure something. 40:49 And it will naturally make errors. 40:51 But it's been tested so much by the manufacturer 40:53 and calibrated by them. 40:55 They know it's not going to be perfect. 40:57 But they knew exactly what error it 40:59 was making, because they've actually tried it 41:00 on things for which they exactly knew 41:02 what the tensile strength was. 41:04 OK? 41:05 Yeah. 41:06 AUDIENCE: [INAUDIBLE] 41:07 41:09 PHILIPPE RIGOLLET: This? 41:10 AUDIENCE: [INAUDIBLE] 41:11 PHILIPPE RIGOLLET: Oh, like that's pointing to-- 41:13 5 prime? 41:14 41:19 OK? 41:21 And we can come up with other examples, right? 41:24 So for example, here's another one. 41:26 41:30 So the names don't really matter, right? 41:33 I call it the siblings model. 41:34 But you won't find the siblings model in the textbook, right? 41:37 So I wouldn't worry too much. 41:38 But for example, let's say you have something-- so 41:41 let's call it 6. 41:42 You have-- I don't know-- 41:45 a truncated-- and that's the name I just came up with. 41:54 But it's actually not exactly describing what I want. 41:57 But let's say I observe y, which is the indicator of x larger 42:03 than say 5 when x follows some exponential with parameter 42:11 lambda. 42:13 OK? 42:13 This is what I get to observe. 42:15 I only observe if my waiting time 42:18 was more than five minutes, because I 42:20 see somebody coming out of the Kendall Station 42:23 being really upset. 42:24 And that's all I record is I've been waiting 42:26 for more than five minutes. 42:27 And that's all I get to record. 42:29 OK? 42:29 That happens a lot. 42:31 These are called censored data. 42:32 I should probably not call it truncated, 42:34 but this should be censored. 42:36 OK? 42:38 You see a lot of censored data when you ask people 42:40 how much they make. 42:42 They say, well, more than five figures. 42:45 And that's all they want to tell you. 42:47 OK? 42:48 And so you see a lot of censored data in survival analysis, 42:54 right? 42:55 You are trying to understand how long your patients are going 42:58 to live after some surgery, OK? 43:01 And maybe you're not going to keep people alive, 43:05 and you're not going to actually be 43:07 in touch in their family every day and ask them, 43:09 is the guy still alive? 43:10 And so what you can do is just you 43:12 ask people maybe five years after your study 43:15 and say, please, come in. 43:18 And you will just happen to have some people say, well, you 43:20 know, the person is deceased. 43:22 And you will only be able to know that the person deceased 43:25 less than five years ago. 43:27 But you only see what happens after that, OK? 43:31 And so this is this truncated and censored data. 43:34 It happens all the time just because you 43:35 don't have the ability to do better than that. 43:39 So this could happen here. 43:42 So what is my physical experiment, right? 43:45 So here, I should probably write this like this, 43:47 because I just told you that my observations are going to be x, 43:50 but there is some unknown y. 43:52 I will never get to see this y. 43:54 I only get to see the x. 43:57 What is my statistical experiment? 43:58 Please help me. 44:00 So is it the real line? 44:02 My sample space-- is it the real line? 44:04 44:09 Sorry, who does not know what this means? 44:12 I'm sorry. 44:13 OK. 44:15 So this is called an indicator. 44:18 So I read it as-- 44:20 if I wrote well, that would be one with a double bar. 44:23 You can also write i if you prefer 44:26 if you don't feel like writing one in double bars. 44:28 And it's one of say-- 44:31 I'm going to write it like that-- 44:32 1 of a is equal to 1 if a is true and 0 if a is false. 44:43 OK? 44:44 So that means that if y is larger than 4, this thing is 1. 44:48 And if y is not larger than 5, this thing is 0. 44:52 OK. 44:53 So that's called an indicator-- 44:56 45:00 indicator function. 45:01 45:06 It was very useful to just turn anything into a 0, 1. 45:10 So now that I'm here, what is my sample space? 45:14 45:17 0, 1. 45:18 Well, whatever this thing I did not tell you 45:21 was taking value with the thing you should have-- 45:24 if I end up telling you that is taking value 6 or 7 that 45:26 would be your sample space, OK? 45:29 OK, so it takes values 0, 1. 45:33 And then what is the probability here? 45:37 What should I write here? 45:38 What should you write without even thinking? 45:40 45:44 Yeah. 45:45 So let's assume there's two seconds 45:47 before the end of the exam. 45:48 You're going to write Bernoulli. 45:50 And that's where you're going to start checking if I'm going 45:52 to give you extra time, OK? 45:54 So you write Bernoulli without thinking, 45:55 because it's taking value 0, 1. 45:57 So you just write Bernoulli, but you still have to tell me 45:59 what possible parameters this thing is taking, right? 46:04 So I'm going to write it p, because I don't know. 46:06 And then p take value-- 46:09 OK, so sorry. 46:11 I could write it like that. 46:14 Right? 46:16 That would be perfectly valid, but actually no more. 46:21 It's not any p. 46:23 The p is the probability that an exponential lambda 46:26 is larger than 5. 46:27 And maybe I want to have lambda as a parameter. 46:30 OK, so what I need to actually compute is, 46:33 what is the probability that y is larger than 5-- 46:38 when y is this exponential lambda, 46:40 which means that what I need to compute 46:42 is the integral between 5 and infinity of-- 46:46 what is it? 46:47 1 over lambda. 46:49 How did I define it in this class? 46:52 Did I change it-- what? 46:54 AUDIENCE: [INAUDIBLE]. 46:57 PHILIPPE RIGOLLET: Yeah, right, right, right. 46:59 Yeah. 46:59 Lambda e to the minus lambda x dx, right? 47:04 So that's what I need to compute. 47:07 What is this? 47:09 Yeah, so what is the value of this integral? 47:11 47:14 Can you take appropriate measures? 47:16 47:25 AUDIENCE: [INAUDIBLE] 47:28 47:32 PHILIPPE RIGOLLET: OK? 47:33 And again, you can cancel this, right? 47:35 So when I'm going to integrate this guy, 47:37 those guys are going to cancel. 47:39 I'm going to get 0 for infinity. 47:40 I'm going to get a 5 for this guy. 47:42 And well, I know it's going to be positive number, so I'm not 47:45 really going to bother with the signs, 47:46 because I know that's what it should be. 47:48 OK, so I get e to the minus 5 lambda. 47:51 And so that means that I can actually write this like that-- 47:55 47:57 and now parametrize this thing by lambda positive. 48:01 OK? 48:02 So what I did here is I changed the parametrization from p 48:05 to lambda. 48:06 Why? 48:07 Well, because maybe if I know this is happening, 48:10 maybe I am actually interested in reporting lambda 48:13 to MBTA, for example. 48:15 Maybe I'm actually trying to estimate 1 over lambda, so 48:20 that I know it is-- 48:22 well, lambda is actually the intensity 48:24 of arrival of my Poisson process, right? 48:26 I have a Poisson process. 48:27 That's how my trains are coming in. 48:31 And so I'm interested in lambda. 48:32 So I will parametrize things by lambda. 48:34 So the thing I get is lambda. 48:35 You can play with this, right? 48:37 I mean, I could parametrize this by 1 over lambda 48:39 and put 1 over lambda here if I want it. 48:42 But you know, the context of your problem 48:46 will tell you exactly how to parametrize this. 48:50 OK? 48:50 48:53 So what else did I want to tell you? 48:59 OK, let's do a final one. 49:00 49:13 By the way, are you guys OK with Poisson exponential, 49:17 Bernoulli's-- 49:21 I don't know, binomial, normal-- 49:22 all these things. 49:24 I'm not going to go back to it, but I'm 49:25 going to use them heavily. 49:26 So just spend five minutes on Wikipedia 49:29 if you forgot about what those things are. 49:31 Usually, you must have seen them the in your probability class. 49:35 So they should not be a crazy name. 49:36 And again, I'm not expecting you. 49:38 I don't remember what the density of an exponential is. 49:40 So it would be pretty unfair of me 49:42 to actually ask you to remember what it is. 49:44 Even for the Gaussian, I don't expect 49:45 you to remember what it is. 49:46 But I want you to remember that if I add 5 to a Gaussian, then 49:51 I have a Gaussian with me and mu plus 5 if I multiply it 49:54 by something, right? 49:55 You need to know how to operate those things. 49:59 But knowing complicated densities 50:02 is definitely not part of the game. 50:04 OK? 50:05 So let's do a final one. 50:09 I don't know what number I have now. 50:11 I'm going to just do uniform. 50:12 50:14 That's another one. 50:15 Everybody knows what uniform is? 50:18 So it's uniform, right? 50:19 So I'm going to have x, which my observations are 50:22 going to be uniform on the interval 0 theta, right? 50:27 So if I want to define a uniform distribution 50:30 for a random variable, I have to tell you which interval 50:32 or which set I want it to be uniform on. 50:35 And so here I'm telling you is the interval 0 theta. 50:38 And so what is going to be my sample space? 50:41 AUDIENCE: [INAUDIBLE] 50:42 PHILIPPE RIGOLLET: I'm sorry? 50:44 0 to theta. 50:44 50:47 And then what is my probability distribution? 50:50 My family of parameters? 50:52 50:57 So well, I can write it like this, right? 51:00 Uniform theta, right? 51:03 And theta let's say is positive. 51:06 51:09 Can somebody tell me what's wrong with what I wrote? 51:12 51:18 This makes no sense. 51:20 Tell me why. 51:21 51:24 Yeah? 51:26 Yeah, this set depends on theta, and why is that a problem? 51:30 AUDIENCE: [INAUDIBLE] 51:32 51:36 PHILIPPE RIGOLLET: There is no theta. 51:38 Right now, there's the families of theta. 51:40 Which one did you pick here? 51:43 Right, this is just something that's indexed by theta, 51:46 but I could have very well written it as, you know, 51:49 just not being Greek for a second, 51:51 I could have just written this as t rather than theta. 51:55 That would be the same thing. 51:56 And then what the hell is theta? 51:59 There's no such thing as theta. 52:00 We don't know what the parameter is. 52:02 This parameter should move with everyone. 52:04 And so that means that I actually am not 52:05 allowed to pick this theta. 52:07 I'm actually-- just for the reason that there is no 52:10 parameter to put on the left side-- 52:12 there should not be, right? 52:13 So you just said, well, there's a problem because the parameter 52:14 is on the left-hand side. 52:16 But there's not even a parameter. 52:17 I'm describing the family of possible parameters. 52:19 There is no one that you can actually plug it in. 52:22 So this should really be 1. 52:24 And I'm going to go back to writing 52:25 this as theta because that's pretty standard. 52:29 Is that clear for everyone. 52:31 I cannot just pick one and put it in there and just take the-- 52:37 before I run my experiments, I could potentially 52:40 get numbers that are all the way up to 1, 52:42 because I don't know what theta is going to be ahead of time. 52:45 Now, if somebody promised to me that theta 52:47 was going to be less than 0.5, that would be-- 52:49 sorry, why do I put 1 here? 52:50 52:56 I could put theta between 0 and 1. 52:58 But if somebody is going to promise me, for example, 53:00 if theta is going to be less than 1, 53:01 then you expect to put 0, 1. 53:03 All right? 53:04 53:08 Is that clear? 53:12 OK, so now you know how to answer the question-- 53:15 what is the statistical model? 53:18 And again, within the scope of this class, 53:20 you will not be asked to just come up with a model right that 53:23 will just tell you. 53:24 Poisson would be probably be a good idea here. 53:26 And then you would just have to trust me that indeed it 53:28 would be a good idea. 53:30 All right, so what I started talking about 20 minutes ago-- 53:35 so it's definitely ahead of myself 53:38 is the notion-- so that's when I was 53:40 talking about well-specified. 53:41 Remember, well-specified says that the true distribution 53:44 is one of the distributions in this parametric families 53:47 of distribution. 53:48 The true distribution of my siblings 53:50 is actually a Poisson with some parameters. 53:52 And all I need to figure out is what this parameter is. 53:56 When I started saying that, I said, well, 53:58 but then that could be that there 53:59 are several parameters that give me 54:01 the same distribution, right? 54:03 It could be the case that Poisson 5 and Poisson 17 54:07 are exactly the same distributions when 54:09 I started putting those numbers in the formula which I erased, 54:13 OK? 54:14 So it could be the case that two different numbers would give me 54:18 exactly the same probabilities. 54:20 And in this case, we see that the model is not identifiable. 54:24 I mean, the parameter is not identifiable. 54:26 I cannot identify the parameter, even if you actually gave me 54:29 an infinite amount of data, which means that I could 54:32 actually estimate exactly the PMF. 54:34 I might not be able to go back, because there would 54:37 be several candidates, and I would not 54:38 be able to tell you which one it was in the first place. 54:41 OK? 54:41 So what we want is that this function-- 54:45 theta maps to p theta is injective. 54:49 And that really can be fancy. 54:51 54:54 What I really mean is that if theta 54:57 is different from theta prime, then p of theta 55:01 is different from p of theta prime. 55:04 Or, if you prefer to think about the contrapositive of this, 55:07 this is the same as saying that if p theta gives me 55:11 the same distribution as theta prime, 55:15 then that implies that theta must 55:17 be equal to the theta prime. 55:20 The logic of those two things are equivalent, right? 55:24 So that's what this means. 55:26 So this is-- we say that the parameter is identifiable 55:37 or identified-- it doesn't really matter-- 55:41 in this model. 55:42 55:49 And this is something we're going to want. 55:50 OK? 55:51 So in all the examples that I gave you, 55:54 those parameters are completely identified. 55:57 Right? 55:57 If I tell you-- 55:58 I mean, if those things are in probability box, 56:01 it means that they were probably thought through, right? 56:03 So when I say exponential lambda, 56:06 I'm really talking about one specific distribution and not-- 56:09 there's not another lambda going to give you 56:11 exactly the same distribution. 56:13 OK so that was the case. 56:15 And you can check that, but it's a little annoying. 56:17 So I would probably not do it. 56:19 But rather than doing this, let me just 56:20 give you some examples where it would not be the case. 56:24 Again, here's an example, if I take xi-- 56:25 56:31 so now I'm back to just using this indicator function-- 56:36 but now for a Gaussian. 56:39 So what I observe is x is the indicator 56:42 that y is, what did we say? 56:44 Positive. 56:44 56:48 OK? 56:49 So this is a Bernoulli random variable, right? 56:51 56:56 And it has some parameter p. 56:57 But p now is going to depend-- sorry, 56:59 and here y is n mu sigma square. 57:04 So the p, the probability that this thing is positive, 57:09 is actually-- 57:10 I don't think I put the 0. 57:11 Oh, yeah, because I have mu. 57:13 OK, so this distribution-- this p the probability 57:15 that it's positive is just the probably 57:17 that some Gaussian is positive. 57:19 And it will depend on mu and sigma, right? 57:22 Because if I draw a 0, and I draw my Gaussian around mu, 57:31 then the probability of this Bernoulli being 1 57:35 is really the area under the curve here. 57:39 Right? 57:40 And this thing-- well, if mu is very large, 57:42 it's going to become very large. 57:44 If mu is very small, it's going to become very small. 57:48 And if sigma changes, it's also going to effect-- 57:51 is that clear for everyone? 57:53 But we can actually compute this, right? 57:56 So the parameter p that I'm looking for here 57:59 as a function of mu and sigma is simply 58:01 the probability that some y is non-negative, 58:06 which is the probability that y minus mu divided by sigma 58:12 is larger than minus mu divided by sigma. 58:16 But when you study probability, is that some operation you 58:20 were used to making? 58:22 Removing the mean and dividing by the standard deviation? 58:26 What is the effect of doing that on the Gaussian 58:28 random variable? 58:30 Yeah, so you normalize it, right? 58:32 And you standardize it. 58:33 You make it a standard Gaussian. 58:34 You remove the mean. 58:36 The mean 0 is Gaussian. 58:38 And you remove the variance for it to become 1. 58:41 So when you have a Gaussian, remove the mean 58:43 and divide by the standard deviation, 58:44 it becomes a standard Gaussian-- 58:46 which this thing has n , 0, 1 distribution, 58:50 which is the one you can read the quintiles of at the end 58:53 of the book. 58:54 Right? 58:55 And that's exactly what we did. 58:57 OK? 58:57 So now you have the probability that some standard Gaussian 59:00 exceeds negative mu over sigma, which 59:04 I can write in terms of the cumulative distribution 59:06 function, capital phi-- 59:07 59:14 like we did in the first lecture. 59:16 So if I do this cumulative distribution function, 59:19 what is this probability in terms of phi? 59:21 59:25 [INAUDIBLE]? 59:26 AUDIENCE: [INAUDIBLE]. 59:28 PHILIPPE RIGOLLET: Well, that's what your name tag says. 59:30 59:33 1 minus-- 59:34 AUDIENCE: [INAUDIBLE]. 59:36 PHILLIPPE RIGOLLET: 1 minus mu of sigma. 59:37 What happens with phi in our-- 59:39 do you think I defined this for fun? 59:43 1 minus phi of mu over sigma, right? 59:50 Right? 59:50 Because this is 1 minus the probability 59:52 that it's less than this. 59:53 And this is exactly the definition 59:55 of the cumulative distribution function. 59:57 So in particular, this thing only depends on mu over sigma. 60:04 Agreed? 60:05 So in particular, if I had 2 mu over 2 sigma, 60:09 p would remain unchanged. 60:11 If I have 12 mu over 12 sigma, this thing 60:15 would remain unchanged, which means 60:18 that p does not change if I scale mu 60:22 and sigma by the same factor. 60:25 So there's no way just by observing x, even 60:28 an infinite times, so that I can actually get exactly what p is. 60:32 I'm never going to be able to get mu and sigma separately. 60:34 All I'm going to be able to get is mu over sigma. 60:37 So here, we say that mu sigma-- 60:41 the parameter mu sigma-- 60:43 or actually each of them individually-- those guys-- 60:46 60:50 they're not identifiable. 60:51 60:58 But the parameter mu over sigma is identifiable. 61:03 61:09 So if I wanted to write a statistical model in which 61:13 the parameter is identifiable-- 61:15 61:25 I would write 0, 1 Bernoulli. 61:32 And then I would write 1 minus phi over mu over sigma. 61:41 And then I would take two parameters, 61:42 which are mu and r and sigma squared positive. 61:48 So let's write sigma positive. 61:52 Right? 61:52 61:56 No, this is not identifiable. 61:59 I cannot write those two guys as being two things different. 62:02 62:12 Instead, what I want to write is 0, 1, Bernoulli 1 minus-- 62:22 62:26 and now my parameter-- 62:30 I forgot this-- my parameter is mu over sigma. 62:37 Can somebody tell me where mu over sigma lives? 62:41 What values can this thing take? 62:42 62:46 Any real value, right? 62:48 62:53 OK, so now I've done this definitely out of convenience, 62:55 right? 62:56 Because that was the only thing I was able to identify-- 62:59 the ratio of mu over sigma. 63:01 But it's still something that has some meaning. 63:04 It's the normalized mean. 63:06 It really tells me what the mean is compared 63:08 to the standard deviation. 63:10 So in some models, in reality, in some real applications, 63:13 this actually might have a good meaning. 63:16 It's just telling me how big the mean 63:17 is compared to the standard fluctuations of this model. 63:22 But I won't be able to get more than that. 63:24 Agreed? 63:25 63:30 All right? 63:32 So now that we've set a parametric model, 63:37 let's try to see what our goals are going to be. 63:40 OK? 63:41 So now we have a sample and a statistical model. 63:44 And we want to estimate the parameter theta, 63:47 and I could say, well, you know what? 63:49 I don't have time for this analysis. 63:51 Collecting data is going to take me a while. 63:53 So I'm just going to mmm-- 63:55 and I'm going to say that mu over sigma is 4. 63:57 And I'm just going to give it to you. 63:59 And maybe you will tell me, yeah, 64:00 it's not very good, right? 64:02 So we need some measure of performance 64:04 of a given parameter. 64:05 We need to be able to evaluate if eyeballing the problem 64:09 is worse than actually collecting 64:11 a large amount of theta. 64:13 We need to know if even if I come up with an estimator that 64:16 actually sort of uses the data, does it 64:18 make an efficient use of the data? 64:20 Would I actually need 10 times more observations 64:22 to achieve the same accuracy? 64:24 To be able to answer these questions, 64:25 well, I need to define what accuracy means. 64:28 And accuracy is something that sort of makes sense. 64:30 It says, well, I want theta. 64:31 I have to be close to theta. 64:33 And the theta is a random variable. 64:35 So I'm going to have to understand 64:36 what it means for a random variable 64:38 to be close to a deterministic number. 64:40 And so, what is a parameter estimator, right? 64:44 So I have an estimator, and I said it's a random variable. 64:46 64:49 And the formal definition-- 64:51 64:59 so an estimator is a measurable function of the data. 65:10 So when I write theta hat, and that 65:12 will typically be my notation for an estimator, right? 65:18 I should really write theta hat of x1 xn. 65:24 OK? 65:25 That's what an estimator is. 65:26 If you want to know an estimator is, 65:28 this is a measurable function of the data. 65:30 And it's actually also known as a statistic. 65:35 65:37 And you know, if you're interested in, 65:39 you know, I see every day I think when I have like, 65:43 you know, a dinner with normal people. 65:47 And they say I'm a statistician. 65:48 Oh, yeah, I really like baseball. 65:50 And they talk to me about batting averages. 65:53 That's not what I do. 65:54 But for them, that's what it is, and that's 65:55 because in a way, that's what a statistic is. 65:58 A batting average is a statistic. 66:00 OK, and so here are some examples. 66:02 You can take the average xn bar. 66:04 You can take the maximum of your observation. 66:06 That's the statistics. 66:07 You can take the first one. 66:08 You can take the first one plus log of 1 66:10 plus the absolute value of the last one. 66:12 You can do whatever you want that will be an estimator. 66:15 Some of them are clearly going to be bad. 66:17 But that's still a statistic, and you can do this. 66:20 Now, when I say measurable, I always have-- 66:24 so you know, graduate students sometimes 66:26 ask me like, yeah, how do I know if this estimator is measurable 66:28 or not. 66:29 And usually, my answer is, well, if I give you data, 66:31 can you compute it. 66:32 And they say, yeah, I'm like, well, then it's measurable. 66:35 That's a very good rule to check if you can actually-- 66:38 if something is actually measurable. 66:40 When is this thing non-measurable? 66:42 It's when it's implicitly defined. 66:44 OK, and in particular, the things 66:46 that give you problems are-- 66:48 66:52 sup or inf. 66:53 Anybody knows what a sup or an inf is? 66:55 It's like a max or a min. 66:57 But it's not always attained. 66:59 OK, so if I have x1. 67:02 So if I look at the infimum of the function 67:06 f of x for x on the real line and f of x, sorry, 67:11 let's say x on the 1 infinity. 67:13 And f of x is equal to 1 over x. 67:16 Right? 67:18 Then the infimum is the smallest value 67:20 we can take except that it doesn't really 67:22 take it at 0 right, because 1 over x is going to 0. 67:28 But it's never really getting there. 67:30 So we just called the inf 0. 67:32 But it's not the value that it ever takes. 67:34 And these things might actually be complicated to compute. 67:37 And so that's when you actually have problems, right? 67:40 When the limit is not-- 67:41 you're not really quite reaching the limit. 67:44 You won't have this problem in general, but just so you know, 67:47 an estimator is not really anything. 67:48 It has to actually be measurable. 67:51 OK, so the first thing we want to know I mentioned it-- 67:54 so an estimator is a statistic which does not depend on theta, 67:57 of course. 67:58 So if I give you the data, you have to be able to compute it. 68:01 And that probably should not require not knowing any known 68:04 parameters. 68:06 OK, so an estimator is said to be consistent. 68:11 When my data-- when I collect more and more data, this thing 68:13 is getting closer and closer to the true parameter. 68:16 All right? 68:16 And we said that eyeballing and saying that it's going to be 4 68:20 is not really something that's probably 68:21 going to be consistent. 68:22 But they can have things that are consistent 68:24 but that are converging to theta at different speeds. 68:28 OK? 68:29 And we know also that this is a random variable. 68:32 It converges to something. 68:33 And there might be some different notions 68:35 of convergence that kick in. 68:36 And actually there are. 68:38 And we say that it's weakly convergent if it converges 68:40 in probability and strongly convergent 68:43 if it converges almost [INAUDIBLE].. 68:46 OK? 68:46 And this is just vocabulary. 68:48 It won't make a big difference. 68:50 OK? 68:51 So we will typically say it's consistent with any of the two. 68:56 AUDIENCE: [INAUDIBLE]. 68:57 69:02 PHILIPPE RIGOLLET: Well, so in parametric statistics, 69:07 it's actually a little difficult to come up with. 69:09 But in non-parametric ones, I could just say, if I had xi, 69:15 yi, and I know that yi is f of xi plus noise s1i. 69:24 And I know that f belongs to some class of function, 69:26 let's say-- 69:27 [INAUDIBLE] class of smooth functions-- it's massive. 69:31 And now, I'm going to actually find the following estimator. 69:33 I'm going to take the average. 69:35 So I'm going to do least squares, right? 69:36 69:40 So I just check. 69:41 I'm trying to minimize the distance of each of my f of xi 69:44 to my yi. 69:45 And now, I want to find the smallest of them. 69:49 So if I look at the infimum here, then the question is-- 69:56 so that could be-- 69:57 well, that's not really an estimator for f. 69:59 But it's an estimator for the smallest possible value. 70:02 And so for example, this is actually 70:04 an estimator for the variance of sigma square. 70:07 This might not be attained, and this might not 70:09 be measurable if f is massive? 70:13 All right, so that's the infimum over some class f of x. 70:16 OK? 70:18 So those are all voice things that are defined implicitly. 70:20 If it's an average, for example, it's completely measurable. 70:24 OK? 70:27 Any other question? 70:28 70:31 OK, so we know that the first thing we might want to check, 70:37 and that's definitely something we want about estimators that 70:40 is consistent, because all consistency tells 70:43 us is that just as I collect more and more data, 70:45 my estimator is going to get closer 70:47 and closer to the parameter. 70:51 There's other things we can look at. 70:52 For each possible value of n-- now, right now, 70:55 I have a finite number of observations-- 71:00 25. 71:01 And I want to know something about my estimator. 71:04 The first thing I want to check is maybe if in average, right? 71:08 So this is a random variable. 71:09 Is this random variable in average 71:11 going to be close to theta or not? 71:14 And so the difference how far I am from theta 71:17 is actually called the bias. 71:20 So the bias of an estimator is the expectation of theta hat 71:28 minus the value that I hope it gets, which is theta. 71:31 If this thing is equal to 0, we say that theta hat is unbiased. 71:38 71:42 And unbiased estimators are things that people 71:44 are looking for in general. 71:46 The problem is that there's lots of unbiased estimators. 71:49 And so it might be misleading to look for unbiasedness 71:52 when that's not really the only thing 71:54 you should be looking for. 71:55 OK, so what does it mean to be unbiased? 71:58 Maybe for this particular round of data 72:00 you collected, you're actually pretty far 72:02 from the true estimator. 72:04 But one thing that actually-- 72:08 what it means is that if I redid this experiment over, and over, 72:12 and over again, and I averaged all the values of my estimators 72:16 that I got, then this would actually be the right-- 72:19 the true parameter. 72:21 OK. 72:21 That's what it means. 72:22 If I were to repeat this experiment, 72:25 in average, I would actually get the right thing. 72:27 But you don't get to repeat the experiment. 72:30 OK, just a remark about estimators, 72:33 look at this estimator-- xn bar. 72:34 Right? 72:35 Think of the kiss example. 72:36 I'm looking at the average of my observations. 72:39 And I want to know what the expectation of this thing is. 72:41 72:44 OK? 72:45 Now, this guy is by linearity of the expectation, 72:56 it is this, right? 72:59 But my data is identically distributed. 73:03 So in particular, all the xi's have the same expectation, 73:07 right? 73:09 Everybody agrees with this. 73:10 When it's identically distributed, 73:12 they'll get the same expectation. 73:14 So what it means is that this guy's here-- 73:17 they're all equal to the expectation of x1. 73:22 Right? 73:23 So what it means is that these guys-- 73:25 I have the average of the same number. 73:28 So this is actually the expectation of x1. 73:31 OK? 73:32 And it's true. 73:33 In the kiss example, this was p. 73:36 And this is p-- 73:37 73:40 the probability of turning your head right. 73:43 OK? 73:43 So those two things are the same. 73:45 In particular, that means that xn bar and just x1 73:50 have the same bias. 73:54 So that should probably illustrate to you 73:56 that bias is not something that really is telling you 73:59 the entire picture, Right? 74:02 I can take only one of my observations-- 74:05 Bernoulli 0, 1. 74:06 This thing will have the same bias 74:07 as if I average 1,000 of them. 74:10 But the bias is really telling you where I am in average. 74:13 But it's really not telling me what fluctuations I'm getting. 74:16 And so if you want to start having fluctuations coming 74:18 into the picture, we actually have 74:20 to look at the risk or the quadratic risk 74:22 of the estimator. 74:23 And so the quadratic risk is the finest-- 74:25 the expectation of the square distance between theta hat 74:28 and theta. 74:30 OK? 74:33 So let's look at this. 74:34 74:42 So the quadratic risk-- 74:43 74:47 sometimes it's denoted that people 74:48 call it the l2 risk of theta hat, of course. 74:57 I'm sorry for maintaining such an ugly board. 74:59 [INAUDIBLE] this stuff. 75:00 75:09 OK, so I look at the square distance 75:10 between theta hat and theta. 75:12 This is still-- this is the function of a random variable. 75:14 So it's a random variable as well. 75:16 And now I'm looking at the expectation of this guy. 75:19 That's the definition. 75:23 I claimed that when this thing goes to 0, then 75:25 my estimator is actually going to be consistent. 75:28 Everybody agrees with this? 75:30 75:37 So if it goes to zero as n goes to infinity-- and here, 75:47 I don't need to tell you what kind of convergence I have, 75:50 because this is just the number, right? 75:51 It's an expectation. 75:52 So it's a regular, usual calculus-style convergence. 75:57 Then that implies that theta hat is actually weakly consistent. 76:03 76:07 What did I use to tell you this? 76:09 76:14 Yeah, this is the convergence in l2. 76:17 This actually is strictly equivalent. 76:19 This is by definition saying that theta hat converges in l2 76:26 to theta. 76:29 And we know that convergence in l2 76:31 implies convergence in credibility to theta. 76:37 That was the picture. 76:38 We're going up. 76:40 And this is actually equivalent to a consistency 76:42 by definition-- a weak consistency. 76:46 OK, so this is actually telling you a little more 76:48 because this guy here-- 76:50 they are both unbiased. 76:52 Theta xn bar is unbiased. 76:55 X1 is unbiased. 76:56 But x1 is certainly not consistent , 76:58 because the more data I collect, I'm not even doing anything 77:01 with it. 77:01 I'm just taking the first data point you're giving to me. 77:04 So they're both unbiased. 77:05 But this one is not consistent. 77:07 And this one we'll see is actually consistent. 77:09 xn bar is consistent. 77:11 And actually, we've seen that last time. 77:14 And that's because of the? 77:15 77:19 What guarantees the fact that xn bar is consistent? 77:23 AUDIENCE: The law of large numbers. 77:25 PHILIPPE RIGOLLET: The law of large numbers, right? 77:26 Actually, it's strongly consistent 77:27 if you have a strong [INAUDIBLE].. 77:29 OK, so just in the last two minutes, 77:35 I want to tell you a little bit about how this risk is linked 77:39 to see, quadratic risk is equal to bias 77:43 squared plus the variance. 77:44 So let's see what I mean by this? 77:48 So I'm going to forget about the absolute values 77:50 that you have a square. 77:50 I don't really need them. 77:54 If theta hat was unbiased, this thing 77:57 would be the expectation of theta hat. 78:01 It might not be the case. 78:02 So let me see how I can actually see-- put the bias in there. 78:06 Well, one way to do this is to see 78:07 that this is equal to the expectation of theta 78:10 hat minus the expectation of theta hat, 78:13 plus the expectation of theta hat minus theta. 78:17 78:21 OK? 78:22 I just removed the same and added the same thing. 78:24 So I didn't change anything. 78:27 Now, this guy is my bias, right? 78:29 78:32 So now let me expand the square. 78:34 So what I get is the expectation of the square of theta 78:37 hat minus its expectation. 78:39 78:42 I should put some square brackets-- 78:45 plus two times the cross-product. 78:50 So the cross-product is what expectation 78:52 of theta hat minus the expectation of theta hat times 78:59 expectation of theta hat minus theta. 79:03 79:07 And then I have the last square. 79:08 79:17 Expectation of theta hat minus theta squared. 79:22 OK? 79:24 So square, cross-products, square. 79:27 Everybody is with me? 79:29 now this guy here-- 79:32 if you pay attention, this thing is the expectation 79:35 of some random variable. 79:36 So it's a deterministic number. 79:38 Theta is the true parameter. 79:39 It's a deterministic number. 79:41 So what I can do is pull out this entire thing out 79:44 of the expectation like this and compute the expectation only 79:52 with respect to that part. 79:53 But what is the expectation of this thing? 79:56 79:59 It's zero, right? 80:00 The expectation of theta hat minus the expectation 80:02 of theta hat is 0. 80:03 So this entire thing is equal 0. 80:07 So now when I actually collect back my quadratic terms-- 80:12 my two squared terms in this expansion-- 80:15 what I get is that the expectation 80:18 of theta hat minus theta squared is 80:21 equal to the expectation of theta hat minus expectation 80:26 of theta hat squared plus the square of expectation 80:32 of theta hat minus theta. 80:35 80:40 Right? 80:41 So those are just the two-- 80:42 the first and the last term of the previous equality? 80:46 Now, here I have the expectation of the square 80:48 of the difference between a random variable 80:50 and its expectation. 80:52 This is otherwise known as the variance, right? 80:56 So this is actually equal to the variance of theta hat. 81:03 And well, this was the bias. 81:05 We already said that's there. 81:07 So this whole thing is the bias square. 81:09 81:12 OK? 81:13 And hence the quadratic term is the sum 81:15 of the variance and the squared bias. 81:18 Why squared bias? 81:18 Well, because otherwise, you would add dollars 81:21 in dollars squared. 81:22 So you need to add dollars squared and dollars 81:24 squared so that this thing is actually homogeneous. 81:27 So if x is in dollars, then the bias is in dollars, 81:30 but the variance is in dollars squared. 81:32 OK, and the square here forced you to put everything 81:35 on the square scale. 81:36 All right, so what's nice is that if the quadratic risk goes 81:39 to 0, then since I have the sum of two positive terms, 81:42 both of them have to go to 0. 81:45 That means that my variance is going to 0-- 81:46 very little fluctuations. 81:48 And my bias is also going to 0, which means that I'm actually 81:51 going to be on target once I reduce 81:53 my fluctuations, because it's one thing to reduce 81:55 the fluctuations. 81:56 But if I'm not on target, it's an issue, right? 81:58 For example, the estimator for the value 4 has no variance. 82:03 Every time I'm going to repeat the experiments, 82:05 I'm going to get 4, 4, 4, 4-- 82:07 variance is 0. 82:08 But the bias is bad. 82:10 The bias is 4 minus theta. 82:12 And if theta is far from 4, that's not doing very well. 82:17 OK, so next week, we will-- 82:21 we'll talk about what is a good estimate-- 82:25 how estimators change if they have 82:26 high variance or low variance or high bias and low bias. 82:32 And we'll talk about confidence intervals as well. 82:35