字幕記錄 https://www.youtube.com/watch?v=VPZD_aij8H0&list=PLUl4u3cNGP60uVBMaoNERc6knT_MgPKS0&index=1 00:00 00:01 The following content is provided under a Creative 00:03 Commons license. 00:05 Your support will help MIT OpenCourseWare 00:07 continue to offer high-quality educational resources for free. 00:11 To make a donation or to view additional materials 00:14 from hundreds of MIT courses, visit MIT OpenCourseWare 00:18 at ocw.mit.edu. 00:19 00:23 PHILIPPE RIGOLLET: OK, so the course you're currently sitting 00:25 in is 18.650. 00:27 And it's called Fundamentals of Statistics. 00:29 And until last spring, it was still called Statistics 00:33 for Applications. 00:34 It turned out that really, based on the content, "Fundamentals 00:37 of Statistics" was a more appropriate title. 00:42 I'll tell you a little bit about what 00:43 we're going to be covering in class, what this class is 00:46 about, what it's not about. 00:48 I realize there's several offerings 00:50 in statistics on campus. 00:52 So I want to make sure that you've chosen the right one. 00:56 And I also understand that for some of you, 00:58 it's a matter of scheduling. 01:01 I need to actually throw out a disclaimer. 01:03 I tend to speak too fast. 01:05 I'm aware that. 01:07 Someone in the back, just do like that when you 01:10 have no idea what I'm saying. 01:12 Hopefully, I will repeat myself many times. 01:14 So if you average over time, you'll 01:15 see that statistics will tell you 01:17 that you will get the right message that I was actually 01:19 trying to stick to send. 01:22 All right, so what are the goals of this class? 01:26 The first one is basically to give you an introduction. 01:28 No one here is expected to have seen statistics before, 01:31 but as you will see, you are expected 01:33 to have seen probability. 01:34 And usually, you do see some statistics 01:36 in a probability course. 01:38 So I'm sure some of you have some ideas, 01:39 but I won't expect anything. 01:42 And we'll be using mathematics. 01:44 Math class, so there's going to be a bunch of equations-- 01:48 not so much real data and statistical thinking. 01:52 We're going to try to provide theoretical guarantees. 01:54 We have two estimators that are available for me-- 01:58 how theory guides me to choose between the best of them, 02:01 how certain can I be of my guarantees or prediction? 02:06 It's one thing to just bid out a number. 02:08 It's another thing to put some error bars around. 02:10 And we'll see how to build error bars, for example. 02:14 You will have your own applications. 02:16 I'm happy to answer questions about specific applications. 02:19 But rather than trying to tailor applications 02:21 to an entire institute, I think we're 02:24 going to work with pretty standard applications, 02:28 mostly not very serious ones. 02:32 And hopefully, you'll be able to take the main principles back 02:36 with you and apply them to your particular problem. 02:39 What I'm hoping that you will get out of this class is that 02:43 when you have a real-life situation-- and by "real life", 02:46 I mean mostly at MIT, so some people probably would not call 02:48 that real life-- 02:50 their goal is to formulate a statistical problem 02:52 in mathematical terms. 02:53 If I want to say, is a drug effective, 02:56 that's not in mathematical terms, 02:58 I have to find out which measure I want 03:00 to have to call it effective. 03:03 Maybe it's over a certain period of time. 03:06 So there's a lot of things that you actually need. 03:08 And I'm not really going to tell you 03:10 how to go from the application to the point you need to be. 03:13 But I will certainly describe to you 03:14 what point you need to be at if you want to start applying 03:19 statistical methodology. 03:21 Then once you understand what kind of question 03:23 you want to answer-- 03:24 do I want a yes/no answer, do I want a number, 03:26 do I want error bars, do I want to make predictions 03:29 five years into future, do I have side information, 03:32 or do I not have side information, all those things-- 03:34 based on that, hopefully, you will 03:36 have a catalog of statistical methods 03:38 that you're going to be able to use and apply it in the wild. 03:44 And also, no statistical method is perfect. 03:49 Some of the math people have agreed upon over the years, 03:52 and people understand that this is the standard. 03:54 But I want you to be able to understand 03:57 what the limitations are, and when you make conclusions 03:59 based on data, that those conclusions might be erroneous, 04:03 for example. 04:03 All right, more practically, my goal here is to have you ready. 04:09 So who has taken, for example, a machine-learning class here? 04:12 All right, so many of you, actually-- maybe a third 04:15 have taken a machine-learning class. 04:19 So statistics has somewhat evolved into machine 04:21 learning in recent years. 04:22 And my goal is to take you there. 04:24 So machine learning has a strong algorithmic component. 04:26 So maybe some of you have taken a machine-learning class 04:29 that displays mostly the algorithmic component. 04:31 But there's also a statistical component. 04:33 The machine learns from data. 04:37 So this is a statistical track. 04:39 And there are some statistical machine-learning classes 04:43 that you can take here. 04:44 They're offered at the graduate level, I believe. 04:47 But I want you to be ready to be able to take those classes, 04:51 having the statistical fundamentals to understand 04:53 what you're doing. 04:54 And then you're going to be able to expand to broader and more 04:58 sophisticated methods. 05:00 Lectures are here from 11:00 to 12:30 on Tuesday and Thursday. 05:05 Victor-Emmanuel will also be-- 05:07 and you can call him Victor-- 05:09 will also be holding mandatory recitation. 05:12 So please go on Stellar and pick your recitation. 05:15 It's either 3:00 to 4:00 or 4:00 to 5:00 on Wednesdays. 05:19 And it's going to be mostly focused on problem-solving. 05:22 They're mandatory in the sense that we're allowed to do this, 05:28 but they're not going to cover entirely new material. 05:32 But they might cover some techniques 05:35 that might save you some time when it comes to the exam. 05:39 So you might get by. 05:41 Attendance is not going to be taken or anything like this. 05:44 But I highly recommend that you go, 05:47 because, well, they're mandatory. 05:49 So you cannot really complain that something was taught only 05:52 in recitation. 05:54 So please register on Stellar for which 05:56 of the two recitations you would like to be in. 05:59 They're capped at 40, so first come, first served. 06:03 Homework will be due weekly. 06:05 There's a total of 11 problem sets. 06:08 I realize this is a lot. 06:09 Hopefully, we'll keep them light. 06:11 I just want you to not rush too much. 06:15 The 10 best will be kept, and this 06:17 will count for a total of 30% of the final grade. 06:20 There are due Mondays at 8:00 PM on Stellar. 06:25 And this is a new thing. 06:28 We're not going to use the boxes outside of the math department. 06:31 We're going to use only PDF files. 06:34 Well, you're always welcome to type them and practice 06:37 your LaTeX or Word typing. 06:40 I also understand that this can be a bit of a strain, 06:42 so just write them down on a piece of paper, 06:45 use your iPhone, and take a picture of it. 06:48 Dropbox has a nice, new-- 06:50 so try to find something that puts a lot of contrast, 06:53 especially if you use pencil, because we're going 06:55 to check if they're readable. 06:56 And this is your responsibility to have a readable file. 07:01 I've had over the years-- 07:02 not at MIT, I must admit-- but I've 07:03 had students who actually write the doc file 07:06 and think that converting it to a PDF 07:08 consists in erasing the extension doc 07:11 and replacing it by PDF. 07:12 This is not how it works. 07:13 07:17 So I'm sure you will figure it out. 07:19 Please try to keep them letter-sized. 07:21 This is not a strict requirement, 07:23 but I don't want to see thumbnails, either. 07:26 You are allowed to have two late homeworks. 07:28 And by late, I mean 24 hours late. 07:31 No questions asked. 07:32 You submit them, this will be counted. 07:34 You don't have to send an email to warn us 07:36 or anything like this. 07:39 Beyond that, even that you have one slack 07:42 for one 0 grade and slack for two late homeworks, 07:46 you're going to have to come up with a very good explanation 07:49 why you need actually more extensions than that, if you 07:52 ever do. 07:53 And particularly, you're going to have 07:54 to keep track about why you've used your three options before. 07:58 There's going to be two midterms. 08:00 One is October 3, and one is November 7. 08:03 They're both going to be in class for the duration 08:05 of the lecture. 08:06 When I say they last for an hour and 20 minutes, 08:09 it does not mean that if you arrive 10 minutes 08:11 before the end of lecture, you still 08:12 get an hour and 20 minutes. 08:14 It will end at the end of lecture time. 08:17 For this as well, no pressure. 08:19 Only the best of the two will be kept. 08:22 And this grade will count for 30% of the grade. 08:26 This will be closed-books and closed-notes. 08:29 The purpose is for you to-- yes? 08:30 AUDIENCE: How many midterms did you say there are? 08:33 PHILIPPE RIGOLLET: Two. 08:34 AUDIENCE: You said the best of the two will be kept? 08:36 PHILIPPE RIGOLLET: I said the best of the two 08:37 will be kept, yes. 08:39 AUDIENCE: So both the midterms will be kept? 08:42 PHILIPPE RIGOLLET: The best of the two, not the best two. 08:45 AUDIENCE: Oh. 08:45 08:50 PHILIPPE RIGOLLET: We will add them, multiply the number by 9, 08:53 and that will be grade. 08:54 No. 08:55 I am trying to be nice, there's just a limit to what I can do. 08:59 All right, so the goal is for you to learn things 09:02 and to be familiar with them. 09:04 In the final, you will be allowed 09:05 to have your notes with you. 09:07 But the midterms are also a way for you 09:09 to develop some mechanism so that you don't actually waste 09:11 too much time on things that you should be able to do 09:14 without thinking too much. 09:16 You will be allowed to cheat sheet, 09:17 because, well, you can always forget something. 09:20 And it will be two-sided letters sheet, 09:23 and you can practice yourself as writing as small as you want. 09:27 And you can put whatever you want on this cheat sheet. 09:30 All right, the final will be decided by the register. 09:33 It's going to be three hours, and it's 09:34 going to count for 40%. 09:35 You cannot bring books, but you can bring your notes. 09:38 Yes. 09:38 AUDIENCE: I noticed that the midterm dates 09:40 aren't dated in the syllabus. 09:41 So I wanted to make sure you know. 09:43 PHILIPPE RIGOLLET: They are not? 09:44 AUDIENCE: Yeah-- 09:45 PHILIPPE RIGOLLET: Oh, yeah, there's 09:45 a "1" that's missing on both of them, isn't there? 09:47 09:59 Yeah, let's figure that out. 10:03 The syllabus is the true one. 10:05 The slides are so that we can discuss, 10:07 but the ones that's on the syllabus 10:08 are the ones that count. 10:09 And I think they're also posted on the calendar on Stellar 10:13 as well. 10:14 Any other question? 10:15 10:20 OK, so the pre-reqs here-- 10:23 and who has looked at the first problem set already? 10:28 OK, so those hands that are raised 10:30 realize that there is a true prerequisite of probability 10:34 for this class. 10:36 It can be at the level of 18.600 or 604.1. 10:40 I should say "B" now. 10:42 It's two classes. 10:44 I will require you to know some calculus 10:48 and have some notions of linear algebra, 10:51 such as, what is a matrix, what is a vector, how 10:53 do you multiply those things together, 10:55 some notion of what orthonormal vectors are. 10:58 11:01 We'll talk about eigenvectors and eigenvalues, 11:03 but I remind you all of that. 11:05 So this is not this strict pre-req. 11:07 But if you've taken it, for example, 11:09 it doesn't hurt to go back to your notes 11:12 when we get closer to this chapter 11:13 on principle-component analysis. 11:15 The chapters, as they're listed in the syllabus, are in order, 11:19 so you will see when it actually comes. 11:22 There's no required textbook. 11:24 And I know you tend to not like that. 11:29 You like to have your textbook to know where you're going 11:31 and what we're doing. 11:32 I'm sorry, it's just this class. 11:34 Either I would have to go to a mathematical statistics 11:36 textbook, which is just too much, 11:38 or to go to a more engineering-type statistics 11:43 class, which is just too little. 11:45 So hopefully, the problems will be enough 11:49 for you to practice the recitations. 11:50 We'll have some problems to solve as well. 11:52 And the material will be posted on the slides. 11:55 So you should have everything you need. 11:57 There's plenty of resources online 11:58 if you want to expand on a particular topic 12:00 or read it as said by somebody else. 12:03 The book that I recommend in the syllabus 12:07 is this book called All of Statistics by Wasserman. 12:11 Mainly because of the title, I'm guessing 12:13 it has all of it in it. 12:16 It's pretty broad. 12:18 There's actually not that many. 12:20 It's more of an intro-grad level. 12:22 But it's not very deep, but you see a lot of the overview. 12:27 Certainly, what we're going to cover 12:29 will be a subset of what's in there. 12:30 12:34 The slides will be posted on Stellar 12:35 before lectures before we start a new chapter 12:38 and after we're done with the chapter, with the annotations, 12:41 and also, with the typos corrected, like for the exam. 12:46 There will be some video lectures. 12:48 Again, the first one will be posted on OCW from last year. 12:52 But all of them will be available on Stellar-- 12:54 of course, module technical problems. 12:57 But this is an automated system. 13:00 And hopefully, it will work out well for us. 13:02 So if you somehow have to miss a lecture, 13:04 you can always catch it up by watching it. 13:08 You can also play at that speed 0.75 13:10 in case I end up speaking too fast, 13:12 but I think I've managed myself so far-- 13:15 so just last warning. 13:19 All right, why should you study statistics? 13:22 Well, if you read the news, you will see a lot of statistics. 13:27 I mentioned machine learning. 13:28 It's built on a lot of statistics. 13:32 If I were to teach this class 10 years ago, 13:34 I would have to explain to you that data collection and making 13:37 decisions based on data was something that made sense. 13:40 But now, it's almost in our life. 13:43 We're used to this idea that data helps in making decisions. 13:47 And people use data to conduct studies. 13:51 So here, I found a bunch of press titles that-- 13:55 I think the key word I was looking for was "study finds"-- 13:57 if I want to do this. 13:58 So I actually did not bother doing it again this year. 14:01 This is all 2016, 2016, 2016. 14:04 But the key word that I look for is usually "study find"-- 14:07 so a new study find-- 14:08 traffic is bad for your health. 14:10 So we had to wait for 2016 for data to tell us that. 14:13 And there's a bunch of other slightly more interesting ones. 14:18 For example, one that you might find interesting 14:20 is that this study finds that students benefit from waiting 14:24 to declare a major. 14:26 Now, there's a bunch of press titles. 14:28 There one in the MIT News that finds brain connections, 14:33 key to reading. 14:34 And so here, we have an idea of what happened there. 14:37 Some data was collected. 14:39 Some scientific hypothesis was formulated. 14:42 And then the data was here to try to prove or disprove 14:47 this scientific hypothesis. 14:49 That's the usual scientific process. 14:51 And we need to understand how the scientific process goes, 14:55 because some of those things might be actually questionable. 14:58 Who is 100% sure that study finds that students-- 15:02 do you think that you benefit from waiting 15:04 to declare a major? 15:05 Right I would be skeptical about this. 15:09 I would be like, I don't want to wait to declare a major. 15:13 So what kind of thing can we bring? 15:15 Well maybe this study studied people 15:17 that were different from me. 15:18 Or maybe the study finds that this 15:21 is beneficial for a majority of people. 15:22 I'm not a majority. 15:23 I'm just one person. 15:24 There's a bunch of things that we 15:26 need to understand what those things actually mean. 15:28 And we'll see that those are actually not 15:30 statements about individuals. 15:31 They're not even statements about the cohort of people 15:33 they've actually looked at. 15:35 They're statements about a parameter 15:37 of a distribution that was used to model 15:40 the benefit of waiting. 15:43 So there's a lot of questions. 15:45 And there are a lot of layers that come into this. 15:46 And we're going to want to understand what was going on 15:49 in there and try to peel it off and understand what assumptions 15:52 have been put in there. 15:53 Even though it looks like a totally legit study, out 15:59 of those studies, statistically, I 16:01 think there's going to be one that's going to be wrong. 16:04 Well, maybe not one. 16:05 But if I put a long list of those, 16:07 there would be a few that would actually be wrong. 16:10 If I put 20, there would definitely be one that's wrong. 16:12 So you have to see that. 16:13 Every time you see 20 studies, one is probably wrong. 16:16 When there are studies about drug effects, 16:19 out of a list of 100, one would be wrong. 16:21 So we'll see what that means and what I mean by that. 16:23 16:26 Of course, not only studies that make discoveries 16:30 are actually making the press titles. 16:32 There's also the press that talks about things 16:36 that make no sense. 16:38 I love this first experiment-- the salmon experiment. 16:41 Actually, it was a grad student who 16:44 came to a neuroscience poster session, 16:47 pulled out this poster, and explained 16:50 the scientific experiment that he was conducting, 16:53 which consisted in taking a previously frozen and thawed 16:59 salmon, putting it in an MRI, showing it 17:02 pictures of violent images, and recording its brain activity. 17:07 And he was able to discover a few voxels that were activated 17:11 by those violent images. 17:14 And can somebody tell me what happened here? 17:16 17:19 Was the salmon responding to the violent activity? 17:23 Basically, this is just a statistical fluke. 17:26 That's just randomness at play. 17:27 There's so many voxels that are recorded, 17:29 and there's so many fluctuations. 17:31 There's always a little bit of noise 17:32 when you're in those things, that some of them, 17:34 just by chance, got lit up. 17:36 And so we need to understand how to correct for that. 17:38 In this particular instance, we need 17:40 to have tools that tell us that, well, finding three voxels that 17:43 are activated for that many voxels 17:47 that you can find in the salmon's brain 17:50 is just too small of a number. 17:53 Maybe we need to find a clump of 20 of them, for example. 17:56 All right, so we're going to have 17:57 mathematical tools that help us find those particular numbers. 18:02 I don't know if you ever saw this one by John Oliver 18:07 about phacking. 18:11 Or actually, it said p-hacking. 18:14 Basically, what John Oliver is saying 18:17 is actually a full-length-- like there's long segments on this. 18:20 And he was explaining how there's a sociology question 18:24 here about how there's a huge incentive for scientists 18:28 to publish results. 18:30 You're not going to say, you know what? 18:31 This year, I found nothing. 18:33 And so people are trying to find things. 18:35 And just by searching, it's as if they 18:36 were searching for all the voxels in a brain 18:39 until they find one that was just lit up by chance. 18:41 And so they just run all these studies. 18:43 And at some point, one will be right just out of chance. 18:46 And so we have to be very careful about doing this. 18:49 There's much more complicated problems associated 18:52 to what's called p-hacking, which 18:53 consists of violating the basic assumptions, in particular, 18:57 looking at the data, and then formulating 19:00 your scientific assumption based on data, 19:02 and then going back to it. 19:03 Your idea doesn't work. 19:04 Let's just formulate another one. 19:05 And if you are doing this, all bets are off. 19:07 19:10 The theory that we're going to develop 19:11 is actually for a very clean use of data, which 19:14 might be a little unpleasant. 19:16 If you've had an army of graduate students collecting 19:19 genomic data for a year, for example, 19:21 maybe you don't want to say, well, 19:23 I had one hypothesis that didn't work. 19:25 Let's throw all the data into the trash. 19:27 And so we need to find ways to be able to do this. 19:30 And there's actually a course been taught at BU. 19:34 It's still in its early stages, but something 19:37 called "adaptive data analysis" that will allow 19:39 you to do these kind of things. 19:42 Questions? 19:43 19:46 OK, so of course, statistics is not 19:49 just for you to be able to read the press. 19:52 Statistics will probably be used in whatever career 19:56 path you choose for yourself. 19:58 It started in the 10th century in Netherlands for hydrology. 20:03 Netherlands is basically under water, under sea level. 20:06 And so they wanted to build some dikes. 20:08 But once you're going to build a dike, 20:09 you want to make sure that it's going to sustain 20:11 some tides and some floods. 20:13 And so in particular, they wanted 20:15 to build dikes that were high enough, but not too high. 20:19 You could always say, well, I'm going 20:21 to build a 500-meter dike, and then I'm going to be safe. 20:25 You want something that's based on data. 20:27 You want to make sure. 20:28 And so in particular, what did they do? 20:30 Well, they collected data for previous floods. 20:33 And then they just found a dike that 20:36 was going to cover all these things. 20:37 Now, if you look at the data they probably had, 20:40 maybe it was scarce. 20:41 Maybe they had 10 data points. 20:43 And so for those data points, then 20:44 maybe they wanted to sort of interpolate 20:47 between those points, maybe extrapolate for the larger one. 20:50 Based on what they've seen, maybe they 20:51 have chances of seeing something which 20:53 is even larger than everything they've seen before. 20:56 And that's exactly the goal of statistical modeling-- 21:00 being able to extrapolate beyond the data that you have, 21:04 guessing what you have not seen yet might happen. 21:08 When you buy insurance for your car, 21:10 or your apartment, or your phone, 21:14 there is a premium that you have to pay. 21:16 And this premium has been determined 21:17 based on how much you are, in expectation, going 21:21 to cost the insurance. 21:22 It says, OK, this person has, day a 10% chance 21:25 of breaking their iPhone. 21:28 An iPhone costs that much to repair, 21:30 so I'm going to charge them that much. 21:31 And then I'm going to add an extra dollar for my time. 21:34 That's basically how those things are determined. 21:36 And so this is using statistics. 21:39 This is basically where statistics is probably 21:42 mostly used. 21:43 I was personally trained as an actuary. 21:45 And that's me being a statistician at an insurance 21:48 company. 21:50 Clinical trials-- this is also one of the earliest success 21:55 stories of statistics. 21:58 It's actually now widespread. 21:59 Every time a new drug is approved for market by the FDA, 22:04 it requires a very strict regimen of testing with data, 22:08 and control group, and treatment group, 22:10 and how many people you need in there, 22:11 and what kind of significance you need for those things. 22:17 In particular, those things look like this, 22:19 so now it's 5,000 patients. 22:21 It depends on what kind of drug it is, 22:23 but for, say, 100 patients, 56 were cured, 22:26 and 44 showed no improvement. 22:29 Does the FDA consider that this is a good number? 22:31 Do they have a table for how many patients were cured? 22:37 Is there a placebo effect? 22:39 Do I need a control group of people that 22:41 are actually getting a placebo? 22:42 It's not clear, all these things. 22:44 And so there's a lot of things to put into place. 22:46 And there's a lot of floating parameters. 22:48 So hopefully, we're going to be able to use 22:50 statistical modeling to shrink it down 22:52 to a small number of parameters to be able to ask 22:55 very simple questions. 22:56 "Is a drug effective" is not a mathematical equation. 22:59 But "Is p larger than 0.5?" 23:02 is a mathematical question And that's 23:04 essentially we're going to be doing. 23:05 We're going to take this, is a drug effective, to reducing to, 23:08 is a variable larger than 0.5? 23:12 Now, of course genetics are using that. 23:15 That's typically actually the same size of data 23:19 that you would see for FMRI data. 23:21 So this is actually a study that I found. 23:25 You have about 4,000 cases of Alzheimer's and 8,000 control. 23:29 So people without Alzheimer's-- that's what's called a control. 23:32 That's something just to make sure 23:34 that you can see the difference with people 23:38 that are not affected by either a drug or a disease. 23:42 Is the gene APOE associated with Alzheimer's disease? 23:46 Everybody can see why this would be an important question. 23:49 We now have it crisper. 23:50 It's targeted to very specific genes. 23:52 If we could edit it, or knock it down, or knock it 23:55 up, or boost it, maybe we could actually 23:57 have an impact on that. 23:58 So those are very important questions, 24:00 because we have the technology to target those things. 24:02 But we need the answers about what those things are. 24:04 And there's a bunch of other questions. 24:07 The minute you're going to talk to biologists about say, 24:09 I can do that. 24:10 They're going to say, OK, are there 24:11 any other genes within the genes, 24:12 or any particular snips that I can actually look at? 24:15 And they're looking at very different questions. 24:17 And when you start asking all these questions, 24:19 you have to be careful, because you're reusing your data again. 24:22 And it might lead you to wrong conclusions. 24:26 And those are all over the place, those things. 24:29 And that's why they go all the way to John Oliver talking 24:32 about them. 24:33 Any questions about those examples? 24:35 24:37 So this is really a motivation. 24:38 Again, we're not going to just take 24:40 this data set of those cases and look at them in detail. 24:46 So what is common to all these examples? 24:49 Like, why do we have to use statistics 24:50 for all those things? 24:53 Well, there's the randomness of the data. 24:55 There's some effect that we just don't understand-- 24:59 for example, the randomness associated with the lining up 25:02 of some voxels. 25:04 Or the fact that as far as the insurance 25:06 is concerned whether you're going to break your iPhone 25:09 or not is essentially a coin toss. 25:10 Fully, it's biased. 25:11 But it's a coin toss. 25:14 From the perspective of the statistician, 25:16 those things are actually random events. 25:18 And we need to tame this randomness, 25:20 to understand this randomness. 25:21 Is this going to be a lot of randomness? 25:23 Or is it going to be a little randomness? 25:25 Is it going to be something that's 25:26 like, out of their people-- 25:29 25:32 let's see, for example, for the floods. 25:35 Were the floods that I saw consistently almost 25:38 the same size? 25:40 It was almost a rounding error, or they're just 25:43 really widespread. 25:44 All these things, we need to understand 25:45 so we can understand how to build those dikes 25:48 or how to make decisions based on those data. 25:54 And we need to understand this randomness. 25:58 OK, so the associated questions to randomness 26:01 were actually hidden in the text. 26:03 So we talked about the notion of average. 26:05 Right, so as far as the insurance is concerned, 26:08 they want to know in average with the probability is. 26:10 Like, what is your chance of actually breaking your iPhone? 26:13 And that's what came in this notion of fair premium. 26:18 There's this notion of quantifying chance. 26:21 We don't want to talk maybe only about average, 26:23 maybe you want to cover say 99% percent of the floods. 26:26 So we need to know what is the height of a flood that's 26:31 higher than 99% of the floods. 26:34 But maybe there's 1% of them, you know. 26:36 When doomsday comes, doomsday comes. 26:38 Right, we're not going to pay for it. 26:40 All right, so that's most of the floods. 26:43 And then there's questions of significance, right? 26:45 So you know I give this example, a second ago 26:47 about clinical trials. 26:50 I give you some numbers. 26:51 Clearly the drug cured more people than it did not. 26:55 But does it mean that it's significantly good, 26:58 or was this just by chance. 26:59 Maybe it's just that these people just recovered. 27:01 It's like you know curing a common cold. 27:04 And you feel like, oh I got cured. 27:06 But it's really you waited five days and then you got cured. 27:09 All right, so there's this notion of significance, 27:11 of variability. 27:12 All these things are actually notions 27:15 that describe randomness and quantify randomness 27:18 into simple things. 27:19 Randomness is a very complicated beast. 27:21 But we can summarize it into things that we understand. 27:24 Just like I am a complicated object. 27:27 I'm made of molecules, and made of genes, 27:29 and made of very complicated things. 27:31 But I can be summarized as my name, my email address, 27:34 my height and my weight, and maybe for most of you, 27:37 this is basically enough. 27:39 You will recognize me without having 27:41 to do a biopsy on me every time you see me. 27:45 All right, so, to understand randomness 27:49 you have to go through probability. 27:51 Probability is the study of randomness. 27:53 That's what it is. 27:54 That's what the first sentence that a lecturer in probability 27:57 will say. 27:58 And so that's why I need the pre-requisite, because this 28:02 is what we're going to use to describe the randomness. 28:04 We'll see in a second how it interacts with statistics. 28:07 So sometimes, and actually probably most of the time 28:10 throughout your semester on probability, 28:13 randomness was very well understood. 28:15 When you saw a probability problem, here 28:18 was the chance of this happening, 28:19 here was the chance of that happening. 28:21 Maybe you had more complicated questions 28:23 that you had some basic elements to answer. 28:26 For example, the probability that I have HBO is this much. 28:32 And the probability that I watch Game of Thrones is that much. 28:34 And given that I play basketball what is the probability-- 28:38 you had all these crazy questions, 28:39 but you were able to build them. 28:42 But all the basic numbers were given to you. 28:45 Statistics will be about finding those basic numbers. 28:48 All right so some examples that you've probably seen 28:51 were dice, cards, roulette, flipping coins. 28:55 All of these things are things that you've 28:57 seen in a probability class. 28:59 And the reason is because it's very easy 29:00 to describe the probability of each outcome. 29:02 For a die we know that each face is going 29:05 to come with probably 1/6. 29:07 Now I'm not going to go into a debate of whether this 29:09 is pure randomness or this is determinism. 29:12 I think as a model for actual randomness 29:14 a die is a pretty good number, flipping a coin 29:18 is a pretty good model. 29:20 So those are actually a good thing. 29:22 So the questions that you would see, for example, 29:24 in probabilities are the following. 29:26 I roll one die. 29:27 Alice gets $1 if the number of dots is less than three. 29:31 Bob gets $2 if the number of dots is less than two. 29:35 Do you want to be Alice or Bob given that your role is 29:37 actually to make money. 29:40 Yeah, you want to be Bob, right? 29:43 So let's see why. 29:45 So if you look at the expectation of what 29:47 Alice makes. 29:48 So let's call it a. 29:51 This is $1, with probability 1/2. 29:56 So 3/6, that's 1/2. 29:59 And the expectation of what Bob makes, 30:02 this is $2 with probably 2/6 and that's 2/3. 30:11 Which is definitely larger than 1/2. 30:13 So Bob's expectations actually a bit higher. 30:17 So those are the kind of questions that you 30:18 may ask with probability. 30:19 I described to you exactly, you use the fact 30:21 that the die would get less than three dots, 30:25 with probability one half. 30:26 We knew that. 30:27 And I didn't have to describe to you what was going on there. 30:29 You didn't have to collect data about a die. 30:32 Same thing, you roll two dice. 30:34 You choose a number between 2 and 12 30:36 and you win $100 if you choose the sum of the two dice. 30:42 Which number do you pick? 30:45 What? 30:46 AUDIENCE: 7. 30:47 PHILIPPE RIGOLLET: 7. 30:48 Why 7? 30:48 AUDIENCE: It's the most likely. 30:50 PHILIPPE RIGOLLET: That's the most likely one, right? 30:52 So your gain here will be $100 times the probability 30:56 that the sum of the two dice, let's say x plus y, 31:00 is equal to your little z where a little z is 31:03 the number you pick. 31:04 31:07 So 7 is the most likely to happen 31:10 and that's the one that maximizes this function of z. 31:14 And for this you need to study a more complicated function. 31:17 But it's a function that enables two die. 31:18 But you can compute the probability that x plus y 31:21 is equal to z, for every z between 2 and 12. 31:26 So you know exactly what the probabilities are 31:29 and that's how you start probability. 31:30 31:35 So here that's exactly what I said. 31:38 You have a very simple process that describes basic events. 31:43 Probability 1/6 for each of them. 31:45 And then you can build up on that, 31:46 and understand probably of more complicated events. 31:49 You can throw some money in there. 31:50 You can be build functions. 31:52 You can do very complicated things building on that. 31:56 Now if I was a statistician, a statistician 31:59 would be the guy who just arrived on earth, 32:01 had never seen a die and needs to understand 32:03 that a die come up with probably 1/6 on each side. 32:05 And the way he would do it is just to roll the die 32:08 until he get some counts and tries to estimate those. 32:12 And maybe that guy would come and say, 32:14 well, you know, actually, the probability 32:16 that I get a 1 is 1/6 plus 0.001 and the probability 32:23 that I get a 2 is 1/6 minus 0.005. 32:27 And there would be some fluctuations around this. 32:29 And it's going to be his role as a statistician 32:31 to say, listen, this is too complicated 32:33 of a model for this thing. 32:34 And these should all be the same numbers. 32:36 Just looking at data, they should be all the same numbers. 32:39 And that's part of the modeling. 32:40 You make some simplifying assumptions 32:41 that essentially make your questions more accurate. 32:46 Now, of course, if your model is wrong, 32:48 if it's not true that all the faces arrive 32:50 with the same probability, then you have a model error here. 32:54 So we will be making model errors. 32:56 But that's going to be the price to pay 32:57 to be able to extract anything from our data. 32:59 33:02 So for more complicated processes, 33:07 so of course nobody's going to waste their time rolling dice. 33:09 I mean, I'm sure you might have done 33:11 this in AP stat or something. 33:13 But the need is to estimate parameters from data. 33:18 All right, so for more complicated things 33:19 you might want to estimate some density parameter 33:27 on a particular set of material. 33:29 And for this maybe you need to beam something to it, 33:31 and measure how fast it's coming back. 33:33 And you're going to have some measurement errors. 33:35 And maybe you need to do that several times 33:37 and you have a model for the physical process that's 33:39 actually going on. 33:40 And physics is usually a very good way 33:42 to get models for engineering perspective. 33:46 But there's models for sociology where we 33:49 have no physical system, right. 33:52 God knows how people interact. 33:53 And maybe I'm going to say that the way 33:55 I make friends is by first flipping a coin in my pocket. 33:59 And with probability 2/3, I'm going 34:01 to make my friend at work. 34:02 And with probability 1/3 I'm going 34:04 to make my friend at soccer. 34:05 And once I make my friends at soccer-- 34:07 I decide to make my friend soccer. 34:09 Then I will face someone who's flipping 34:11 the same coin with maybe be slightly different parameters. 34:14 But those things actually exist. 34:16 There's models about how friendships are formed. 34:18 And the one I described is called 34:22 the mixed-membership model. 34:23 So those are models that are sort of hypothesized. 34:25 And they're more reasonable than taking into account 34:29 all the things that made you meet that person 34:31 at that particular time. 34:34 So the goal here-- so based on data now, 34:38 once we have the model is going to be reduced to maybe two, 34:41 three, four parameters, depending 34:42 on how complex the model is. 34:44 And then your goal will be to estimate those parameters. 34:48 So sometimes the randomness we have here is real. 34:51 So there's some true randomness in some surveys. 34:56 If I pick a random student, as long 34:58 as I believe that my random number generator that 35:00 will pick your random ID is actually random, 35:04 there is something random about you. 35:06 The student that I pick at random 35:07 will be a random student. 35:09 The person that I call on the phone is a random person. 35:13 So there's some randomness that I can build into my system 35:16 by drawing something from a random number generator. 35:20 A biased coin is a random thing. 35:22 It's not a very interesting random thing. 35:24 But it is a random thing. 35:26 Again, if I wash out the fact that it actually 35:29 is a deterministic mechanism. 35:30 But at a certain accuracy, a certain granularity, 35:33 this can be thought of as a truly random experiment. 35:36 Measurement error for example, if you by some measurement 35:39 device. 35:39 or some optics device, for example. 35:42 You will have like standard deviation and things that 35:45 come on the side of the box. 35:46 And it tells you, this will be making some measurement error. 35:48 And it's usually thermal noise maybe, or things like this. 35:51 And those are very accurately described 35:54 by some random phenomenon. 35:56 But sometimes, and I'd say most times, there's no randomness. 36:01 There's no randomness. 36:02 It's not like you breaking your iPhone is a random event. 36:06 This is just something that we sweep-- 36:09 randomness is a big rug under which we sweep 36:11 everything we don't understand. 36:13 And we just hope that in average we've 36:15 captured, the average effect of what's going on. 36:18 And the rest of it might fluctuate to the right, 36:20 might fluctuate to the left. 36:22 But what remains is just sort of randomness 36:26 that can be averaged out. 36:27 So, of course, this is where the leap of faith is. 36:31 We do not know whether we were correct of doing this. 36:33 Maybe we make some huge systematic biases 36:35 by doing this. 36:36 Maybe we forget a very important component. 36:39 Right, for example, if I have-- 36:42 I don't know, let's think of something-- 36:45 36:49 a drug for breast cancer. 36:51 All right, and I throw out the fact 36:52 that my patient is either a man or woman. 36:55 I'm going to have some serious model biases. 36:57 Right. 36:58 So if I say I'm going to collect a random and patient. 37:00 And said I'm going to start doing this. 37:02 There's some information that I really need, clearly, 37:04 to build into my model. 37:06 And so the model should be complicated enough, but not too 37:10 complicated. 37:11 Right so it should take into account things 37:13 there will systematically be important. 37:17 So, in particular, the simple rule of thumb 37:19 is, when you have a complicated process, 37:24 you can think of it as being a simple process 37:26 and some random noise. 37:28 Now, again, the random noise is everything 37:30 you don't understand about the complicated process. 37:33 And the simple process is everything you actually do. 37:37 So good modeling, and this is not 37:40 where we'll be seeing in this class, 37:43 consistent choosing plausible simple models. 37:46 And this requires a tremendous amount of domain knowledge. 37:50 And that's why we're not doing it in this class. 37:52 This is not something where I can make a blanket statement 37:54 about making good modeling. 37:55 You need to know, if I were a statistician working 37:58 on a study, I would have to grill the person in front 38:00 of me, the expert, for two hours to know, but how about this? 38:04 How about that? 38:05 How does this work? 38:06 So it requires to understand a lot of things. 38:08 There's this famous statistician to whom this sentence is 38:14 attributed, and it's probably not his then, 38:16 but Tukey said that he loves being a statistician, 38:21 because you get to play in everybody's backyard. 38:23 Right, so you get to go and see people. 38:25 And you get to understand, at least to a certain extent, what 38:28 their problems are. 38:29 Enough that you can actually build 38:31 a reasonable model for what they're actually doing. 38:33 So you get to do some sociology. 38:34 You get to do some biology. 38:35 You get to do some engineering. 38:37 And you get to do a lot of different things. 38:39 Right, so he was actually at some point 38:40 predicting the presidential election. 38:46 So, you see, you get to do a lot of different things. 38:48 But it requires a lot of time to understand 38:50 what problem you're working on. 38:52 And if you have a particular application in mind 38:54 you're the best person to actually understand this. 38:56 So I'm just going to give you the basic tools. 38:58 39:07 So this is the circle of trust. 39:11 No, this is really just a simple graphic 39:14 that tells you what's going on. 39:15 When you do probability, you're given the truth. 39:19 Somebody tells you what die God is rolling. 39:24 So you know exactly what the parameters of the problems are. 39:27 And what you're trying to do is to describe what 39:29 the outcomes are going to be. 39:31 You can say, if you're rolling a fair die, 39:34 you're going to have 1/6 of the time in your data 39:36 you're going to have one. 39:37 1/6 of the time you're going to have to have two. 39:39 And so you can describe-- if I told you what the truth is, 39:42 you could actually go into a computer, 39:44 either generate some data. 39:46 Or you could describe to me some more macro properties 39:51 of what the data would be like. 39:52 Oh, I would see a bunch of numbers 39:54 that would be centered around 35, if I 39:57 drew from a Gaussian distribution centered at 35. 40:00 Right, you would know this kind of thing. 40:01 I would know that it's very unlikely that if my Gaussian 40:07 has standard deviation-- 40:08 is centered on 0, say, with standard deviation 3. 40:13 It's very unlikely that I will see numbers below minus 10 40:17 in above 10, right? 40:18 You know this, that you basically will not see them. 40:21 So you know from the truth, from the distribution 40:25 of a random variable that does not have mu or sigmas, really 40:27 numbers there. 40:28 You know what data, you're going to be having. 40:31 Statistics is about going backwards. 40:33 It's saying, if I have some data, what was 40:37 the truth that generated it. 40:39 And since there are so many possible truths, 40:41 Modeling says you have to pick one 40:44 of the simpler possible truths, so that you can average out. 40:47 Statistics basically means averaging. 40:49 You're averaging when you do statistics. 40:51 And averaging means that if I say 40:54 that I received-- so if I collect 40:56 all your GPAs, for example. 40:58 And my model is that the possible GPAs 41:01 are any possible numbers. 41:03 And anybody can have any possible GPA. 41:05 This is going to be a serious problem. 41:06 But if I can summarize those GPAs into two numbers, 41:09 say, mean and standard deviation, 41:11 than I have a pretty good description of what 41:13 is going on, rather than having to have 41:15 to predict the full list. 41:16 Right, if I learn a full list of GPAs and I say, 41:18 well this was the distribution. 41:20 Then it's not going to be of any use for me to predict what 41:22 the GPA would be, or some random student walking in, 41:25 or something like this. 41:26 41:30 So just to finish my rant about probability versus statistics, 41:34 this is a question you would see in a probability-- this 41:37 is a probabilistic question, and this is a statistical question. 41:40 The probabilistic question is, previous studies 41:42 showed that the drug was 80% effective. 41:45 So you know that. 41:46 This is the effectiveness of the drug. 41:48 It's given to you. 41:49 This is how your problem starts. 41:51 Then we can anticipate that, for a study on 100 patients, 41:54 in average, 80 be cured. 41:57 And at least 65 will be cured with 99% chances. 42:00 So again these are not-- 42:03 I'm not predicting on 100 patients exactly the number 42:05 of them they're going to be cured. 42:07 And the number of them that are not. 42:08 But I'm actually sort of predicting 42:11 what things are going to look like on average, 42:13 or some macro properties of what my data sets will look like. 42:17 So with 99 percent chances, that means 42:19 that in 99.99% of the data sets you will 42:23 draw from this particular draw. 42:25 99.99% of the cohort of 100 patients to whom you administer 42:30 this drug, I will be able to conclude that at least 65 42:34 of them will be cured, on 99.99% percent of those data sets. 42:41 So that's a pretty accurate prediction 42:42 of what's going to happen. 42:45 Statistics is the opposite. 42:46 It says, well, I just know that 78 out of 100 were cured. 42:49 I have only one data set. 42:50 I cannot make predictions for all data sets. 42:53 But I can go back to the probability, 42:57 make some inference about what my probability will look 42:59 like, and then say, OK, then I can make those predictions 43:03 later on. 43:04 So when I start with 78/100 then maybe 43:08 I'm actually, in this case, I just don't know. 43:11 My best guess here is that I'm confident I 43:16 have to add the extra error that I bet you making by predicting 43:19 that here, the drug is not 80% effective but 78% effective. 43:25 And they need some error bars around this, 43:27 that will hopefully contain 80%, and then based on those error 43:30 bars I'm going to make slightly less precise predictions 43:34 for the future. 43:35 43:39 So, to conclude, so this was, why statistics? 43:44 So what is this course about? 43:46 It's about understanding the mathematics 43:48 behind statistical methods. 43:50 It's more of a tool. 43:51 We're not going to have fun and talk about algebraic geometry 43:54 just for fun in the middle of it. 43:57 So it justifies quantitative statements given some modeling 44:01 assumptions, that we will, in this class, 44:03 mostly admit that the modeling assumptions are correct. 44:06 | the first part-- in this introduction, 44:08 we will go through them because it's 44:10 very easy to forget what the assumptions are actually 44:12 making. 44:13 But this will be a pretty standard thing. 44:15 The words you will hear a lot are IID-- 44:18 independent and identically distributed-- 44:20 that means that your data is basically all the sams. 44:23 And one data point is not impacting another data point. 44:28 Hopefully we can describe some interesting mathematics 44:30 arising in statistics. 44:31 You know, if you've taken linear algebra, 44:33 maybe we can explain to you why. 44:36 If you've done some calculus, maybe we 44:38 can do some interesting calculus. 44:40 We'll see how in the spirit of applied math 44:42 those things answer interesting questions. 44:45 And basically we'll try to carve out a math toolbox that's 44:49 useful for us statistics. 44:52 And maybe you can extend it to more sophisticated methods 44:55 that we did not cover in this class. 44:57 In particular in the immersion learning class, 44:59 hopefully you'll be able to have some statistical intuition 45:02 about what is going on. 45:04 So what this course is not about, 45:06 it's not about spending a lot of time looking at data sets, 45:09 and trying to understand some statistical thinking 45:13 kind of questions. 45:14 So this is more of an applied statistical perspective 45:16 on things, or more modeling. 45:19 So I'm going to typically give you the model. 45:22 And say this is a model. 45:23 And this is how we're going to build an estimator 45:26 in the framework of this model. 45:28 So for example, 18.075, to a certain extent, 45:30 is called "Statistical Thinking and Data Analysis." 45:32 So I'm hoping there is some statistical thinking in there. 45:36 We will not talk about software implementation. 45:38 Unfortunately, there's just too little time in a semester. 45:42 There's other courses that are giving you some overview. 45:45 So the main software these days are R 45:49 is the leading software I'd say in statistics, both in academia 45:54 and industry, lots of packages, one every day 45:58 that's probably coming out. 46:00 But there's other things, right, so now Python is probably 46:03 catching up with all these scikit-learn packages that 46:09 are coming up. 46:10 Julia has some statistics in there, 46:14 but it really if you were to learn a statistical software, 46:17 let's say you love doing this, this 46:19 would be the one that would prove most useful for you 46:21 in the future. 46:22 It does not scale super well to high dimensional data. 46:26 So there is a class an IDSS that actually 46:28 uses R. It's called IDS 0.12, I think 46:31 it's called "Statistics, Computation, and Applications," 46:36 or something like this. 46:37 I'm also preparing, with Peter Kempthorne, 46:40 a course called "Computational Statistics." 46:42 46:47 It's going to be offered this Spring as a special topics. 46:50 And so Peter Kempthorne will be teaching it. 46:55 And this class will actually focus 46:58 on using R. And even beyond that, 47:00 it's not just going to be about using. 47:02 It's going to be about understanding-- 47:04 just the same way we we're going to see 47:05 how math helps you do statistics, 47:07 it's going to help see how math helps you 47:09 do algorithims for statistics. 47:12 All right, so we'll talk about maximum likelihood estimator. 47:15 Will need to maximize some function. 47:16 There's an optimization toolbox to do that. 47:19 And we'll see how we can have specialized 47:20 for statistics for that, and what 47:22 are the principles behind it. 47:25 And you know, of course, if you've 47:26 taken AP stats you probably think that stats 47:29 is boring to death because it was just 47:31 a long laundry-list that spent a lot of time on t-test. 47:34 I'm pretty sure we're not going to talk about t-test, well, 47:37 maybe once. 47:38 But this is not a matter of saying you're going to do this. 47:42 And this is a slight variant of it. 47:43 We're going to really try to understand what's going on. 47:46 So, admittedly, you have not chosen the simplest way 47:49 to get an A in statistics on campus. 47:52 All right, this is not the easiest class. 47:54 It might be challenging at times, 47:56 but I can promise you that you will maybe suffer. 47:59 But you will learn something by the time 48:01 you're out of this class. 48:02 This will not be a waste of your time. 48:04 And you will be able to understand, 48:06 and not having to remember by heart how those things actually 48:09 work. 48:10 Are there any questions? 48:13 Anybody want to go to other stats class on campus? 48:16 Maybe it's not too late. 48:18 OK. 48:21 So let's do some statistics. 48:25 So I see the time now and it's 11:56, 48:29 so we have another 30 minutes. 48:31 I will typically give you a three, 48:35 four minute break if you want to stretch, 48:37 if you want to run to the bathroom, 48:39 if you want to check your texts or Instagram. 48:45 There was very little content in this class, 48:47 hopefully it was entertaining enough 48:49 that you don't need the break. 48:51 But just in the future, so you know you will have a break. 48:55 So statistics, this is how it starts, I'm French, what can 49:01 I say I need to put some French words. 49:05 So this is not how office hours are going to go down. 49:08 49:12 Anybody know this sculpture by a Rodin, The Kiss. 49:16 Maybe probably The Thinker is more famous. 49:18 But this is actually a pretty famous one. 49:20 But is it really this one, or is it this one. 49:23 Anybody knows which one it is? 49:26 This one? 49:27 Or this one? 49:28 AUDIENCE: The previous. 49:30 PHILIPPE RIGOLLET: What's that? 49:32 AUDIENCE: This one. 49:33 PHILIPPE RIGOLLET: It's this one. 49:33 AUDIENCE: Final answer. 49:35 PHILIPPE RIGOLLET: Yeah, who votes for this one. 49:39 OK. 49:40 Who votes for that one? 49:42 Thank you. 49:42 I love that you do not want to pronounce yourself with no data 49:45 actually to make any decision. 49:47 This is a total coin toss right. 49:49 Turns out that there is data, and there 49:51 is in the very serious journal Nature, 49:53 someone published a very serious paper which 49:56 actually looks pretty serious. 49:58 If you look at it, it's like "Human Behavior: 50:00 Adult persistence of head-turning symmetry," 50:02 is a lot of fancy words in there. 50:04 And this, I'm not kidding you, this study 50:07 is about collecting data of people kissing, 50:09 and knowing if they bend their head to the right 50:12 or if they bend they head to the left. 50:14 And that's all it is. 50:15 And so a neonatal right-side preference 50:21 makes a surprising romantic reappearance in later life. 50:25 There's an explanation for it. 50:27 All right, so if we follow this Nature which one is the one. 50:32 This one? 50:33 Or this one? 50:34 This one, right? 50:35 Head to the right. 50:38 And to be fair, for this class I was like, 50:41 oh, I'm going to go and show them what Google Images does. 50:46 When you Google kissing couple, it's 50:49 inappropriate after maybe the first picture. 50:51 And so I cannot show you this. 50:53 But you know you can check for yourself. 50:55 Though I would argue, so this person 50:57 here actually went out in airports 51:00 and took pictures of strangers kissing and collecting data. 51:06 And can somebody guess why did he just not stay home 51:10 and collect data from Google Images 51:12 by just googling kissing couples. 51:17 What's wrong with this data? 51:19 I didn't know actually before I actually went on Google Images. 51:22 AUDIENCE: It can be altered? 51:23 PHILIPPE RIGOLLET: What was that? 51:24 AUDIENCE: It can be altered. 51:25 PHILIPPE RIGOLLET: It can be altered. 51:26 But, you know, who would want to do this? 51:28 I mean there's no particular reason why 51:29 you would want to flip an image before putting it out there. 51:31 I mean, you might, but you know maybe they 51:34 want to hide the brand of your Gap shirt or something. 51:38 AUDIENCE: I guess the people who post pictures of themselves 51:42 kissing on Google Images are not representative 51:44 of the general population. 51:45 PHILIPPE RIGOLLET: Yeah, that's very true. 51:47 And actually it's even worse than that. 51:49 The people who post pictures of themselves, 51:51 are not posting pictures of themselves 51:52 or putting pictures of the people 51:54 that they took a picture of. 51:55 And there usually is a stock watermark on this. 51:59 And it's basically stock images. 52:00 Those are actors, and so they've been directed to kiss 52:03 and this is not a natural thing to do. 52:06 And actually, if you go to Google Images-- and I 52:08 encourage you to do this, unless you 52:10 don't want to see inappropriate pictures, 52:12 and they're mightily inappropriate. 52:14 And basically you will see that this study is actually not 52:19 working at all. 52:20 I mean, I looked briefly. 52:21 I didn't actually collect numbers. 52:22 But I didn't find a particular tendency to bend right. 52:26 If anything, it was actually probably the opposite. 52:28 And it's because those people were directed to do it. 52:31 They just don't actually think about doing it. 52:34 And also because I think you need 52:36 to justify writing in your paper more than, 52:38 I sat in front of my computer. 52:41 So again, this first sentence here, 52:46 a neonatal right-side preference-- 52:49 "is there a right side preference?" 52:51 is not a mathematical question. 52:53 But we can start saying, let's blah, and put some variables, 52:57 and ask questions about those variables. 52:59 So you know x is actually not a variable that's 53:02 used very much in statistics for parameters. 53:04 But p is one, for parameter. 53:07 And so you're going to take your parameter of interest, 53:09 p, As here is going to be the proportion of couples. 53:12 And that's among all couples. 53:13 So here, if you talk about statistical thinking, 53:17 there would be a question about what population this would 53:20 actually be representative of. 53:22 | usually this is a call to your-- 53:24 53:26 sorry, I should not forget this word it's important for you. 53:30 53:33 OK, I forget this word. 53:34 So this is-- 53:38 OK, 53:43 So if you look at this proportion, 53:44 maybe these couples that are in the study 53:46 might be representative only of couples in airports. 53:49 Maybe they actually put on a show for the other passengers. 53:51 Who knows? 53:52 You know, like, oh, let's just do it as well. 53:54 And just like the people in Google Images 53:56 they are actually doing it. 53:57 So maybe you want to just restrict it. 53:58 But of course clearly if it's appearing in Nature, 54:01 it should not be only about couples in airports. 54:04 It's supposedly representative of all couples in the world. 54:07 And so here let's just keep it vague, 54:10 but you need to keep in mind what population 54:12 this is actually making a statement about. 54:14 So you have this full population of people in the world. 54:20 Right, so those are all the couples. 54:23 And this person went ahead and collected data 54:27 about a bunch of them. 54:29 And we know that, in this thing, there's basically 54:31 a proportion of them, that's like p, 54:33 and that's the proportion of them that's bending 54:35 their head to the right. 54:36 And so everybody on this side is bending their heads right. 54:40 And hopefully we can actually sample 54:41 this thing you're informing. 54:42 That's basically the process that's going on. 54:44 So this is the statistical experiment. 54:47 We're going to observe n kissing couples. 54:49 So here we're going to put as many variables as we can. 54:51 So we don't have to stick with numbers. 54:53 And then we'll just plug in the numbers. 54:55 n kissing couples, and n is also, in statistics, 54:58 by the way, n is the size of your sample 99.9% of the time. 55:04 And collect the value of each outcome. 55:06 So we want numbers. 55:07 We don't want right or left. 55:08 So we're going to code them by 0 and 1, pretty naturally. 55:12 And then we're going to estimate p which is unknown. 55:16 So p is this area. 55:18 And we're going to estimate it simply 55:19 by the proportion of right So the proportion of crosses 55:24 that actually fell in the right side. 55:27 55:29 So in this study what you will find 55:33 is that the numbers that were collected 55:36 were 124 couples, and that, out of those 124, 80 of them 55:43 turned their head to the right. 55:46 So, p hat is a proportion. 55:49 How do we do it? 55:50 Well, you don't need statistics for that. 55:51 You're going to see 80 divided by 124. 55:54 And you will find that in this particular study 55:57 64.5% of the couples were bending 56:00 their heads to the right. 56:01 That's a pretty large number, right? 56:03 The question is if I picked another 124 couples, maybe 56:07 at different airports, different times, would I see same number? 56:10 Would this number be all over the place? 56:11 Would it be sometimes very close to 120, or sometimes 56:14 for close to 10? 56:15 Or would it be-- is this number actually fluctuating a lot. 56:20 And so, hopefully not too much, 64.5 percent is definitely 56:26 much larger than 50%. 56:28 And so there seems to be this preference. 56:31 Now we're going to have to quantify 56:33 how much of this preference. 56:34 Is this number significantly larger than 50%? 56:38 So if our data, for example, was just three couples. 56:41 I'm just going there, I'm going to Logan. 56:43 I call it, I do right, left right. 56:45 And then I see-- 56:47 see what's the name of the fish place there? 56:53 I go to I go to Wahlburgers at Logan and I'm like, 56:56 OK, I'm done for the day. 56:57 I collect this data. 56:58 I go home, and I'm like, wow, 66.7% to the right. 57:02 That's a pretty big number. 57:03 It's even farther from 50% than this other guy. 57:06 So I'm doing even better. 57:08 But of course you know that this is not true. 57:10 Three people is definitely not representative. 57:12 If I stopped at the first one, I would 57:13 have actually-- at the first two, I would have even 100%. 57:18 So the question that statistics is going to help us answer is, 57:21 how large should the sample be? 57:23 For some reason, I don't know if you guys receive this, 57:25 I'm an affiliate with the Broad Institute, 57:27 and since then I receive one email per day 57:30 that says, sample size determination-- 57:32 how large should your sample be? 57:33 Like, I know how large should with my sample be. 57:36 I've taken 18.650 multiple times. 57:39 And so I know, but the question is-- is 124 57:43 a large enough number or not? 57:45 Well, the answer is actually, as usual, it depends. 57:47 It will depend on the true unknown value of p. 57:51 But from those particular values that we got, so 120 and-- 57:56 how many couples was there? 57:57 80? 57:58 We actually can make some question. 58:02 So here we said that 80 was larger than 50-- 58:07 was allowing us to conclude at 64.5%. 58:12 So it could be one reason to say that it was larger than 50%. 58:17 50% of 124 is 62. 58:23 So the question is, would I be would I 58:24 be willing to make this conclusion at 63? 58:28 Is that a number that would convince you? 58:30 Who would be convinced by 63? 58:34 who would be convinced by 72? 58:35 58:38 Who would be convinced by 75? 58:40 Hopefully the number of hands that are raised should grow. 58:42 58:46 Who would be convinced by 80? 58:48 All right, so basically those numbers actually 58:51 don't come from anywhere. 58:52 This 72 would be the number that you would need for a study-- 58:56 most statistical studies would be the number 58:58 that they would retain. 58:59 That's not for 124. 59:01 You would need to see 72 that turn their head 59:04 right to actually make this conclusion. 59:07 And then 75-- 59:08 So we'll see that there's many ways to come to this conclusion 59:11 because, as you can see, this was 59:12 published in Nature with 80. 59:15 So that was OK. 59:15 So 80 is actually a very large number. 59:17 This is 99 point-- 59:20 this 99% -- no, so this is 95% confidence. 59:24 This is 99% confidence. 59:26 And this is 99.9% percent confidence. 59:29 So if you said 80 you're a very conservative person. 59:34 Starting at 72, you can start making this conclusion. 59:36 59:39 To understand this, we need to do 59:41 our little mathematical kitchen here, 59:45 and we need to do some modeling. 59:49 So we need to understand by modeling-- 59:51 we need understand what random process we think 59:55 this data is generating from. 59:57 So it's going to have some unknown parameters, 59:59 unlike in probability. 60:00 But we need to have just basically everything written 60:02 except for the values of the parameters. 60:04 When I said a die is coming uniformly with probably 1/6 60:08 then I need to have, say maybe with probability-- maybe 60:12 I should say here are six numbers, 60:14 and I need to just fill those numbers. 60:18 So for i equal 1 to n, I'm going to define 60:23 Ri to be the indicator. 60:27 An indicator is just something that takes value 1 if something 60:29 is true, and 0 if not. 60:31 So it's an indicator that i-th couple 60:34 turns the head to the right. 60:36 So, Ri, so it's indexed by i. 60:39 And it's one if the i-th couple turns their head to the right, 60:42 and 0 if it's-- 60:45 well actually, I guess they can probably kiss straight, right? 60:48 So that would be weird, but they might be able to do this. 60:51 So let's say not right. 60:54 Then the estimator of p, we said, was p hat. 60:56 It was just the ratio of two numbers. 60:58 But really what it is is I count, I sum those Ri's. 61:02 Since I only add those that take value 1, what this is is-- 61:07 this sum here is actually just counting the number of 1's. 61:10 Which is another way to say it's counting the number of couples 61:13 that are kissing to the right. 61:15 And here I don't even have to tell you anything 61:18 about the numbers or anything. 61:20 I can only keep track of-- 61:21 first couple is a 0 second couple is a 1, 61:24 third couple is 0. 61:25 The data set-- you can actually find it online-- 61:27 is actually a sequence of 0's and 1's. 61:29 Now clearly for the question that we're 61:31 asking about this proportion, I don't 61:32 need to keep track of all this information. 61:34 All I need to keep track of is the number 61:37 of 0's and the number of 1's. 61:39 Those are completely interchangeable. 61:42 There's no time effect in this. 61:44 The first couple is no different than the 15th couple. 61:48 So we call this Rn bar. 61:50 That's going to be a very standard notation that we use. 61:52 R might be replaced by other letters like x-- 61:55 so xn bar, yn bar. 61:58 And this thing essentially means that I 62:00 average the R's, or the Ri's over n of them. 62:04 And the bar means the average. 62:06 So I divide by n the total number of 1's. 62:09 So here this sum was equal to 80 in our example and n 62:13 was equal to 124. 62:16 Now this is an estimator. 62:18 So an estimator is different from an estimate. 62:20 An estimate is a number. 62:21 My estimate was 64.5. 62:23 My estimator is this thing where I keep all the variables free. 62:29 And in particular, I keep those variables 62:31 to be random because I'm going to think of a random couple 62:34 kissing left to right as the outcome of a random process, 62:37 just like flipping a coin be getting heads or tails. 62:41 And so this thing here is a random variable, Ri. 62:43 And this average is, of course, an average of random variables. 62:46 It's itself a random variable. 62:47 So an estimator is a random variable. 62:49 An estimate is the realization of a random variable, 62:51 or, in other words, is the value that you 62:53 get for this random variable once you plug in the numbers 62:56 that you've collected. 62:58 So I can talk about the accuracy of an estimator. 63:01 Accuracy means what? 63:02 Well, what would we want for an estimator? 63:04 Maybe we won't want it to fluctuate too much. 63:06 It's a random variable. 63:07 So I'm talking about the accuracy of a random variable. 63:11 So maybe I don't want it to be too volatile. 63:13 I could have one estimator which would be-- 63:16 just throw out 182 couples, keep only 2 63:20 and average those two numbers. 63:21 That's definitely a worse estimator 63:23 than keeping all of the 124. 63:25 So I need to find a way to say that. 63:26 And what I'm going to be able to say 63:28 is that the number is going to be fluctuating. 63:30 If I take another two couples, I'm 63:31 going to be I'm probably going to get 63:33 a completely different number. 63:34 But if I take another 124 couples two days later, 63:38 maybe I'm going to have a very number that's 63:40 very close to 64.5%. 63:43 So that's one way. 63:43 The other thing we would like about this estimator it's 63:46 actually-- 63:47 maybe it's not too volatile-- but also 63:49 we want it to be close to the number that we're looking for. 63:54 Here is an estimator. 63:55 It's a beautiful variable. 63:57 72%, that's an estimator. 64:00 Go out there just do your favorite study 64:02 about drug performance. 64:06 And then they're going to call you, MIT student taking 64:10 statistics, they say, so how are you 64:12 going to build your estimator? 64:13 We've collected those 5,000 or something like that. 64:15 I'm just going to spit out 72%. 64:17 Whatever the data says, that's an estimator. 64:19 It's a stupid estimator but it is an estimator. 64:21 But this is estimator is very not volatile. 64:23 Every time you're going to have a new study, 64:25 even if you change fields, it's still going to be 72%. 64:27 This is beautiful. 64:29 And the problem is that's probably not 64:31 very close to the value you're actually trying to estimate. 64:34 So we need two things. 64:35 We need are estimated to be a random variable. 64:36 So think in terms of densities. 64:39 We want the density to be pretty narrow. 64:42 We want this thing to have very little-- 64:46 so this is definitely better than this. 64:52 But also, we want the number that we're interested in, p, 64:55 to be very close to this-- 64:57 to be close to the values that this thing is likely to take. 65:00 If p is here, this is not very good for us. 65:04 So that's basically the things we're going to be looking at. 65:06 The first one is referred to as variance. 65:08 The second one is referred to as bias. 65:10 Those things come all over in statistics. 65:14 So we need to understand a model. 65:16 So here's the model that we have for this particular problem. 65:20 So we need to make assumptions on the observations 65:22 that we see. 65:23 So we said we're going to assume that the random variable-- 65:25 that's not too much of a leap of faith. 65:27 We're just sweeping under the rug everything thing 65:29 we don't understand about those couples. 65:31 And the assumption that we make is 65:33 that Ri is a random variable. 65:36 This one you will forget very soon. 65:38 The second one is that each of the Ri's is-- 65:41 so it's a random variable that takes value 0 or 1. 65:45 Anybody can suggest the distribution 65:47 for this random variable? 65:48 AUDIENCE: Bernoulli. 65:49 PHILIPPE RIGOLLET: What? 65:50 AUDIENCE: Bernoulli. 65:51 PHILIPPE RIGOLLET: Bernoulli, right? 65:51 And it's actually beautiful. 65:53 This is where you have to do the least statistical modeling. 65:56 A random variable that takes value 0 or 1 65:59 is always a Bernoulli. 66:00 That's the simplest variable you can ever think of. 66:02 Any variable that takes only two possible values 66:04 can be reduced to a Bernoulli. 66:06 OK, so this is a Bernoulli. 66:10 And here we make the assumption that it actually 66:12 takes parameter p. 66:16 And there's an assumption here. 66:17 Anybody can tell me what the assumption is? 66:21 AUDIENCE: It's the same. 66:22 PHILIPPE RIGOLLET: Yeah, it's same, right? 66:24 I could have said p i, but it's p. 66:26 And that's where I'm going to be able to start 66:28 getting to do some statistics. 66:29 It's that I'm going to start to be able to pull information 66:31 across all my guys. 66:32 If I assume that they're all pi's 66:34 completely uncoupled with each other. 66:36 Then I'm in trouble. 66:37 There's nothing I can actually get. 66:39 And then I'm going to assume that those guys are 66:41 mutually independent. 66:42 And most of the time they will just say independent. 66:45 Meaning that, it's not like all these guys called each other 66:48 and it's actually a flash mob. 66:50 And they were like, let's all turn our left side to the left. 66:53 And then this is definitely not going 66:54 to give you a valid conclusion. 66:59 So, again. randomness is a way of modeling lack 67:02 of information. 67:03 Here there is a way to figure it out. 67:05 Maybe I could have followed all those guys, 67:07 and knew exactly what they were-- maybe 67:09 I could have looked at pictures of them in the womb 67:11 and guess how they were turning-- by the way that's 67:14 one of the conclusions, they're guessing 67:16 that we turn our head to the right 67:17 because our head is turned to the right in the womb. 67:21 So we don't know what goes on in the kissers minds. 67:24 And there's, you know, physics, sociology. 67:26 There's a lot of things that could help us, 67:28 but it's just too complicated to keep track of, 67:31 or too expensive for many instances 67:34 Now again, the nicest part of this modeling 67:37 was the fact that Ri's take only two values, which 67:39 mean that this conclusion that they were Bernoulli 67:41 was totally free for us. 67:43 Once we know it's a random variable, it's a Bernoulli. 67:45 Now they could have been, as we said, 67:47 they could have been a Bernoulli with parameter p i. 67:51 For each i, I could have put a different parameter, 67:55 but I just don't have enough information. 67:57 What would I have said? 67:58 I would say, well the first couple turned to the right. 68:00 p1 has to be one, that's my best guess. 68:04 The second couple kiss to the left, 68:06 well, p2 should be 0, that's my best guess. 68:10 And so basically I need to have to be 68:14 able to average my information. 68:16 And the way I do it is by coupling all these guys, 68:19 pi's to be the same p for all i. 68:22 OK, does it make sense? 68:23 Here what I am assuming is that my population is homogeneous. 68:28 Maybe it's not. 68:29 Maybe I could actually look at a finer grain, 68:31 but I'm basically making a statement about a population. 68:35 And so maybe you kiss to the left, and then you're not-- 68:41 I'm not making a statement about a person individually, 68:44 I'm making a statement about the overall population. 68:47 Now independence is probably reasonable, right? 68:49 This person just went and know can seriously 68:53 hope that these couples did not communicate with each other. 68:56 Or that you know Tanya did not text that we should all 68:59 turn our head to the left now. 69:01 And there's no external stimulus that forces people 69:05 to do something different. 69:08 OK, so-- sorry about that. 69:15 Since we have about less than 10 minutes. 69:19 Let's do a little bit of exercises, is that OK with you? 69:22 So I just have some exercises so we can see what 69:24 an exercise going to look like. 69:26 This is sort of similar to the exercises you will see with me. 69:30 We should do it together, OK? 69:31 So now we're going to have-- 69:32 I have a test. 69:33 69:36 So that's an exam in probability. 69:42 OK. 69:44 And I'm going to have 15 students in this test. 69:50 And hopefully, this should be 15 grades 69:53 that are representative of the grades of all a large class. 69:57 Right, so if you go you know 18.600, it's a large class, 70:00 there's definitely more than 15 students. 70:02 And maybe, just by sampling 15 students at random, 70:04 I want to have an idea of what my grade distribution will 70:08 look like. 70:09 I'm grading them, I want to make an educated guess. 70:13 So I'm going to make some modeling assumptions 70:15 for those guys. 70:16 So here, 15 students and the grades are x1 to x15. 70:22 Just like we had R1, R2, all the way to R124. 70:26 Those were my Ri's. 70:27 And so now I have my xi's. 70:29 And I'm going to assume that xi follows 70:33 a Gaussian or normal distribution with min mu 70:39 and variance sigma squared. 70:40 Now this is modeling, right? 70:43 Nobody told me there's no physical process that 70:45 makes this happen. 70:46 We know that there's something called the central limit 70:48 theorem in the background that says that things 70:50 tend to be Gaussian, but this is really a matter of convenience. 70:53 Actually this is, if you think about it, 70:55 this is terrible because this puts non-zero probability 70:57 on negative scores. 70:58 I'm definitely not going to get a negative score. 71:00 But you know it's good enough because they 71:02 know the probabilities non-zero but it's probably 10 71:05 to the minus 12. 71:06 So I would be very unlucky to see a negative score. 71:10 So here's the list of grades, so I have 65, 41, 70, 90, 58, 82, 71:24 76, 78-- 71:28 maybe I should have done it with 8 --59, 59-- 71:35 sitting next to each other --84, 89, 134, 51, and 72. 71:47 So those are the scores that I got. 71:51 There were clearly some bonus points over there. 71:53 And the question is, find estimator for mu. 72:05 What is my estimator for mu? 72:06 72:09 Well, an estimator, again, is something that 72:11 depends on the random variable. 72:12 All right, so mu is the expectation, right? 72:15 So a good estimator is definitely the average score, 72:22 just like we had the average of the Ri's. 72:24 Now the xi's no longer need to be 0's and 1's, so it's not 72:28 going to boil down to being a number of ones divided 72:31 by the total numbers. 72:32 Now if I'm looking for an estimate, 72:41 well, I need to actually sum those numbers 72:43 and divide them by 15. 72:45 So my estimate is going to be 1/15. 72:47 Then I'm going to start summing those numbers-- 72:49 65 plus 72. 72:51 72:54 OK, and I can do it, and it's 67.5. 73:06 This is my estimate. 73:08 Now if I want to compute a standard deviation-- 73:13 so let's say estimate for sigma. 73:18 73:21 You've seen that before, right? 73:23 An estimate for sigma is what? 73:24 An estimate for sigma, we'll see methods to do this, 73:27 but sigma squared is the variance, 73:31 or is the expectation, of x minus expectation of x squared. 73:35 73:38 And the problem is that I don't know 73:40 what those expectations are. 73:42 And so I'm going to do what 99.9% percent of statistics is. 73:47 And what is statistics about? 73:49 What's my motto? 73:51 Statistics is about replacing expectations with averages. 73:54 That's what all of statistics is about. 73:57 There's 300 pages in a purple book called All of Statistics 74:00 that tells you this. 74:01 All right, and then you do something fancy. 74:03 Maybe you minimize something after you 74:05 replace the expectation. 74:07 Maybe you need to plug in other stuff. 74:08 But really, every time you see an expectation, 74:10 you replace it by an average. 74:12 OK let's do this. 74:13 So sigma squared hat will be what? 74:16 It's going to be 1 over n, sum from i equals 1 to n 74:20 of xi minus-- 74:22 well, here I need to replace my expectation by an average, 74:25 which is really this average. 74:27 I'm going to call it mu hat squared. 74:31 There, you have replaced my expectation with average. 74:34 OK so the golden thing is, take your expectation 74:38 and replace it with this. 74:39 74:45 Frame it, get a tattoo, I don't care but that's what it is. 74:49 If you remember one thing from this class, that's what it is. 74:53 Now you can be fancy, if you look at your calculator, 74:56 it's going to put an n minus 1 here because it 74:59 wants to be unbiased. 75:00 And those are things we are going to come to. 75:02 But let's say right now we stick to this. 75:04 And then when I plug in my numbers. 75:06 I'm going to get an estimate for sigma, 75:14 which is the square root of the estimator 75:17 once I plug in the numbers. 75:18 And you can check that the number, you get will be 18. 75:21 75:27 So those are basic things and if you've taken any AP stats 75:31 this should be completely standard to you. 75:32 75:35 Now I have another list, and I don't have time to see it. 75:39 75:42 It doesn't really matter. 75:43 75:49 OK, we'll do that next time. 75:50 This is fine. 75:51 We'll see another list of numbers and see-- 75:55 we're going to think about modeling assumptions. 75:57 The goal of this exercise is not to compute those things, 75:59 it's really to think about modeling assumptions. 76:01 Is it reasonable to think that things are IID? 76:04 Is it reasonable to think that they 76:05 have all the same parameters, that they're independent, 76:07 et cetera, 76:08 OK so one thing that I wanted to add is, probably by tonight, 76:16 so I will try to use-- 76:18 in the spirit of-- 76:20 I don't know what's starting to happen. 76:22 In the spirit of using my iPad and fancy things, 76:26 I will try to post some videos of-- for in particular, 76:29 who has never used a statistical table to read, say, 76:33 the quantiles of a Gaussian distribution? 76:37 OK, so there's several of you. 76:39 This is a simple but boring exercise. 76:42 I will just post a video on how to do this, 76:44 and you will be able to find it on Stellar. 76:46 It's going to take five minutes, and then you 76:48 will know everything there is to know about those things 76:50 but that's something you need for the first problem set. 76:53 By the way, so the problem set has 76:54 30 exercises in probability. 76:57 You need to do 15. 76:59 And you only need to turn in 15. 77:01 You can turn in all of 30 if you want. 77:03 But you need to know, by the time we hit those things, 77:07 you need to know-- 77:08 well actually, by next week you need to know what's in there. 77:11 So if you don't have time to do all the homework, 77:13 and then go back to your probability class 77:15 to figure out how to do it, just do 15 easy that you can do. 77:19 And return those things. 77:20 But go back to your probability class 77:21 and make sure that you know how to do all of them. 77:23 Those are pretty basic questions, 77:25 and those are things that I'm not going to slow down on. 77:28 So you need to remember that the expectation of the product 77:30 of independent random variables is 77:32 a product of the expectations. 77:34 Expectation of the sum, is the sum of the expectation. 77:36 This kind of thing, which is a little silly, 77:38 but it just requires you practice. 77:40 So, just have fun. 77:42 Those are simple exercises. 77:43 You will have fun remembering your probability class. 77:46 All right, so I'll see you on Tuesday-- 77:49 or Monday. 77:51