字幕記錄
https://www.youtube.com/watch?v=VPZD_aij8H0&list=PLUl4u3cNGP60uVBMaoNERc6knT_MgPKS0&index=1


00:00
00:01
The following content is provided under a Creative
00:03
Commons license.
00:05
Your support will help MIT OpenCourseWare
00:07
continue to offer high-quality educational resources for free.
00:11
To make a donation or to view additional materials
00:14
from hundreds of MIT courses, visit MIT OpenCourseWare
00:18
at ocw.mit.edu.
00:19
00:23
PHILIPPE RIGOLLET: OK, so the course you're currently sitting
00:25
in is 18.650.
00:27
And it's called Fundamentals of Statistics.
00:29
And until last spring, it was still called Statistics
00:33
for Applications.
00:34
It turned out that really, based on the content, "Fundamentals
00:37
of Statistics" was a more appropriate title.
00:42
I'll tell you a little bit about what
00:43
we're going to be covering in class, what this class is
00:46
about, what it's not about.
00:48
I realize there's several offerings
00:50
in statistics on campus.
00:52
So I want to make sure that you've chosen the right one.
00:56
And I also understand that for some of you,
00:58
it's a matter of scheduling.
01:01
I need to actually throw out a disclaimer.
01:03
I tend to speak too fast.
01:05
I'm aware that.
01:07
Someone in the back, just do like that when you
01:10
have no idea what I'm saying.
01:12
Hopefully, I will repeat myself many times.
01:14
So if you average over time, you'll
01:15
see that statistics will tell you
01:17
that you will get the right message that I was actually
01:19
trying to stick to send.
01:22
All right, so what are the goals of this class?
01:26
The first one is basically to give you an introduction.
01:28
No one here is expected to have seen statistics before,
01:31
but as you will see, you are expected
01:33
to have seen probability.
01:34
And usually, you do see some statistics
01:36
in a probability course.
01:38
So I'm sure some of you have some ideas,
01:39
but I won't expect anything.
01:42
And we'll be using mathematics.
01:44
Math class, so there's going to be a bunch of equations--
01:48
not so much real data and statistical thinking.
01:52
We're going to try to provide theoretical guarantees.
01:54
We have two estimators that are available for me--
01:58
how theory guides me to choose between the best of them,
02:01
how certain can I be of my guarantees or prediction?
02:06
It's one thing to just bid out a number.
02:08
It's another thing to put some error bars around.
02:10
And we'll see how to build error bars, for example.
02:14
You will have your own applications.
02:16
I'm happy to answer questions about specific applications.
02:19
But rather than trying to tailor applications
02:21
to an entire institute, I think we're
02:24
going to work with pretty standard applications,
02:28
mostly not very serious ones.
02:32
And hopefully, you'll be able to take the main principles back
02:36
with you and apply them to your particular problem.
02:39
What I'm hoping that you will get out of this class is that
02:43
when you have a real-life situation-- and by "real life",
02:46
I mean mostly at MIT, so some people probably would not call
02:48
that real life--
02:50
their goal is to formulate a statistical problem
02:52
in mathematical terms.
02:53
If I want to say, is a drug effective,
02:56
that's not in mathematical terms,
02:58
I have to find out which measure I want
03:00
to have to call it effective.
03:03
Maybe it's over a certain period of time.
03:06
So there's a lot of things that you actually need.
03:08
And I'm not really going to tell you
03:10
how to go from the application to the point you need to be.
03:13
But I will certainly describe to you
03:14
what point you need to be at if you want to start applying
03:19
statistical methodology.
03:21
Then once you understand what kind of question
03:23
you want to answer--
03:24
do I want a yes/no answer, do I want a number,
03:26
do I want error bars, do I want to make predictions
03:29
five years into future, do I have side information,
03:32
or do I not have side information, all those things--
03:34
based on that, hopefully, you will
03:36
have a catalog of statistical methods
03:38
that you're going to be able to use and apply it in the wild.
03:44
And also, no statistical method is perfect.
03:49
Some of the math people have agreed upon over the years,
03:52
and people understand that this is the standard.
03:54
But I want you to be able to understand
03:57
what the limitations are, and when you make conclusions
03:59
based on data, that those conclusions might be erroneous,
04:03
for example.
04:03
All right, more practically, my goal here is to have you ready.
04:09
So who has taken, for example, a machine-learning class here?
04:12
All right, so many of you, actually-- maybe a third
04:15
have taken a machine-learning class.
04:19
So statistics has somewhat evolved into machine
04:21
learning in recent years.
04:22
And my goal is to take you there.
04:24
So machine learning has a strong algorithmic component.
04:26
So maybe some of you have taken a machine-learning class
04:29
that displays mostly the algorithmic component.
04:31
But there's also a statistical component.
04:33
The machine learns from data.
04:37
So this is a statistical track.
04:39
And there are some statistical machine-learning classes
04:43
that you can take here.
04:44
They're offered at the graduate level, I believe.
04:47
But I want you to be ready to be able to take those classes,
04:51
having the statistical fundamentals to understand
04:53
what you're doing.
04:54
And then you're going to be able to expand to broader and more
04:58
sophisticated methods.
05:00
Lectures are here from 11:00 to 12:30 on Tuesday and Thursday.
05:05
Victor-Emmanuel will also be--
05:07
and you can call him Victor--
05:09
will also be holding mandatory recitation.
05:12
So please go on Stellar and pick your recitation.
05:15
It's either 3:00 to 4:00 or 4:00 to 5:00 on Wednesdays.
05:19
And it's going to be mostly focused on problem-solving.
05:22
They're mandatory in the sense that we're allowed to do this,
05:28
but they're not going to cover entirely new material.
05:32
But they might cover some techniques
05:35
that might save you some time when it comes to the exam.
05:39
So you might get by.
05:41
Attendance is not going to be taken or anything like this.
05:44
But I highly recommend that you go,
05:47
because, well, they're mandatory.
05:49
So you cannot really complain that something was taught only
05:52
in recitation.
05:54
So please register on Stellar for which
05:56
of the two recitations you would like to be in.
05:59
They're capped at 40, so first come, first served.
06:03
Homework will be due weekly.
06:05
There's a total of 11 problem sets.
06:08
I realize this is a lot.
06:09
Hopefully, we'll keep them light.
06:11
I just want you to not rush too much.
06:15
The 10 best will be kept, and this
06:17
will count for a total of 30% of the final grade.
06:20
There are due Mondays at 8:00 PM on Stellar.
06:25
And this is a new thing.
06:28
We're not going to use the boxes outside of the math department.
06:31
We're going to use only PDF files.
06:34
Well, you're always welcome to type them and practice
06:37
your LaTeX or Word typing.
06:40
I also understand that this can be a bit of a strain,
06:42
so just write them down on a piece of paper,
06:45
use your iPhone, and take a picture of it.
06:48
Dropbox has a nice, new--
06:50
so try to find something that puts a lot of contrast,
06:53
especially if you use pencil, because we're going
06:55
to check if they're readable.
06:56
And this is your responsibility to have a readable file.
07:01
I've had over the years--
07:02
not at MIT, I must admit-- but I've
07:03
had students who actually write the doc file
07:06
and think that converting it to a PDF
07:08
consists in erasing the extension doc
07:11
and replacing it by PDF.
07:12
This is not how it works.
07:13
07:17
So I'm sure you will figure it out.
07:19
Please try to keep them letter-sized.
07:21
This is not a strict requirement,
07:23
but I don't want to see thumbnails, either.
07:26
You are allowed to have two late homeworks.
07:28
And by late, I mean 24 hours late.
07:31
No questions asked.
07:32
You submit them, this will be counted.
07:34
You don't have to send an email to warn us
07:36
or anything like this.
07:39
Beyond that, even that you have one slack
07:42
for one 0 grade and slack for two late homeworks,
07:46
you're going to have to come up with a very good explanation
07:49
why you need actually more extensions than that, if you
07:52
ever do.
07:53
And particularly, you're going to have
07:54
to keep track about why you've used your three options before.
07:58
There's going to be two midterms.
08:00
One is October 3, and one is November 7.
08:03
They're both going to be in class for the duration
08:05
of the lecture.
08:06
When I say they last for an hour and 20 minutes,
08:09
it does not mean that if you arrive 10 minutes
08:11
before the end of lecture, you still
08:12
get an hour and 20 minutes.
08:14
It will end at the end of lecture time.
08:17
For this as well, no pressure.
08:19
Only the best of the two will be kept.
08:22
And this grade will count for 30% of the grade.
08:26
This will be closed-books and closed-notes.
08:29
The purpose is for you to-- yes?
08:30
AUDIENCE: How many midterms did you say there are?
08:33
PHILIPPE RIGOLLET: Two.
08:34
AUDIENCE: You said the best of the two will be kept?
08:36
PHILIPPE RIGOLLET: I said the best of the two
08:37
will be kept, yes.
08:39
AUDIENCE: So both the midterms will be kept?
08:42
PHILIPPE RIGOLLET: The best of the two, not the best two.
08:45
AUDIENCE: Oh.
08:45
08:50
PHILIPPE RIGOLLET: We will add them, multiply the number by 9,
08:53
and that will be grade.
08:54
No.
08:55
I am trying to be nice, there's just a limit to what I can do.
08:59
All right, so the goal is for you to learn things
09:02
and to be familiar with them.
09:04
In the final, you will be allowed
09:05
to have your notes with you.
09:07
But the midterms are also a way for you
09:09
to develop some mechanism so that you don't actually waste
09:11
too much time on things that you should be able to do
09:14
without thinking too much.
09:16
You will be allowed to cheat sheet,
09:17
because, well, you can always forget something.
09:20
And it will be two-sided letters sheet,
09:23
and you can practice yourself as writing as small as you want.
09:27
And you can put whatever you want on this cheat sheet.
09:30
All right, the final will be decided by the register.
09:33
It's going to be three hours, and it's
09:34
going to count for 40%.
09:35
You cannot bring books, but you can bring your notes.
09:38
Yes.
09:38
AUDIENCE: I noticed that the midterm dates
09:40
aren't dated in the syllabus.
09:41
So I wanted to make sure you know.
09:43
PHILIPPE RIGOLLET: They are not?
09:44
AUDIENCE: Yeah--
09:45
PHILIPPE RIGOLLET: Oh, yeah, there's
09:45
a "1" that's missing on both of them, isn't there?
09:47
09:59
Yeah, let's figure that out.
10:03
The syllabus is the true one.
10:05
The slides are so that we can discuss,
10:07
but the ones that's on the syllabus
10:08
are the ones that count.
10:09
And I think they're also posted on the calendar on Stellar
10:13
as well.
10:14
Any other question?
10:15
10:20
OK, so the pre-reqs here--
10:23
and who has looked at the first problem set already?
10:28
OK, so those hands that are raised
10:30
realize that there is a true prerequisite of probability
10:34
for this class.
10:36
It can be at the level of 18.600 or 604.1.
10:40
I should say "B" now.
10:42
It's two classes.
10:44
I will require you to know some calculus
10:48
and have some notions of linear algebra,
10:51
such as, what is a matrix, what is a vector, how
10:53
do you multiply those things together,
10:55
some notion of what orthonormal vectors are.
10:58
11:01
We'll talk about eigenvectors and eigenvalues,
11:03
but I remind you all of that.
11:05
So this is not this strict pre-req.
11:07
But if you've taken it, for example,
11:09
it doesn't hurt to go back to your notes
11:12
when we get closer to this chapter
11:13
on principle-component analysis.
11:15
The chapters, as they're listed in the syllabus, are in order,
11:19
so you will see when it actually comes.
11:22
There's no required textbook.
11:24
And I know you tend to not like that.
11:29
You like to have your textbook to know where you're going
11:31
and what we're doing.
11:32
I'm sorry, it's just this class.
11:34
Either I would have to go to a mathematical statistics
11:36
textbook, which is just too much,
11:38
or to go to a more engineering-type statistics
11:43
class, which is just too little.
11:45
So hopefully, the problems will be enough
11:49
for you to practice the recitations.
11:50
We'll have some problems to solve as well.
11:52
And the material will be posted on the slides.
11:55
So you should have everything you need.
11:57
There's plenty of resources online
11:58
if you want to expand on a particular topic
12:00
or read it as said by somebody else.
12:03
The book that I recommend in the syllabus
12:07
is this book called All of Statistics by Wasserman.
12:11
Mainly because of the title, I'm guessing
12:13
it has all of it in it.
12:16
It's pretty broad.
12:18
There's actually not that many.
12:20
It's more of an intro-grad level.
12:22
But it's not very deep, but you see a lot of the overview.
12:27
Certainly, what we're going to cover
12:29
will be a subset of what's in there.
12:30
12:34
The slides will be posted on Stellar
12:35
before lectures before we start a new chapter
12:38
and after we're done with the chapter, with the annotations,
12:41
and also, with the typos corrected, like for the exam.
12:46
There will be some video lectures.
12:48
Again, the first one will be posted on OCW from last year.
12:52
But all of them will be available on Stellar--
12:54
of course, module technical problems.
12:57
But this is an automated system.
13:00
And hopefully, it will work out well for us.
13:02
So if you somehow have to miss a lecture,
13:04
you can always catch it up by watching it.
13:08
You can also play at that speed 0.75
13:10
in case I end up speaking too fast,
13:12
but I think I've managed myself so far--
13:15
so just last warning.
13:19
All right, why should you study statistics?
13:22
Well, if you read the news, you will see a lot of statistics.
13:27
I mentioned machine learning.
13:28
It's built on a lot of statistics.
13:32
If I were to teach this class 10 years ago,
13:34
I would have to explain to you that data collection and making
13:37
decisions based on data was something that made sense.
13:40
But now, it's almost in our life.
13:43
We're used to this idea that data helps in making decisions.
13:47
And people use data to conduct studies.
13:51
So here, I found a bunch of press titles that--
13:55
I think the key word I was looking for was "study finds"--
13:57
if I want to do this.
13:58
So I actually did not bother doing it again this year.
14:01
This is all 2016, 2016, 2016.
14:04
But the key word that I look for is usually "study find"--
14:07
so a new study find--
14:08
traffic is bad for your health.
14:10
So we had to wait for 2016 for data to tell us that.
14:13
And there's a bunch of other slightly more interesting ones.
14:18
For example, one that you might find interesting
14:20
is that this study finds that students benefit from waiting
14:24
to declare a major.
14:26
Now, there's a bunch of press titles.
14:28
There one in the MIT News that finds brain connections,
14:33
key to reading.
14:34
And so here, we have an idea of what happened there.
14:37
Some data was collected.
14:39
Some scientific hypothesis was formulated.
14:42
And then the data was here to try to prove or disprove
14:47
this scientific hypothesis.
14:49
That's the usual scientific process.
14:51
And we need to understand how the scientific process goes,
14:55
because some of those things might be actually questionable.
14:58
Who is 100% sure that study finds that students--
15:02
do you think that you benefit from waiting
15:04
to declare a major?
15:05
Right I would be skeptical about this.
15:09
I would be like, I don't want to wait to declare a major.
15:13
So what kind of thing can we bring?
15:15
Well maybe this study studied people
15:17
that were different from me.
15:18
Or maybe the study finds that this
15:21
is beneficial for a majority of people.
15:22
I'm not a majority.
15:23
I'm just one person.
15:24
There's a bunch of things that we
15:26
need to understand what those things actually mean.
15:28
And we'll see that those are actually not
15:30
statements about individuals.
15:31
They're not even statements about the cohort of people
15:33
they've actually looked at.
15:35
They're statements about a parameter
15:37
of a distribution that was used to model
15:40
the benefit of waiting.
15:43
So there's a lot of questions.
15:45
And there are a lot of layers that come into this.
15:46
And we're going to want to understand what was going on
15:49
in there and try to peel it off and understand what assumptions
15:52
have been put in there.
15:53
Even though it looks like a totally legit study, out
15:59
of those studies, statistically, I
16:01
think there's going to be one that's going to be wrong.
16:04
Well, maybe not one.
16:05
But if I put a long list of those,
16:07
there would be a few that would actually be wrong.
16:10
If I put 20, there would definitely be one that's wrong.
16:12
So you have to see that.
16:13
Every time you see 20 studies, one is probably wrong.
16:16
When there are studies about drug effects,
16:19
out of a list of 100, one would be wrong.
16:21
So we'll see what that means and what I mean by that.
16:23
16:26
Of course, not only studies that make discoveries
16:30
are actually making the press titles.
16:32
There's also the press that talks about things
16:36
that make no sense.
16:38
I love this first experiment-- the salmon experiment.
16:41
Actually, it was a grad student who
16:44
came to a neuroscience poster session,
16:47
pulled out this poster, and explained
16:50
the scientific experiment that he was conducting,
16:53
which consisted in taking a previously frozen and thawed
16:59
salmon, putting it in an MRI, showing it
17:02
pictures of violent images, and recording its brain activity.
17:07
And he was able to discover a few voxels that were activated
17:11
by those violent images.
17:14
And can somebody tell me what happened here?
17:16
17:19
Was the salmon responding to the violent activity?
17:23
Basically, this is just a statistical fluke.
17:26
That's just randomness at play.
17:27
There's so many voxels that are recorded,
17:29
and there's so many fluctuations.
17:31
There's always a little bit of noise
17:32
when you're in those things, that some of them,
17:34
just by chance, got lit up.
17:36
And so we need to understand how to correct for that.
17:38
In this particular instance, we need
17:40
to have tools that tell us that, well, finding three voxels that
17:43
are activated for that many voxels
17:47
that you can find in the salmon's brain
17:50
is just too small of a number.
17:53
Maybe we need to find a clump of 20 of them, for example.
17:56
All right, so we're going to have
17:57
mathematical tools that help us find those particular numbers.
18:02
I don't know if you ever saw this one by John Oliver
18:07
about phacking.
18:11
Or actually, it said p-hacking.
18:14
Basically, what John Oliver is saying
18:17
is actually a full-length-- like there's long segments on this.
18:20
And he was explaining how there's a sociology question
18:24
here about how there's a huge incentive for scientists
18:28
to publish results.
18:30
You're not going to say, you know what?
18:31
This year, I found nothing.
18:33
And so people are trying to find things.
18:35
And just by searching, it's as if they
18:36
were searching for all the voxels in a brain
18:39
until they find one that was just lit up by chance.
18:41
And so they just run all these studies.
18:43
And at some point, one will be right just out of chance.
18:46
And so we have to be very careful about doing this.
18:49
There's much more complicated problems associated
18:52
to what's called p-hacking, which
18:53
consists of violating the basic assumptions, in particular,
18:57
looking at the data, and then formulating
19:00
your scientific assumption based on data,
19:02
and then going back to it.
19:03
Your idea doesn't work.
19:04
Let's just formulate another one.
19:05
And if you are doing this, all bets are off.
19:07
19:10
The theory that we're going to develop
19:11
is actually for a very clean use of data, which
19:14
might be a little unpleasant.
19:16
If you've had an army of graduate students collecting
19:19
genomic data for a year, for example,
19:21
maybe you don't want to say, well,
19:23
I had one hypothesis that didn't work.
19:25
Let's throw all the data into the trash.
19:27
And so we need to find ways to be able to do this.
19:30
And there's actually a course been taught at BU.
19:34
It's still in its early stages, but something
19:37
called "adaptive data analysis" that will allow
19:39
you to do these kind of things.
19:42
Questions?
19:43
19:46
OK, so of course, statistics is not
19:49
just for you to be able to read the press.
19:52
Statistics will probably be used in whatever career
19:56
path you choose for yourself.
19:58
It started in the 10th century in Netherlands for hydrology.
20:03
Netherlands is basically under water, under sea level.
20:06
And so they wanted to build some dikes.
20:08
But once you're going to build a dike,
20:09
you want to make sure that it's going to sustain
20:11
some tides and some floods.
20:13
And so in particular, they wanted
20:15
to build dikes that were high enough, but not too high.
20:19
You could always say, well, I'm going
20:21
to build a 500-meter dike, and then I'm going to be safe.
20:25
You want something that's based on data.
20:27
You want to make sure.
20:28
And so in particular, what did they do?
20:30
Well, they collected data for previous floods.
20:33
And then they just found a dike that
20:36
was going to cover all these things.
20:37
Now, if you look at the data they probably had,
20:40
maybe it was scarce.
20:41
Maybe they had 10 data points.
20:43
And so for those data points, then
20:44
maybe they wanted to sort of interpolate
20:47
between those points, maybe extrapolate for the larger one.
20:50
Based on what they've seen, maybe they
20:51
have chances of seeing something which
20:53
is even larger than everything they've seen before.
20:56
And that's exactly the goal of statistical modeling--
21:00
being able to extrapolate beyond the data that you have,
21:04
guessing what you have not seen yet might happen.
21:08
When you buy insurance for your car,
21:10
or your apartment, or your phone,
21:14
there is a premium that you have to pay.
21:16
And this premium has been determined
21:17
based on how much you are, in expectation, going
21:21
to cost the insurance.
21:22
It says, OK, this person has, day a 10% chance
21:25
of breaking their iPhone.
21:28
An iPhone costs that much to repair,
21:30
so I'm going to charge them that much.
21:31
And then I'm going to add an extra dollar for my time.
21:34
That's basically how those things are determined.
21:36
And so this is using statistics.
21:39
This is basically where statistics is probably
21:42
mostly used.
21:43
I was personally trained as an actuary.
21:45
And that's me being a statistician at an insurance
21:48
company.
21:50
Clinical trials-- this is also one of the earliest success
21:55
stories of statistics.
21:58
It's actually now widespread.
21:59
Every time a new drug is approved for market by the FDA,
22:04
it requires a very strict regimen of testing with data,
22:08
and control group, and treatment group,
22:10
and how many people you need in there,
22:11
and what kind of significance you need for those things.
22:17
In particular, those things look like this,
22:19
so now it's 5,000 patients.
22:21
It depends on what kind of drug it is,
22:23
but for, say, 100 patients, 56 were cured,
22:26
and 44 showed no improvement.
22:29
Does the FDA consider that this is a good number?
22:31
Do they have a table for how many patients were cured?
22:37
Is there a placebo effect?
22:39
Do I need a control group of people that
22:41
are actually getting a placebo?
22:42
It's not clear, all these things.
22:44
And so there's a lot of things to put into place.
22:46
And there's a lot of floating parameters.
22:48
So hopefully, we're going to be able to use
22:50
statistical modeling to shrink it down
22:52
to a small number of parameters to be able to ask
22:55
very simple questions.
22:56
"Is a drug effective" is not a mathematical equation.
22:59
But "Is p larger than 0.5?"
23:02
is a mathematical question And that's
23:04
essentially we're going to be doing.
23:05
We're going to take this, is a drug effective, to reducing to,
23:08
is a variable larger than 0.5?
23:12
Now, of course genetics are using that.
23:15
That's typically actually the same size of data
23:19
that you would see for FMRI data.
23:21
So this is actually a study that I found.
23:25
You have about 4,000 cases of Alzheimer's and 8,000 control.
23:29
So people without Alzheimer's-- that's what's called a control.
23:32
That's something just to make sure
23:34
that you can see the difference with people
23:38
that are not affected by either a drug or a disease.
23:42
Is the gene APOE associated with Alzheimer's disease?
23:46
Everybody can see why this would be an important question.
23:49
We now have it crisper.
23:50
It's targeted to very specific genes.
23:52
If we could edit it, or knock it down, or knock it
23:55
up, or boost it, maybe we could actually
23:57
have an impact on that.
23:58
So those are very important questions,
24:00
because we have the technology to target those things.
24:02
But we need the answers about what those things are.
24:04
And there's a bunch of other questions.
24:07
The minute you're going to talk to biologists about say,
24:09
I can do that.
24:10
They're going to say, OK, are there
24:11
any other genes within the genes,
24:12
or any particular snips that I can actually look at?
24:15
And they're looking at very different questions.
24:17
And when you start asking all these questions,
24:19
you have to be careful, because you're reusing your data again.
24:22
And it might lead you to wrong conclusions.
24:26
And those are all over the place, those things.
24:29
And that's why they go all the way to John Oliver talking
24:32
about them.
24:33
Any questions about those examples?
24:35
24:37
So this is really a motivation.
24:38
Again, we're not going to just take
24:40
this data set of those cases and look at them in detail.
24:46
So what is common to all these examples?
24:49
Like, why do we have to use statistics
24:50
for all those things?
24:53
Well, there's the randomness of the data.
24:55
There's some effect that we just don't understand--
24:59
for example, the randomness associated with the lining up
25:02
of some voxels.
25:04
Or the fact that as far as the insurance
25:06
is concerned whether you're going to break your iPhone
25:09
or not is essentially a coin toss.
25:10
Fully, it's biased.
25:11
But it's a coin toss.
25:14
From the perspective of the statistician,
25:16
those things are actually random events.
25:18
And we need to tame this randomness,
25:20
to understand this randomness.
25:21
Is this going to be a lot of randomness?
25:23
Or is it going to be a little randomness?
25:25
Is it going to be something that's
25:26
like, out of their people--
25:29
25:32
let's see, for example, for the floods.
25:35
Were the floods that I saw consistently almost
25:38
the same size?
25:40
It was almost a rounding error, or they're just
25:43
really widespread.
25:44
All these things, we need to understand
25:45
so we can understand how to build those dikes
25:48
or how to make decisions based on those data.
25:54
And we need to understand this randomness.
25:58
OK, so the associated questions to randomness
26:01
were actually hidden in the text.
26:03
So we talked about the notion of average.
26:05
Right, so as far as the insurance is concerned,
26:08
they want to know in average with the probability is.
26:10
Like, what is your chance of actually breaking your iPhone?
26:13
And that's what came in this notion of fair premium.
26:18
There's this notion of quantifying chance.
26:21
We don't want to talk maybe only about average,
26:23
maybe you want to cover say 99% percent of the floods.
26:26
So we need to know what is the height of a flood that's
26:31
higher than 99% of the floods.
26:34
But maybe there's 1% of them, you know.
26:36
When doomsday comes, doomsday comes.
26:38
Right, we're not going to pay for it.
26:40
All right, so that's most of the floods.
26:43
And then there's questions of significance, right?
26:45
So you know I give this example, a second ago
26:47
about clinical trials.
26:50
I give you some numbers.
26:51
Clearly the drug cured more people than it did not.
26:55
But does it mean that it's significantly good,
26:58
or was this just by chance.
26:59
Maybe it's just that these people just recovered.
27:01
It's like you know curing a common cold.
27:04
And you feel like, oh I got cured.
27:06
But it's really you waited five days and then you got cured.
27:09
All right, so there's this notion of significance,
27:11
of variability.
27:12
All these things are actually notions
27:15
that describe randomness and quantify randomness
27:18
into simple things.
27:19
Randomness is a very complicated beast.
27:21
But we can summarize it into things that we understand.
27:24
Just like I am a complicated object.
27:27
I'm made of molecules, and made of genes,
27:29
and made of very complicated things.
27:31
But I can be summarized as my name, my email address,
27:34
my height and my weight, and maybe for most of you,
27:37
this is basically enough.
27:39
You will recognize me without having
27:41
to do a biopsy on me every time you see me.
27:45
All right, so, to understand randomness
27:49
you have to go through probability.
27:51
Probability is the study of randomness.
27:53
That's what it is.
27:54
That's what the first sentence that a lecturer in probability
27:57
will say.
27:58
And so that's why I need the pre-requisite, because this
28:02
is what we're going to use to describe the randomness.
28:04
We'll see in a second how it interacts with statistics.
28:07
So sometimes, and actually probably most of the time
28:10
throughout your semester on probability,
28:13
randomness was very well understood.
28:15
When you saw a probability problem, here
28:18
was the chance of this happening,
28:19
here was the chance of that happening.
28:21
Maybe you had more complicated questions
28:23
that you had some basic elements to answer.
28:26
For example, the probability that I have HBO is this much.
28:32
And the probability that I watch Game of Thrones is that much.
28:34
And given that I play basketball what is the probability--
28:38
you had all these crazy questions,
28:39
but you were able to build them.
28:42
But all the basic numbers were given to you.
28:45
Statistics will be about finding those basic numbers.
28:48
All right so some examples that you've probably seen
28:51
were dice, cards, roulette, flipping coins.
28:55
All of these things are things that you've
28:57
seen in a probability class.
28:59
And the reason is because it's very easy
29:00
to describe the probability of each outcome.
29:02
For a die we know that each face is going
29:05
to come with probably 1/6.
29:07
Now I'm not going to go into a debate of whether this
29:09
is pure randomness or this is determinism.
29:12
I think as a model for actual randomness
29:14
a die is a pretty good number, flipping a coin
29:18
is a pretty good model.
29:20
So those are actually a good thing.
29:22
So the questions that you would see, for example,
29:24
in probabilities are the following.
29:26
I roll one die.
29:27
Alice gets $1 if the number of dots is less than three.
29:31
Bob gets $2 if the number of dots is less than two.
29:35
Do you want to be Alice or Bob given that your role is
29:37
actually to make money.
29:40
Yeah, you want to be Bob, right?
29:43
So let's see why.
29:45
So if you look at the expectation of what
29:47
Alice makes.
29:48
So let's call it a.
29:51
This is $1, with probability 1/2.
29:56
So 3/6, that's 1/2.
29:59
And the expectation of what Bob makes,
30:02
this is $2 with probably 2/6 and that's 2/3.
30:11
Which is definitely larger than 1/2.
30:13
So Bob's expectations actually a bit higher.
30:17
So those are the kind of questions that you
30:18
may ask with probability.
30:19
I described to you exactly, you use the fact
30:21
that the die would get less than three dots,
30:25
with probability one half.
30:26
We knew that.
30:27
And I didn't have to describe to you what was going on there.
30:29
You didn't have to collect data about a die.
30:32
Same thing, you roll two dice.
30:34
You choose a number between 2 and 12
30:36
and you win $100 if you choose the sum of the two dice.
30:42
Which number do you pick?
30:45
What?
30:46
AUDIENCE: 7.
30:47
PHILIPPE RIGOLLET: 7.
30:48
Why 7?
30:48
AUDIENCE: It's the most likely.
30:50
PHILIPPE RIGOLLET: That's the most likely one, right?
30:52
So your gain here will be $100 times the probability
30:56
that the sum of the two dice, let's say x plus y,
31:00
is equal to your little z where a little z is
31:03
the number you pick.
31:04
31:07
So 7 is the most likely to happen
31:10
and that's the one that maximizes this function of z.
31:14
And for this you need to study a more complicated function.
31:17
But it's a function that enables two die.
31:18
But you can compute the probability that x plus y
31:21
is equal to z, for every z between 2 and 12.
31:26
So you know exactly what the probabilities are
31:29
and that's how you start probability.
31:30
31:35
So here that's exactly what I said.
31:38
You have a very simple process that describes basic events.
31:43
Probability 1/6 for each of them.
31:45
And then you can build up on that,
31:46
and understand probably of more complicated events.
31:49
You can throw some money in there.
31:50
You can be build functions.
31:52
You can do very complicated things building on that.
31:56
Now if I was a statistician, a statistician
31:59
would be the guy who just arrived on earth,
32:01
had never seen a die and needs to understand
32:03
that a die come up with probably 1/6 on each side.
32:05
And the way he would do it is just to roll the die
32:08
until he get some counts and tries to estimate those.
32:12
And maybe that guy would come and say,
32:14
well, you know, actually, the probability
32:16
that I get a 1 is 1/6 plus 0.001 and the probability
32:23
that I get a 2 is 1/6 minus 0.005.
32:27
And there would be some fluctuations around this.
32:29
And it's going to be his role as a statistician
32:31
to say, listen, this is too complicated
32:33
of a model for this thing.
32:34
And these should all be the same numbers.
32:36
Just looking at data, they should be all the same numbers.
32:39
And that's part of the modeling.
32:40
You make some simplifying assumptions
32:41
that essentially make your questions more accurate.
32:46
Now, of course, if your model is wrong,
32:48
if it's not true that all the faces arrive
32:50
with the same probability, then you have a model error here.
32:54
So we will be making model errors.
32:56
But that's going to be the price to pay
32:57
to be able to extract anything from our data.
32:59
33:02
So for more complicated processes,
33:07
so of course nobody's going to waste their time rolling dice.
33:09
I mean, I'm sure you might have done
33:11
this in AP stat or something.
33:13
But the need is to estimate parameters from data.
33:18
All right, so for more complicated things
33:19
you might want to estimate some density parameter
33:27
on a particular set of material.
33:29
And for this maybe you need to beam something to it,
33:31
and measure how fast it's coming back.
33:33
And you're going to have some measurement errors.
33:35
And maybe you need to do that several times
33:37
and you have a model for the physical process that's
33:39
actually going on.
33:40
And physics is usually a very good way
33:42
to get models for engineering perspective.
33:46
But there's models for sociology where we
33:49
have no physical system, right.
33:52
God knows how people interact.
33:53
And maybe I'm going to say that the way
33:55
I make friends is by first flipping a coin in my pocket.
33:59
And with probability 2/3, I'm going
34:01
to make my friend at work.
34:02
And with probability 1/3 I'm going
34:04
to make my friend at soccer.
34:05
And once I make my friends at soccer--
34:07
I decide to make my friend soccer.
34:09
Then I will face someone who's flipping
34:11
the same coin with maybe be slightly different parameters.
34:14
But those things actually exist.
34:16
There's models about how friendships are formed.
34:18
And the one I described is called
34:22
the mixed-membership model.
34:23
So those are models that are sort of hypothesized.
34:25
And they're more reasonable than taking into account
34:29
all the things that made you meet that person
34:31
at that particular time.
34:34
So the goal here-- so based on data now,
34:38
once we have the model is going to be reduced to maybe two,
34:41
three, four parameters, depending
34:42
on how complex the model is.
34:44
And then your goal will be to estimate those parameters.
34:48
So sometimes the randomness we have here is real.
34:51
So there's some true randomness in some surveys.
34:56
If I pick a random student, as long
34:58
as I believe that my random number generator that
35:00
will pick your random ID is actually random,
35:04
there is something random about you.
35:06
The student that I pick at random
35:07
will be a random student.
35:09
The person that I call on the phone is a random person.
35:13
So there's some randomness that I can build into my system
35:16
by drawing something from a random number generator.
35:20
A biased coin is a random thing.
35:22
It's not a very interesting random thing.
35:24
But it is a random thing.
35:26
Again, if I wash out the fact that it actually
35:29
is a deterministic mechanism.
35:30
But at a certain accuracy, a certain granularity,
35:33
this can be thought of as a truly random experiment.
35:36
Measurement error for example, if you by some measurement
35:39
device.
35:39
or some optics device, for example.
35:42
You will have like standard deviation and things that
35:45
come on the side of the box.
35:46
And it tells you, this will be making some measurement error.
35:48
And it's usually thermal noise maybe, or things like this.
35:51
And those are very accurately described
35:54
by some random phenomenon.
35:56
But sometimes, and I'd say most times, there's no randomness.
36:01
There's no randomness.
36:02
It's not like you breaking your iPhone is a random event.
36:06
This is just something that we sweep--
36:09
randomness is a big rug under which we sweep
36:11
everything we don't understand.
36:13
And we just hope that in average we've
36:15
captured, the average effect of what's going on.
36:18
And the rest of it might fluctuate to the right,
36:20
might fluctuate to the left.
36:22
But what remains is just sort of randomness
36:26
that can be averaged out.
36:27
So, of course, this is where the leap of faith is.
36:31
We do not know whether we were correct of doing this.
36:33
Maybe we make some huge systematic biases
36:35
by doing this.
36:36
Maybe we forget a very important component.
36:39
Right, for example, if I have--
36:42
I don't know, let's think of something--
36:45
36:49
a drug for breast cancer.
36:51
All right, and I throw out the fact
36:52
that my patient is either a man or woman.
36:55
I'm going to have some serious model biases.
36:57
Right.
36:58
So if I say I'm going to collect a random and patient.
37:00
And said I'm going to start doing this.
37:02
There's some information that I really need, clearly,
37:04
to build into my model.
37:06
And so the model should be complicated enough, but not too
37:10
complicated.
37:11
Right so it should take into account things
37:13
there will systematically be important.
37:17
So, in particular, the simple rule of thumb
37:19
is, when you have a complicated process,
37:24
you can think of it as being a simple process
37:26
and some random noise.
37:28
Now, again, the random noise is everything
37:30
you don't understand about the complicated process.
37:33
And the simple process is everything you actually do.
37:37
So good modeling, and this is not
37:40
where we'll be seeing in this class,
37:43
consistent choosing plausible simple models.
37:46
And this requires a tremendous amount of domain knowledge.
37:50
And that's why we're not doing it in this class.
37:52
This is not something where I can make a blanket statement
37:54
about making good modeling.
37:55
You need to know, if I were a statistician working
37:58
on a study, I would have to grill the person in front
38:00
of me, the expert, for two hours to know, but how about this?
38:04
How about that?
38:05
How does this work?
38:06
So it requires to understand a lot of things.
38:08
There's this famous statistician to whom this sentence is
38:14
attributed, and it's probably not his then,
38:16
but Tukey said that he loves being a statistician,
38:21
because you get to play in everybody's backyard.
38:23
Right, so you get to go and see people.
38:25
And you get to understand, at least to a certain extent, what
38:28
their problems are.
38:29
Enough that you can actually build
38:31
a reasonable model for what they're actually doing.
38:33
So you get to do some sociology.
38:34
You get to do some biology.
38:35
You get to do some engineering.
38:37
And you get to do a lot of different things.
38:39
Right, so he was actually at some point
38:40
predicting the presidential election.
38:46
So, you see, you get to do a lot of different things.
38:48
But it requires a lot of time to understand
38:50
what problem you're working on.
38:52
And if you have a particular application in mind
38:54
you're the best person to actually understand this.
38:56
So I'm just going to give you the basic tools.
38:58
39:07
So this is the circle of trust.
39:11
No, this is really just a simple graphic
39:14
that tells you what's going on.
39:15
When you do probability, you're given the truth.
39:19
Somebody tells you what die God is rolling.
39:24
So you know exactly what the parameters of the problems are.
39:27
And what you're trying to do is to describe what
39:29
the outcomes are going to be.
39:31
You can say, if you're rolling a fair die,
39:34
you're going to have 1/6 of the time in your data
39:36
you're going to have one.
39:37
1/6 of the time you're going to have to have two.
39:39
And so you can describe-- if I told you what the truth is,
39:42
you could actually go into a computer,
39:44
either generate some data.
39:46
Or you could describe to me some more macro properties
39:51
of what the data would be like.
39:52
Oh, I would see a bunch of numbers
39:54
that would be centered around 35, if I
39:57
drew from a Gaussian distribution centered at 35.
40:00
Right, you would know this kind of thing.
40:01
I would know that it's very unlikely that if my Gaussian
40:07
has standard deviation--
40:08
is centered on 0, say, with standard deviation 3.
40:13
It's very unlikely that I will see numbers below minus 10
40:17
in above 10, right?
40:18
You know this, that you basically will not see them.
40:21
So you know from the truth, from the distribution
40:25
of a random variable that does not have mu or sigmas, really
40:27
numbers there.
40:28
You know what data, you're going to be having.
40:31
Statistics is about going backwards.
40:33
It's saying, if I have some data, what was
40:37
the truth that generated it.
40:39
And since there are so many possible truths,
40:41
Modeling says you have to pick one
40:44
of the simpler possible truths, so that you can average out.
40:47
Statistics basically means averaging.
40:49
You're averaging when you do statistics.
40:51
And averaging means that if I say
40:54
that I received-- so if I collect
40:56
all your GPAs, for example.
40:58
And my model is that the possible GPAs
41:01
are any possible numbers.
41:03
And anybody can have any possible GPA.
41:05
This is going to be a serious problem.
41:06
But if I can summarize those GPAs into two numbers,
41:09
say, mean and standard deviation,
41:11
than I have a pretty good description of what
41:13
is going on, rather than having to have
41:15
to predict the full list.
41:16
Right, if I learn a full list of GPAs and I say,
41:18
well this was the distribution.
41:20
Then it's not going to be of any use for me to predict what
41:22
the GPA would be, or some random student walking in,
41:25
or something like this.
41:26
41:30
So just to finish my rant about probability versus statistics,
41:34
this is a question you would see in a probability-- this
41:37
is a probabilistic question, and this is a statistical question.
41:40
The probabilistic question is, previous studies
41:42
showed that the drug was 80% effective.
41:45
So you know that.
41:46
This is the effectiveness of the drug.
41:48
It's given to you.
41:49
This is how your problem starts.
41:51
Then we can anticipate that, for a study on 100 patients,
41:54
in average, 80 be cured.
41:57
And at least 65 will be cured with 99% chances.
42:00
So again these are not--
42:03
I'm not predicting on 100 patients exactly the number
42:05
of them they're going to be cured.
42:07
And the number of them that are not.
42:08
But I'm actually sort of predicting
42:11
what things are going to look like on average,
42:13
or some macro properties of what my data sets will look like.
42:17
So with 99 percent chances, that means
42:19
that in 99.99% of the data sets you will
42:23
draw from this particular draw.
42:25
99.99% of the cohort of 100 patients to whom you administer
42:30
this drug, I will be able to conclude that at least 65
42:34
of them will be cured, on 99.99% percent of those data sets.
42:41
So that's a pretty accurate prediction
42:42
of what's going to happen.
42:45
Statistics is the opposite.
42:46
It says, well, I just know that 78 out of 100 were cured.
42:49
I have only one data set.
42:50
I cannot make predictions for all data sets.
42:53
But I can go back to the probability,
42:57
make some inference about what my probability will look
42:59
like, and then say, OK, then I can make those predictions
43:03
later on.
43:04
So when I start with 78/100 then maybe
43:08
I'm actually, in this case, I just don't know.
43:11
My best guess here is that I'm confident I
43:16
have to add the extra error that I bet you making by predicting
43:19
that here, the drug is not 80% effective but 78% effective.
43:25
And they need some error bars around this,
43:27
that will hopefully contain 80%, and then based on those error
43:30
bars I'm going to make slightly less precise predictions
43:34
for the future.
43:35
43:39
So, to conclude, so this was, why statistics?
43:44
So what is this course about?
43:46
It's about understanding the mathematics
43:48
behind statistical methods.
43:50
It's more of a tool.
43:51
We're not going to have fun and talk about algebraic geometry
43:54
just for fun in the middle of it.
43:57
So it justifies quantitative statements given some modeling
44:01
assumptions, that we will, in this class,
44:03
mostly admit that the modeling assumptions are correct.
44:06
| the first part-- in this introduction,
44:08
we will go through them because it's
44:10
very easy to forget what the assumptions are actually
44:12
making.
44:13
But this will be a pretty standard thing.
44:15
The words you will hear a lot are IID--
44:18
independent and identically distributed--
44:20
that means that your data is basically all the sams.
44:23
And one data point is not impacting another data point.
44:28
Hopefully we can describe some interesting mathematics
44:30
arising in statistics.
44:31
You know, if you've taken linear algebra,
44:33
maybe we can explain to you why.
44:36
If you've done some calculus, maybe we
44:38
can do some interesting calculus.
44:40
We'll see how in the spirit of applied math
44:42
those things answer interesting questions.
44:45
And basically we'll try to carve out a math toolbox that's
44:49
useful for us statistics.
44:52
And maybe you can extend it to more sophisticated methods
44:55
that we did not cover in this class.
44:57
In particular in the immersion learning class,
44:59
hopefully you'll be able to have some statistical intuition
45:02
about what is going on.
45:04
So what this course is not about,
45:06
it's not about spending a lot of time looking at data sets,
45:09
and trying to understand some statistical thinking
45:13
kind of questions.
45:14
So this is more of an applied statistical perspective
45:16
on things, or more modeling.
45:19
So I'm going to typically give you the model.
45:22
And say this is a model.
45:23
And this is how we're going to build an estimator
45:26
in the framework of this model.
45:28
So for example, 18.075, to a certain extent,
45:30
is called "Statistical Thinking and Data Analysis."
45:32
So I'm hoping there is some statistical thinking in there.
45:36
We will not talk about software implementation.
45:38
Unfortunately, there's just too little time in a semester.
45:42
There's other courses that are giving you some overview.
45:45
So the main software these days are R
45:49
is the leading software I'd say in statistics, both in academia
45:54
and industry, lots of packages, one every day
45:58
that's probably coming out.
46:00
But there's other things, right, so now Python is probably
46:03
catching up with all these scikit-learn packages that
46:09
are coming up.
46:10
Julia has some statistics in there,
46:14
but it really if you were to learn a statistical software,
46:17
let's say you love doing this, this
46:19
would be the one that would prove most useful for you
46:21
in the future.
46:22
It does not scale super well to high dimensional data.
46:26
So there is a class an IDSS that actually
46:28
uses R. It's called IDS 0.12, I think
46:31
it's called "Statistics, Computation, and Applications,"
46:36
or something like this.
46:37
I'm also preparing, with Peter Kempthorne,
46:40
a course called "Computational Statistics."
46:42
46:47
It's going to be offered this Spring as a special topics.
46:50
And so Peter Kempthorne will be teaching it.
46:55
And this class will actually focus
46:58
on using R. And even beyond that,
47:00
it's not just going to be about using.
47:02
It's going to be about understanding--
47:04
just the same way we we're going to see
47:05
how math helps you do statistics,
47:07
it's going to help see how math helps you
47:09
do algorithims for statistics.
47:12
All right, so we'll talk about maximum likelihood estimator.
47:15
Will need to maximize some function.
47:16
There's an optimization toolbox to do that.
47:19
And we'll see how we can have specialized
47:20
for statistics for that, and what
47:22
are the principles behind it.
47:25
And you know, of course, if you've
47:26
taken AP stats you probably think that stats
47:29
is boring to death because it was just
47:31
a long laundry-list that spent a lot of time on t-test.
47:34
I'm pretty sure we're not going to talk about t-test, well,
47:37
maybe once.
47:38
But this is not a matter of saying you're going to do this.
47:42
And this is a slight variant of it.
47:43
We're going to really try to understand what's going on.
47:46
So, admittedly, you have not chosen the simplest way
47:49
to get an A in statistics on campus.
47:52
All right, this is not the easiest class.
47:54
It might be challenging at times,
47:56
but I can promise you that you will maybe suffer.
47:59
But you will learn something by the time
48:01
you're out of this class.
48:02
This will not be a waste of your time.
48:04
And you will be able to understand,
48:06
and not having to remember by heart how those things actually
48:09
work.
48:10
Are there any questions?
48:13
Anybody want to go to other stats class on campus?
48:16
Maybe it's not too late.
48:18
OK.
48:21
So let's do some statistics.
48:25
So I see the time now and it's 11:56,
48:29
so we have another 30 minutes.
48:31
I will typically give you a three,
48:35
four minute break if you want to stretch,
48:37
if you want to run to the bathroom,
48:39
if you want to check your texts or Instagram.
48:45
There was very little content in this class,
48:47
hopefully it was entertaining enough
48:49
that you don't need the break.
48:51
But just in the future, so you know you will have a break.
48:55
So statistics, this is how it starts, I'm French, what can
49:01
I say I need to put some French words.
49:05
So this is not how office hours are going to go down.
49:08
49:12
Anybody know this sculpture by a Rodin, The Kiss.
49:16
Maybe probably The Thinker is more famous.
49:18
But this is actually a pretty famous one.
49:20
But is it really this one, or is it this one.
49:23
Anybody knows which one it is?
49:26
This one?
49:27
Or this one?
49:28
AUDIENCE: The previous.
49:30
PHILIPPE RIGOLLET: What's that?
49:32
AUDIENCE: This one.
49:33
PHILIPPE RIGOLLET: It's this one.
49:33
AUDIENCE: Final answer.
49:35
PHILIPPE RIGOLLET: Yeah, who votes for this one.
49:39
OK.
49:40
Who votes for that one?
49:42
Thank you.
49:42
I love that you do not want to pronounce yourself with no data
49:45
actually to make any decision.
49:47
This is a total coin toss right.
49:49
Turns out that there is data, and there
49:51
is in the very serious journal Nature,
49:53
someone published a very serious paper which
49:56
actually looks pretty serious.
49:58
If you look at it, it's like "Human Behavior:
50:00
Adult persistence of head-turning symmetry,"
50:02
is a lot of fancy words in there.
50:04
And this, I'm not kidding you, this study
50:07
is about collecting data of people kissing,
50:09
and knowing if they bend their head to the right
50:12
or if they bend they head to the left.
50:14
And that's all it is.
50:15
And so a neonatal right-side preference
50:21
makes a surprising romantic reappearance in later life.
50:25
There's an explanation for it.
50:27
All right, so if we follow this Nature which one is the one.
50:32
This one?
50:33
Or this one?
50:34
This one, right?
50:35
Head to the right.
50:38
And to be fair, for this class I was like,
50:41
oh, I'm going to go and show them what Google Images does.
50:46
When you Google kissing couple, it's
50:49
inappropriate after maybe the first picture.
50:51
And so I cannot show you this.
50:53
But you know you can check for yourself.
50:55
Though I would argue, so this person
50:57
here actually went out in airports
51:00
and took pictures of strangers kissing and collecting data.
51:06
And can somebody guess why did he just not stay home
51:10
and collect data from Google Images
51:12
by just googling kissing couples.
51:17
What's wrong with this data?
51:19
I didn't know actually before I actually went on Google Images.
51:22
AUDIENCE: It can be altered?
51:23
PHILIPPE RIGOLLET: What was that?
51:24
AUDIENCE: It can be altered.
51:25
PHILIPPE RIGOLLET: It can be altered.
51:26
But, you know, who would want to do this?
51:28
I mean there's no particular reason why
51:29
you would want to flip an image before putting it out there.
51:31
I mean, you might, but you know maybe they
51:34
want to hide the brand of your Gap shirt or something.
51:38
AUDIENCE: I guess the people who post pictures of themselves
51:42
kissing on Google Images are not representative
51:44
of the general population.
51:45
PHILIPPE RIGOLLET: Yeah, that's very true.
51:47
And actually it's even worse than that.
51:49
The people who post pictures of themselves,
51:51
are not posting pictures of themselves
51:52
or putting pictures of the people
51:54
that they took a picture of.
51:55
And there usually is a stock watermark on this.
51:59
And it's basically stock images.
52:00
Those are actors, and so they've been directed to kiss
52:03
and this is not a natural thing to do.
52:06
And actually, if you go to Google Images-- and I
52:08
encourage you to do this, unless you
52:10
don't want to see inappropriate pictures,
52:12
and they're mightily inappropriate.
52:14
And basically you will see that this study is actually not
52:19
working at all.
52:20
I mean, I looked briefly.
52:21
I didn't actually collect numbers.
52:22
But I didn't find a particular tendency to bend right.
52:26
If anything, it was actually probably the opposite.
52:28
And it's because those people were directed to do it.
52:31
They just don't actually think about doing it.
52:34
And also because I think you need
52:36
to justify writing in your paper more than,
52:38
I sat in front of my computer.
52:41
So again, this first sentence here,
52:46
a neonatal right-side preference--
52:49
"is there a right side preference?"
52:51
is not a mathematical question.
52:53
But we can start saying, let's blah, and put some variables,
52:57
and ask questions about those variables.
52:59
So you know x is actually not a variable that's
53:02
used very much in statistics for parameters.
53:04
But p is one, for parameter.
53:07
And so you're going to take your parameter of interest,
53:09
p, As here is going to be the proportion of couples.
53:12
And that's among all couples.
53:13
So here, if you talk about statistical thinking,
53:17
there would be a question about what population this would
53:20
actually be representative of.
53:22
| usually this is a call to your--
53:24
53:26
sorry, I should not forget this word it's important for you.
53:30
53:33
OK, I forget this word.
53:34
So this is--
53:38
OK,
53:43
So if you look at this proportion,
53:44
maybe these couples that are in the study
53:46
might be representative only of couples in airports.
53:49
Maybe they actually put on a show for the other passengers.
53:51
Who knows?
53:52
You know, like, oh, let's just do it as well.
53:54
And just like the people in Google Images
53:56
they are actually doing it.
53:57
So maybe you want to just restrict it.
53:58
But of course clearly if it's appearing in Nature,
54:01
it should not be only about couples in airports.
54:04
It's supposedly representative of all couples in the world.
54:07
And so here let's just keep it vague,
54:10
but you need to keep in mind what population
54:12
this is actually making a statement about.
54:14
So you have this full population of people in the world.
54:20
Right, so those are all the couples.
54:23
And this person went ahead and collected data
54:27
about a bunch of them.
54:29
And we know that, in this thing, there's basically
54:31
a proportion of them, that's like p,
54:33
and that's the proportion of them that's bending
54:35
their head to the right.
54:36
And so everybody on this side is bending their heads right.
54:40
And hopefully we can actually sample
54:41
this thing you're informing.
54:42
That's basically the process that's going on.
54:44
So this is the statistical experiment.
54:47
We're going to observe n kissing couples.
54:49
So here we're going to put as many variables as we can.
54:51
So we don't have to stick with numbers.
54:53
And then we'll just plug in the numbers.
54:55
n kissing couples, and n is also, in statistics,
54:58
by the way, n is the size of your sample 99.9% of the time.
55:04
And collect the value of each outcome.
55:06
So we want numbers.
55:07
We don't want right or left.
55:08
So we're going to code them by 0 and 1, pretty naturally.
55:12
And then we're going to estimate p which is unknown.
55:16
So p is this area.
55:18
And we're going to estimate it simply
55:19
by the proportion of right So the proportion of crosses
55:24
that actually fell in the right side.
55:27
55:29
So in this study what you will find
55:33
is that the numbers that were collected
55:36
were 124 couples, and that, out of those 124, 80 of them
55:43
turned their head to the right.
55:46
So, p hat is a proportion.
55:49
How do we do it?
55:50
Well, you don't need statistics for that.
55:51
You're going to see 80 divided by 124.
55:54
And you will find that in this particular study
55:57
64.5% of the couples were bending
56:00
their heads to the right.
56:01
That's a pretty large number, right?
56:03
The question is if I picked another 124 couples, maybe
56:07
at different airports, different times, would I see same number?
56:10
Would this number be all over the place?
56:11
Would it be sometimes very close to 120, or sometimes
56:14
for close to 10?
56:15
Or would it be-- is this number actually fluctuating a lot.
56:20
And so, hopefully not too much, 64.5 percent is definitely
56:26
much larger than 50%.
56:28
And so there seems to be this preference.
56:31
Now we're going to have to quantify
56:33
how much of this preference.
56:34
Is this number significantly larger than 50%?
56:38
So if our data, for example, was just three couples.
56:41
I'm just going there, I'm going to Logan.
56:43
I call it, I do right, left right.
56:45
And then I see--
56:47
see what's the name of the fish place there?
56:53
I go to I go to Wahlburgers at Logan and I'm like,
56:56
OK, I'm done for the day.
56:57
I collect this data.
56:58
I go home, and I'm like, wow, 66.7% to the right.
57:02
That's a pretty big number.
57:03
It's even farther from 50% than this other guy.
57:06
So I'm doing even better.
57:08
But of course you know that this is not true.
57:10
Three people is definitely not representative.
57:12
If I stopped at the first one, I would
57:13
have actually-- at the first two, I would have even 100%.
57:18
So the question that statistics is going to help us answer is,
57:21
how large should the sample be?
57:23
For some reason, I don't know if you guys receive this,
57:25
I'm an affiliate with the Broad Institute,
57:27
and since then I receive one email per day
57:30
that says, sample size determination--
57:32
how large should your sample be?
57:33
Like, I know how large should with my sample be.
57:36
I've taken 18.650 multiple times.
57:39
And so I know, but the question is-- is 124
57:43
a large enough number or not?
57:45
Well, the answer is actually, as usual, it depends.
57:47
It will depend on the true unknown value of p.
57:51
But from those particular values that we got, so 120 and--
57:56
how many couples was there?
57:57
80?
57:58
We actually can make some question.
58:02
So here we said that 80 was larger than 50--
58:07
was allowing us to conclude at 64.5%.
58:12
So it could be one reason to say that it was larger than 50%.
58:17
50% of 124 is 62.
58:23
So the question is, would I be would I
58:24
be willing to make this conclusion at 63?
58:28
Is that a number that would convince you?
58:30
Who would be convinced by 63?
58:34
who would be convinced by 72?
58:35
58:38
Who would be convinced by 75?
58:40
Hopefully the number of hands that are raised should grow.
58:42
58:46
Who would be convinced by 80?
58:48
All right, so basically those numbers actually
58:51
don't come from anywhere.
58:52
This 72 would be the number that you would need for a study--
58:56
most statistical studies would be the number
58:58
that they would retain.
58:59
That's not for 124.
59:01
You would need to see 72 that turn their head
59:04
right to actually make this conclusion.
59:07
And then 75--
59:08
So we'll see that there's many ways to come to this conclusion
59:11
because, as you can see, this was
59:12
published in Nature with 80.
59:15
So that was OK.
59:15
So 80 is actually a very large number.
59:17
This is 99 point--
59:20
this 99% -- no, so this is 95% confidence.
59:24
This is 99% confidence.
59:26
And this is 99.9% percent confidence.
59:29
So if you said 80 you're a very conservative person.
59:34
Starting at 72, you can start making this conclusion.
59:36
59:39
To understand this, we need to do
59:41
our little mathematical kitchen here,
59:45
and we need to do some modeling.
59:49
So we need to understand by modeling--
59:51
we need understand what random process we think
59:55
this data is generating from.
59:57
So it's going to have some unknown parameters,
59:59
unlike in probability.
60:00
But we need to have just basically everything written
60:02
except for the values of the parameters.
60:04
When I said a die is coming uniformly with probably 1/6
60:08
then I need to have, say maybe with probability-- maybe
60:12
I should say here are six numbers,
60:14
and I need to just fill those numbers.
60:18
So for i equal 1 to n, I'm going to define
60:23
Ri to be the indicator.
60:27
An indicator is just something that takes value 1 if something
60:29
is true, and 0 if not.
60:31
So it's an indicator that i-th couple
60:34
turns the head to the right.
60:36
So, Ri, so it's indexed by i.
60:39
And it's one if the i-th couple turns their head to the right,
60:42
and 0 if it's--
60:45
well actually, I guess they can probably kiss straight, right?
60:48
So that would be weird, but they might be able to do this.
60:51
So let's say not right.
60:54
Then the estimator of p, we said, was p hat.
60:56
It was just the ratio of two numbers.
60:58
But really what it is is I count, I sum those Ri's.
61:02
Since I only add those that take value 1, what this is is--
61:07
this sum here is actually just counting the number of 1's.
61:10
Which is another way to say it's counting the number of couples
61:13
that are kissing to the right.
61:15
And here I don't even have to tell you anything
61:18
about the numbers or anything.
61:20
I can only keep track of--
61:21
first couple is a 0 second couple is a 1,
61:24
third couple is 0.
61:25
The data set-- you can actually find it online--
61:27
is actually a sequence of 0's and 1's.
61:29
Now clearly for the question that we're
61:31
asking about this proportion, I don't
61:32
need to keep track of all this information.
61:34
All I need to keep track of is the number
61:37
of 0's and the number of 1's.
61:39
Those are completely interchangeable.
61:42
There's no time effect in this.
61:44
The first couple is no different than the 15th couple.
61:48
So we call this Rn bar.
61:50
That's going to be a very standard notation that we use.
61:52
R might be replaced by other letters like x--
61:55
so xn bar, yn bar.
61:58
And this thing essentially means that I
62:00
average the R's, or the Ri's over n of them.
62:04
And the bar means the average.
62:06
So I divide by n the total number of 1's.
62:09
So here this sum was equal to 80 in our example and n
62:13
was equal to 124.
62:16
Now this is an estimator.
62:18
So an estimator is different from an estimate.
62:20
An estimate is a number.
62:21
My estimate was 64.5.
62:23
My estimator is this thing where I keep all the variables free.
62:29
And in particular, I keep those variables
62:31
to be random because I'm going to think of a random couple
62:34
kissing left to right as the outcome of a random process,
62:37
just like flipping a coin be getting heads or tails.
62:41
And so this thing here is a random variable, Ri.
62:43
And this average is, of course, an average of random variables.
62:46
It's itself a random variable.
62:47
So an estimator is a random variable.
62:49
An estimate is the realization of a random variable,
62:51
or, in other words, is the value that you
62:53
get for this random variable once you plug in the numbers
62:56
that you've collected.
62:58
So I can talk about the accuracy of an estimator.
63:01
Accuracy means what?
63:02
Well, what would we want for an estimator?
63:04
Maybe we won't want it to fluctuate too much.
63:06
It's a random variable.
63:07
So I'm talking about the accuracy of a random variable.
63:11
So maybe I don't want it to be too volatile.
63:13
I could have one estimator which would be--
63:16
just throw out 182 couples, keep only 2
63:20
and average those two numbers.
63:21
That's definitely a worse estimator
63:23
than keeping all of the 124.
63:25
So I need to find a way to say that.
63:26
And what I'm going to be able to say
63:28
is that the number is going to be fluctuating.
63:30
If I take another two couples, I'm
63:31
going to be I'm probably going to get
63:33
a completely different number.
63:34
But if I take another 124 couples two days later,
63:38
maybe I'm going to have a very number that's
63:40
very close to 64.5%.
63:43
So that's one way.
63:43
The other thing we would like about this estimator it's
63:46
actually--
63:47
maybe it's not too volatile-- but also
63:49
we want it to be close to the number that we're looking for.
63:54
Here is an estimator.
63:55
It's a beautiful variable.
63:57
72%, that's an estimator.
64:00
Go out there just do your favorite study
64:02
about drug performance.
64:06
And then they're going to call you, MIT student taking
64:10
statistics, they say, so how are you
64:12
going to build your estimator?
64:13
We've collected those 5,000 or something like that.
64:15
I'm just going to spit out 72%.
64:17
Whatever the data says, that's an estimator.
64:19
It's a stupid estimator but it is an estimator.
64:21
But this is estimator is very not volatile.
64:23
Every time you're going to have a new study,
64:25
even if you change fields, it's still going to be 72%.
64:27
This is beautiful.
64:29
And the problem is that's probably not
64:31
very close to the value you're actually trying to estimate.
64:34
So we need two things.
64:35
We need are estimated to be a random variable.
64:36
So think in terms of densities.
64:39
We want the density to be pretty narrow.
64:42
We want this thing to have very little--
64:46
so this is definitely better than this.
64:52
But also, we want the number that we're interested in, p,
64:55
to be very close to this--
64:57
to be close to the values that this thing is likely to take.
65:00
If p is here, this is not very good for us.
65:04
So that's basically the things we're going to be looking at.
65:06
The first one is referred to as variance.
65:08
The second one is referred to as bias.
65:10
Those things come all over in statistics.
65:14
So we need to understand a model.
65:16
So here's the model that we have for this particular problem.
65:20
So we need to make assumptions on the observations
65:22
that we see.
65:23
So we said we're going to assume that the random variable--
65:25
that's not too much of a leap of faith.
65:27
We're just sweeping under the rug everything thing
65:29
we don't understand about those couples.
65:31
And the assumption that we make is
65:33
that Ri is a random variable.
65:36
This one you will forget very soon.
65:38
The second one is that each of the Ri's is--
65:41
so it's a random variable that takes value 0 or 1.
65:45
Anybody can suggest the distribution
65:47
for this random variable?
65:48
AUDIENCE: Bernoulli.
65:49
PHILIPPE RIGOLLET: What?
65:50
AUDIENCE: Bernoulli.
65:51
PHILIPPE RIGOLLET: Bernoulli, right?
65:51
And it's actually beautiful.
65:53
This is where you have to do the least statistical modeling.
65:56
A random variable that takes value 0 or 1
65:59
is always a Bernoulli.
66:00
That's the simplest variable you can ever think of.
66:02
Any variable that takes only two possible values
66:04
can be reduced to a Bernoulli.
66:06
OK, so this is a Bernoulli.
66:10
And here we make the assumption that it actually
66:12
takes parameter p.
66:16
And there's an assumption here.
66:17
Anybody can tell me what the assumption is?
66:21
AUDIENCE: It's the same.
66:22
PHILIPPE RIGOLLET: Yeah, it's same, right?
66:24
I could have said p i, but it's p.
66:26
And that's where I'm going to be able to start
66:28
getting to do some statistics.
66:29
It's that I'm going to start to be able to pull information
66:31
across all my guys.
66:32
If I assume that they're all pi's
66:34
completely uncoupled with each other.
66:36
Then I'm in trouble.
66:37
There's nothing I can actually get.
66:39
And then I'm going to assume that those guys are
66:41
mutually independent.
66:42
And most of the time they will just say independent.
66:45
Meaning that, it's not like all these guys called each other
66:48
and it's actually a flash mob.
66:50
And they were like, let's all turn our left side to the left.
66:53
And then this is definitely not going
66:54
to give you a valid conclusion.
66:59
So, again. randomness is a way of modeling lack
67:02
of information.
67:03
Here there is a way to figure it out.
67:05
Maybe I could have followed all those guys,
67:07
and knew exactly what they were-- maybe
67:09
I could have looked at pictures of them in the womb
67:11
and guess how they were turning-- by the way that's
67:14
one of the conclusions, they're guessing
67:16
that we turn our head to the right
67:17
because our head is turned to the right in the womb.
67:21
So we don't know what goes on in the kissers minds.
67:24
And there's, you know, physics, sociology.
67:26
There's a lot of things that could help us,
67:28
but it's just too complicated to keep track of,
67:31
or too expensive for many instances
67:34
Now again, the nicest part of this modeling
67:37
was the fact that Ri's take only two values, which
67:39
mean that this conclusion that they were Bernoulli
67:41
was totally free for us.
67:43
Once we know it's a random variable, it's a Bernoulli.
67:45
Now they could have been, as we said,
67:47
they could have been a Bernoulli with parameter p i.
67:51
For each i, I could have put a different parameter,
67:55
but I just don't have enough information.
67:57
What would I have said?
67:58
I would say, well the first couple turned to the right.
68:00
p1 has to be one, that's my best guess.
68:04
The second couple kiss to the left,
68:06
well, p2 should be 0, that's my best guess.
68:10
And so basically I need to have to be
68:14
able to average my information.
68:16
And the way I do it is by coupling all these guys,
68:19
pi's to be the same p for all i.
68:22
OK, does it make sense?
68:23
Here what I am assuming is that my population is homogeneous.
68:28
Maybe it's not.
68:29
Maybe I could actually look at a finer grain,
68:31
but I'm basically making a statement about a population.
68:35
And so maybe you kiss to the left, and then you're not--
68:41
I'm not making a statement about a person individually,
68:44
I'm making a statement about the overall population.
68:47
Now independence is probably reasonable, right?
68:49
This person just went and know can seriously
68:53
hope that these couples did not communicate with each other.
68:56
Or that you know Tanya did not text that we should all
68:59
turn our head to the left now.
69:01
And there's no external stimulus that forces people
69:05
to do something different.
69:08
OK, so-- sorry about that.
69:15
Since we have about less than 10 minutes.
69:19
Let's do a little bit of exercises, is that OK with you?
69:22
So I just have some exercises so we can see what
69:24
an exercise going to look like.
69:26
This is sort of similar to the exercises you will see with me.
69:30
We should do it together, OK?
69:31
So now we're going to have--
69:32
I have a test.
69:33
69:36
So that's an exam in probability.
69:42
OK.
69:44
And I'm going to have 15 students in this test.
69:50
And hopefully, this should be 15 grades
69:53
that are representative of the grades of all a large class.
69:57
Right, so if you go you know 18.600, it's a large class,
70:00
there's definitely more than 15 students.
70:02
And maybe, just by sampling 15 students at random,
70:04
I want to have an idea of what my grade distribution will
70:08
look like.
70:09
I'm grading them, I want to make an educated guess.
70:13
So I'm going to make some modeling assumptions
70:15
for those guys.
70:16
So here, 15 students and the grades are x1 to x15.
70:22
Just like we had R1, R2, all the way to R124.
70:26
Those were my Ri's.
70:27
And so now I have my xi's.
70:29
And I'm going to assume that xi follows
70:33
a Gaussian or normal distribution with min mu
70:39
and variance sigma squared.
70:40
Now this is modeling, right?
70:43
Nobody told me there's no physical process that
70:45
makes this happen.
70:46
We know that there's something called the central limit
70:48
theorem in the background that says that things
70:50
tend to be Gaussian, but this is really a matter of convenience.
70:53
Actually this is, if you think about it,
70:55
this is terrible because this puts non-zero probability
70:57
on negative scores.
70:58
I'm definitely not going to get a negative score.
71:00
But you know it's good enough because they
71:02
know the probabilities non-zero but it's probably 10
71:05
to the minus 12.
71:06
So I would be very unlucky to see a negative score.
71:10
So here's the list of grades, so I have 65, 41, 70, 90, 58, 82,
71:24
76, 78--
71:28
maybe I should have done it with 8 --59, 59--
71:35
sitting next to each other --84, 89, 134, 51, and 72.
71:47
So those are the scores that I got.
71:51
There were clearly some bonus points over there.
71:53
And the question is, find estimator for mu.
72:05
What is my estimator for mu?
72:06
72:09
Well, an estimator, again, is something that
72:11
depends on the random variable.
72:12
All right, so mu is the expectation, right?
72:15
So a good estimator is definitely the average score,
72:22
just like we had the average of the Ri's.
72:24
Now the xi's no longer need to be 0's and 1's, so it's not
72:28
going to boil down to being a number of ones divided
72:31
by the total numbers.
72:32
Now if I'm looking for an estimate,
72:41
well, I need to actually sum those numbers
72:43
and divide them by 15.
72:45
So my estimate is going to be 1/15.
72:47
Then I'm going to start summing those numbers--
72:49
65 plus 72.
72:51
72:54
OK, and I can do it, and it's 67.5.
73:06
This is my estimate.
73:08
Now if I want to compute a standard deviation--
73:13
so let's say estimate for sigma.
73:18
73:21
You've seen that before, right?
73:23
An estimate for sigma is what?
73:24
An estimate for sigma, we'll see methods to do this,
73:27
but sigma squared is the variance,
73:31
or is the expectation, of x minus expectation of x squared.
73:35
73:38
And the problem is that I don't know
73:40
what those expectations are.
73:42
And so I'm going to do what 99.9% percent of statistics is.
73:47
And what is statistics about?
73:49
What's my motto?
73:51
Statistics is about replacing expectations with averages.
73:54
That's what all of statistics is about.
73:57
There's 300 pages in a purple book called All of Statistics
74:00
that tells you this.
74:01
All right, and then you do something fancy.
74:03
Maybe you minimize something after you
74:05
replace the expectation.
74:07
Maybe you need to plug in other stuff.
74:08
But really, every time you see an expectation,
74:10
you replace it by an average.
74:12
OK let's do this.
74:13
So sigma squared hat will be what?
74:16
It's going to be 1 over n, sum from i equals 1 to n
74:20
of xi minus--
74:22
well, here I need to replace my expectation by an average,
74:25
which is really this average.
74:27
I'm going to call it mu hat squared.
74:31
There, you have replaced my expectation with average.
74:34
OK so the golden thing is, take your expectation
74:38
and replace it with this.
74:39
74:45
Frame it, get a tattoo, I don't care but that's what it is.
74:49
If you remember one thing from this class, that's what it is.
74:53
Now you can be fancy, if you look at your calculator,
74:56
it's going to put an n minus 1 here because it
74:59
wants to be unbiased.
75:00
And those are things we are going to come to.
75:02
But let's say right now we stick to this.
75:04
And then when I plug in my numbers.
75:06
I'm going to get an estimate for sigma,
75:14
which is the square root of the estimator
75:17
once I plug in the numbers.
75:18
And you can check that the number, you get will be 18.
75:21
75:27
So those are basic things and if you've taken any AP stats
75:31
this should be completely standard to you.
75:32
75:35
Now I have another list, and I don't have time to see it.
75:39
75:42
It doesn't really matter.
75:43
75:49
OK, we'll do that next time.
75:50
This is fine.
75:51
We'll see another list of numbers and see--
75:55
we're going to think about modeling assumptions.
75:57
The goal of this exercise is not to compute those things,
75:59
it's really to think about modeling assumptions.
76:01
Is it reasonable to think that things are IID?
76:04
Is it reasonable to think that they
76:05
have all the same parameters, that they're independent,
76:07
et cetera,
76:08
OK so one thing that I wanted to add is, probably by tonight,
76:16
so I will try to use--
76:18
in the spirit of--
76:20
I don't know what's starting to happen.
76:22
In the spirit of using my iPad and fancy things,
76:26
I will try to post some videos of-- for in particular,
76:29
who has never used a statistical table to read, say,
76:33
the quantiles of a Gaussian distribution?
76:37
OK, so there's several of you.
76:39
This is a simple but boring exercise.
76:42
I will just post a video on how to do this,
76:44
and you will be able to find it on Stellar.
76:46
It's going to take five minutes, and then you
76:48
will know everything there is to know about those things
76:50
but that's something you need for the first problem set.
76:53
By the way, so the problem set has
76:54
30 exercises in probability.
76:57
You need to do 15.
76:59
And you only need to turn in 15.
77:01
You can turn in all of 30 if you want.
77:03
But you need to know, by the time we hit those things,
77:07
you need to know--
77:08
well actually, by next week you need to know what's in there.
77:11
So if you don't have time to do all the homework,
77:13
and then go back to your probability class
77:15
to figure out how to do it, just do 15 easy that you can do.
77:19
And return those things.
77:20
But go back to your probability class
77:21
and make sure that you know how to do all of them.
77:23
Those are pretty basic questions,
77:25
and those are things that I'm not going to slow down on.
77:28
So you need to remember that the expectation of the product
77:30
of independent random variables is
77:32
a product of the expectations.
77:34
Expectation of the sum, is the sum of the expectation.
77:36
This kind of thing, which is a little silly,
77:38
but it just requires you practice.
77:40
So, just have fun.
77:42
Those are simple exercises.
77:43
You will have fun remembering your probability class.
77:46
All right, so I'll see you on Tuesday--
77:49
or Monday.
77:51