https://www.youtube.com/watch?v=0Va2dOLqUfM&list=PLUl4u3cNGP60uVBMaoNERc6knT_MgPKS0&index=5

字幕記錄


00:00
00:00
The following content is provided under a Creative
00:02
Commons license.
00:03
Your support will help MIT OpenCourseWare
00:06
continue to offer high quality educational resources for free.
00:10
To make a donation or to view additional materials
00:12
from hundreds of MIT courses, visit MIT OpenCourseWare
00:16
at ocw.mit.edu.
00:17
00:22
PROFESSOR: So I'm using a few things here, right?
00:24
I'm using the fact that KL is non-negative.
00:27
But KL is equal to 0 when I take twice the same argument.
00:31
So I know that this function is always non-negative.
00:34
00:38
So that's theta and that's KL P theta star P theta.
00:45
And I know that at theta star, it's equal to 0.
00:51
OK?
00:52
I could be in the case where I have this happening.
00:56
I have two-- let's call it theta star prime.
01:01
I have two minimizers.
01:04
That could be the case, right?
01:05
I'm not saying that-- so K of L--
01:07
KL is 0 at the minimum.
01:11
That doesn't mean that I have a unit minimum, right?
01:14
But it does, actually.
01:16
What do I need to use to make sure
01:17
that I have only one minimum?
01:18
01:22
So the definiteness is guaranteeing to me
01:24
that there's a unique P theta star that minimizes it.
01:28
But then I need to make sure that there's
01:30
a unique-- from this unique P theta star,
01:33
I need to make sure there's a unique theta star that
01:35
defines this P theta star.
01:36
01:39
Exactly.
01:40
All right, so I combine definiteness
01:43
and identifiability to make sure that there is a unique
01:47
minimizer, in this case cannot exist.
01:50
OK, so basically, let me write what I just said.
01:55
So definiteness, that implies that P theta star
02:06
is the unique minimizer of P theta maps to KL P theta star P
02:23
theta.
02:23
So definiteness only guarantees that the probability
02:26
distribution is uniquely identified.
02:29
And identifiability implies that theta star
02:42
is the unique minimizer of theta maps to KL P theta star P
02:56
theta, OK?
03:00
So I'm basically doing the composition
03:02
of two injective functions.
03:04
The first one is the one that maps, say, theta to P theta.
03:07
And the second one is the one that maps P theta
03:11
to the set of minimizers, OK?
03:14
03:20
So at least morally, you should agree that theta star
03:27
is the minimizer of this thing.
03:28
Whether it's unique or not, you should
03:30
agree that it's a good one.
03:33
So maybe you can think a little longer on this.
03:36
So thinking about this being the minimizer,
03:38
then it says, well, if I actually
03:40
had a good estimate for this function,
03:42
I would use the strategy that I described
03:44
for the total variation, which is,
03:45
well, I don't know what this function looks like.
03:48
It depends on theta star.
03:49
But maybe I can find an estimator
03:51
of this function that fluctuates around this function,
03:55
and such that when I minimize this estimator of the function,
03:58
I'm actually not too far, OK?
04:01
And this is exactly what drives me to do this,
04:04
because I can actually construct an estimator.
04:07
I can actually construct an estimator such
04:09
that this estimator is actually--
04:12
of the KL is actually close to the KL, all right?
04:15
So I define KL hat.
04:18
So all we did is just replacing expectation with respect
04:22
to theta star by averages.
04:27
04:30
That's what we did.
04:33
So if you're a little puzzled by this error, that's all it says.
04:37
Replace this guy by this guy.
04:39
It has no mathematical meaning.
04:41
It just means just replace it by.
04:42
And now that actually tells me how to get my estimator.
04:46
It just says, well, my estimator, KL hat,
04:51
is equal to some constant which I don't know.
04:54
I mean, it certainly depends on theta star,
04:56
but I won't care about it when I'm trying to minimize--
04:59
minus 1/n sum from i from 1 to n log f theta of x.
05:09
So here I'm reading it with the density.
05:11
You have it with the PMF on the slides,
05:13
and so you have the two versions in front of you, OK?
05:18
Oh sorry, I forgot the xi.
05:22
Now clearly, this function I know how to compute.
05:25
If you give me a theta, since I know the form of the density
05:30
f theta, for each theta that you give me,
05:33
I can actually compute this quantity, right?
05:38
This I don't know, but I don't care.
05:40
Because I'm just shifting the value of the function
05:42
I'm trying to minimize.
05:43
The set of minimizers is not going to change.
05:46
So now, this is my estimation strategy.
05:50
Minimize in theta KL hat P theta star P theta, OK?
06:01
So now let's just make sure that we all agree that--
06:05
so what we want is the argument of the minimum,
06:07
right? arg min means the theta that minimizes this guy,
06:10
rather than finding the value of the min.
06:13
OK, so I'm trying to find the arg min of this thing.
06:15
Well, this is equivalent to finding the arg
06:18
min of, say, a constant minus 1/n sum from i from 1 to n
06:28
of log f theta of xi.
06:31
06:33
So that's just--
06:34
06:38
I don't think it likes me.
06:41
No.
06:42
OK, so thus minimizing this average, right?
06:46
I just plugged in the definition of KL hat.
06:48
Now, I claim that taking the arg min
06:50
of a constant plus a function or the arg min of the function
06:53
is the same thing.
06:55
Is anybody not comfortable with this idea?
07:00
OK, so this is the same.
07:03
07:13
By the way, this I should probably
07:15
switch to the next slide, because I'm writing
07:18
the same thing, but better.
07:22
And it's with PMF rather than as PF.
07:29
OK, now, arg min of the minimum is the same of arg max--
07:34
sorry, arg min of the negative thing
07:35
is the same as arg max without the negative, right?
07:37
07:40
arg max over theta of 1/n from i equal equal 1 to n log f
07:49
theta of xi.
07:49
07:53
Taking the arg min of the average
07:54
or the arg min of the sum, again, it's
07:56
not going to make much difference.
07:59
Just adding constants OR multiplying by constants
08:01
does not change the arg min or the arg max.
08:04
Now, I have the sum of logs, which
08:07
is the log of the product.
08:08
08:23
OK?
08:24
It's the arg max of the log of f theta of x1 times
08:27
f theta of x2, f theta of xn.
08:30
But the log is a function that's increasing, so maximizing
08:37
log of a function or maximizing the function itself
08:40
is the same thing.
08:42
The value is going to change, but the arg max
08:45
is not going to change.
08:46
Everybody agrees with this?
08:47
08:50
So this is equivalent to arg max over theta of pi from 1 to n
08:59
of f theta xi.
09:02
And that's because x maps to log x is increasing.
09:10
09:13
So now I've gone from minimizing the KL
09:17
to minimizing the estimate of the KL
09:19
to maximizing this product.
09:23
Well, this chapter is called maximum likelihood estimation.
09:27
The maximum comes from the fact that our original idea
09:30
was to minimize the negative of a function.
09:32
So that's why it's maximum likelihood.
09:34
And this function here is called the likelihood.
09:42
This function is really just telling me--
09:45
they call it likelihood because it's
09:47
some measure of how likely it is that theta
09:49
was the parameter that generated the data.
09:52
OK, so let's go to the--
09:55
well, we'll go to the formal definition in a second.
09:57
But actually, let me just give you
09:59
intuition as to why this is the distribution of the data.
10:05
Why this is the likelihood-- sorry.
10:07
Why is this making sense as a measure of likelihood?
10:11
Let's now think for simplicity of the following model.
10:14
So I have--
10:15
I'm on the real line and I look at n, say,
10:19
theta 1 for theta in the real-- do you see that?
10:25
OK.
10:26
Probably you don't.
10:27
Not that you care.
10:28
OK, so--
10:29
10:41
OK, let's look at a simple example.
10:42
10:45
So here's the model.
10:48
As I said, we're looking at observations on the real line.
10:52
And they're distributed according to some n theta 1.
10:57
So I don't care about the variance.
10:58
I know it's 1.
10:59
And it's indexed by theta in the real line.
11:03
OK, so this is-- the only thing I need to figure out
11:05
is, what is the mean of those guys, OK?
11:09
Now, I have this n observations.
11:11
And if you actually remember from your probability class,
11:15
are you familiar with the concept of joint density?
11:18
You have multivariate observations.
11:20
The joint density of independent random variables
11:23
is just a product of their individual densities.
11:26
So really, when I look at the product from i
11:30
equal 1 to n of f theta of xi, this
11:34
is really the joint density of the vector--
11:44
11:48
well, let me not use the word vector--
11:51
of x1 xn, OK?
11:55
So if I take the product of density, is it still a density?
11:58
And it's actually-- but this time on the r to the n.
12:04
And so now what this thing is telling me-- so
12:06
think of it in r2, right?
12:07
So this is the joint density of two Gaussians.
12:10
So it's something that looks like some bell-shaped curve
12:14
in two dimensions.
12:15
And it's centered at the value theta theta.
12:20
OK, they both have the mean theta.
12:22
So let's assume for one second-- it's
12:24
going to be hard for me to make pictures in n dimensions.
12:28
Actually, already in two dimensions,
12:29
I can promise you that it's not very easy.
12:31
So I'm actually just going to assume
12:34
that n is equal to 1 for the sake of illustration.
12:37
OK, so now I have this data.
12:40
And now I have one observation, OK?
12:44
And I know that the f theta looks like this.
12:47
And what I'm doing is I'm actually
12:48
looking at the value of x theta as my observation.
12:51
12:54
Let's call it x1.
12:57
Now, my principal tells me, just find the theta that
13:00
makes this guy the most likely.
13:03
What is the likelihood of my x1?
13:05
Well, it's just the value of the function.
13:07
That this value here.
13:09
And if I wanted to find the most likely theta that had generated
13:13
this x1, what I would need to do is to shift this thing
13:16
and put it here.
13:19
And so my estimate, my maximum likelihood estimator
13:21
here would be theta is equal to x1, OK?
13:28
That would be just the observation.
13:30
Because if I have only one observation,
13:32
what else am I going to do?
13:33
OK, and so it sort of makes sense.
13:34
And if you have more observations,
13:36
you can think of it this way, as if you had more observations.
13:40
So now I have, say, K observations,
13:42
or n observations.
13:44
And what I do is that I look at the value for each
13:46
of these guys.
13:48
So this value, this value, this value, this value.
13:52
I take their product and I make this thing large.
13:55
OK, why do I take the product?
13:57
Well, because I'm trying to maximize their value
14:00
all together, and I need to just turn it into one number
14:02
that I can maximize.
14:04
And taking the product is the natural way
14:06
of doing it, either by motivating it
14:08
by the KL principle or motivating it
14:11
by maximizing the joint density, rather than just maximizing
14:14
anything.
14:15
OK, so that's why, visually, this is the maximum likelihood.
14:20
It just says that if my observations are here,
14:24
then this guy, this mean theta, is more likely than this guy.
14:29
Because now if I look at the value
14:31
of the function for this guy-- if I
14:33
look at theta being this thing, then this
14:35
is a very small value.
14:37
Very small value, very small value, very small value.
14:39
Everything gets a super small value, right?
14:41
That's just the value that it gets in the tail
14:43
here, which is very close to 0.
14:45
But as soon as I start covering all my points
14:47
with my bell-shaped curve, then all the values go up.
14:53
All right, so I just want to make a short break
14:58
into statistics, and just make sure
15:00
that the maximum likelihood principle involves
15:04
maximizing a function.
15:05
So I just want to make sure that we're
15:07
all on par about how do we maximize functions.
15:11
In most instances, it's going to be a one-dimensional function,
15:13
because theta is going to be a one-dimensional parameter.
15:16
Like here it's the real line.
15:18
So it's going to be easy.
15:20
In some cases, it may be a multivariate function
15:22
and it might be more complicated.
15:24
OK, so let's just make this interlude.
15:26
So the first thing I want you to notice
15:28
is that if you open any book on what's called optimization,
15:31
which basically is the science behind optimizing functions,
15:35
you will talk mostly--
15:36
I mean, I'd say 99.9% of the cases
15:40
will talk about minimizing functions.
15:42
But it doesn't matter, because you can just flip the function
15:44
and you just put a minus sign, and minimizing h
15:47
is the same as maximizing minus h and the opposite, OK?
15:51
So for this class, since we're only
15:53
going to talk about maximum likelihood estimation,
15:55
we will talk about maximizing functions.
15:57
But don't be lost if you decide suddenly
15:59
to open a book on optimization and find only something
16:01
about minimizing functions.
16:03
OK, so maximizing an arbitrary function can actually be fairly
16:08
difficult. If I give you a function that has this weird
16:10
shape, right-- let's think of this polynomial for example--
16:13
and I wanted to find the maximum, how would we do it?
16:17
16:20
So what is the thing you've learned in calculus on how
16:23
to maximize the function?
16:26
Set the derivative equal to 0.
16:27
Maybe you want to check the second derivative
16:29
to make sure it's a maximum and not a minimum.
16:31
But the thing is, this is only guaranteeing to you that you
16:34
have a local one, right?
16:35
So if I do it for this function, for example, then this guy
16:38
is going to satisfy this criterion,
16:39
this guy is going to satisfy this criterion,
16:41
this guy is going to satisfy this criterion, this guy here,
16:43
and this guy satisfies the criterion, but not
16:45
the second derivative one.
16:46
So I have a lot of candidates.
16:50
And if my function can be really anything,
16:52
it's going to be difficult, whether it's
16:54
analytically by taking derivatives and setting them
16:56
to 0, or trying to find some algorithms to do this.
17:00
Because if my function is very jittery,
17:02
then my algorithm basically has to check all candidates.
17:05
And if there's a lot of them, it might take forever, OK?
17:08
So this is-- I have only one, two, three, four,
17:11
five candidates to check.
17:13
But in practice, you might have a million of them to check.
17:15
And that might take forever.
17:17
OK, so what's nice about statistical models, and one
17:21
of the things that makes all these models particularly
17:24
robust, and that we still talk about them 100
17:27
years after they've been introduced
17:29
is that the functions that-- the likelihoods
17:31
that they lead for us to maximize
17:33
are actually very simple.
17:34
And they all share a nice property,
17:37
which is that of being concave.
17:40
All right, so what is a concave function?
17:42
Well, by definition, it's just a function for which--
17:44
let's think of it as being twice differentiable.
17:47
You can define functions that are not
17:49
differentiable as being concave, but let's think about it
17:51
as having a second derivative.
17:53
And so if you look at the function that
17:54
has a second derivative, concave are the functions
17:57
that have their second derivative that's
17:59
negative everywhere.
18:02
Not just at the maximum, everywhere, OK?
18:06
And so if it's strictly concave, this second derivative
18:09
is actually strictly less than zero.
18:12
And particularly if I think of a linear function,
18:16
y is equal to x, then this function
18:19
has its second derivative which is equal to zero, OK?
18:24
So it is concave.
18:26
But it's not strictly concave, OK?
18:28
If I look at the function which is negative x squared,
18:31
what is its second derivative?
18:33
18:35
Minus 2.
18:36
So it's strictly negative everywhere, OK?
18:39
So actually, this is a pretty canonical example
18:43
strictly concave function.
18:44
If you want to think of a picture of a strictly concave
18:46
function, think of negative x squared.
18:48
So parabola pointing downwards.
18:52
OK, so we can talk about strictly convex functions.
18:56
So convex is just happening when the negative of the function
18:59
is concave.
19:00
So that translates into having a second derivative which
19:03
is either non-negative or positive, depending
19:05
on whether you're talking about convexity or strict convexity.
19:09
But again, those convex functions
19:11
are convenient when you're trying to minimize something.
19:14
And since we're trying to maximize the function,
19:16
we're looking for concave.
19:18
So here are some examples.
19:21
Let's just go through them quickly.
19:23
19:39
OK, so the first one is--
19:41
so here I made my life a little uneasy
19:46
by talking about the functions in theta, right?
19:49
I'm talking about likelihoods, right?
19:51
So I'm thinking of functions where the parameter is theta.
19:54
So I have h of theta.
19:56
And so if I start with theta squared,
19:59
negative theta squared, then as we said,
20:02
h prime prime of theta, the second derivative is minus 2,
20:09
which is strictly negative, so this function is strictly
20:11
concave.
20:12
20:19
OK, another function is h of theta, which is--
20:24
what did we pick--
20:25
square root of theta.
20:28
What is the first derivative?
20:30
20:35
1/2 square root of theta.
20:39
What is the second derivative?
20:41
20:48
So that's theta to the negative 1/2.
20:51
So I'm just picking up another negative 1/2,
20:53
so I get negative 1/4.
20:56
And then I get theta to the 3/4 downstairs, OK?
21:02
Sorry, 3/2.
21:03
21:09
And that's strictly negative for theta, say, larger than 0.
21:16
And I really need to have this thing larger than 0
21:20
so that it's well-defined.
21:21
But strictly larger than 0 is so that this thing does not
21:24
blow up to infinity.
21:25
And it's true.
21:26
If you think about this function, it looks like this.
21:30
And already, the first derivative to infinity at 0.
21:34
And it's a concave function, OK?
21:37
Another one is the log, of course.
21:39
21:44
What is the derivative of the log?
21:47
That's 1 over theta, where h prime of theta is 1 over theta.
21:52
And the second derivative negative 1 over theta squared,
22:01
which again, is negative if theta is strictly positive.
22:06
Here I define it as--
22:07
I don't need to define it to be strictly positive here,
22:10
but I need it for the log.
22:13
And sine.
22:16
OK, so let's just do one more.
22:18
So h of theta is sine of theta.
22:22
But here I take it only on an interval,
22:24
because you want to think of this function
22:27
as pointing always downwards.
22:29
And in particular, you don't want this function
22:31
to have an inflection point.
22:32
You don't want it to go down and then up
22:34
and then down and then up, because this is not concave.
22:37
And so sine is certainly going up and down, right?
22:39
So what we do is we restrict it to an interval where sine
22:43
is actually-- so what does the sine function looks
22:45
like at 0, 0?
22:47
And it's going up.
22:48
Where is the first maximum of the sine?
22:53
STUDENT: [INAUDIBLE]
22:54
PROFESSOR: I'm sorry.
22:55
STUDENT: Pi over 2.
22:56
PROFESSOR: Pi over 2, where it takes value 1.
22:59
And then it goes down again.
23:01
And then that's at pi.
23:04
And then I go down again.
23:05
And here you see I actually start changing my inflection.
23:08
So what we do is we stop it at pi.
23:10
And we look at this function, it certainly
23:12
looks like a parabola pointing downwards.
23:14
And so if you look at the-- you can check that it actually
23:16
works with the derivatives.
23:17
So the derivative of sine is cosine.
23:22
23:25
And the derivative of cosine is negative sine.
23:31
23:34
OK, and this thing between 0 and pi is actually positive.
23:38
So this entire thing is going to be negative.
23:40
OK?
23:41
And you know, I can come up with a lot of examples,
23:45
but let's just stick to those.
23:46
There's a linear function, of course.
23:48
And the find function is going to be concave,
23:51
but it's actually going to be convex as well, which
23:53
means that it's certainly not going to be
23:55
strictly concave or convex, OK?
23:58
So here's your standard picture.
24:01
And here, if you look at the dotted line, what
24:04
it tells me is that a concave function,
24:07
and the property we're going to be using
24:08
is that if a strictly concave function has a maximum, which
24:12
is not always the case, but if it has a maximum,
24:15
then it actually must be-- sorry, a local maximum,
24:18
it must be a global maximum.
24:21
OK, so just the fact that it goes up and down and not
24:23
again means that there's only global maximum that can exist.
24:28
Now if you looked, for example, at the square root function,
24:32
look at the entire positive real line,
24:34
then this thing is never going to attain a maximum.
24:36
It's just going to infinity as x goes to infinity.
24:39
So if I wanted to find the maximum,
24:40
I would have to stop somewhere and say
24:42
that the maximum is attained at the right-hand side.
24:46
OK, so that's the beauty about convex functions or concave
24:49
functions, is that essentially, these functions
24:53
are easy to maximize.
24:55
And if I tell you a function is concave,
24:57
you take the first derivative, set it equal to 0.
25:00
If you find a point that satisfies this,
25:01
then it must be a global maximum, OK?
25:07
STUDENT: What if your set theta was
25:09
[INAUDIBLE] then couldn't you have a function that,
25:13
by the definition, is concave, with two upside down parabolas
25:17
at two disjoint intervals, but yet it has two global maximums?
25:22
25:26
PROFESSOR: So you won't get them--
25:28
so you want the function to be concave on what?
25:31
On the convex cell of the intervals?
25:34
Or you want it to be--
25:35
STUDENT: [INAUDIBLE] just said that any subset.
25:38
PROFESSOR: OK, OK.
25:40
You're right.
25:40
So maybe the definition-- so you're
25:42
pointing to a weakness in the definition.
25:45
Let's just say that theta is a convex set
25:49
and then you're good, OK?
25:50
So you're right.
25:51
25:54
Since I actually just said that this is true only for theta,
25:56
I can just take pieces of concave functions, right?
25:59
I can do this, and then the next one
26:00
I can do this, on the next one I can do this.
26:03
And then I would have a bunch of them.
26:05
But what I want is think of it as a global function
26:10
on some convex set.
26:11
You're right.
26:13
So think of theta as being convex
26:14
for this guy, an interval, if it's a real line.
26:17
26:20
OK, so as I said, for more generally-- so
26:25
we can actually define concave functions more generally
26:27
in higher dimensions.
26:29
And that will be useful if theta is not just
26:32
one parameter but several parameters.
26:34
And for that, you need to remind yourself of Calculus II,
26:39
and you have generalization of the notion of derivative, which
26:42
is called a gradient, which is basically a vector where
26:46
each coordinate is just the partial derivative with respect
26:49
to each coordinate of theta.
26:51
And the Hessian is the matrix, which
26:54
is essentially a generalization of the second derivative.
26:58
I denote it by nabla squared, but you
27:01
can write it the way you want.
27:02
And so this matrix here is taking as entry
27:07
the second partial derivatives of h with respect
27:10
to theta i and theta j.
27:12
And so that's the ij-th entry.
27:15
Who has never seen that?
27:16
27:19
OK.
27:20
So now, being concave here is essentially generalizing,
27:27
saying that a vector is equal to zero.
27:28
Well, that's just setting the vector-- sorry.
27:31
The first order condition to say that it's a maximum
27:33
is going to be the same.
27:34
Saying that a function has a gradient equal to zero
27:38
is the same as saying that each of its coordinates
27:43
are equal to zero.
27:44
And that's actually going to be a condition
27:46
for a global maximum here.
27:48
So to check convexity, we need to see that a matrix itself
27:52
is negative.
27:53
Sorry, to check concavity, we need
27:55
to check that a matrix is negative.
27:57
And there is a notion among matrices
27:59
that compare matrix to zero, and that's exactly this notion.
28:03
You pre- and post-multiply by the same x.
28:06
So that works for symmetric matrices,
28:08
which is the case here.
28:10
And so you pre-multiply by x, post-multiply by the same x.
28:13
So you have your matrix, your Hessian here.
28:15
28:20
It's a d by d matrix if you have a d-dimensional matrix.
28:24
So let's call it--
28:26
OK.
28:27
And then here I pre-multiply by x transpose.
28:31
I post-multiply by x.
28:34
And this has to be non-positive if I want it to be concave,
28:38
and strictly negative if I want it to be strictly concave.
28:42
OK, that's just a real generalization.
28:44
You can check for yourself that this is the same thing.
28:47
If I were in dimension 1, this would be the same thing.
28:49
Why?
28:50
Because in dimension 1, pre- and post-multiplying by x
28:53
is the same as multiplying by x squared.
28:55
Because in dimension 1, I can just move my x's around, right?
28:58
And so that would just mean the first condition
29:01
would mean in dimension 1 that the second derivative times x
29:04
squared has to be less than or equal to zero.
29:11
So here I need this for all x's that are not zero,
29:14
because I can take x to be zero and make this equal to zero,
29:16
right?
29:17
So this is for x's that are not equal to zero, OK?
29:21
And so some examples.
29:25
Just look at this function.
29:27
So now I have functions that depend on two parameters,
29:29
theta1 and theta2.
29:31
So the first one is--
29:33
so if I take theta to be equal to--
29:36
now I need two parameters, r squared.
29:39
And I look at the function, which is h of theta.
29:42
Can somebody tell me what h of theta is?
29:45
STUDENT: [INAUDIBLE]
29:49
PROFESSOR: Minus 2 theta2 squared?
29:52
OK, so let's compute the gradient of h of theta.
30:00
So it's going to be something that has two coordinates.
30:04
To get the first coordinate, what do I do?
30:06
Well, I take the derivative with respect
30:07
to theta1, thinking of theta2 as being a constant.
30:10
So this thing is going to go away.
30:11
And so I get negative 2 theta1.
30:14
And when I take the derivative with respect
30:15
to the second part, thinking of this part as being constant,
30:18
I get minus 4 theta2.
30:21
30:24
That clear for everyone?
30:26
That's just the definition of partial derivatives.
30:29
30:32
And then if I want to do the Hessian,
30:40
so now I'm going to get a 2 by 2 matrix.
30:42
30:45
The first guy here, I take the first-- so this guy
30:48
I get by taking the derivative of this guy with respect
30:51
to theta1.
30:52
So that's easy.
30:53
So that's just minus 2.
30:55
This guy I get by taking derivative
30:56
of this guy with respect to theta2.
30:58
So I get what?
31:00
Zero.
31:00
I treat this guy as being a constant.
31:03
This guy is also going to be zero,
31:04
because I take the derivative of this guy with respect
31:06
to theta1.
31:08
And then I take the derivative of this guy with respect
31:10
to theta2, so I get minus 4.
31:14
OK, so now I want to check that this matrix satisfies
31:19
x transpose--
31:21
this matrix x is negative.
31:24
So what I do is--
31:25
so what is x transpose x?
31:27
So if I do x transpose delta squared h theta x, what I get
31:33
is minus 2 x1 squared minus 4 x2 squared.
31:42
Because this matrix is diagonal, so all it does is just weights
31:45
the square of the x's.
31:47
So this guy is definitely negative.
31:51
This guy is negative.
31:53
And actually, if one of the two is non-zero,
31:56
which means that x is non-zero, then this thing
31:58
is actually strictly negative.
32:00
So this function is actually strictly concave.
32:02
32:05
And it looks like a parabola that's slightly
32:07
distorted in one direction.
32:09
32:15
So well, I know this might have been some time ago.
32:21
Maybe for some of you might have been since high school.
32:23
So just remind yourself of doing second derivatives and Hessians
32:27
and things like this.
32:29
Here's another one as an exercise.
32:32
h is minus theta1 minus theta2 squared.
32:36
So this one is going to actually not be diagonal.
32:44
The Hessian is not going to be diagonal.
32:46
Who would like to do this now in class?
32:50
OK, thank you.
32:51
This is not a calculus class.
32:53
So you can just do it as a calculus exercise.
32:56
And you can do it for log as well.
32:58
Now, there is a nice recipe for concavity
33:01
that works for the second one and the third one.
33:05
And the thing is, if you look at those particular functions,
33:07
what I'm doing is taking, first of all, a linear combination
33:11
of my arguments.
33:13
And then I take a concave function of this guy.
33:15
And this is always going to work.
33:18
This is always going to give me a complete function.
33:20
So the computations that I just made,
33:22
I actually never made them when I prepared those
33:24
slides because I don't have to.
33:26
I know that if I take a linear combination of those things
33:28
and then I take a concave function of this guy,
33:30
I'm always going to get a concave function.
33:33
OK, so that's an easy way to check this, or at least as
33:39
a sanity check.
33:42
All right, and so as I said, finding maximizers of concave
33:48
or strictly concave function is the same
33:50
as it was in the one-dimensional case.
33:52
What I do-- sorry, in the one-dimensional case,
33:55
we just agreed that we just take the derivative
33:57
and set it to zero.
33:58
In the high dimensional case, we take the gradient
34:00
and set it equal to zero.
34:01
Again, that's calculus, all right?
34:04
So it turns out that so this is going
34:07
to give me equations, right?
34:09
The first one is an equation in theta.
34:11
The second one is an equation in theta1, theta2, theta3,
34:15
all the way to theta d.
34:16
And it doesn't mean that because I can write this equation
34:19
that I can actually solve it.
34:21
This equation might be super nasty.
34:23
It might be like some polynomial and exponentials and logs equal
34:28
zero, or some crazy thing.
34:31
And so there's actually, for a concave function,
34:36
since we know there's a unique maximizer,
34:38
there's this theory of convex optimization, which really,
34:42
since those books are talking about minimizing,
34:44
you had to find some sort of direction.
34:46
But you can think of it as the theory of concave maximization.
34:50
And they allow you to find algorithms to solve
34:54
this numerically and fairly efficiently.
34:57
OK, that means fast.
34:58
Even if d is of size 10,000, you're
35:01
going to wait for one second and it's
35:02
going to tell you what the maximum is.
35:05
And that's what machine learning is about.
35:06
If you've taken any class on machine learning,
35:08
there's a lot of optimization, because they have really,
35:11
really big problems to solve.
35:13
Often in this class, since this is
35:15
more introductory statistics, we will have a close form.
35:19
For the maximum likelihood estimator
35:21
will be saying theta hat equals, and say x bar,
35:25
and that will be the maximum likelihood estimator.
35:28
So just why-- so has anybody seen convex optimization
35:34
before?
35:36
So let me just give you an intuition
35:38
why those functions are easy to maximize or to minimize.
35:43
In one dimension, it's actually very easy for you to see that.
35:46
35:50
And the reason is this.
35:52
If I want to maximize the concave function, what
35:57
I need to do is to be able to query a point
35:59
and get as an answer the derivative of this function,
36:04
OK?
36:04
So now I said this is the function I want to optimize,
36:07
and I've been running my algorithm for 5/10 of a second.
36:13
And it's at this point here.
36:15
OK, that's the candidate.
36:17
Now, what I can ask is, what is the derivative
36:19
of my function here?
36:21
Well, it's going to give me a value.
36:22
And this value is going to be either negative, positive,
36:26
or zero.
36:27
Well, if it's zero, that's great.
36:28
That means I'm here and I can just go home.
36:30
I've solved my problem.
36:31
I know there's a unique maximum, and that's
36:33
what I wanted to find.
36:34
If it's positive, it actually tells me
36:37
that I'm on the left of the optimizer.
36:41
And on the left of the optimal value.
36:43
And if it's negative, it means that I'm
36:47
at the right of the value I'm looking for.
36:50
And so most of the convex optimization methods
36:53
basically tell you, well, if you query the derivative
36:56
and it's actually positive, move to the right.
37:00
And if it's negative, move to the left.
37:02
Now, by how much you move is basically, well,
37:07
why people write books.
37:09
And in higher dimension, it's a little more complicated,
37:13
because in higher dimension, thinks about two dimensions,
37:16
then I'm only being able to get in a vector.
37:21
And the vector is only telling me, well, here
37:24
is half of the space in which you can move.
37:26
Now here, if you tell me move to the right,
37:28
I know exactly which direction I'm going to have to move.
37:30
But in two dimension, you're going
37:32
to basically tell me, well, move in this global direction.
37:37
And so, of course, I know there's a line on the floor I
37:40
cannot move behind.
37:42
But even if you tell me, draw a line on the floor
37:45
and move only to that side of the line,
37:47
then there's many directions in that line that I can go to.
37:50
And that's also why there's lots of things
37:53
you can do in optimization.
37:55
OK, but still, putting this line on the floor is telling me,
38:00
do not go backwards.
38:02
And that's very important.
38:03
It's just telling you which direction
38:04
I should be going always, OK?
38:07
All right, so that's what's behind this notion
38:11
of gradient descent algorithm, steepest descent.
38:14
Or steepest descent, actually, if we're trying to maximize.
38:17
OK, so let's move on.
38:22
So this course is not about optimization, all right?
38:26
So as I said, the likelihood was this guy.
38:30
The product of f of the xi's.
38:32
And one way you can do this is just
38:33
basically the joint distribution of my data at the point theta.
38:39
So now the likelihood, formerly-- so here
38:41
I am giving myself the model e theta.
38:44
And here I'm going to assume that e is discrete
38:48
so that I can talk about PMFs.
38:49
But everything you're doing, just
38:51
redo for the sake of yourself by replacing PMFs by PDFs,
38:55
and everything's going to be fine.
38:56
We'll do it in a second.
38:58
All right, so the likelihood of the model.
39:02
So here I'm not looking at the likelihood of a parameter.
39:05
I'm looking at the likelihood of a model.
39:07
So it's actually a function of the parameter.
39:09
And actually, I'm going to make it
39:10
even a function of the points x1 to xn.
39:14
All right, so I have a function.
39:15
And what it takes as input is all the points x1
39:18
to xn and a candidate parameter theta.
39:22
Not the true one.
39:22
A candidate.
39:23
And what I'm going to do is I'm going
39:25
to look at the probability that my random variables
39:28
under this distribution, p theta,
39:29
take these exact values, px1, px2, pxn.
39:34
Now remember, if my data was independent,
39:40
then I could actually just say that the probability
39:43
of this intersection is just a product of the probabilities.
39:45
And it would look something like this.
39:48
But I can define likelihood even if I don't have
39:50
independent random variables.
39:52
But think of them as being independent,
39:54
because that's all we're going to encounter in this class, OK?
39:57
I just want you to be aware that if I had dependent variables,
40:00
I could still define the likelihood.
40:02
I would have to understand how to compute these probabilities
40:04
there to be able to compute it.
40:08
OK, so think of Bernoullis, for example.
40:11
So here is my example of a Bernoulli.
40:12
40:16
So my parameter is--
40:18
so my model is 0,1 Bernoulli p.
40:25
p is in the interval 0,1.
40:31
The probability, just as a side remark,
40:35
I'm just going to use the fact that I can actually
40:38
write the PMF of a Bernoulli in a very concise form, right?
40:41
If I ask you what the PMF of a Bernoulli is,
40:43
you could tell me, well, the probability that x--
40:46
so under p, the probability that x is equal to 0 is 1 minus p.
40:50
The probability under p that x is equal to 1 is equal to p.
40:57
But I can be a bit smart and say that for any X that's
41:01
either 0 or 1, the probability under p
41:04
that X is equal to little x, I can write it
41:07
in a compact form as p to the X, 1 minus p to the 1 minus x.
41:14
And you can check that this is the right form because, well,
41:17
you have to check it only for two values of X, 0 and 1.
41:20
And if you plug in 1, you only keep the p.
41:23
If you plug in 0, you only keep the 1 minus p.
41:27
And that's just a trick, OK?
41:31
I could have gone with many other ways.
41:34
Agreed?
41:35
I could have said, actually, something like--
41:39
another one would be-- which we are not going to use,
41:41
but we could say, well, it's xp plus and minus x 1 minus
41:47
p, right?
41:47
41:50
That's another one.
41:53
But this one is going to be convenient.
41:56
So forget about this guy for a second.
41:57
42:02
So now, I said that the likelihood is just
42:05
this function that's computing the probability that X1
42:12
is equal to little x1.
42:15
So likelihood is L of X1, Xn.
42:27
So let me try to make those calligraphic so you
42:30
know that I'm talking about smaller values, right?
42:33
Small x's.
42:35
x1, xn, and then of course p.
42:38
Sometimes we even put--
42:40
I didn't do it, but sometimes you can actually
42:42
put a semicolon here, semicolon so you know that those two
42:46
things are treated differently.
42:48
And so now you have this thing is equal to what?
42:51
Well, it's just the probability under p
42:54
that X1 is little x1 all the way to Xn is little xn.
42:59
OK, that's just the definition.
43:02
43:06
All right, so now let's start working.
43:11
So we write the definition, and then we
43:13
want to make it look like something we would potentially
43:16
be able to maximize if I were--
43:17
like if I take the derivative of this with respect to p,
43:20
it's not very helpful because I just don't know.
43:22
Just want the algebraic function of p.
43:26
So this thing is going to be equal to what?
43:28
Well, what is the first thing I want to use?
43:30
43:32
I have a probability of an intersection of events,
43:35
so it's just the product of the probabilities.
43:39
So this is the product from i equal 1 to n of P, small p,
43:44
Xi is equal to little xi.
43:47
That's independence.
43:49
43:54
OK, now, I'm starting to mean business, because for each P,
43:58
we have a closed form, right?
44:00
I wrote this as this supposedly convenient form.
44:03
I still have to reveal to you why it's convenient.
44:06
So this thing is equal to--
44:09
well, we said that that was p xi for a little xi.
44:15
1 minus p to the 1 minus xi, OK?
44:20
44:22
So that was just what I wrote over there as the probability
44:26
that Xi is equal to little xi.
44:29
And since they all have the same parameter p, just
44:32
have this p that shows up here.
44:34
44:38
And so now I'm just taking the products of something
44:41
to the xi, so it's this thing to the sum of the xi's.
44:45
Everybody agrees with this?
44:48
So this is equal to p sum of the xi, 1 minus p
44:56
to the n minus sum of the xi.
44:58
45:10
If you don't feel comfortable with writing it directly,
45:13
you can observe that this thing here
45:15
is actually equal to p over 1 minus p to the xi times 1
45:22
minus p, OK?
45:26
So now when I take the product, I'm
45:27
getting the products of those guys.
45:28
So it's just this guy to the power of sum
45:31
and this guy to the power n.
45:33
And then I can rewrite it like this if I want to
45:39
And so now-- well, that's what we have here.
45:42
And now I am in business because I can still
45:45
hope to maximize this function.
45:48
And how to maximize this function?
45:50
All I have to do is to take the derivative.
45:52
Do you want to do it?
45:54
Let's just take the derivative, OK?
45:56
Sorry, I didn't tell you that, well, the maximum likelihood
45:58
principle is to just maxim-- the idea is to maximize this thing,
46:01
OK?
46:02
But I'm not going to get there right now.
46:04
OK, so let's do it maybe for the Poisson model for a second.
46:08
So if you want to do it for the Poisson model,
46:16
let's write the likelihood.
46:18
So right now I'm not doing anything.
46:20
I'm not maximizing.
46:21
I'm just computing the likelihood function.
46:24
46:29
OK, so the likelihood function for Poisson.
46:32
So now I know-- what is my sample space for Poisson?
46:36
STUDENT: Positives.
46:38
PROFESSOR: Positive integers.
46:41
And well, let me write it like this.
46:45
Poisson lambda, and I'm going to take lambda to be positive.
46:51
And so that means that the probability under lambda
46:53
that X is equal to little x in the sample space
46:57
is lambda to the X over factorial x
47:01
e to the minus lambda.
47:03
So that's basically the same as the compact form
47:05
that I wrote over there.
47:06
It's just now a different one.
47:08
And so when I want to write my likelihood, again,
47:12
we said little x's.
47:13
47:17
This is equal to what?
47:18
Well, it's equal to the probability under lambda
47:23
that X1 is little x1, Xn is little xn,
47:31
which is equal to the product.
47:33
47:40
OK?
47:42
Just by independence.
47:45
And now I can write those guys as being-- each
47:47
of them being i equal 1 to n.
47:52
So this guy is just this thing where a plug in Xi.
47:56
So I get lambda to the Xi divided by factorial xi times e
48:05
to the minus lambda, OK?
48:10
And now, I mean, this guy is going to be nice.
48:13
This guy is not going to be too nice.
48:15
But let's write it.
48:16
When I'm going to take the product of those guys here,
48:18
I'm going to pick up lambda to the sum of the xi's.
48:21
Here I'm going to pick up exponential
48:23
minus n times lambda.
48:25
And here I'm going to pick up just the product
48:27
of the factorials.
48:29
So x1 factorial all the way to xn factorial.
48:35
Then I get lambda, the sum of the xi.
48:41
Those are little xi's.
48:43
e to the minus xn lambda.
48:46
OK?
48:47
48:51
So that might be freaky at this point, but remember,
48:55
this is a function we will be maximizing.
48:58
And the denominator here does not depend on lambda.
49:01
So we knew that maximizing this function with this denominator,
49:04
or any other denominator, including 1,
49:07
will give me the same arg max.
49:09
So it won't be a problem for me.
49:12
As long as it does not depend on lambda,
49:14
this thing is going to go away.
49:15
49:19
OK, so in the continuous case, the likelihood I cannot--
49:24
right?
49:25
So if I would write the likelihood
49:26
like this in the continuous case,
49:29
this one would be equal to what?
49:32
Zero, right?
49:33
So it's not very helpful.
49:34
And so what we do is we define the likelihood
49:36
as the product of the f of theta xi.
49:39
Now that would be a jump if I told you,
49:43
well, just define it like that and go home
49:45
and don't discuss it.
49:46
But we know that this is exactly what's coming from the--
49:52
well, actually, I think I erased it.
49:53
It was just behind.
49:55
So this was exactly what was coming from the KL
49:58
divergence estimated, right?
50:00
The thing that I showed you, if we
50:01
want to follow this strategy, which
50:03
consists in estimating the KL divergence and minimizing it,
50:06
is exactly doing this.
50:08
50:12
So in the Gaussian case--
50:16
well, let's write it.
50:17
So in the Gaussian case, let's see
50:19
what the likelihood looks like.
50:20
50:27
OK, so if I have a Gaussian experiment here--
50:32
did I actually write it?
50:33
50:36
OK, so I'm going to take mu and sigma as being two parameters.
50:40
So that means that my sample space is going to be what?
50:43
50:47
Well, my sample space is still R.
50:49
Those are just my observations.
50:51
But then I'm going to have a N mu sigma squared.
50:56
And the parameters of interest are mu
50:58
and R. And sigma squared and say 0 infinity.
51:04
OK, so that's my Gaussian model.
51:06
Yes.
51:07
STUDENT: [INAUDIBLE]
51:17
PROFESSOR: No, there's no--
51:18
I mean, there's no difference.
51:20
STUDENT: [INAUDIBLE]
51:21
PROFESSOR: Yeah.
51:22
I think the all the slides I put the curly bracket,
51:24
then I'm just being lazy.
51:26
I just like those concave parenthesis.
51:31
All right, so let's write it.
51:33
So the definition, L xi, xn.
51:39
And now I have two parameters, mu and sigma squared.
51:43
We said, by definition, is the product from i
51:48
equal 1 to n of f theta of little xi.
51:55
Now, think about it.
51:57
Here we always had an extra line, right?
52:00
The line was to say that the definition was the probability
52:03
that they were all equal to each other.
52:05
That was the joint probability.
52:08
And here it could actually have a line that says it's the joint
52:12
probability distribution of the xi's.
52:14
And if it's not independent, it's
52:15
not going to be the product.
52:16
But again, since we're only dealing
52:18
with independent observations in the scope of this class,
52:21
this is the only definition we're going to be using.
52:23
OK, and actually, from here on, I
52:26
will literally skip this step when I talk about discrete ones
52:30
as well, because they are also independent.
52:33
Agreed?
52:35
So we start with this, which we agreed
52:37
was the definition for this particular case.
52:39
And so now all of you know by heart what the density of a--
52:44
sorry, that's not theta.
52:45
I should write it mu sigma squared.
52:47
And so you need to understand what this density.
52:50
And it's product of 1 over sigma square root 2 pi times
53:01
exponential minus xi minus mu squared
53:07
divided by 2 sigma squared.
53:10
OK, that's the Gaussian density with parameters mu and sigma
53:13
squared.
53:15
I just plugged in this thing which I don't give you,
53:18
so you just have to trust me.
53:20
It's all over any book.
53:22
Certainly, I mean, you can find it.
53:25
I will give it to you.
53:26
And again, you're not expected to know it by heart.
53:29
Though, if you do your homework every week without wanting to,
53:34
you will definitely use some of your brain
53:36
to remember that thing.
53:38
OK, and so now, well, I have this constant in front.
53:42
1 over sigma square root 2 pi that I can pull out.
53:45
So I get 1 over sigma square root 2 pi to the power n.
53:50
And then I have the product of exponentials, which we know
53:52
is the exponential of the sum.
53:55
So this is equal to exponential minus.
53:58
And here I'm going to put the 1 over 2 sigma squared
54:01
outside the sum.
54:02
54:15
And so that's how this guy shows up.
54:19
Just the product of the density is evaluated at, respectively,
54:23
x1 to xn.
54:24
54:28
OK, any questions about computing those likelihoods?
54:33
Yes.
54:34
STUDENT: Why [INAUDIBLE]
54:41
PROFESSOR: Oh, that's a typo.
54:42
Thank you.
54:43
Because I just took it from probably the previous thing.
54:47
So those are actually-- should be--
54:48
OK, thank you for noting that one.
54:50
So this line should say for any x1 to xn in R to the n.
55:00
Thank you, good catch.
55:01
55:06
All right, so that's really e to the n, right?
55:10
My sample space always.
55:12
55:16
OK, so what is maximum likelihood estimation?
55:19
Well again, if you go back to the estimate
55:24
that we got, the estimation strategy, which consisted
55:27
in replacing expectation with respect to theta star
55:31
by average of the data in the KL divergence,
55:35
we would try to maximize not this guy, but this guy.
55:41
55:45
The thing that we actually plugged in were not any small
55:48
xi's.
55:48
Were actually-- the random variable is capital Xi.
55:52
So the maximum likelihood estimator
55:54
is actually taking the likelihood,
55:57
which is a function of little x's, and now
55:59
the values at which it estimates, if you look at it,
56:02
is actually--
56:03
the capital X is my data.
56:05
So it looks at the function, at the data,
56:09
and at the parameter theta.
56:11
That's what the-- so that's the first thing.
56:14
And then the maximum likelihood estimator
56:16
is maximizing this, OK?
56:19
So in a way, what it does is it's a function that couples
56:24
together the data, capital X1 to capital Xn,
56:27
with the parameter theta and just now tries to maximize it.
56:32
So if this is just a little hard for you to get,
56:40
the likelihood is formally defined
56:42
as a function of x, right?
56:43
Like when I write f of x.
56:46
f of little x, I define it like that.
56:48
But really, the only x arguments we're
56:52
going to evaluate this function at
56:54
are always the random variable, which is the data.
56:57
So if you want, you can think of it
56:59
as those guys being not parameters of this function,
57:02
but really, random variables themselves directly.
57:04
57:09
Is there any question?
57:10
STUDENT: [INAUDIBLE] those random variables [INAUDIBLE]??
57:15
PROFESSOR: So those are going to be known once you have--
57:17
so it's always the same thing in stats.
57:20
You first design your estimator as a function
57:24
of random variables.
57:25
And then once you get data, you just plug it in.
57:27
But we want to think of them as being random variables
57:29
because we want to understand what the fluctuations are.
57:32
So we're going to keep them as random variables for as long
57:34
as we can.
57:35
We're going to spit out the estimator as a function
57:37
of the random variables.
57:38
And then when we want to compute it from data,
57:40
we're just going to plug it in.
57:41
57:44
So keep the random variables for as long as you can.
57:46
Unless I give you numbers, actual numbers,
57:48
just those are random variables.
57:51
OK, so there might be some confusion
57:53
if you've seen any stats class, sometimes there's
57:55
a notation which says, oh, the realization
57:58
of the random variables are lower case versions
58:01
of the original random variables.
58:02
So lowercase x should be thought as the realization
58:05
of the upper case X. This is not the case here.
58:09
When I write this, it's the same way
58:12
as I write f of x is equal to x squared, right?
58:16
It's just an argument of a function that I want to define.
58:20
So those are just generic x.
58:22
So if you correct the typo that I have,
58:24
this should say that this should be for any x and xn.
58:27
I'm just describing a function.
58:28
And now the only place at which I'm
58:30
interested in evaluating that function,
58:32
at least for those first n arguments, is at the capital
58:35
N observations random variables that I have.
58:37
58:41
So there's actually texts, there's actually
58:45
people doing research on when does the maximum likelihood
58:48
estimator exist?
58:49
And that happens when you have infinite sets, thetas.
58:56
And this thing can diverge.
58:58
There is no global maximum.
59:00
There's crazy things that might happen.
59:01
And so we're actually always going to be in a case
59:04
where this maximum likelihood estimator exists.
59:07
And if it doesn't, then it means that you actually
59:09
need to restrict your parameter space, capital Theta,
59:13
to something smaller.
59:15
Otherwise it won't exist.
59:17
OK, so another thing is the log likelihood estimator.
59:21
So it is still the likelihood estimator.
59:23
We solved before that maximizing a function
59:26
or maximizing log of this function
59:27
is the same thing, because the log function is increasing.
59:30
So the same thing is maximizing a function
59:32
or maximizing, I don't know, exponential of this function.
59:35
Every time I take an increasing function,
59:37
it's actually the same thing.
59:38
Maximizing a function or maximizing 10 times
59:40
this function is the same thing.
59:41
So the function x maps to 10 times x is increasing.
59:45
And so why do we talk about log likelihood rather than
59:49
likelihood?
59:50
So the log of likelihood is really just--
59:52
I mean the log likelihood is the log of the likelihood.
59:55
And the reason is exactly for this kind of reasons.
59:59
Remember, that was my likelihood, right?
60:02
And I want to maximize it.
60:04
And it turns out that in stats, there's
60:05
a lot of distributions that look like exponential of something.
60:10
So I might as well just remove the exponential
60:12
by taking the log.
60:14
So once I have this guy, I can take the log.
60:17
This is something to a power of something.
60:19
If I take the log, it's going to look better for me.
60:21
I have this thing--
60:23
well, I have another one somewhere, I think,
60:25
where I had the Poisson.
60:27
Where was the Poisson?
60:29
The Poisson's gone.
60:31
So the Poisson was the same thing.
60:33
If I took the log, because it had a power,
60:35
that would make my life easier.
60:37
So the log doesn't have any particular intrinsic notion,
60:43
except that it's just more convenient.
60:47
Now, that being said, if you think
60:49
about maximizing the KL, the original formulation,
60:53
we actually remove the log.
60:55
If we come back to the KL thing--
60:57
61:00
where is my KL?
61:01
Sorry.
61:03
That was maximizing the sum of the logs of the pi's.
61:08
And so then we worked at it by saying that the sum of the logs
61:11
was--
61:12
maximizing the sum of the logs was the same
61:14
as maximizing the product.
61:16
But here, we're basically-- log likelihood
61:18
is just going backwards in this chain of equivalences.
61:21
And that's just because the original formulation
61:23
was already convenient.
61:27
So we went to find the likelihood
61:28
and then coming back to our original estimation strategy.
61:32
So look at the Poisson.
61:34
I want to take log here to make my sum of xi's go down.
61:39
OK, so this is my estimator.
61:47
So the log of L--
61:50
so one thing that you want to notice
61:51
is that the log of L of x1, xn theta, as we said,
61:59
is equal to the sum from i equal 1
62:02
to n of the log of either p theta of xi, or--
62:09
so that's in the discrete case.
62:11
And in the continuous case is the sum
62:14
of the log of f theta of xi.
62:16
62:19
The beauty of this is that you don't have to really understand
62:21
the difference between probability mass
62:23
function and probability distribution function
62:25
to implement this.
62:26
Whatever you get, that's what you plug in.
62:29
62:32
Any questions so far?
62:33
62:36
All right, so shall we do some computations
62:39
and check that, actually, we've introduced all this stuff--
62:44
complicate functions, maximizing, KL divergence,
62:47
lot of things-- so that we can spit out, again, averages?
62:50
All right?
62:51
That's great.
62:51
We're going to able to sleep at night
62:52
and know that there's a really powerful mechanism called
62:55
maximum likelihood estimator that was actually
62:57
driving our intuition without us knowing.
63:00
OK, so let's do this so.
63:04
Bernoulli trials.
63:06
I still have it over there.
63:07
63:15
OK, so actually, I don't know what--
63:19
well, let me write it like that.
63:21
So it's P over 1 minus P xi--
63:25
sorry, sum of the xi's times 1 minus P is to the n.
63:32
So now I want to maximize this as a function of P.
63:37
Well, the first thing we would want to do
63:39
is to check that this function is concave.
63:41
And I'm just going to ask you to trust me on this.
63:45
So I don't want-- sorry, sum of the xi's.
63:47
I only want to take the derivative and just go home.
63:52
So let's just take the derivative of this with respect
63:55
to P. Actually, no.
63:56
This one was more convenient.
63:57
I'm sorry.
63:58
64:00
This one was slightly more convenient, OK?
64:03
So now we have--
64:05
so now let me take the log.
64:09
So if I take the log, what I get is sum of the xi's times log p
64:16
plus n minus some of the xi's times log 1 minus p.
64:24
64:27
Now I take the derivative with respect
64:29
to p and set it equal to zero.
64:35
So what does that give me?
64:36
It tells me that sum of the xi's divided by p minus n
64:43
sum of the xi's divided by 1 minus p is equal to 0.
64:50
64:56
So now I need to solve for p.
64:58
So let's just do it.
64:59
So what we get is that 1 minus p sum of the xi's is equal to p n
65:06
minus sum of the xi's.
65:10
So that's p times n minus sum of the xi's plus sum of the xi's.
65:17
So let me put it on the right.
65:18
So that's p times n is equal to sum of the xi's.
65:24
And that's equivalent to p--
65:27
actually, I should start by putting p hat from here
65:30
on, because I'm already solving an equation, right?
65:33
And so p hat is equal to syn of the xi's
65:36
divided by n, which is my xn bar.
65:38
65:44
Poisson model, as I said, Poisson is gone.
65:50
So let me rewrite it quickly.
65:51
66:00
So Poisson, the likelihood in X1, Xn, and lambda
66:07
was equal to lambda to the sum of the xi's e
66:13
to the minus n lambda divided by X1 factorial,
66:17
all the way to Xn factorial.
66:20
So let me take the log likelihood.
66:25
That's going to be equal to what?
66:26
It's going to tell me.
66:27
It's going to be--
66:29
well, let me get rid of this guy first.
66:30
Minus log of X1 factorial all the way to Xn factorial.
66:36
That's a constant with respect to lambda.
66:39
So when I'm going to take the derivative, it's going to go.
66:43
Then I'm going to have plus sum of the xi's times log lambda.
66:49
And then I'm going to have minus n lambda.
66:51
66:54
So now then, you take the derivative
66:55
and set it equal to zero.
66:57
So log L-- well, partial with respect to lambda of log L,
67:04
say lambda, equals zero.
67:08
This is equivalent to, so this guy goes.
67:11
This guy gives me sum of the xi's divided by lambda hat
67:16
equals n.
67:17
67:22
And so that's equivalent to lambda hat
67:25
is equal to sum of the xi's divided by n, which is Xn bar.
67:31
67:34
Take derivative, set it equal to zero, and just solve.
67:38
It's a very satisfying exercise, especially when
67:42
you get the average in the end.
67:45
You don't have to think about it forever.
67:49
OK, the Gaussian model I'm going to leave to you as an exercise.
67:54
Take the log to get rid of the pesky exponential,
67:57
and then take the derivative and you should be fine.
68:00
It's a bit more--
68:02
it might be one more line than those guys.
68:05
OK, so-- well actually, you need to take
68:12
the gradient in this case.
68:14
Don't check the second derivative right now.
68:15
You don't have to really think about it.
68:17
68:21
What did I want to add?
68:23
I think there was something I wanted to say.
68:25
Yes.
68:27
When I have a function that's concave and I'm on, like,
68:31
some infinite interval, then it's
68:33
true that taking the derivative and setting it equal to zero
68:36
will give me the maximum.
68:38
But again, I might have a function that looks like this.
68:42
Now, if I'm on some finite interval-- let me go elsewhere.
68:46
So if I'm on some finite interval
68:55
and my function looks like this as a function of theta--
69:00
let's say this is my log likelihood
69:03
as a function of theta--
69:06
then, OK, there's no place in this interval--
69:13
let's say this is between 0 and 1-- there's
69:15
no place in this interval where the derivative is equal to 0.
69:19
And if you actually try to solve this,
69:22
you won't find a solution which is not in the interval 0, 1.
69:26
And that's actually how you know that you probably
69:28
should not take the derivative equal to zero.
69:30
So don't panic if you get something that says,
69:32
well, the solution is at infinity, right?
69:34
If this function keeps going, you
69:36
will find that the solution-- you
69:37
won't be able to find a solution apart from infinity.
69:40
You are going to see something like 1 over theta hat
69:43
is equal to 0, or something like this.
69:46
So you know that when you've found this kind of solution,
69:48
you've probably made a mistake at some point.
69:51
And the reason is because the functions that are like this,
69:54
you don't find the maximum by setting the derivative equal
69:58
to zero.
69:59
You actually just find the maximum by saying,
70:01
well, it's an increasing function on the interval 0, 1,
70:03
so the maximum must be attained at 1.
70:05
70:07
So here in this case, that would mean
70:08
that my maximum would be 1.
70:12
My estimator would be 1, which would be weird.
70:14
So typically here, you have a function of the xi's.
70:17
So one example that you will see many times is when this guy is
70:19
the maximum of the xi's.
70:24
And in which case, the maximum is attained here,
70:27
which is the maximum of this.
70:29
OK, so just keep in mind--
70:31
what I would recommend is every time
70:33
you're trying to take the maximum of a function,
70:36
just try to plot the function in your head.
70:39
It's not too complicated.
70:40
Those things are usually squares, or square roots,
70:44
or logs.
70:45
You know what those functions look like.
70:47
Just plug them in your mind and make sure
70:50
that you will find a maximum which really
70:52
goes up and then down again.
70:54
If you don't, then that means your maximum
70:56
is achieved at the boundary and you have
70:59
to think differently to get it.
71:01
So the machinery that consists in setting the derivative equal
71:04
to zero works 80% of the time.
71:06
But o you have to be careful.
71:08
And from the context, it will be clear
71:11
that you had to be careful, because you will find
71:14
some crazy stuff, such as solve 1 over theta hat
71:17
is equal to zero.
71:18
71:23
All right, so before we conclude,
71:25
I just wanted to give you some intuition about how does
71:28
the maximum likelihood perform?
71:30
So there's something called the Fisher information
71:33
that essentially controls how this thing performs.
71:35
And the Fisher information is, essentially,
71:38
a second derivative or a Hessian.
71:40
So if I'm in a one-dimensional parameter case, it's a number,
71:44
it's a second derivative.
71:46
If I'm in a multidimensional case, it's actually a Hessian,
71:51
it's a matrix.
71:52
So I'm going to actually take in notation little curly L
71:57
of theta to be the log likelihood, OK?
72:00
And that's the log likelihood for one observation.
72:02
So let's call it x generically, but think of it as being x1,
72:05
for example.
72:07
And I don't care of, like, summing,
72:09
because I'm actually going to take expectation of this thing.
72:11
So it's not going to be a data driven quantity
72:13
I'm going to play with.
72:14
So now I'm going to assume that it
72:15
is twice differentiable, almost surely, because it's
72:19
a random function.
72:21
And so now I'm going to just sweep under the rug
72:23
some technical conditions when these things hold.
72:27
So typically, when can I permute integral and derivatives
72:32
and this kind of stuff that you don't want to think about?
72:35
OK, the rule of thumb is it always
72:36
works until it doesn't, in which case,
72:39
that probably means you're actually solving
72:41
some sort of calculus problem.
72:44
Because in practice, it just doesn't happen.
72:47
So the Fisher information is the expectation of the--
72:56
that's called the outer product.
72:57
So that's the product of this gradient
73:01
and the gradient transpose.
73:02
So that forms a matrix, right?
73:04
That's a matrix minus the outer product of the expectations.
73:09
So that's really what's called the covariance matrix
73:12
of this vector, nabla of L theta, which
73:16
is a random vector.
73:18
So I'm forming the covariance matrix of this thing.
73:21
And the technical conditions tells me that, actually,
73:23
this guy, which depends only on the Hessian,
73:26
is actually equal to negative expectation of the-- sorry.
73:31
It depends on the gradient.
73:32
Is actually negative expectation of the Hessian.
73:36
So I can actually get a quantity that
73:38
depends on the second derivatives only using
73:40
first derivatives.
73:41
But the expectation is going to play a role here.
73:44
And the fact that it's a log.
73:45
And lots of things actually show up here.
73:48
And so in this case, what I get is that--
73:51
so in the one-dimensional case, then this
73:53
is just the covariance matrix of a one-dimensional thing, which
73:56
is just a variance of itself.
73:58
So the variance of the derivative
74:00
is actually equal to negative the expectation
74:04
of the second derivative.
74:07
OK, so we'll see that next time.
74:09
But what I wanted to emphasize with this is that why do
74:12
we care about this quantity?
74:15
That's called the Fisher information.
74:16
Fisher is the founding father of modern statistics.
74:19
Why do we give this quantity his name?
74:23
Well, it's because this quantity is actually very critical.
74:25
What does the second derivative of a function
74:27
tell me at the maximum?
74:29
Well, it's telling me how curved it is, right?
74:34
If I have a zero second derivative, I'm basically flat.
74:37
And if I have a very high second derivative, I'm very curvy.
74:41
And when I'm very curvy, what it means
74:42
is that I'm very robust to the estimation error.
74:45
Remember our estimation strategy,
74:47
which consisted in replacing expectation by averages?
74:50
If I'm extremely curvy, I can move a little bit.
74:52
This thing, the maximum, is not going to move much.
74:55
And this formula here--
74:57
so forget about the matrix version for a second--
75:00
is actually telling me exactly--
75:01
it's telling me the curvature is basically the variance
75:06
of the first derivative.
75:08
And so the more the first derivative fluctuates,
75:10
the more your maximum is actually-- your org max
75:12
is going to move all over the place.
75:14
So this is really controlling how flat
75:16
your likelihood, your log likelihood, is at its maximum.
75:20
The flatter it is, the more sensitive to fluctuation
75:23
the arg max is going to be.
75:24
The curvier it is, the less sensitive it is.
75:27
And so what we're hoping-- a good model
75:28
is going to be one that has a large or small value
75:31
for the Fisher information.
75:34
I want this to be--
75:36
small?
75:38
I want it to be large.
75:40
Because this is the curvature, right?
75:42
This number is negative, it's concave.
75:44
So if I take a negative sign, it's
75:45
going to be something that's positive.
75:47
And the larger this thing, the more curvy it is.
75:51
Oh, yeah, because it's the variance.
75:52
Again, sorry.
75:53
This is what--
75:55
OK.
75:55
75:59
Yeah, maybe I should not go into those details
76:02
because I'm actually out of time.
76:03
But just spoiler alert, the asymptotic variance
76:06
of your-- the variance, basically, as n
76:09
goes to infinity of the maximum likelihood estimator
76:11
is going to be 1 over this guy.
76:12
So we want it to be large, because the asymptotic variance
76:15
is going to be very small.
76:16
All right, so we're out of time.
76:18
We'll see that next week.
76:20
And I have your homework with me.
76:22
And I will actually turn it in.
76:25
I will give it to you outside so we
76:26
can let the other room come in.
76:28
OK, I'll just leave you the--