https://www.youtube.com/watch?v=0Va2dOLqUfM&list=PLUl4u3cNGP60uVBMaoNERc6knT_MgPKS0&index=5 字幕記錄 00:00 00:00 The following content is provided under a Creative 00:02 Commons license. 00:03 Your support will help MIT OpenCourseWare 00:06 continue to offer high quality educational resources for free. 00:10 To make a donation or to view additional materials 00:12 from hundreds of MIT courses, visit MIT OpenCourseWare 00:16 at ocw.mit.edu. 00:17 00:22 PROFESSOR: So I'm using a few things here, right? 00:24 I'm using the fact that KL is non-negative. 00:27 But KL is equal to 0 when I take twice the same argument. 00:31 So I know that this function is always non-negative. 00:34 00:38 So that's theta and that's KL P theta star P theta. 00:45 And I know that at theta star, it's equal to 0. 00:51 OK? 00:52 I could be in the case where I have this happening. 00:56 I have two-- let's call it theta star prime. 01:01 I have two minimizers. 01:04 That could be the case, right? 01:05 I'm not saying that-- so K of L-- 01:07 KL is 0 at the minimum. 01:11 That doesn't mean that I have a unit minimum, right? 01:14 But it does, actually. 01:16 What do I need to use to make sure 01:17 that I have only one minimum? 01:18 01:22 So the definiteness is guaranteeing to me 01:24 that there's a unique P theta star that minimizes it. 01:28 But then I need to make sure that there's 01:30 a unique-- from this unique P theta star, 01:33 I need to make sure there's a unique theta star that 01:35 defines this P theta star. 01:36 01:39 Exactly. 01:40 All right, so I combine definiteness 01:43 and identifiability to make sure that there is a unique 01:47 minimizer, in this case cannot exist. 01:50 OK, so basically, let me write what I just said. 01:55 So definiteness, that implies that P theta star 02:06 is the unique minimizer of P theta maps to KL P theta star P 02:23 theta. 02:23 So definiteness only guarantees that the probability 02:26 distribution is uniquely identified. 02:29 And identifiability implies that theta star 02:42 is the unique minimizer of theta maps to KL P theta star P 02:56 theta, OK? 03:00 So I'm basically doing the composition 03:02 of two injective functions. 03:04 The first one is the one that maps, say, theta to P theta. 03:07 And the second one is the one that maps P theta 03:11 to the set of minimizers, OK? 03:14 03:20 So at least morally, you should agree that theta star 03:27 is the minimizer of this thing. 03:28 Whether it's unique or not, you should 03:30 agree that it's a good one. 03:33 So maybe you can think a little longer on this. 03:36 So thinking about this being the minimizer, 03:38 then it says, well, if I actually 03:40 had a good estimate for this function, 03:42 I would use the strategy that I described 03:44 for the total variation, which is, 03:45 well, I don't know what this function looks like. 03:48 It depends on theta star. 03:49 But maybe I can find an estimator 03:51 of this function that fluctuates around this function, 03:55 and such that when I minimize this estimator of the function, 03:58 I'm actually not too far, OK? 04:01 And this is exactly what drives me to do this, 04:04 because I can actually construct an estimator. 04:07 I can actually construct an estimator such 04:09 that this estimator is actually-- 04:12 of the KL is actually close to the KL, all right? 04:15 So I define KL hat. 04:18 So all we did is just replacing expectation with respect 04:22 to theta star by averages. 04:27 04:30 That's what we did. 04:33 So if you're a little puzzled by this error, that's all it says. 04:37 Replace this guy by this guy. 04:39 It has no mathematical meaning. 04:41 It just means just replace it by. 04:42 And now that actually tells me how to get my estimator. 04:46 It just says, well, my estimator, KL hat, 04:51 is equal to some constant which I don't know. 04:54 I mean, it certainly depends on theta star, 04:56 but I won't care about it when I'm trying to minimize-- 04:59 minus 1/n sum from i from 1 to n log f theta of x. 05:09 So here I'm reading it with the density. 05:11 You have it with the PMF on the slides, 05:13 and so you have the two versions in front of you, OK? 05:18 Oh sorry, I forgot the xi. 05:22 Now clearly, this function I know how to compute. 05:25 If you give me a theta, since I know the form of the density 05:30 f theta, for each theta that you give me, 05:33 I can actually compute this quantity, right? 05:38 This I don't know, but I don't care. 05:40 Because I'm just shifting the value of the function 05:42 I'm trying to minimize. 05:43 The set of minimizers is not going to change. 05:46 So now, this is my estimation strategy. 05:50 Minimize in theta KL hat P theta star P theta, OK? 06:01 So now let's just make sure that we all agree that-- 06:05 so what we want is the argument of the minimum, 06:07 right? arg min means the theta that minimizes this guy, 06:10 rather than finding the value of the min. 06:13 OK, so I'm trying to find the arg min of this thing. 06:15 Well, this is equivalent to finding the arg 06:18 min of, say, a constant minus 1/n sum from i from 1 to n 06:28 of log f theta of xi. 06:31 06:33 So that's just-- 06:34 06:38 I don't think it likes me. 06:41 No. 06:42 OK, so thus minimizing this average, right? 06:46 I just plugged in the definition of KL hat. 06:48 Now, I claim that taking the arg min 06:50 of a constant plus a function or the arg min of the function 06:53 is the same thing. 06:55 Is anybody not comfortable with this idea? 07:00 OK, so this is the same. 07:03 07:13 By the way, this I should probably 07:15 switch to the next slide, because I'm writing 07:18 the same thing, but better. 07:22 And it's with PMF rather than as PF. 07:29 OK, now, arg min of the minimum is the same of arg max-- 07:34 sorry, arg min of the negative thing 07:35 is the same as arg max without the negative, right? 07:37 07:40 arg max over theta of 1/n from i equal equal 1 to n log f 07:49 theta of xi. 07:49 07:53 Taking the arg min of the average 07:54 or the arg min of the sum, again, it's 07:56 not going to make much difference. 07:59 Just adding constants OR multiplying by constants 08:01 does not change the arg min or the arg max. 08:04 Now, I have the sum of logs, which 08:07 is the log of the product. 08:08 08:23 OK? 08:24 It's the arg max of the log of f theta of x1 times 08:27 f theta of x2, f theta of xn. 08:30 But the log is a function that's increasing, so maximizing 08:37 log of a function or maximizing the function itself 08:40 is the same thing. 08:42 The value is going to change, but the arg max 08:45 is not going to change. 08:46 Everybody agrees with this? 08:47 08:50 So this is equivalent to arg max over theta of pi from 1 to n 08:59 of f theta xi. 09:02 And that's because x maps to log x is increasing. 09:10 09:13 So now I've gone from minimizing the KL 09:17 to minimizing the estimate of the KL 09:19 to maximizing this product. 09:23 Well, this chapter is called maximum likelihood estimation. 09:27 The maximum comes from the fact that our original idea 09:30 was to minimize the negative of a function. 09:32 So that's why it's maximum likelihood. 09:34 And this function here is called the likelihood. 09:42 This function is really just telling me-- 09:45 they call it likelihood because it's 09:47 some measure of how likely it is that theta 09:49 was the parameter that generated the data. 09:52 OK, so let's go to the-- 09:55 well, we'll go to the formal definition in a second. 09:57 But actually, let me just give you 09:59 intuition as to why this is the distribution of the data. 10:05 Why this is the likelihood-- sorry. 10:07 Why is this making sense as a measure of likelihood? 10:11 Let's now think for simplicity of the following model. 10:14 So I have-- 10:15 I'm on the real line and I look at n, say, 10:19 theta 1 for theta in the real-- do you see that? 10:25 OK. 10:26 Probably you don't. 10:27 Not that you care. 10:28 OK, so-- 10:29 10:41 OK, let's look at a simple example. 10:42 10:45 So here's the model. 10:48 As I said, we're looking at observations on the real line. 10:52 And they're distributed according to some n theta 1. 10:57 So I don't care about the variance. 10:58 I know it's 1. 10:59 And it's indexed by theta in the real line. 11:03 OK, so this is-- the only thing I need to figure out 11:05 is, what is the mean of those guys, OK? 11:09 Now, I have this n observations. 11:11 And if you actually remember from your probability class, 11:15 are you familiar with the concept of joint density? 11:18 You have multivariate observations. 11:20 The joint density of independent random variables 11:23 is just a product of their individual densities. 11:26 So really, when I look at the product from i 11:30 equal 1 to n of f theta of xi, this 11:34 is really the joint density of the vector-- 11:44 11:48 well, let me not use the word vector-- 11:51 of x1 xn, OK? 11:55 So if I take the product of density, is it still a density? 11:58 And it's actually-- but this time on the r to the n. 12:04 And so now what this thing is telling me-- so 12:06 think of it in r2, right? 12:07 So this is the joint density of two Gaussians. 12:10 So it's something that looks like some bell-shaped curve 12:14 in two dimensions. 12:15 And it's centered at the value theta theta. 12:20 OK, they both have the mean theta. 12:22 So let's assume for one second-- it's 12:24 going to be hard for me to make pictures in n dimensions. 12:28 Actually, already in two dimensions, 12:29 I can promise you that it's not very easy. 12:31 So I'm actually just going to assume 12:34 that n is equal to 1 for the sake of illustration. 12:37 OK, so now I have this data. 12:40 And now I have one observation, OK? 12:44 And I know that the f theta looks like this. 12:47 And what I'm doing is I'm actually 12:48 looking at the value of x theta as my observation. 12:51 12:54 Let's call it x1. 12:57 Now, my principal tells me, just find the theta that 13:00 makes this guy the most likely. 13:03 What is the likelihood of my x1? 13:05 Well, it's just the value of the function. 13:07 That this value here. 13:09 And if I wanted to find the most likely theta that had generated 13:13 this x1, what I would need to do is to shift this thing 13:16 and put it here. 13:19 And so my estimate, my maximum likelihood estimator 13:21 here would be theta is equal to x1, OK? 13:28 That would be just the observation. 13:30 Because if I have only one observation, 13:32 what else am I going to do? 13:33 OK, and so it sort of makes sense. 13:34 And if you have more observations, 13:36 you can think of it this way, as if you had more observations. 13:40 So now I have, say, K observations, 13:42 or n observations. 13:44 And what I do is that I look at the value for each 13:46 of these guys. 13:48 So this value, this value, this value, this value. 13:52 I take their product and I make this thing large. 13:55 OK, why do I take the product? 13:57 Well, because I'm trying to maximize their value 14:00 all together, and I need to just turn it into one number 14:02 that I can maximize. 14:04 And taking the product is the natural way 14:06 of doing it, either by motivating it 14:08 by the KL principle or motivating it 14:11 by maximizing the joint density, rather than just maximizing 14:14 anything. 14:15 OK, so that's why, visually, this is the maximum likelihood. 14:20 It just says that if my observations are here, 14:24 then this guy, this mean theta, is more likely than this guy. 14:29 Because now if I look at the value 14:31 of the function for this guy-- if I 14:33 look at theta being this thing, then this 14:35 is a very small value. 14:37 Very small value, very small value, very small value. 14:39 Everything gets a super small value, right? 14:41 That's just the value that it gets in the tail 14:43 here, which is very close to 0. 14:45 But as soon as I start covering all my points 14:47 with my bell-shaped curve, then all the values go up. 14:53 All right, so I just want to make a short break 14:58 into statistics, and just make sure 15:00 that the maximum likelihood principle involves 15:04 maximizing a function. 15:05 So I just want to make sure that we're 15:07 all on par about how do we maximize functions. 15:11 In most instances, it's going to be a one-dimensional function, 15:13 because theta is going to be a one-dimensional parameter. 15:16 Like here it's the real line. 15:18 So it's going to be easy. 15:20 In some cases, it may be a multivariate function 15:22 and it might be more complicated. 15:24 OK, so let's just make this interlude. 15:26 So the first thing I want you to notice 15:28 is that if you open any book on what's called optimization, 15:31 which basically is the science behind optimizing functions, 15:35 you will talk mostly-- 15:36 I mean, I'd say 99.9% of the cases 15:40 will talk about minimizing functions. 15:42 But it doesn't matter, because you can just flip the function 15:44 and you just put a minus sign, and minimizing h 15:47 is the same as maximizing minus h and the opposite, OK? 15:51 So for this class, since we're only 15:53 going to talk about maximum likelihood estimation, 15:55 we will talk about maximizing functions. 15:57 But don't be lost if you decide suddenly 15:59 to open a book on optimization and find only something 16:01 about minimizing functions. 16:03 OK, so maximizing an arbitrary function can actually be fairly 16:08 difficult. If I give you a function that has this weird 16:10 shape, right-- let's think of this polynomial for example-- 16:13 and I wanted to find the maximum, how would we do it? 16:17 16:20 So what is the thing you've learned in calculus on how 16:23 to maximize the function? 16:26 Set the derivative equal to 0. 16:27 Maybe you want to check the second derivative 16:29 to make sure it's a maximum and not a minimum. 16:31 But the thing is, this is only guaranteeing to you that you 16:34 have a local one, right? 16:35 So if I do it for this function, for example, then this guy 16:38 is going to satisfy this criterion, 16:39 this guy is going to satisfy this criterion, 16:41 this guy is going to satisfy this criterion, this guy here, 16:43 and this guy satisfies the criterion, but not 16:45 the second derivative one. 16:46 So I have a lot of candidates. 16:50 And if my function can be really anything, 16:52 it's going to be difficult, whether it's 16:54 analytically by taking derivatives and setting them 16:56 to 0, or trying to find some algorithms to do this. 17:00 Because if my function is very jittery, 17:02 then my algorithm basically has to check all candidates. 17:05 And if there's a lot of them, it might take forever, OK? 17:08 So this is-- I have only one, two, three, four, 17:11 five candidates to check. 17:13 But in practice, you might have a million of them to check. 17:15 And that might take forever. 17:17 OK, so what's nice about statistical models, and one 17:21 of the things that makes all these models particularly 17:24 robust, and that we still talk about them 100 17:27 years after they've been introduced 17:29 is that the functions that-- the likelihoods 17:31 that they lead for us to maximize 17:33 are actually very simple. 17:34 And they all share a nice property, 17:37 which is that of being concave. 17:40 All right, so what is a concave function? 17:42 Well, by definition, it's just a function for which-- 17:44 let's think of it as being twice differentiable. 17:47 You can define functions that are not 17:49 differentiable as being concave, but let's think about it 17:51 as having a second derivative. 17:53 And so if you look at the function that 17:54 has a second derivative, concave are the functions 17:57 that have their second derivative that's 17:59 negative everywhere. 18:02 Not just at the maximum, everywhere, OK? 18:06 And so if it's strictly concave, this second derivative 18:09 is actually strictly less than zero. 18:12 And particularly if I think of a linear function, 18:16 y is equal to x, then this function 18:19 has its second derivative which is equal to zero, OK? 18:24 So it is concave. 18:26 But it's not strictly concave, OK? 18:28 If I look at the function which is negative x squared, 18:31 what is its second derivative? 18:33 18:35 Minus 2. 18:36 So it's strictly negative everywhere, OK? 18:39 So actually, this is a pretty canonical example 18:43 strictly concave function. 18:44 If you want to think of a picture of a strictly concave 18:46 function, think of negative x squared. 18:48 So parabola pointing downwards. 18:52 OK, so we can talk about strictly convex functions. 18:56 So convex is just happening when the negative of the function 18:59 is concave. 19:00 So that translates into having a second derivative which 19:03 is either non-negative or positive, depending 19:05 on whether you're talking about convexity or strict convexity. 19:09 But again, those convex functions 19:11 are convenient when you're trying to minimize something. 19:14 And since we're trying to maximize the function, 19:16 we're looking for concave. 19:18 So here are some examples. 19:21 Let's just go through them quickly. 19:23 19:39 OK, so the first one is-- 19:41 so here I made my life a little uneasy 19:46 by talking about the functions in theta, right? 19:49 I'm talking about likelihoods, right? 19:51 So I'm thinking of functions where the parameter is theta. 19:54 So I have h of theta. 19:56 And so if I start with theta squared, 19:59 negative theta squared, then as we said, 20:02 h prime prime of theta, the second derivative is minus 2, 20:09 which is strictly negative, so this function is strictly 20:11 concave. 20:12 20:19 OK, another function is h of theta, which is-- 20:24 what did we pick-- 20:25 square root of theta. 20:28 What is the first derivative? 20:30 20:35 1/2 square root of theta. 20:39 What is the second derivative? 20:41 20:48 So that's theta to the negative 1/2. 20:51 So I'm just picking up another negative 1/2, 20:53 so I get negative 1/4. 20:56 And then I get theta to the 3/4 downstairs, OK? 21:02 Sorry, 3/2. 21:03 21:09 And that's strictly negative for theta, say, larger than 0. 21:16 And I really need to have this thing larger than 0 21:20 so that it's well-defined. 21:21 But strictly larger than 0 is so that this thing does not 21:24 blow up to infinity. 21:25 And it's true. 21:26 If you think about this function, it looks like this. 21:30 And already, the first derivative to infinity at 0. 21:34 And it's a concave function, OK? 21:37 Another one is the log, of course. 21:39 21:44 What is the derivative of the log? 21:47 That's 1 over theta, where h prime of theta is 1 over theta. 21:52 And the second derivative negative 1 over theta squared, 22:01 which again, is negative if theta is strictly positive. 22:06 Here I define it as-- 22:07 I don't need to define it to be strictly positive here, 22:10 but I need it for the log. 22:13 And sine. 22:16 OK, so let's just do one more. 22:18 So h of theta is sine of theta. 22:22 But here I take it only on an interval, 22:24 because you want to think of this function 22:27 as pointing always downwards. 22:29 And in particular, you don't want this function 22:31 to have an inflection point. 22:32 You don't want it to go down and then up 22:34 and then down and then up, because this is not concave. 22:37 And so sine is certainly going up and down, right? 22:39 So what we do is we restrict it to an interval where sine 22:43 is actually-- so what does the sine function looks 22:45 like at 0, 0? 22:47 And it's going up. 22:48 Where is the first maximum of the sine? 22:53 STUDENT: [INAUDIBLE] 22:54 PROFESSOR: I'm sorry. 22:55 STUDENT: Pi over 2. 22:56 PROFESSOR: Pi over 2, where it takes value 1. 22:59 And then it goes down again. 23:01 And then that's at pi. 23:04 And then I go down again. 23:05 And here you see I actually start changing my inflection. 23:08 So what we do is we stop it at pi. 23:10 And we look at this function, it certainly 23:12 looks like a parabola pointing downwards. 23:14 And so if you look at the-- you can check that it actually 23:16 works with the derivatives. 23:17 So the derivative of sine is cosine. 23:22 23:25 And the derivative of cosine is negative sine. 23:31 23:34 OK, and this thing between 0 and pi is actually positive. 23:38 So this entire thing is going to be negative. 23:40 OK? 23:41 And you know, I can come up with a lot of examples, 23:45 but let's just stick to those. 23:46 There's a linear function, of course. 23:48 And the find function is going to be concave, 23:51 but it's actually going to be convex as well, which 23:53 means that it's certainly not going to be 23:55 strictly concave or convex, OK? 23:58 So here's your standard picture. 24:01 And here, if you look at the dotted line, what 24:04 it tells me is that a concave function, 24:07 and the property we're going to be using 24:08 is that if a strictly concave function has a maximum, which 24:12 is not always the case, but if it has a maximum, 24:15 then it actually must be-- sorry, a local maximum, 24:18 it must be a global maximum. 24:21 OK, so just the fact that it goes up and down and not 24:23 again means that there's only global maximum that can exist. 24:28 Now if you looked, for example, at the square root function, 24:32 look at the entire positive real line, 24:34 then this thing is never going to attain a maximum. 24:36 It's just going to infinity as x goes to infinity. 24:39 So if I wanted to find the maximum, 24:40 I would have to stop somewhere and say 24:42 that the maximum is attained at the right-hand side. 24:46 OK, so that's the beauty about convex functions or concave 24:49 functions, is that essentially, these functions 24:53 are easy to maximize. 24:55 And if I tell you a function is concave, 24:57 you take the first derivative, set it equal to 0. 25:00 If you find a point that satisfies this, 25:01 then it must be a global maximum, OK? 25:07 STUDENT: What if your set theta was 25:09 [INAUDIBLE] then couldn't you have a function that, 25:13 by the definition, is concave, with two upside down parabolas 25:17 at two disjoint intervals, but yet it has two global maximums? 25:22 25:26 PROFESSOR: So you won't get them-- 25:28 so you want the function to be concave on what? 25:31 On the convex cell of the intervals? 25:34 Or you want it to be-- 25:35 STUDENT: [INAUDIBLE] just said that any subset. 25:38 PROFESSOR: OK, OK. 25:40 You're right. 25:40 So maybe the definition-- so you're 25:42 pointing to a weakness in the definition. 25:45 Let's just say that theta is a convex set 25:49 and then you're good, OK? 25:50 So you're right. 25:51 25:54 Since I actually just said that this is true only for theta, 25:56 I can just take pieces of concave functions, right? 25:59 I can do this, and then the next one 26:00 I can do this, on the next one I can do this. 26:03 And then I would have a bunch of them. 26:05 But what I want is think of it as a global function 26:10 on some convex set. 26:11 You're right. 26:13 So think of theta as being convex 26:14 for this guy, an interval, if it's a real line. 26:17 26:20 OK, so as I said, for more generally-- so 26:25 we can actually define concave functions more generally 26:27 in higher dimensions. 26:29 And that will be useful if theta is not just 26:32 one parameter but several parameters. 26:34 And for that, you need to remind yourself of Calculus II, 26:39 and you have generalization of the notion of derivative, which 26:42 is called a gradient, which is basically a vector where 26:46 each coordinate is just the partial derivative with respect 26:49 to each coordinate of theta. 26:51 And the Hessian is the matrix, which 26:54 is essentially a generalization of the second derivative. 26:58 I denote it by nabla squared, but you 27:01 can write it the way you want. 27:02 And so this matrix here is taking as entry 27:07 the second partial derivatives of h with respect 27:10 to theta i and theta j. 27:12 And so that's the ij-th entry. 27:15 Who has never seen that? 27:16 27:19 OK. 27:20 So now, being concave here is essentially generalizing, 27:27 saying that a vector is equal to zero. 27:28 Well, that's just setting the vector-- sorry. 27:31 The first order condition to say that it's a maximum 27:33 is going to be the same. 27:34 Saying that a function has a gradient equal to zero 27:38 is the same as saying that each of its coordinates 27:43 are equal to zero. 27:44 And that's actually going to be a condition 27:46 for a global maximum here. 27:48 So to check convexity, we need to see that a matrix itself 27:52 is negative. 27:53 Sorry, to check concavity, we need 27:55 to check that a matrix is negative. 27:57 And there is a notion among matrices 27:59 that compare matrix to zero, and that's exactly this notion. 28:03 You pre- and post-multiply by the same x. 28:06 So that works for symmetric matrices, 28:08 which is the case here. 28:10 And so you pre-multiply by x, post-multiply by the same x. 28:13 So you have your matrix, your Hessian here. 28:15 28:20 It's a d by d matrix if you have a d-dimensional matrix. 28:24 So let's call it-- 28:26 OK. 28:27 And then here I pre-multiply by x transpose. 28:31 I post-multiply by x. 28:34 And this has to be non-positive if I want it to be concave, 28:38 and strictly negative if I want it to be strictly concave. 28:42 OK, that's just a real generalization. 28:44 You can check for yourself that this is the same thing. 28:47 If I were in dimension 1, this would be the same thing. 28:49 Why? 28:50 Because in dimension 1, pre- and post-multiplying by x 28:53 is the same as multiplying by x squared. 28:55 Because in dimension 1, I can just move my x's around, right? 28:58 And so that would just mean the first condition 29:01 would mean in dimension 1 that the second derivative times x 29:04 squared has to be less than or equal to zero. 29:11 So here I need this for all x's that are not zero, 29:14 because I can take x to be zero and make this equal to zero, 29:16 right? 29:17 So this is for x's that are not equal to zero, OK? 29:21 And so some examples. 29:25 Just look at this function. 29:27 So now I have functions that depend on two parameters, 29:29 theta1 and theta2. 29:31 So the first one is-- 29:33 so if I take theta to be equal to-- 29:36 now I need two parameters, r squared. 29:39 And I look at the function, which is h of theta. 29:42 Can somebody tell me what h of theta is? 29:45 STUDENT: [INAUDIBLE] 29:49 PROFESSOR: Minus 2 theta2 squared? 29:52 OK, so let's compute the gradient of h of theta. 30:00 So it's going to be something that has two coordinates. 30:04 To get the first coordinate, what do I do? 30:06 Well, I take the derivative with respect 30:07 to theta1, thinking of theta2 as being a constant. 30:10 So this thing is going to go away. 30:11 And so I get negative 2 theta1. 30:14 And when I take the derivative with respect 30:15 to the second part, thinking of this part as being constant, 30:18 I get minus 4 theta2. 30:21 30:24 That clear for everyone? 30:26 That's just the definition of partial derivatives. 30:29 30:32 And then if I want to do the Hessian, 30:40 so now I'm going to get a 2 by 2 matrix. 30:42 30:45 The first guy here, I take the first-- so this guy 30:48 I get by taking the derivative of this guy with respect 30:51 to theta1. 30:52 So that's easy. 30:53 So that's just minus 2. 30:55 This guy I get by taking derivative 30:56 of this guy with respect to theta2. 30:58 So I get what? 31:00 Zero. 31:00 I treat this guy as being a constant. 31:03 This guy is also going to be zero, 31:04 because I take the derivative of this guy with respect 31:06 to theta1. 31:08 And then I take the derivative of this guy with respect 31:10 to theta2, so I get minus 4. 31:14 OK, so now I want to check that this matrix satisfies 31:19 x transpose-- 31:21 this matrix x is negative. 31:24 So what I do is-- 31:25 so what is x transpose x? 31:27 So if I do x transpose delta squared h theta x, what I get 31:33 is minus 2 x1 squared minus 4 x2 squared. 31:42 Because this matrix is diagonal, so all it does is just weights 31:45 the square of the x's. 31:47 So this guy is definitely negative. 31:51 This guy is negative. 31:53 And actually, if one of the two is non-zero, 31:56 which means that x is non-zero, then this thing 31:58 is actually strictly negative. 32:00 So this function is actually strictly concave. 32:02 32:05 And it looks like a parabola that's slightly 32:07 distorted in one direction. 32:09 32:15 So well, I know this might have been some time ago. 32:21 Maybe for some of you might have been since high school. 32:23 So just remind yourself of doing second derivatives and Hessians 32:27 and things like this. 32:29 Here's another one as an exercise. 32:32 h is minus theta1 minus theta2 squared. 32:36 So this one is going to actually not be diagonal. 32:44 The Hessian is not going to be diagonal. 32:46 Who would like to do this now in class? 32:50 OK, thank you. 32:51 This is not a calculus class. 32:53 So you can just do it as a calculus exercise. 32:56 And you can do it for log as well. 32:58 Now, there is a nice recipe for concavity 33:01 that works for the second one and the third one. 33:05 And the thing is, if you look at those particular functions, 33:07 what I'm doing is taking, first of all, a linear combination 33:11 of my arguments. 33:13 And then I take a concave function of this guy. 33:15 And this is always going to work. 33:18 This is always going to give me a complete function. 33:20 So the computations that I just made, 33:22 I actually never made them when I prepared those 33:24 slides because I don't have to. 33:26 I know that if I take a linear combination of those things 33:28 and then I take a concave function of this guy, 33:30 I'm always going to get a concave function. 33:33 OK, so that's an easy way to check this, or at least as 33:39 a sanity check. 33:42 All right, and so as I said, finding maximizers of concave 33:48 or strictly concave function is the same 33:50 as it was in the one-dimensional case. 33:52 What I do-- sorry, in the one-dimensional case, 33:55 we just agreed that we just take the derivative 33:57 and set it to zero. 33:58 In the high dimensional case, we take the gradient 34:00 and set it equal to zero. 34:01 Again, that's calculus, all right? 34:04 So it turns out that so this is going 34:07 to give me equations, right? 34:09 The first one is an equation in theta. 34:11 The second one is an equation in theta1, theta2, theta3, 34:15 all the way to theta d. 34:16 And it doesn't mean that because I can write this equation 34:19 that I can actually solve it. 34:21 This equation might be super nasty. 34:23 It might be like some polynomial and exponentials and logs equal 34:28 zero, or some crazy thing. 34:31 And so there's actually, for a concave function, 34:36 since we know there's a unique maximizer, 34:38 there's this theory of convex optimization, which really, 34:42 since those books are talking about minimizing, 34:44 you had to find some sort of direction. 34:46 But you can think of it as the theory of concave maximization. 34:50 And they allow you to find algorithms to solve 34:54 this numerically and fairly efficiently. 34:57 OK, that means fast. 34:58 Even if d is of size 10,000, you're 35:01 going to wait for one second and it's 35:02 going to tell you what the maximum is. 35:05 And that's what machine learning is about. 35:06 If you've taken any class on machine learning, 35:08 there's a lot of optimization, because they have really, 35:11 really big problems to solve. 35:13 Often in this class, since this is 35:15 more introductory statistics, we will have a close form. 35:19 For the maximum likelihood estimator 35:21 will be saying theta hat equals, and say x bar, 35:25 and that will be the maximum likelihood estimator. 35:28 So just why-- so has anybody seen convex optimization 35:34 before? 35:36 So let me just give you an intuition 35:38 why those functions are easy to maximize or to minimize. 35:43 In one dimension, it's actually very easy for you to see that. 35:46 35:50 And the reason is this. 35:52 If I want to maximize the concave function, what 35:57 I need to do is to be able to query a point 35:59 and get as an answer the derivative of this function, 36:04 OK? 36:04 So now I said this is the function I want to optimize, 36:07 and I've been running my algorithm for 5/10 of a second. 36:13 And it's at this point here. 36:15 OK, that's the candidate. 36:17 Now, what I can ask is, what is the derivative 36:19 of my function here? 36:21 Well, it's going to give me a value. 36:22 And this value is going to be either negative, positive, 36:26 or zero. 36:27 Well, if it's zero, that's great. 36:28 That means I'm here and I can just go home. 36:30 I've solved my problem. 36:31 I know there's a unique maximum, and that's 36:33 what I wanted to find. 36:34 If it's positive, it actually tells me 36:37 that I'm on the left of the optimizer. 36:41 And on the left of the optimal value. 36:43 And if it's negative, it means that I'm 36:47 at the right of the value I'm looking for. 36:50 And so most of the convex optimization methods 36:53 basically tell you, well, if you query the derivative 36:56 and it's actually positive, move to the right. 37:00 And if it's negative, move to the left. 37:02 Now, by how much you move is basically, well, 37:07 why people write books. 37:09 And in higher dimension, it's a little more complicated, 37:13 because in higher dimension, thinks about two dimensions, 37:16 then I'm only being able to get in a vector. 37:21 And the vector is only telling me, well, here 37:24 is half of the space in which you can move. 37:26 Now here, if you tell me move to the right, 37:28 I know exactly which direction I'm going to have to move. 37:30 But in two dimension, you're going 37:32 to basically tell me, well, move in this global direction. 37:37 And so, of course, I know there's a line on the floor I 37:40 cannot move behind. 37:42 But even if you tell me, draw a line on the floor 37:45 and move only to that side of the line, 37:47 then there's many directions in that line that I can go to. 37:50 And that's also why there's lots of things 37:53 you can do in optimization. 37:55 OK, but still, putting this line on the floor is telling me, 38:00 do not go backwards. 38:02 And that's very important. 38:03 It's just telling you which direction 38:04 I should be going always, OK? 38:07 All right, so that's what's behind this notion 38:11 of gradient descent algorithm, steepest descent. 38:14 Or steepest descent, actually, if we're trying to maximize. 38:17 OK, so let's move on. 38:22 So this course is not about optimization, all right? 38:26 So as I said, the likelihood was this guy. 38:30 The product of f of the xi's. 38:32 And one way you can do this is just 38:33 basically the joint distribution of my data at the point theta. 38:39 So now the likelihood, formerly-- so here 38:41 I am giving myself the model e theta. 38:44 And here I'm going to assume that e is discrete 38:48 so that I can talk about PMFs. 38:49 But everything you're doing, just 38:51 redo for the sake of yourself by replacing PMFs by PDFs, 38:55 and everything's going to be fine. 38:56 We'll do it in a second. 38:58 All right, so the likelihood of the model. 39:02 So here I'm not looking at the likelihood of a parameter. 39:05 I'm looking at the likelihood of a model. 39:07 So it's actually a function of the parameter. 39:09 And actually, I'm going to make it 39:10 even a function of the points x1 to xn. 39:14 All right, so I have a function. 39:15 And what it takes as input is all the points x1 39:18 to xn and a candidate parameter theta. 39:22 Not the true one. 39:22 A candidate. 39:23 And what I'm going to do is I'm going 39:25 to look at the probability that my random variables 39:28 under this distribution, p theta, 39:29 take these exact values, px1, px2, pxn. 39:34 Now remember, if my data was independent, 39:40 then I could actually just say that the probability 39:43 of this intersection is just a product of the probabilities. 39:45 And it would look something like this. 39:48 But I can define likelihood even if I don't have 39:50 independent random variables. 39:52 But think of them as being independent, 39:54 because that's all we're going to encounter in this class, OK? 39:57 I just want you to be aware that if I had dependent variables, 40:00 I could still define the likelihood. 40:02 I would have to understand how to compute these probabilities 40:04 there to be able to compute it. 40:08 OK, so think of Bernoullis, for example. 40:11 So here is my example of a Bernoulli. 40:12 40:16 So my parameter is-- 40:18 so my model is 0,1 Bernoulli p. 40:25 p is in the interval 0,1. 40:31 The probability, just as a side remark, 40:35 I'm just going to use the fact that I can actually 40:38 write the PMF of a Bernoulli in a very concise form, right? 40:41 If I ask you what the PMF of a Bernoulli is, 40:43 you could tell me, well, the probability that x-- 40:46 so under p, the probability that x is equal to 0 is 1 minus p. 40:50 The probability under p that x is equal to 1 is equal to p. 40:57 But I can be a bit smart and say that for any X that's 41:01 either 0 or 1, the probability under p 41:04 that X is equal to little x, I can write it 41:07 in a compact form as p to the X, 1 minus p to the 1 minus x. 41:14 And you can check that this is the right form because, well, 41:17 you have to check it only for two values of X, 0 and 1. 41:20 And if you plug in 1, you only keep the p. 41:23 If you plug in 0, you only keep the 1 minus p. 41:27 And that's just a trick, OK? 41:31 I could have gone with many other ways. 41:34 Agreed? 41:35 I could have said, actually, something like-- 41:39 another one would be-- which we are not going to use, 41:41 but we could say, well, it's xp plus and minus x 1 minus 41:47 p, right? 41:47 41:50 That's another one. 41:53 But this one is going to be convenient. 41:56 So forget about this guy for a second. 41:57 42:02 So now, I said that the likelihood is just 42:05 this function that's computing the probability that X1 42:12 is equal to little x1. 42:15 So likelihood is L of X1, Xn. 42:27 So let me try to make those calligraphic so you 42:30 know that I'm talking about smaller values, right? 42:33 Small x's. 42:35 x1, xn, and then of course p. 42:38 Sometimes we even put-- 42:40 I didn't do it, but sometimes you can actually 42:42 put a semicolon here, semicolon so you know that those two 42:46 things are treated differently. 42:48 And so now you have this thing is equal to what? 42:51 Well, it's just the probability under p 42:54 that X1 is little x1 all the way to Xn is little xn. 42:59 OK, that's just the definition. 43:02 43:06 All right, so now let's start working. 43:11 So we write the definition, and then we 43:13 want to make it look like something we would potentially 43:16 be able to maximize if I were-- 43:17 like if I take the derivative of this with respect to p, 43:20 it's not very helpful because I just don't know. 43:22 Just want the algebraic function of p. 43:26 So this thing is going to be equal to what? 43:28 Well, what is the first thing I want to use? 43:30 43:32 I have a probability of an intersection of events, 43:35 so it's just the product of the probabilities. 43:39 So this is the product from i equal 1 to n of P, small p, 43:44 Xi is equal to little xi. 43:47 That's independence. 43:49 43:54 OK, now, I'm starting to mean business, because for each P, 43:58 we have a closed form, right? 44:00 I wrote this as this supposedly convenient form. 44:03 I still have to reveal to you why it's convenient. 44:06 So this thing is equal to-- 44:09 well, we said that that was p xi for a little xi. 44:15 1 minus p to the 1 minus xi, OK? 44:20 44:22 So that was just what I wrote over there as the probability 44:26 that Xi is equal to little xi. 44:29 And since they all have the same parameter p, just 44:32 have this p that shows up here. 44:34 44:38 And so now I'm just taking the products of something 44:41 to the xi, so it's this thing to the sum of the xi's. 44:45 Everybody agrees with this? 44:48 So this is equal to p sum of the xi, 1 minus p 44:56 to the n minus sum of the xi. 44:58 45:10 If you don't feel comfortable with writing it directly, 45:13 you can observe that this thing here 45:15 is actually equal to p over 1 minus p to the xi times 1 45:22 minus p, OK? 45:26 So now when I take the product, I'm 45:27 getting the products of those guys. 45:28 So it's just this guy to the power of sum 45:31 and this guy to the power n. 45:33 And then I can rewrite it like this if I want to 45:39 And so now-- well, that's what we have here. 45:42 And now I am in business because I can still 45:45 hope to maximize this function. 45:48 And how to maximize this function? 45:50 All I have to do is to take the derivative. 45:52 Do you want to do it? 45:54 Let's just take the derivative, OK? 45:56 Sorry, I didn't tell you that, well, the maximum likelihood 45:58 principle is to just maxim-- the idea is to maximize this thing, 46:01 OK? 46:02 But I'm not going to get there right now. 46:04 OK, so let's do it maybe for the Poisson model for a second. 46:08 So if you want to do it for the Poisson model, 46:16 let's write the likelihood. 46:18 So right now I'm not doing anything. 46:20 I'm not maximizing. 46:21 I'm just computing the likelihood function. 46:24 46:29 OK, so the likelihood function for Poisson. 46:32 So now I know-- what is my sample space for Poisson? 46:36 STUDENT: Positives. 46:38 PROFESSOR: Positive integers. 46:41 And well, let me write it like this. 46:45 Poisson lambda, and I'm going to take lambda to be positive. 46:51 And so that means that the probability under lambda 46:53 that X is equal to little x in the sample space 46:57 is lambda to the X over factorial x 47:01 e to the minus lambda. 47:03 So that's basically the same as the compact form 47:05 that I wrote over there. 47:06 It's just now a different one. 47:08 And so when I want to write my likelihood, again, 47:12 we said little x's. 47:13 47:17 This is equal to what? 47:18 Well, it's equal to the probability under lambda 47:23 that X1 is little x1, Xn is little xn, 47:31 which is equal to the product. 47:33 47:40 OK? 47:42 Just by independence. 47:45 And now I can write those guys as being-- each 47:47 of them being i equal 1 to n. 47:52 So this guy is just this thing where a plug in Xi. 47:56 So I get lambda to the Xi divided by factorial xi times e 48:05 to the minus lambda, OK? 48:10 And now, I mean, this guy is going to be nice. 48:13 This guy is not going to be too nice. 48:15 But let's write it. 48:16 When I'm going to take the product of those guys here, 48:18 I'm going to pick up lambda to the sum of the xi's. 48:21 Here I'm going to pick up exponential 48:23 minus n times lambda. 48:25 And here I'm going to pick up just the product 48:27 of the factorials. 48:29 So x1 factorial all the way to xn factorial. 48:35 Then I get lambda, the sum of the xi. 48:41 Those are little xi's. 48:43 e to the minus xn lambda. 48:46 OK? 48:47 48:51 So that might be freaky at this point, but remember, 48:55 this is a function we will be maximizing. 48:58 And the denominator here does not depend on lambda. 49:01 So we knew that maximizing this function with this denominator, 49:04 or any other denominator, including 1, 49:07 will give me the same arg max. 49:09 So it won't be a problem for me. 49:12 As long as it does not depend on lambda, 49:14 this thing is going to go away. 49:15 49:19 OK, so in the continuous case, the likelihood I cannot-- 49:24 right? 49:25 So if I would write the likelihood 49:26 like this in the continuous case, 49:29 this one would be equal to what? 49:32 Zero, right? 49:33 So it's not very helpful. 49:34 And so what we do is we define the likelihood 49:36 as the product of the f of theta xi. 49:39 Now that would be a jump if I told you, 49:43 well, just define it like that and go home 49:45 and don't discuss it. 49:46 But we know that this is exactly what's coming from the-- 49:52 well, actually, I think I erased it. 49:53 It was just behind. 49:55 So this was exactly what was coming from the KL 49:58 divergence estimated, right? 50:00 The thing that I showed you, if we 50:01 want to follow this strategy, which 50:03 consists in estimating the KL divergence and minimizing it, 50:06 is exactly doing this. 50:08 50:12 So in the Gaussian case-- 50:16 well, let's write it. 50:17 So in the Gaussian case, let's see 50:19 what the likelihood looks like. 50:20 50:27 OK, so if I have a Gaussian experiment here-- 50:32 did I actually write it? 50:33 50:36 OK, so I'm going to take mu and sigma as being two parameters. 50:40 So that means that my sample space is going to be what? 50:43 50:47 Well, my sample space is still R. 50:49 Those are just my observations. 50:51 But then I'm going to have a N mu sigma squared. 50:56 And the parameters of interest are mu 50:58 and R. And sigma squared and say 0 infinity. 51:04 OK, so that's my Gaussian model. 51:06 Yes. 51:07 STUDENT: [INAUDIBLE] 51:17 PROFESSOR: No, there's no-- 51:18 I mean, there's no difference. 51:20 STUDENT: [INAUDIBLE] 51:21 PROFESSOR: Yeah. 51:22 I think the all the slides I put the curly bracket, 51:24 then I'm just being lazy. 51:26 I just like those concave parenthesis. 51:31 All right, so let's write it. 51:33 So the definition, L xi, xn. 51:39 And now I have two parameters, mu and sigma squared. 51:43 We said, by definition, is the product from i 51:48 equal 1 to n of f theta of little xi. 51:55 Now, think about it. 51:57 Here we always had an extra line, right? 52:00 The line was to say that the definition was the probability 52:03 that they were all equal to each other. 52:05 That was the joint probability. 52:08 And here it could actually have a line that says it's the joint 52:12 probability distribution of the xi's. 52:14 And if it's not independent, it's 52:15 not going to be the product. 52:16 But again, since we're only dealing 52:18 with independent observations in the scope of this class, 52:21 this is the only definition we're going to be using. 52:23 OK, and actually, from here on, I 52:26 will literally skip this step when I talk about discrete ones 52:30 as well, because they are also independent. 52:33 Agreed? 52:35 So we start with this, which we agreed 52:37 was the definition for this particular case. 52:39 And so now all of you know by heart what the density of a-- 52:44 sorry, that's not theta. 52:45 I should write it mu sigma squared. 52:47 And so you need to understand what this density. 52:50 And it's product of 1 over sigma square root 2 pi times 53:01 exponential minus xi minus mu squared 53:07 divided by 2 sigma squared. 53:10 OK, that's the Gaussian density with parameters mu and sigma 53:13 squared. 53:15 I just plugged in this thing which I don't give you, 53:18 so you just have to trust me. 53:20 It's all over any book. 53:22 Certainly, I mean, you can find it. 53:25 I will give it to you. 53:26 And again, you're not expected to know it by heart. 53:29 Though, if you do your homework every week without wanting to, 53:34 you will definitely use some of your brain 53:36 to remember that thing. 53:38 OK, and so now, well, I have this constant in front. 53:42 1 over sigma square root 2 pi that I can pull out. 53:45 So I get 1 over sigma square root 2 pi to the power n. 53:50 And then I have the product of exponentials, which we know 53:52 is the exponential of the sum. 53:55 So this is equal to exponential minus. 53:58 And here I'm going to put the 1 over 2 sigma squared 54:01 outside the sum. 54:02 54:15 And so that's how this guy shows up. 54:19 Just the product of the density is evaluated at, respectively, 54:23 x1 to xn. 54:24 54:28 OK, any questions about computing those likelihoods? 54:33 Yes. 54:34 STUDENT: Why [INAUDIBLE] 54:41 PROFESSOR: Oh, that's a typo. 54:42 Thank you. 54:43 Because I just took it from probably the previous thing. 54:47 So those are actually-- should be-- 54:48 OK, thank you for noting that one. 54:50 So this line should say for any x1 to xn in R to the n. 55:00 Thank you, good catch. 55:01 55:06 All right, so that's really e to the n, right? 55:10 My sample space always. 55:12 55:16 OK, so what is maximum likelihood estimation? 55:19 Well again, if you go back to the estimate 55:24 that we got, the estimation strategy, which consisted 55:27 in replacing expectation with respect to theta star 55:31 by average of the data in the KL divergence, 55:35 we would try to maximize not this guy, but this guy. 55:41 55:45 The thing that we actually plugged in were not any small 55:48 xi's. 55:48 Were actually-- the random variable is capital Xi. 55:52 So the maximum likelihood estimator 55:54 is actually taking the likelihood, 55:57 which is a function of little x's, and now 55:59 the values at which it estimates, if you look at it, 56:02 is actually-- 56:03 the capital X is my data. 56:05 So it looks at the function, at the data, 56:09 and at the parameter theta. 56:11 That's what the-- so that's the first thing. 56:14 And then the maximum likelihood estimator 56:16 is maximizing this, OK? 56:19 So in a way, what it does is it's a function that couples 56:24 together the data, capital X1 to capital Xn, 56:27 with the parameter theta and just now tries to maximize it. 56:32 So if this is just a little hard for you to get, 56:40 the likelihood is formally defined 56:42 as a function of x, right? 56:43 Like when I write f of x. 56:46 f of little x, I define it like that. 56:48 But really, the only x arguments we're 56:52 going to evaluate this function at 56:54 are always the random variable, which is the data. 56:57 So if you want, you can think of it 56:59 as those guys being not parameters of this function, 57:02 but really, random variables themselves directly. 57:04 57:09 Is there any question? 57:10 STUDENT: [INAUDIBLE] those random variables [INAUDIBLE]?? 57:15 PROFESSOR: So those are going to be known once you have-- 57:17 so it's always the same thing in stats. 57:20 You first design your estimator as a function 57:24 of random variables. 57:25 And then once you get data, you just plug it in. 57:27 But we want to think of them as being random variables 57:29 because we want to understand what the fluctuations are. 57:32 So we're going to keep them as random variables for as long 57:34 as we can. 57:35 We're going to spit out the estimator as a function 57:37 of the random variables. 57:38 And then when we want to compute it from data, 57:40 we're just going to plug it in. 57:41 57:44 So keep the random variables for as long as you can. 57:46 Unless I give you numbers, actual numbers, 57:48 just those are random variables. 57:51 OK, so there might be some confusion 57:53 if you've seen any stats class, sometimes there's 57:55 a notation which says, oh, the realization 57:58 of the random variables are lower case versions 58:01 of the original random variables. 58:02 So lowercase x should be thought as the realization 58:05 of the upper case X. This is not the case here. 58:09 When I write this, it's the same way 58:12 as I write f of x is equal to x squared, right? 58:16 It's just an argument of a function that I want to define. 58:20 So those are just generic x. 58:22 So if you correct the typo that I have, 58:24 this should say that this should be for any x and xn. 58:27 I'm just describing a function. 58:28 And now the only place at which I'm 58:30 interested in evaluating that function, 58:32 at least for those first n arguments, is at the capital 58:35 N observations random variables that I have. 58:37 58:41 So there's actually texts, there's actually 58:45 people doing research on when does the maximum likelihood 58:48 estimator exist? 58:49 And that happens when you have infinite sets, thetas. 58:56 And this thing can diverge. 58:58 There is no global maximum. 59:00 There's crazy things that might happen. 59:01 And so we're actually always going to be in a case 59:04 where this maximum likelihood estimator exists. 59:07 And if it doesn't, then it means that you actually 59:09 need to restrict your parameter space, capital Theta, 59:13 to something smaller. 59:15 Otherwise it won't exist. 59:17 OK, so another thing is the log likelihood estimator. 59:21 So it is still the likelihood estimator. 59:23 We solved before that maximizing a function 59:26 or maximizing log of this function 59:27 is the same thing, because the log function is increasing. 59:30 So the same thing is maximizing a function 59:32 or maximizing, I don't know, exponential of this function. 59:35 Every time I take an increasing function, 59:37 it's actually the same thing. 59:38 Maximizing a function or maximizing 10 times 59:40 this function is the same thing. 59:41 So the function x maps to 10 times x is increasing. 59:45 And so why do we talk about log likelihood rather than 59:49 likelihood? 59:50 So the log of likelihood is really just-- 59:52 I mean the log likelihood is the log of the likelihood. 59:55 And the reason is exactly for this kind of reasons. 59:59 Remember, that was my likelihood, right? 60:02 And I want to maximize it. 60:04 And it turns out that in stats, there's 60:05 a lot of distributions that look like exponential of something. 60:10 So I might as well just remove the exponential 60:12 by taking the log. 60:14 So once I have this guy, I can take the log. 60:17 This is something to a power of something. 60:19 If I take the log, it's going to look better for me. 60:21 I have this thing-- 60:23 well, I have another one somewhere, I think, 60:25 where I had the Poisson. 60:27 Where was the Poisson? 60:29 The Poisson's gone. 60:31 So the Poisson was the same thing. 60:33 If I took the log, because it had a power, 60:35 that would make my life easier. 60:37 So the log doesn't have any particular intrinsic notion, 60:43 except that it's just more convenient. 60:47 Now, that being said, if you think 60:49 about maximizing the KL, the original formulation, 60:53 we actually remove the log. 60:55 If we come back to the KL thing-- 60:57 61:00 where is my KL? 61:01 Sorry. 61:03 That was maximizing the sum of the logs of the pi's. 61:08 And so then we worked at it by saying that the sum of the logs 61:11 was-- 61:12 maximizing the sum of the logs was the same 61:14 as maximizing the product. 61:16 But here, we're basically-- log likelihood 61:18 is just going backwards in this chain of equivalences. 61:21 And that's just because the original formulation 61:23 was already convenient. 61:27 So we went to find the likelihood 61:28 and then coming back to our original estimation strategy. 61:32 So look at the Poisson. 61:34 I want to take log here to make my sum of xi's go down. 61:39 OK, so this is my estimator. 61:47 So the log of L-- 61:50 so one thing that you want to notice 61:51 is that the log of L of x1, xn theta, as we said, 61:59 is equal to the sum from i equal 1 62:02 to n of the log of either p theta of xi, or-- 62:09 so that's in the discrete case. 62:11 And in the continuous case is the sum 62:14 of the log of f theta of xi. 62:16 62:19 The beauty of this is that you don't have to really understand 62:21 the difference between probability mass 62:23 function and probability distribution function 62:25 to implement this. 62:26 Whatever you get, that's what you plug in. 62:29 62:32 Any questions so far? 62:33 62:36 All right, so shall we do some computations 62:39 and check that, actually, we've introduced all this stuff-- 62:44 complicate functions, maximizing, KL divergence, 62:47 lot of things-- so that we can spit out, again, averages? 62:50 All right? 62:51 That's great. 62:51 We're going to able to sleep at night 62:52 and know that there's a really powerful mechanism called 62:55 maximum likelihood estimator that was actually 62:57 driving our intuition without us knowing. 63:00 OK, so let's do this so. 63:04 Bernoulli trials. 63:06 I still have it over there. 63:07 63:15 OK, so actually, I don't know what-- 63:19 well, let me write it like that. 63:21 So it's P over 1 minus P xi-- 63:25 sorry, sum of the xi's times 1 minus P is to the n. 63:32 So now I want to maximize this as a function of P. 63:37 Well, the first thing we would want to do 63:39 is to check that this function is concave. 63:41 And I'm just going to ask you to trust me on this. 63:45 So I don't want-- sorry, sum of the xi's. 63:47 I only want to take the derivative and just go home. 63:52 So let's just take the derivative of this with respect 63:55 to P. Actually, no. 63:56 This one was more convenient. 63:57 I'm sorry. 63:58 64:00 This one was slightly more convenient, OK? 64:03 So now we have-- 64:05 so now let me take the log. 64:09 So if I take the log, what I get is sum of the xi's times log p 64:16 plus n minus some of the xi's times log 1 minus p. 64:24 64:27 Now I take the derivative with respect 64:29 to p and set it equal to zero. 64:35 So what does that give me? 64:36 It tells me that sum of the xi's divided by p minus n 64:43 sum of the xi's divided by 1 minus p is equal to 0. 64:50 64:56 So now I need to solve for p. 64:58 So let's just do it. 64:59 So what we get is that 1 minus p sum of the xi's is equal to p n 65:06 minus sum of the xi's. 65:10 So that's p times n minus sum of the xi's plus sum of the xi's. 65:17 So let me put it on the right. 65:18 So that's p times n is equal to sum of the xi's. 65:24 And that's equivalent to p-- 65:27 actually, I should start by putting p hat from here 65:30 on, because I'm already solving an equation, right? 65:33 And so p hat is equal to syn of the xi's 65:36 divided by n, which is my xn bar. 65:38 65:44 Poisson model, as I said, Poisson is gone. 65:50 So let me rewrite it quickly. 65:51 66:00 So Poisson, the likelihood in X1, Xn, and lambda 66:07 was equal to lambda to the sum of the xi's e 66:13 to the minus n lambda divided by X1 factorial, 66:17 all the way to Xn factorial. 66:20 So let me take the log likelihood. 66:25 That's going to be equal to what? 66:26 It's going to tell me. 66:27 It's going to be-- 66:29 well, let me get rid of this guy first. 66:30 Minus log of X1 factorial all the way to Xn factorial. 66:36 That's a constant with respect to lambda. 66:39 So when I'm going to take the derivative, it's going to go. 66:43 Then I'm going to have plus sum of the xi's times log lambda. 66:49 And then I'm going to have minus n lambda. 66:51 66:54 So now then, you take the derivative 66:55 and set it equal to zero. 66:57 So log L-- well, partial with respect to lambda of log L, 67:04 say lambda, equals zero. 67:08 This is equivalent to, so this guy goes. 67:11 This guy gives me sum of the xi's divided by lambda hat 67:16 equals n. 67:17 67:22 And so that's equivalent to lambda hat 67:25 is equal to sum of the xi's divided by n, which is Xn bar. 67:31 67:34 Take derivative, set it equal to zero, and just solve. 67:38 It's a very satisfying exercise, especially when 67:42 you get the average in the end. 67:45 You don't have to think about it forever. 67:49 OK, the Gaussian model I'm going to leave to you as an exercise. 67:54 Take the log to get rid of the pesky exponential, 67:57 and then take the derivative and you should be fine. 68:00 It's a bit more-- 68:02 it might be one more line than those guys. 68:05 OK, so-- well actually, you need to take 68:12 the gradient in this case. 68:14 Don't check the second derivative right now. 68:15 You don't have to really think about it. 68:17 68:21 What did I want to add? 68:23 I think there was something I wanted to say. 68:25 Yes. 68:27 When I have a function that's concave and I'm on, like, 68:31 some infinite interval, then it's 68:33 true that taking the derivative and setting it equal to zero 68:36 will give me the maximum. 68:38 But again, I might have a function that looks like this. 68:42 Now, if I'm on some finite interval-- let me go elsewhere. 68:46 So if I'm on some finite interval 68:55 and my function looks like this as a function of theta-- 69:00 let's say this is my log likelihood 69:03 as a function of theta-- 69:06 then, OK, there's no place in this interval-- 69:13 let's say this is between 0 and 1-- there's 69:15 no place in this interval where the derivative is equal to 0. 69:19 And if you actually try to solve this, 69:22 you won't find a solution which is not in the interval 0, 1. 69:26 And that's actually how you know that you probably 69:28 should not take the derivative equal to zero. 69:30 So don't panic if you get something that says, 69:32 well, the solution is at infinity, right? 69:34 If this function keeps going, you 69:36 will find that the solution-- you 69:37 won't be able to find a solution apart from infinity. 69:40 You are going to see something like 1 over theta hat 69:43 is equal to 0, or something like this. 69:46 So you know that when you've found this kind of solution, 69:48 you've probably made a mistake at some point. 69:51 And the reason is because the functions that are like this, 69:54 you don't find the maximum by setting the derivative equal 69:58 to zero. 69:59 You actually just find the maximum by saying, 70:01 well, it's an increasing function on the interval 0, 1, 70:03 so the maximum must be attained at 1. 70:05 70:07 So here in this case, that would mean 70:08 that my maximum would be 1. 70:12 My estimator would be 1, which would be weird. 70:14 So typically here, you have a function of the xi's. 70:17 So one example that you will see many times is when this guy is 70:19 the maximum of the xi's. 70:24 And in which case, the maximum is attained here, 70:27 which is the maximum of this. 70:29 OK, so just keep in mind-- 70:31 what I would recommend is every time 70:33 you're trying to take the maximum of a function, 70:36 just try to plot the function in your head. 70:39 It's not too complicated. 70:40 Those things are usually squares, or square roots, 70:44 or logs. 70:45 You know what those functions look like. 70:47 Just plug them in your mind and make sure 70:50 that you will find a maximum which really 70:52 goes up and then down again. 70:54 If you don't, then that means your maximum 70:56 is achieved at the boundary and you have 70:59 to think differently to get it. 71:01 So the machinery that consists in setting the derivative equal 71:04 to zero works 80% of the time. 71:06 But o you have to be careful. 71:08 And from the context, it will be clear 71:11 that you had to be careful, because you will find 71:14 some crazy stuff, such as solve 1 over theta hat 71:17 is equal to zero. 71:18 71:23 All right, so before we conclude, 71:25 I just wanted to give you some intuition about how does 71:28 the maximum likelihood perform? 71:30 So there's something called the Fisher information 71:33 that essentially controls how this thing performs. 71:35 And the Fisher information is, essentially, 71:38 a second derivative or a Hessian. 71:40 So if I'm in a one-dimensional parameter case, it's a number, 71:44 it's a second derivative. 71:46 If I'm in a multidimensional case, it's actually a Hessian, 71:51 it's a matrix. 71:52 So I'm going to actually take in notation little curly L 71:57 of theta to be the log likelihood, OK? 72:00 And that's the log likelihood for one observation. 72:02 So let's call it x generically, but think of it as being x1, 72:05 for example. 72:07 And I don't care of, like, summing, 72:09 because I'm actually going to take expectation of this thing. 72:11 So it's not going to be a data driven quantity 72:13 I'm going to play with. 72:14 So now I'm going to assume that it 72:15 is twice differentiable, almost surely, because it's 72:19 a random function. 72:21 And so now I'm going to just sweep under the rug 72:23 some technical conditions when these things hold. 72:27 So typically, when can I permute integral and derivatives 72:32 and this kind of stuff that you don't want to think about? 72:35 OK, the rule of thumb is it always 72:36 works until it doesn't, in which case, 72:39 that probably means you're actually solving 72:41 some sort of calculus problem. 72:44 Because in practice, it just doesn't happen. 72:47 So the Fisher information is the expectation of the-- 72:56 that's called the outer product. 72:57 So that's the product of this gradient 73:01 and the gradient transpose. 73:02 So that forms a matrix, right? 73:04 That's a matrix minus the outer product of the expectations. 73:09 So that's really what's called the covariance matrix 73:12 of this vector, nabla of L theta, which 73:16 is a random vector. 73:18 So I'm forming the covariance matrix of this thing. 73:21 And the technical conditions tells me that, actually, 73:23 this guy, which depends only on the Hessian, 73:26 is actually equal to negative expectation of the-- sorry. 73:31 It depends on the gradient. 73:32 Is actually negative expectation of the Hessian. 73:36 So I can actually get a quantity that 73:38 depends on the second derivatives only using 73:40 first derivatives. 73:41 But the expectation is going to play a role here. 73:44 And the fact that it's a log. 73:45 And lots of things actually show up here. 73:48 And so in this case, what I get is that-- 73:51 so in the one-dimensional case, then this 73:53 is just the covariance matrix of a one-dimensional thing, which 73:56 is just a variance of itself. 73:58 So the variance of the derivative 74:00 is actually equal to negative the expectation 74:04 of the second derivative. 74:07 OK, so we'll see that next time. 74:09 But what I wanted to emphasize with this is that why do 74:12 we care about this quantity? 74:15 That's called the Fisher information. 74:16 Fisher is the founding father of modern statistics. 74:19 Why do we give this quantity his name? 74:23 Well, it's because this quantity is actually very critical. 74:25 What does the second derivative of a function 74:27 tell me at the maximum? 74:29 Well, it's telling me how curved it is, right? 74:34 If I have a zero second derivative, I'm basically flat. 74:37 And if I have a very high second derivative, I'm very curvy. 74:41 And when I'm very curvy, what it means 74:42 is that I'm very robust to the estimation error. 74:45 Remember our estimation strategy, 74:47 which consisted in replacing expectation by averages? 74:50 If I'm extremely curvy, I can move a little bit. 74:52 This thing, the maximum, is not going to move much. 74:55 And this formula here-- 74:57 so forget about the matrix version for a second-- 75:00 is actually telling me exactly-- 75:01 it's telling me the curvature is basically the variance 75:06 of the first derivative. 75:08 And so the more the first derivative fluctuates, 75:10 the more your maximum is actually-- your org max 75:12 is going to move all over the place. 75:14 So this is really controlling how flat 75:16 your likelihood, your log likelihood, is at its maximum. 75:20 The flatter it is, the more sensitive to fluctuation 75:23 the arg max is going to be. 75:24 The curvier it is, the less sensitive it is. 75:27 And so what we're hoping-- a good model 75:28 is going to be one that has a large or small value 75:31 for the Fisher information. 75:34 I want this to be-- 75:36 small? 75:38 I want it to be large. 75:40 Because this is the curvature, right? 75:42 This number is negative, it's concave. 75:44 So if I take a negative sign, it's 75:45 going to be something that's positive. 75:47 And the larger this thing, the more curvy it is. 75:51 Oh, yeah, because it's the variance. 75:52 Again, sorry. 75:53 This is what-- 75:55 OK. 75:55 75:59 Yeah, maybe I should not go into those details 76:02 because I'm actually out of time. 76:03 But just spoiler alert, the asymptotic variance 76:06 of your-- the variance, basically, as n 76:09 goes to infinity of the maximum likelihood estimator 76:11 is going to be 1 over this guy. 76:12 So we want it to be large, because the asymptotic variance 76:15 is going to be very small. 76:16 All right, so we're out of time. 76:18 We'll see that next week. 76:20 And I have your homework with me. 76:22 And I will actually turn it in. 76:25 I will give it to you outside so we 76:26 can let the other room come in. 76:28 OK, I'll just leave you the--