https://www.youtube.com/watch?v=JBIz7UadY5M&list=PLUl4u3cNGP60uVBMaoNERc6knT_MgPKS0&index=13 字幕記錄 00:00 00:00 The following content is provided under a Creative 00:02 Commons license. 00:03 Your support will help MIT OpenCourseWare 00:06 continue to offer high-quality educational resources for free. 00:10 To make a donation or to view additional materials 00:12 from hundreds of MIT courses, visit MIT OpenCourseWare 00:16 at ocw.mit.edu. 00:19 PHILIPPE RIGOLLET: [INAUDIBLE] minus xi transpose t. 00:26 I just pick whatever notation I want from a variable. 00:30 And let's say it's t. 00:33 So that's the least squares estimator. 00:35 And it turns out that, as I said last time, 00:38 it's going to be convenient to think 00:39 of those things as matrices. 00:42 So here, I already have vectors. 00:44 I've already gone from one dimension, just real valued 00:47 random variables through random vectors when 00:49 I think of each xi, but if I start stacking them together, 00:52 I'm going to have vectors and matrices that show up. 00:56 So the first vector I'm getting is 00:57 y, which is just a vector where I have y1 to yn. 01:04 Then I have-- so that's a boldface vector. 01:07 Then I have x, which is a matrix where I have-- 01:12 well, the first coordinate is always 1. 01:16 So I have 1, and then x1 xp minus 1, and that's-- sorry, 01:24 x1 xp minus 1, and that's for observation 1. 01:29 And then I have the same thing all the way down 01:31 for observation n. 01:32 01:40 OK, everybody understands what this is? 01:42 So I'm just basically stacking up all the xi's. 01:47 So this i-th row is xi transpose. 01:55 I am just stacking them up. 01:57 And so if I want to write all these things to be 02:00 true for each of them, all I need to do 02:03 is to write a vector epsilon, which 02:05 is epsilon 1 to epsilon n. 02:08 And what I'm going to have is that y, the boldface vector, 02:11 now is equal to the matrix x times the vector 02:14 beta plus the vector epsilon. 02:18 And it's really just exactly saying 02:20 what's there, because for 2-- so this is a vector, right? 02:23 This is a vector. 02:25 And what is the dimension of this vector? 02:27 02:32 n, so this is n observations. 02:37 And for all these-- for two vectors to be equal, 02:39 I need to have all the coordinates to be equal, 02:41 and that's exactly the same thing as saying that this 02:44 holds for i equal 1 to n. 02:46 02:48 But now, when I have this, I can actually 02:51 rewrite the sum for t equals-- 02:55 sorry, for i equals 1 to n of yi minus xi transpose 03:03 beta squared, this turns out to be 03:05 equal to the Euclidean norm of the vector y minus the matrix x 03:12 times beta squared. 03:14 And I'm going to put a 2 here so we 03:16 know we're talking about the Euclidean norm. 03:19 This just means this is the Euclidean norm. 03:20 03:27 That's the one we've seen before when 03:28 we talked about chi squared-- 03:30 that's the square norm is the sum 03:31 of the square of the coefficients, 03:32 and then I take a square root, but here I 03:34 have an extra square. 03:35 So it's really just the sum of the square of the coefficients, 03:38 which is this. 03:38 And here are the coefficients. 03:40 03:43 So then, that I write this thing like that, then minimizing-- 03:49 so my goal here, now, is going to solve minimum over t 03:54 in our p of y minus x times t2 squared. 04:05 And just like we did for one dimension, 04:07 we can actually write optimality conditions for this. 04:12 I mean, this is a function. 04:14 So this is a function from rp to r. 04:23 And if I want to minimize it, all I have to do 04:24 is to take its gradient and set it equal to 0. 04:28 So minimum, set gradient to 0. 04:42 So that's where it becomes a little complicated. 04:45 Now I'm going to have to take the gradient of this norm. 04:49 It might be a little annoying to do. 04:51 But actually, what's nice about those things-- 04:53 I mean, I remember that it was a bit annoying to learn. 04:56 I mean, it's just basically rules of calculus 04:59 that you don't use that much. 05:01 But essentially, you can actually expend this norm. 05:05 And you will see that the rules are basically 05:07 the same as in one dimension, you just 05:09 have to be careful about the fact that matrices do not 05:11 commute. 05:13 So let's expand this thing. 05:15 y minus xt squared-- 05:19 well, this is equal to the norm of y 05:21 squared plus the norm of x squared plus 2 05:30 times y transpose xt. 05:33 05:36 That's just expanding the square in more dimensions. 05:41 And this, I'm actually going to write as y squared plus-- 05:47 so here, the norm squared of this guy, 05:50 I always have that the norm of x squared 05:53 is equal to x transpose x. 05:56 So I'm going to write this as x transpose 05:58 x, so it's t transpose x transpose xt 06:04 plus 2 times y transpose xt. 06:10 So now, if I'm going to take the gradient with respect to t, 06:13 I have basically three terms, and each of them 06:16 has some sort of a different nature. 06:18 This term is linear in t, and it's 06:21 going to differentiate the same way 06:23 that I differentiate a times x. 06:25 I'm just going to keep the a. 06:28 This guy is quadratic. 06:29 t appears twice. 06:32 And this guy, I'm going to pick up a 2, 06:34 and it's going to differentiate just like when I differentiate 06:37 a times x squared. 06:38 It's 2 times ax. 06:41 And this guy is a constant with respect to t, 06:43 so it's going to differentiate to 0. 06:47 So when I compute the gradient-- 06:49 06:53 now, of course, all of these rules that I give you you 06:55 can check by looking at the partial derivative with respect 06:58 to each coordinate. 06:59 But arguably, it's much faster to know 07:02 the rules of differentiability. 07:04 It's like if I gave you the function exponential x 07:06 and I said, what is the derivative, 07:08 and you started writing, well, I'm 07:09 going to write exponential x plus h minus exponential ax 07:13 divided by h and let h go to 0. 07:15 That's a bit painful. 07:17 AUDIENCE: Why did you transpose your-- 07:19 why does x have to be [INAUDIBLE]?? 07:23 PHILIPPE RIGOLLET: I'm sorry? 07:25 AUDIENCE: I was wondering why you times t 07:26 times the [INAUDIBLE]? 07:29 PHILIPPE RIGOLLET: The transpose of 2ab 07:33 is b transpose a transpose. 07:35 07:38 If you're not sure about this, just 07:40 make a and b have different size, 07:42 and then you will see that there's some incompatibility. 07:46 I mean, there's basically only one way to not screw 07:48 that one up, so that's easy to remember. 07:51 So if I take the gradient, then it's going to be equal to what? 07:54 It's going to be 0 plus-- we said here, 07:58 this is going to differentiate like-- so 07:59 think a times x squared. 08:05 So I'm going to have 2ax. 08:06 So here, basically, this guy is going to go to x transpose xt. 08:13 Now, I could have made this one go away, 08:17 but that's the same thing as saying that my gradient is-- 08:20 I can think of my gradient as being 08:21 either a horizontal vector or a vertical vector. 08:24 So if I remove this guy, I'm thinking of my gradient 08:26 as being horizontal. 08:27 If I remove that guy, I'm thinking of my gradient 08:30 as being vertical. 08:31 And that's what I want to think of, typically-- 08:33 vertical vectors, column vectors. 08:36 And then this guy, well, it's like these guys just 08:39 think a times x. 08:42 So the derivative is just a, so I'm going 08:44 to keep only that part here. 08:47 Sorry, I forgot a minus somewhere-- yeah, here. 08:49 08:55 Minus 2y transpose x. 08:59 And what I want is this thing to be equal to 0. 09:02 09:06 So t, the optimal t, is called beta hat and satisfies-- 09:20 09:24 well, I can cancel the 2's and put the minus 09:27 on the other side, and what I get 09:29 is that x transpose xt is equal to y transpose x. 09:36 09:44 Yeah, that's not working for me. 09:48 Yeah, that's because when I took the derivative, 09:50 I still need to make sure-- 09:51 so it's the same question of whether I want 09:53 things to be columns or rows. 09:55 So this is not a column. 09:57 If I remove that guy, y transpose t is a row. 10:01 So I'm just going to take the transpose of this guy 10:03 to make things work, and this is just going to be x transpose y. 10:07 10:11 And this guy is x transpose y so that I have columns. 10:14 10:19 So this is just the linear equation in t. 10:23 And I have to solve it, so it's of the form some matrix times 10:26 t is equal to another vector. 10:29 And so that's basically in your system. 10:31 And the way to solve it, at least formally, 10:33 is to just take the inverse of the matrix on the left. 10:36 So if x transpose x is invertible, then-- sorry, 10:45 that's beta hat is the t I want. 10:48 I get that beta hat is equal to x transpose 10:51 x inverse x transpose y. 10:54 10:57 And that's the least squares estimator. 10:59 11:12 So here, I use this condition. 11:16 I want it to be invertible so I can actually write its inverse. 11:21 Here, I wrote, rank of x is equal to p. 11:25 What is the difference? 11:26 11:36 Well, there's basically no difference. 11:40 Basically, here, I have to assume-- 11:45 what is the size of the matrix x transpose x? 11:47 11:52 [INTERPOSING VOICES] 11:53 PHILIPPE RIGOLLET: Yeah, so what is the size? 11:55 AUDIENCE: p by p. 11:56 PHILIPPE RIGOLLET: p by p. 11:58 So this matrix is invertible if it's a rank p, 12:00 if you know what rank means. 12:02 If you don't, that just rank p means that it's invertible. 12:05 So it's full rank and it's invertible. 12:07 And the rank of x transpose x is actually 12:10 just the rank of x because this is the same matrix that you 12:13 apply twice. 12:14 And that's all it's saying. 12:15 So if you're not comfortable with the notion of rank 12:18 that you see here, just think of this condition 12:21 just being the condition that x transpose x is invertible. 12:25 And that's all it says. 12:26 What it means for it to be invertible-- this was true. 12:29 We made no assumption up to this point. 12:32 If x is not invertible, it means that there 12:35 might be multiple solutions to this equation. 12:38 In particular, for a matrix to not be invertible, 12:42 it means that there's some vector v. 12:45 So if x transpose x is not invertible, 12:55 then this is equivalent to there exists a vector 13:00 v, which is not 0, and such that x transpose xv is equal to 0. 13:07 That's what it means to not be invertible. 13:10 So in particular, if beta hat is a solution-- 13:13 13:22 so this equation is sometimes called score equations, 13:26 because the gradient is called the score, 13:28 and so you're just checking if the gradient is equal to 0. 13:31 So if beta hat satisfies star, then so 13:33 does beta hat plus lambda v for all lambda in the real line. 13:46 13:51 And the reason is because, well, if I start looking at-- 13:54 what is x transpose x times beta hat plus lambda v? 14:02 Well, by linearity, this is just x transpose x 14:08 beta hat plus lambda x transpose x times v. 14:16 But this guy is what? 14:17 14:22 It's 0, just because that's what we assumed. 14:27 We assumed that x transpose xv was equal to 0, 14:31 so we're left only with this part, which, by star, is just 14:34 x transpose y. 14:35 14:40 So that means that x transpose x beta hat plus lambda v 14:44 is actually equal to x transpose y, which means that there's 14:48 another solution, which is not just beta hat, 14:50 but any move of beta hat along this direction v by any size. 14:56 So that's going to be an issue, because you're 14:58 looking for one estimator. 15:00 And there's not just one, in this case, there's many. 15:03 And so this is not going to be well-defined 15:05 and you're going to have some issues. 15:07 So if you want to talk about the least squares estimator, 15:09 you have to make this assumption. 15:13 What does it imply in terms of, can I 15:15 think of p being too n, for example, in this case? 15:18 What happens if p is equal to 2n? 15:20 15:27 AUDIENCE: Well, then the rank of your matrix is only p/2. 15:31 PHILIPPE RIGOLLET: So the rank of your matrix is only p/2, 15:33 so that means that this is actually not going to happen. 15:36 I mean, it's not only p/2, it's at most p/2. 15:39 It's at most the smallest of the two dimensions of your matrix. 15:42 So if your matrix is n times 2n, it's 15:44 at most n, which means that it's not going to be full rank, 15:47 so it's not going to be invertible. 15:49 So every time the dimension p is larger than the sample size, 15:53 your matrix is not invertible, and you cannot talk about 15:56 the least squares estimator. 15:57 So that's something to keep in mind. 15:59 And it's actually a very simple thing. 16:01 It's essentially saying, well, if p is lower than n, 16:05 it means that you have more parameters 16:07 to estimate than you have equations to estimate it. 16:11 So you have this linear system. 16:12 There's one equation per observation. 16:17 Each row, which was each observation, 16:19 was giving me one equation. 16:21 But then the number of unknowns in this linear system is p, 16:24 and so I cannot solve linear systems that have more unknowns 16:28 than they have equations. 16:30 And so that's basically what's happening. 16:32 Now, in practice, if you think about what 16:34 data sets look like these days, for example, 16:36 people are trying to express some phenotype. 16:38 So phenotype is something you can measure on people-- maybe 16:41 the color of your eyes, or your height, 16:43 or whether you have diabetes or not, things like this, 16:47 so things that are macroscopic. 16:51 And then they want to use the genotype to do that. 16:53 They want to measure your-- they want to sequence your genome 16:55 and try to use this to predict whether you're going 16:58 to be responsive to a drug or whether your r's are 17:01 going to be blue, or something like this. 17:03 Now, the data sets that you can have-- 17:05 people, maybe, for a given study about some sort of disease. 17:09 Maybe you will sequence the genome of maybe 100 people. 17:15 n is equal to 100. 17:17 p is basically the number of genes they're sequencing. 17:21 This is of the order of 100,000. 17:23 So you can imagine that this is a case where n is much, 17:26 much smaller than p, and you cannot talk about the least 17:28 squares estimator. 17:29 There's plenty of them. 17:31 There's not just one line like that, 17:33 lambda times v that you can move away. 17:36 There's basically an entire space in which you can move, 17:40 and so it's not well-defined. 17:42 So at the end of this class, I will give you 17:43 a short introduction on how you do this. 17:46 This actually represents more and more. 17:49 It becomes a more and more preponderant part of the data 17:51 sets you have to deal with, because people just 17:53 collect data. 17:54 When I do the sequencing, the machine 17:57 allows me to sequence 100,000 genes. 17:59 I'm not going to stop at 100 because doctors are never 18:03 going to have cohorts of more than 100 patients. 18:06 So you just collect everything you can collect. 18:08 And this is true for everything. 18:11 Cars have sensors all over the place, 18:13 much more than they actually gather data. 18:15 There's data, there's-- we're creating, 18:16 we're recording everything we can. 18:18 And so we need some new techniques for that, 18:20 and that's what high-dimensional statistics is trying to answer. 18:23 So this is way beyond the scope of this class, 18:25 but towards the end, I will give you 18:27 some hints about what can be done in this framework 18:29 because, well, this is the new reality we have to deal with. 18:34 So here, we're in a case where p's less than n 18:37 and typically much smaller than n. 18:38 So the kind of orders of magnitude you want to have 18:40 is maybe p's of order 10 and n's of order 100, something 18:46 like this. 18:46 So you can scale that, but maybe 10 times larger. 18:50 So maybe you cannot solve this guy b for b hat, but actually, 18:57 you can talk about x times b hat, 19:00 even if p is larger than n. 19:02 And the reason is that x times b hat 19:06 is actually something that's very well-defined. 19:09 So what is x times b hat? 19:11 Remember, I started with the model. 19:16 So if I look at this definition, essentially, what I 19:20 had as the original thing was that the vector 19:24 y was equal to x times beta plus the vector epsilon. 19:29 That was my model. 19:32 So beta is actually giving me something. 19:36 Beta is actually some parameter, some coefficients 19:39 that are interesting. 19:40 But a good estimator for-- so here, it 19:43 means that the observations that I have 19:45 are of the form x times beta plus some noise. 19:48 So if I want to adjust the noise, remove the noise, 19:51 a good candidate to do noise is x times beta hat. 19:57 x times beta hat is something that should actually 19:59 be useful to me, which should be close to x times beta. 20:10 So in the one-dimensional case, what it means is that if I 20:13 have-- let's say this is the true line, 20:16 and these are my x's, so I have-- 20:19 these are the true points on the real line, 20:22 and then I have my little epsilon 20:24 that just give me my observations that 20:26 move around this line. 20:28 So this is one of epsilons, say epsilon i. 20:34 Then I can actually either talk-- 20:37 to say that I recovered the line, 20:39 I can actually talk about recovering the right intercept 20:42 or recovering the right slope for this line. 20:44 Those are the two parameters that I need to recover. 20:46 But I can also say that I've actually 20:48 found a set of points that's closer 20:50 to being on the line that are closer to this set of points 20:56 right here than the original crosses that I observed. 21:00 So if we go back to the picture here, 21:03 for example, what I could do is say, well, for this point 21:08 here-- 21:09 there was an x here-- 21:11 rather than looking at this dot, which was my observation, 21:14 I can say, well, now that I've estimated the red line, 21:17 I can actually just say, well, this point 21:19 should really be here. 21:21 And actually, I can move all these dots 21:23 so that they're actually on the red line. 21:26 And this should be a better value, something 21:28 that has less noise than the original y value 21:30 that I should see. 21:32 It should be close to the true value 21:33 that I should be seeing without the extra noise. 21:37 So that's definitely something that could be of interest. 21:40 21:43 For example, in imaging, you're not 21:45 trying to understand-- so when you do imaging, 21:48 y is basically an image. 21:50 So think of a pixel image, and you just 21:53 stack it into one long vector. 21:55 And what you see is something that 21:57 should look like some linear combination of some feature 21:59 vectors, maybe. 22:01 So there's people created a bunch of features. 22:05 They're called, for example, Gabor frames or wavelet 22:09 transforms-- so just well-known libraries of variables x such 22:14 that when you take linear combinations of those guys, 22:17 this should looks like a bunch of images. 22:19 And what you want for your image-- 22:22 you don't care what the coefficients of the image 22:24 are in these bases that you came up with. 22:26 What you care about is the noise in the image. 22:28 And so you really want to get x beta. 22:31 So if you want to estimate x beta, 22:34 well, you can use x beta hat. 22:36 What is x beta hat? 22:37 Well, since beta hat is x transpose x inverse x transpose 22:42 y, this is x transpose. 22:44 22:48 That's my estimator for x beta. 22:50 22:54 Now, this thing, actually, I can define 22:59 even if I'm not low rank. 23:01 So why is this thing interesting? 23:03 Well, there's a formula for this estimator, 23:08 but actually, I can visualize what this thing is. 23:10 23:18 So let's assume, for the sake of illustration, 23:22 that n is equal to 3. 23:26 23:29 So that means that y lives in a three-dimensional space. 23:33 And so let's say it's here. 23:36 And so I have my, let's say, y's here. 23:43 And I also have a plane that's given 23:48 by the vectors x1 transpose x2 transpose, which 23:55 is, by the way, 1-- 23:58 sorry, that's not what I want to do. 24:01 24:04 I'm going to say that n is equal to 3 and that p is equal to 2. 24:10 So I basically have two vectors, 1, 1 and another one, 24:18 let's assume that it's, for example, abc. 24:25 So those are my two vectors. 24:27 This is x1, and this is x2. 24:33 24:36 And those are my three observations for this guy. 24:39 So what I want when I minimize this, 24:48 I'm looking at the point which can 24:50 be formed as the linear combination of the columns 24:52 of x, and I'm trying to find the guy that's the closest to y. 24:57 So what does it look like? 24:58 Well, the two points, 1, 1, 1 is going to be, say, here. 25:01 That's the point 1, 1, 1. 25:04 And let's say that abc is this point. 25:06 25:14 So now I have a line that goes through those two guys. 25:17 25:20 That's not really-- let's say it's 25:22 going through those two guys. 25:24 And this is the line which can be formed by looking only 25:27 at linear combination. 25:28 So this is the line of x times t for t in r2. 25:36 That's this entire line that you can get. 25:39 Why is it-- yeah, sorry, it's not just a line, 25:42 I also have to have t, all the 0's thing. 25:45 So that actually creates an entire plane, 25:48 which is going to be really hard for me to represent. 25:54 I don't know. 25:55 I mean, maybe I shouldn't do it in these dimensions. 26:00 26:05 So I'm going to do it like that. 26:08 26:11 So this plane here is the set of xt for t and r2. 26:14 26:17 So that's a two-dimensional plane, definitely goes to 0, 26:22 and those are all these things. 26:23 So think of a sheet of paper in three dimensions. 26:25 Those are the things I can get. 26:27 So now, what I'm going to have as y 26:29 is not necessarily in this plane. 26:32 y is actually something in this plane, x beta 26:39 plus some epsilon. 26:40 26:44 y is x beta plus epsilon. 26:50 So I start from this plane, and then 26:51 I have this epsilon that pushes me, 26:53 maybe, outside of this plane. 26:54 And what least squares is doing is saying, 26:56 well, I know that epsilon should be fairly small, 26:59 so the only thing I'm going to be doing that actually makes 27:01 sense is to take y and find the point that's 27:04 on this plane that's the closest to it. 27:06 And that corresponds to doing an orthogonal projection of y 27:10 onto this thing, and that's actually exactly x beta hat. 27:13 27:18 So in one dimension, just because this is actually 27:21 a little hard-- 27:22 in one dimension, so that's if p is equal to 1. 27:34 So let's say this is my point. 27:36 And then I have y, which is in two dimensions, 27:38 so this is all on the plane. 27:40 27:42 What it does, this is my-- 27:44 the point that's right here is actually x beta hat. 27:48 That's how you find x beta hat. 27:49 You take your point y and you project it 27:51 on the linear span of the columns of x. 27:54 And that's x beta hat. 27:56 This does not tell you exactly what beta should be. 27:59 And if you know a little bit of linear algebra, 28:00 it's pretty clear, because if you want to find beta hat, 28:04 that means that you should be able to find 28:06 the coordinates of a point in the system of columns of x. 28:12 And if those guys are redundant, there's 28:13 not going to be unique coordinates for these guys, 28:17 so that's why it's actually not easy to find. 28:19 But x beta hat is uniquely defined. 28:21 It's a projection. 28:21 Yeah? 28:22 AUDIENCE: And epsilon is the distance 28:24 between the y and the-- 28:25 PHILIPPE RIGOLLET: No, epsilon is the vector that goes from-- 28:29 so there's a true x beta. 28:33 That's the true one. 28:36 It's not clear. 28:36 I mean, x beta hat is unlikely to be exactly equal to x beta. 28:41 And then the epsilon is the one that starts from this line. 28:44 It's the vector that pushes you away. 28:46 So really, this is this vector. 28:48 That's epsilon. 28:50 So it's not a length. 28:51 The lengths of epsilon is the distance, 28:54 but epsilon is just the actual vector that takes you 28:57 from one to the other. 28:58 29:01 So this is all in two dimensions, 29:03 and it's probably much clearer than what's here. 29:05 29:09 And so here, I claim that this x beta hat-- 29:12 so from this picture, I implicitly 29:15 claim that forming this operator that ticks y 29:22 and maps it into this vector x times x transpose y, blah, 29:26 blah, blah, this should actually be equal to the projection of y 29:33 onto the linear span of the columns of x. 29:44 That's what I just drew for you. 29:46 And what it means is that this matrix 29:48 must be the projection matrix. 29:49 29:54 So of course, anybody-- 29:56 who knows linear algebra here? 29:59 OK, wow. 30:01 So what are the conditions that a projection matrix 30:04 should be satisfying? 30:06 AUDIENCE: Squares through itself. 30:08 PHILIPPE RIGOLLET: Squares through itself, right. 30:09 If I project twice, I'm not moving. 30:11 If I keep on iterating projection, 30:13 once I'm in the space I'm projecting onto, 30:15 I'm not moving. 30:16 What else? 30:17 30:24 Do they have to be symmetric, maybe? 30:28 AUDIENCE: If it's an orthogonal projection. 30:29 PHILIPPE RIGOLLET: Yeah, so this is an orthogonal projection. 30:32 It has to be symmetric. 30:34 And that's pretty much it. 30:36 So from those things, you can actually 30:38 get quite a bit of things. 30:39 But what's interesting is that if you actually 30:41 look at the eigenvalues of this matrix, 30:44 they should be either 0 or 1, essentially. 30:47 And they are 1 if the eigenvector associated 30:52 is within this space, and 0 otherwise. 30:55 And so that's basically what you can check. 30:56 This is not an exercise in linear algebra, 30:58 so I'm not going to go too much into those details. 31:00 But this is essentially what you want to keep in mind. 31:03 What's associated to orthogonal projections 31:05 is Pythagoras theorem. 31:07 And that's something that's going to be useful for us. 31:10 What it's essentially telling is that if I 31:12 look at this norm squared, it's equal to this norm squared-- 31:16 sorry, this norm squared plus this norm squared 31:18 is equal to this norm squared. 31:20 And that's something the norm of y squared. 31:22 So Pythagoras tells me that the norm of y squared 31:32 is equal to the norm of x beta hat squared plus the norm of y 31:40 minus x beta hat squared. 31:41 31:46 Agreed? 31:47 It's just because I have a straight angle here. 31:51 So that's this plus this is equal to this. 31:54 31:58 So now, to define this, I made no assumption. 32:02 Epsilon could be as wild. 32:04 I was just crossing my fingers that epsilon was actually 32:07 small enough that it would make sense 32:09 to project onto the linear span, because I implicitly 32:13 assumed that epsilon did not take me all the way there, 32:16 so that actually, it makes sense to project back. 32:19 And so for that, I need to somehow make assumptions 32:22 that epsilon is well-behaved and that it's 32:24 completely wild, that it's moving uniformly 32:31 in all directions of the space. 32:33 There's no privileged direction where 32:34 it's always going, otherwise, I'm 32:36 going to make a systematic error. 32:37 And I need that those epsilons are going to average somehow. 32:42 So here are the assumptions we're 32:44 going to be making so that we can actually 32:46 do some statistical inference. 32:48 The first one is that the design matrix is deterministic. 32:53 So I started by saying the x-- 32:55 I have xi, yi, and maybe they're independent. 32:58 Here, they are, but the xi's, I want to think as deterministic. 33:03 If they're not deterministic, it can condition on them, 33:06 but otherwise, it's very difficult 33:08 to think about this thing if I think of those entries 33:11 as being random, because then I have 33:14 the inverse of a random matrix, and things become very, very 33:17 complicated. 33:18 So we're to think of those guys as being deterministic. 33:21 We're going to think of the model as being homoscedastic. 33:27 And actually, let me come back to this in a second. 33:29 Homoscedastic-- well, I mean, if you're 33:31 trying to find the etymology of this word, 33:34 "homo" means the same, "scedastic" means scaling. 33:38 So what I want to say is that the epsilons 33:40 have the same scaling. 33:41 And since my third assumption is that epsilon is Gaussian, then 33:46 essentially, what I'm going to want is that they all share 33:49 the same sigma squared. 33:52 They're independent, so this is definitely in the identity 33:55 covariance matrix. 33:56 And I want them to be centered, as well. 33:58 That means that there's no direction 34:00 that I'm always privileging when I'm moving away from my plane 34:04 there. 34:05 So these are important conditions. 34:09 It depends on how much inference you want to do. 34:13 If you want to write t-tests, you need all these assumptions. 34:16 But if you only want to write, for example, the fact 34:19 that your least squares estimator is consistent, 34:23 you really just need the fact that epsilon 34:25 has variance sigma squared. 34:27 The fact that it's Gaussian won't matter, just 34:29 like Gaussianity doesn't matter for a large number. 34:33 Yeah? 34:34 AUDIENCE: So the first assumption 34:35 that x has to be deterministic, but I just 34:38 made up this x1, x2-- 34:40 PHILIPPE RIGOLLET: x is the matrix. 34:41 AUDIENCE: Yeah. 34:42 So most are random variables, right? 34:45 PHILIPPE RIGOLLET: No, that's the assumption. 34:47 AUDIENCE: OK. 34:49 So I mean, once we collect the data and put it in the matrix, 34:52 it becomes deterministic. 34:54 So maybe I'm missing something. 34:55 PHILIPPE RIGOLLET: Yeah. 34:56 So this is for the purpose of the analysis. 35:00 I can actually assume that-- 35:01 I look at my data, and I think of this. 35:04 So what is the difference between thinking 35:06 of data as deterministic or thinking of it as random? 35:08 When I talked about random data, the only assumptions 35:11 that I made were about the distribution. 35:12 I said, well, if my x is a random variable, 35:14 I want it to have this variance and I want it to have, 35:16 maybe, this distribution, things like this. 35:19 Here, I'm actually making an assumption on the values 35:25 that I see. 35:25 I'm seeing that the value that you give me is-- 35:30 the matrix is actually invertible. 35:32 x transpose x will be invertible. 35:33 So I've never done that before, assuming 35:36 that some random variable-- 35:38 assuming that some Gaussian random variable was positive, 35:41 for example. 35:43 We don't do that, because there's always some probability 35:45 that things don't happen if you make things at random. 35:49 And so here, I'm just going to say, OK, forget about-- 35:52 here, it's basically a little stronger. 35:54 I start my assumption by saying, the data that's given to me 35:58 will actually satisfy those assumptions. 36:00 And that means that I don't actually 36:02 need to make some modeling assumption on this thing, 36:05 because I'm actually putting directly 36:06 the assumption I want to see. 36:08 36:12 So here, either I know sigma squared 36:14 or I don't know sigma squared. 36:16 So is that clear? 36:16 So essentially, I'm assuming that I have this model, where 36:21 this guy, now, is deterministic, and this 36:26 is some multivariate Gaussian with mean 0 36:30 and covariance matrix identity of rn. 36:33 That's the model I'm assuming. 36:36 And I'm observing this, and I'm given this matrix x. 36:40 Where does this make sense? 36:42 You could say, well, if I think of my rows as being people 36:44 and I'm collecting genes, it's a little intense 36:48 to assume that I actually know, ahead of time, 36:50 what I'm going to be seeing, and that those things are 36:51 deterministic. 36:52 That's true, but it still does not prevent the analysis 36:55 to go through, for one. 36:56 And second, a better example might be this imaging example 37:00 that I described, where those x's are actually libraries. 37:04 Those are libraries of patterns that people 37:07 have created, maybe from deep learning nets, 37:09 or something like this. 37:10 But they've created patterns, and they 37:12 say that all images should be representable as a linear 37:14 combination of those patterns. 37:16 And those patterns are somewhere in books, 37:18 so they're certainly deterministic. 37:19 Everything that's actually written down in a book 37:21 is as deterministic as it gets. 37:22 37:29 Any questions about those assumptions? 37:30 Those are the things we're going to be working with. 37:32 There's only three of them. 37:33 One is about x. 37:35 Actually, there's really two of them. 37:37 I mean, this guy already appears here. 37:41 So there's two-- one on the noise, one on the x's. 37:44 That's it. 37:45 37:48 Those things allow us to do quite a bit. 37:51 They will allow us to-- 37:52 37:55 well, that's actually-- they allow 38:02 me to write the distribution of beta hat, which is great, 38:09 because when I know the distribution of my estimator, 38:12 I know it's fluctuations. 38:14 If it's centered around the true parameter, 38:16 I know that it's going to be fluctuating 38:19 around the true parameter. 38:20 And it should tell me what kind of distribution 38:22 the fluctuations are. 38:23 I actually know how to build confidence intervals. 38:26 I know how to build tests. 38:27 I know how to build everything. 38:29 It's just like when I told you that asymptotically, 38:31 the empirical variance was Gaussian 38:33 with mean theta and standard deviation that depended on n, 38:39 et cetera, that's basically the only thing I needed. 38:42 And this is what I'm actually getting here. 38:44 So let me start with this statement. 38:49 So remember, beta hat satisfied this, 38:52 so I'm going to rewrite it here. 38:53 38:57 So beta hat was equal to x transpose 39:02 x inverse x transpose y. 39:07 That was the definition that we found. 39:09 And now, I also know that y was equal to x beta plus epsilon. 39:17 So let me just replace y by x beta plus epsilon here. 39:20 Yeah? 39:21 AUDIENCE: Isn't it x transpose x inverse x transpose y? 39:25 PHILIPPE RIGOLLET: Yes, x transpose. 39:26 Thank you. 39:27 39:31 So I'm going to replace y by x beta plus epsilon. 39:36 So that's-- and here comes the magic. 39:58 I have an inverse of a matrix, and then 40:00 I have the true matrix, I have the original matrix. 40:03 So this is actually the identity times beta. 40:08 And now this guy, well, this is a Gaussian, 40:11 because this is a Gaussian random vector, 40:13 and I just multiply it by a deterministic matrix. 40:18 So we're going to use the rule that if I have, say, epsilon, 40:22 which is n0 sigma, then b times epsilon is n0-- 40:29 can somebody tell me what the covariance matrix of b epsilon 40:32 is? 40:32 40:35 AUDIENCE: What is capital B in this case? 40:37 PHILIPPE RIGOLLET: It's just a matrix. 40:38 40:42 And for any matrix, I mean any matrix that I can premultiply-- 40:46 that I can postmultiply with epsilon. 40:48 Yeah? 40:49 AUDIENCE: b transpose b. 40:50 PHILIPPE RIGOLLET: b transpose? 40:52 AUDIENCE: Times b. 40:53 PHILIPPE RIGOLLET: And sigma is gone. 40:55 AUDIENCE: Oh, times sigma, sorry. 40:57 PHILIPPE RIGOLLET: That's the matrix, right? 40:59 AUDIENCE: b transpose sigma b. 41:00 PHILIPPE RIGOLLET: Almost. 41:01 41:04 Anybody wants to take a guess at the last one? 41:07 I think we've removed all other possibilities. 41:12 It's b sigma b transpose. 41:15 41:20 So if you ever answered to the question, 41:24 do you know Gaussian random vectors, 41:26 but you did not know that, there's a gap in your knowledge 41:29 that you need to fill, because that's probably 41:31 the most important property of Gaussian vectors. 41:33 When you multiply them by matrices, 41:38 you have a simple rule on how to update the covariance matrix. 41:43 So here, sigma is the identity. 41:49 And here, this is the matrix b that I had here. 41:53 So what this is is, basically, n, some multivariate n, 41:58 of course. 41:59 Then I'm going to have 0. 42:00 And so what I need to do is b times the identity times b 42:04 transpose, which is just b b transpose. 42:07 And what is it going to tell me? 42:08 It's x transpose x-- 42:12 sorry, that's inverse-- inverse x transpose, and then 42:17 the transpose of this guy, which is x x 42:21 transpose x inverse transpose. 42:25 But this matrix is symmetric, so I'm actually 42:27 not going to make the transpose of this guy. 42:30 And again, magic shows up. 42:34 Inverse times the matrix of those two guys 42:36 cancel, and so this is actually equal to beta 42:38 plus some n0 x transpose x inverse. 42:43 42:46 Yeah? 42:47 AUDIENCE: I'm a little lost on the [INAUDIBLE].. 42:49 So you define that as the b matrix, and what happens? 42:52 PHILIPPE RIGOLLET: So I just apply this rule, right? 42:54 AUDIENCE: Yeah. 42:55 PHILIPPE RIGOLLET: So if I multiply a matrix 42:57 by a Gaussian, then let's say this Gaussian had 43:01 mean 0, which is the case of epsilon here, 43:05 then the covariance matrix that I get 43:07 is b times the original covariance matrix times b 43:10 transpose. 43:11 So all I did is write this matrix times the identity 43:15 times this matrix transpose. 43:18 And the identity, of course, doesn't play any role, 43:20 so I can remove it. 43:21 It's just this matrix, then the matrix transpose. 43:23 And what happened? 43:25 So what is the transpose of this matrix? 43:27 So I used the fact that if I look at x transpose x inverse x 43:32 transpose, and now I look at the whole transpose of this thing, 43:39 that's actually equal 2. 43:40 And I use the rule that ab transpose is b transpose 43:43 a transpose-- let me finish-- 43:46 and it's x x transpose x inverse. 43:51 43:55 Yes? 43:55 AUDIENCE: I thought the-- 43:58 for epsilon, it was sigma squared. 44:00 PHILIPPE RIGOLLET: Oh, thank you. 44:01 There's a sigma squared somewhere. 44:03 So this was sigma squared times the identity, so I can just 44:08 pick up a sigma squared anywhere. 44:10 44:14 So here, in our case, so for epsilon, this is sigma. 44:28 Sigma squared times the identity, 44:30 that's my covariance matrix. 44:31 44:33 You seem perplexed. 44:35 AUDIENCE: It's just a new idea for me 44:37 to think of a maximum likelihood estimator as a random variable. 44:41 PHILIPPE RIGOLLET: Oh, it should not be. 44:43 Any estimator is a random variable. 44:45 AUDIENCE: Oh, yeah, that's a good point. 44:48 PHILIPPE RIGOLLET: [LAUGHS] And I have not 44:52 told you that this was the maximum likelihood 44:54 estimator just yet. 44:55 The estimator is a random variable. 44:58 There's a word-- some people use estimate just 45:00 to differentiate the estimator while you're 45:03 doing the analysis with random variables and the values 45:05 when you plug in the numbers in there. 45:09 But then, of course, people use estimate because it's shorter, 45:12 so then it's confusing. 45:14 So any questions about this computation? 45:17 Did I forget any other Greek letter along the way? 45:20 All right, I think we're good. 45:22 So one thing that it says-- and actually, 45:26 thank you for pointing this out-- 45:27 I said there's actually a little hidden statement there. 45:30 By the way, this answers this question. 45:33 Beta hat is of the form beta plus something that's centered, 45:35 so it's indeed of the form Gaussian with mean beta 45:39 and covariance matrix sigma squared x transpose x inverse. 45:41 45:45 So that's very nice. 45:47 As long as x transpose x is not huge, 45:50 I'm going to have something that is close to what I want. 45:55 Oh, sorry, x transpose x inverse is not huge. 45:58 46:01 So there's a hidden claim in there, 46:05 which is that least squares estimator 46:08 is equal to the maximum likelihood estimator. 46:11 46:15 Why does the maximum likelihood estimator just 46:17 enter the picture now? 46:19 We've been talking about regression for the past 18 46:23 slides. 46:24 And we've been talking about estimators. 46:26 And I just dumped on you the least squares estimator, 46:29 but I never really came back to this thing that we know-- 46:31 maybe the method of moments, or maybe the maximum likelihood 46:35 estimator. 46:35 It turns out that those two things are the same. 46:37 But if I want to talk about a maximum likelihood estimator, 46:41 I need to have a likelihood. 46:43 In particular, I need to have a density. 46:46 And so if I want a density, I have 46:47 to make those assumptions, such as the epsilons have 46:53 this Gaussian distribution. 46:55 So why is this the maximum likelihood estimator? 46:58 Well, remember, y is x transpose beta plus epsilon. 47:04 So I actually have a bunch of data. 47:07 So what is my model here? 47:14 Well, its the family of Gaussians 47:18 on n observations with mean x beta, variance sigma 47:22 squared identity, and beta lives in rp. 47:31 Here's my family of distributions. 47:34 That's the possible distributions for y. 47:38 And so in particular, I can write the density of y. 47:41 47:47 Well, what is it? 47:48 It's something that looks like p of x-- 47:52 well, p of y, let's say, is equal to 1 over-- 47:58 so now its going to be a little more complicated, 48:00 but its sigma squared times 2 pi to the p/2 exponential minus 48:17 norm of y minus x beta squared divided by 2 sigma squared. 48:26 So that's just the multivariate Gaussian density. 48:29 I just wrote it. 48:30 That's the density of a multivariate Gaussian 48:33 with mean x beta and covariance matrix sigma squared 48:36 times the identity. 48:37 That's what it is. 48:40 So you don't have to learn this by heart, 48:42 but if you are familiar with the case where p is equal to 1, 48:47 you can check that you recover what you're familiar with, 48:49 and this makes sense as an extension. 48:54 48:59 So now, I can actually write my log likelihood. 49:08 How many observations do I have of this vector y? 49:14 49:23 Do I have n observations of y? 49:25 49:30 I have just one, right? 49:33 Oh, sorry, I shouldn't have said p, this is n. 49:36 Everything is in dimension n. 49:38 So I can think of either having n independent observations 49:42 of each coordinate, or I can think 49:44 of having just one observation of the vector y. 49:47 So when I write my log likelihood, 49:50 it's just the log of the density at y. 49:54 50:09 And that's the vector y, which I can 50:13 write as minus n/2 log sigma squared 50:18 2 pi minus 1 over 2 sigma squared norm of y minus x beta. 50:28 And that's, again, my boldface y. 50:30 50:36 And what is my maximum likelihood estimator? 50:39 50:44 Well, this guy does not depend on beta. 50:47 And this is just a constant factor in front of this guy. 50:50 So it's the same thing as just minimizing, 50:54 because I have a minus sign, over all beta and rp. 50:57 51:03 y minus x beta squared, and that's my least squares 51:05 estimator. 51:06 51:15 Is there anything that's unclear on this board? 51:17 Any question? 51:17 51:20 So all I used was-- so I wrote my log likelihood, which 51:23 is just the log of this expression 51:25 where y is my observation. 51:28 And that's indeed the observation that I have here. 51:32 And that was just some constant minus some constant times 51:35 this quantity that depends on beta. 51:37 So maximizing this whole thing is the same thing 51:40 as minimizing only this thing. 51:42 The minimizers are the same. 51:44 And so that tells me that I actually just 51:47 have to minimize the squared norm 51:49 to get my maximum likelihood estimator. 51:51 But this used, heavily, the fact that I could actually 51:55 write exactly what my density was, 52:03 and that when I took the log of this thing, 52:06 I had exactly the square norm that showed up. 52:09 If I had a different density, if, for example, 52:12 I assumed that my coordinates of epsilons were, say, iid 52:17 double exponential random variables. 52:18 So it's just half of an exponential. 52:21 And the plus is half of an exponential on the negatives. 52:24 So if I said that, then this would not 52:27 have the square norm that shows up. 52:28 This is really idiosyncratic to Gaussians. 52:31 If I had something else, I would have, 52:32 maybe, a different norm here, or something different 52:35 measures the difference between y and x beta. 52:39 And that's how you come up with other maximum likelihood 52:41 estimators that leads to other estimators that 52:44 are not the least squares-- 52:45 maybe the least absolute deviation, 52:47 for example, or this fourth movement, 52:50 for example, that you suggested last time. 52:52 So I can come up with a bunch of different things, 52:55 and they might be tied-- 52:56 maybe I can come up from them from the same perspective 52:59 that I came from the least squares estimator. 53:01 I said, let's just do something smart 53:03 and check, then, that it's indeed the maximum likelihood 53:06 estimator. 53:08 Or I could just start with the modeling on-- 53:11 and check, then, what happens-- 53:13 what was the implicit assumption that I put on my noise. 53:15 Or I could start with the assumption of the noise, 53:18 compute the maximum likelihood estimator 53:19 and see what it turns into. 53:21 53:24 So that was the first thing. 53:26 I've just proved to you the first line. 53:29 And from there, you can get what you want. 53:31 So all the other lines are going to follow. 53:34 So what is beta hat-- so for example, let's look 53:39 at the second line, the quadratic risk. 53:41 53:46 Beta hat minus beta, from this formula, 53:49 has a distribution, which is n n0, 53:53 and then I have x transpose x inverse. 53:58 AUDIENCE: Wouldn't the dimension be p on the board? 54:03 54:07 PHILIPPE RIGOLLET: Sorry, the dimension of what? 54:10 AUDIENCE: Oh beta hat minus beta. 54:11 Isn't beta only a p dimensional? 54:13 PHILIPPE RIGOLLET: Oh, yeah, you're right, you're right. 54:15 That was all p dimensional there. 54:17 54:22 Yeah. 54:23 So if b here, the matrix that I'm actually applying, 54:28 has dimension p times n-- 54:30 so even if epsilon was an n dimensional Gaussian vector, 54:34 then b times epsilon is a p dimensional Gaussian vector 54:39 now. 54:39 So that's how I switch from p to n-- 54:42 from n to p. 54:43 Thank you. 54:45 So you're right, this is beta hat minus beta is this guy. 54:50 And so in particular, if I look at the expectation 54:54 of the norm of beta hat minus beta squared, what is it? 55:01 It's the expectation of the norm of some Gaussian vector. 55:08 55:12 And so it turns out-- so maybe we don't have-- 55:15 well, that's just also a property of a Gaussian vector. 55:18 So if epsilon is n0 sigma, then the expectation 55:26 of the norm of epsilon squared is just the trace of sigma. 55:34 55:37 Actually, we can probably check this 55:41 by saying that this is the sum from j equal 1 55:44 to p of the expectation of beta hat j minus beta j squared. 55:51 55:54 Since beta j squared is the expectation-- beta j 55:57 is the expectation of beta hat. 55:59 This is actually equal to the sum from j equal 1 56:01 to p of the variance of beta hat j, 56:08 just because this is the expectation of beta hat. 56:11 And how do I read the variances in a covariance matrix? 56:15 There are just the diagonal elements. 56:17 So that's really just sigma jj. 56:25 And so that's really equal to-- 56:27 so that's the sum of the diagonal elements 56:29 of this matrix. 56:30 Let's call it sigma. 56:33 So that's equal to the trace of x transpose x inverse. 56:40 56:42 The trace is the sum of the diagonal elements of a matrix. 56:45 56:48 And I still had something else. 56:49 I'm sorry, this was sigma squared. 56:52 I forget it all the time. 56:54 So the sigma squared comes out. 56:56 It's there. 56:58 And so the sigma squared comes out 57:01 because the trace is a linear operator. 57:02 If I multiply all the entries of my matrix by the same number, 57:06 then all the diagonal elements are multiplied 57:08 by the same number, so when I sum them, 57:09 the sum is multiplied by the same number. 57:13 So that's for the quadratic risk of beta hat. 57:18 And now I need to tell you about x beta hat. 57:21 x beta hat was something that was actually telling me 57:27 that that was the point that I reported on the red line 57:30 that I estimated. 57:31 That was my x beta hat. 57:32 That was my y minus the noise. 57:40 Now, this thing here-- 57:42 so remember, we had this line, and I had my observation. 57:47 And here, I'm really trying to measure this distance squared. 57:51 This distance is actually quite important for me 57:53 because it actually shows up in the Pythagoras theorem. 57:58 And so you could actually try to estimate this thing. 58:02 So what is the prediction error? 58:03 58:12 So we said we have y minus x beta hat, so that's 58:18 the norm of this thing we're trying to compute. 58:21 But let's write this for what it is for one second. 58:25 So we said that beta hat was x transpose 58:27 x inverse extra transpose y, and we know that y is 58:31 x transpose beta plus epsilon. 58:35 So let's write this-- 58:37 58:40 x beta plus epsilon plus x. 58:43 58:57 And actually, maybe I should not write it. 59:00 Let me keep the y for what it is now. 59:02 59:07 So that means that I have, essentially, 59:08 the identity of rn times y minus this matrix times y. 59:13 So I can factor y out, and that's 59:15 the identity of rn minus x x transpose 59:20 x inverse x transpose, the whole thing times y. 59:27 59:32 We call this matrix p because this was the projection matrix 59:37 onto the linear span of the x's. 59:41 So that means that if I take a point x and I apply p times x, 59:46 I'm projecting onto the linear span of the columns of x. 59:50 What happens if I do i minus p times x? 59:57 It's x minus px. 59:59 60:01 So if I look at the point on which-- 60:04 so this is the point on which I project. 60:07 This is x. 60:08 I project orthogonally to get p times x. 60:13 And so what it means is that this operator i 60:15 minus px is actually giving me this guy, this vector here-- 60:21 x minus p times x. 60:23 60:30 Let's say this is 0. 60:33 This means that this vector, I can put it here. 60:36 It's this vector here. 60:38 And that's actually the orthogonal projection 60:40 of x onto the orthogonal complement of the span 60:43 of the columns of x. 60:45 So if I project x, or if I look of x minus its projection, 60:51 I'm basically projecting onto two orthogonal spaces. 60:55 What I'm trying to say here is that this here 60:59 is another projection matrix p prime. 61:01 61:04 That is just the projection matrix onto the orthogonal-- 61:10 projection onto orthogonal of column span of x. 61:29 Orthogonal means the set of vectors 61:31 that's orthogonal to everyone in this linear space. 61:34 61:37 So now, when I'm doing this, this is exactly what-- 61:40 I mean, in a way, this is illustrating this Pythagoras 61:42 theorem. 61:43 And so when I want to compute the norm of this guy, the norm 61:47 squared of this guy, I'm really computing-- 61:49 if this is my y now, this is px of y, 61:52 I'm really controlling the norm squared of this thing. 61:55 62:06 So if I want to compute the norm squared-- 62:08 62:42 so I'm almost there. 62:48 So what am I projecting here onto the orthogonal projector? 62:52 So here, y, now, I know that y is 62:55 equal to x beta plus epsilon. 63:00 So when I look at this matrix p prime times y, 63:06 It's actually p prime times x beta plus p prime times 63:11 epsilon. 63:11 63:14 What's happening to p prime times x beta? 63:18 Let's look at this picture. 63:19 63:23 So we know that p prime takes any point here and projects it 63:26 orthogonally on this guy. 63:29 But x beta is actually a point that lives here. 63:33 It's something that's on the linear span. 63:36 So where do all the points that are on this line 63:39 get projected to? 63:43 AUDIENCE: The origin. 63:43 PHILIPPE RIGOLLET: The origin, to 0. 63:45 They all get projected to 0. 63:47 And that's because I'm basically projecting 63:50 something that's on the column span of x onto its orthogonal. 63:54 So that's always 0 that I'm getting here. 63:56 64:02 So when I apply p prime to y, I'm 64:04 really just applying p prime to epsilon. 64:08 So I know that now, this, actually, 64:10 is equal to the norm of some multivariate Gaussian. 64:18 What is the size of this Gaussian? 64:20 64:22 What is the size of this matrix? 64:24 Well, I actually had it there. 64:25 It's i n, so it's n dimensional. 64:28 So it's some n dimensional with mean 0. 64:31 And what is the covariance matrix 64:32 of p prime times epsilon? 64:34 64:39 AUDIENCE: p p transpose. 64:40 PHILIPPE RIGOLLET: Yeah, p prime p prime transpose, 64:43 which we just said p prime transpose is p, 64:48 so that's p squared. 64:49 And we see that when we project twice, 64:51 it's as if we projected only once. 64:54 So here, this is n0 p prime p prime transpose. 65:00 That's the formula for the covariance matrix. 65:05 But this guy is actually equal to p prime times p prime, 65:09 which is equal to p prime. 65:13 So now, what I'm looking for is the norm squared of the trace. 65:18 So that means that this whole thing here 65:20 is actually equal to the trace. 65:22 Oh, did I forget again a sigma squared? 65:24 Yeah, I forgot it only here, which is good news. 65:28 So I should assume that sigma squared is equal to 1. 65:32 So sigma squared's here. 65:34 And then what I'm left with is sigma squared 65:36 times the trace of p prime. 65:39 65:45 At some point, I mentioned that the eigenvalues of a projection 65:51 matrix were actually 0 or 1. 65:54 The trace is the sum of the eigenvalues. 65:56 So that means that the trace is going 65:58 to be an integer number as the number of non-0 eigenvalues. 66:03 And the non-0 eigenvalues are just 66:05 the dimension of the space onto which I'm projecting. 66:07 66:10 Now, I'm projecting from something of dimension n 66:15 onto the orthogonal of a space of dimension p. 66:19 What is the dimension of the orthogonal 66:21 of a space of dimension p when thought 66:23 of space in dimension n? 66:26 AUDIENCE: [? 1. ?] 66:27 PHILIPPE RIGOLLET: N minus p-- 66:28 that's the so-called rank theorem, I guess, as a name. 66:32 And so that's how I get this n minus p here. 66:35 This is really just equal to n minus p. 66:40 Yeah? 66:40 AUDIENCE: Here, we're taking the expectation of the whole thing. 66:43 PHILIPPE RIGOLLET: Yes, you're right. 66:44 So that's actually the expectation 66:48 of this thing that's equal to that. 66:50 Absolutely. 66:53 But I actually have much better. 66:55 I know, even, that the norm that I'm looking at, 66:57 I know it's going to be this thing. 66:58 What is going to be the distribution of this guy? 67:00 67:03 Norm squared of a Gaussian, chi squared. 67:06 So there's going to be some chi squared that shows up. 67:09 And the number of degrees of freedom 67:10 is actually going to be also n minus p. 67:12 And maybe it's actually somewhere-- 67:16 yeah, right here-- n minus p times sigma hat 67:20 squared over sigma squared. 67:22 This is my sigma hat squared. 67:24 If I multiply n minus p, I'm left only with this thing, 67:28 and so that means that I get sigma squared times-- 67:31 because they always forget my sigma squared-- 67:33 I get sigma squared times this thing. 67:34 And it turns out that the square norm of this guy 67:37 is actually exactly chi squared with n minus b 67:39 degrees of freedom. 67:40 67:43 So in particular, so we know that the expectation 67:47 of this thing is equal to sigma squared times n minus p. 67:50 So if I divide both sides by n minus p, 67:53 I'm going to have that something whose expectation is 67:55 sigma squared. 67:57 And this something, I can actually compute. 67:59 It depends on y, and x that I know, 68:02 and beta hat that I've just estimated. 68:04 I know what n is. 68:05 And pr's are the dimensions of my matrix x. 68:07 So I'm actually given an estimator whose expectation 68:11 is sigma squared. 68:13 And so now, I actually have an unbiased estimator 68:15 of sigma squared. 68:17 That's this guy right here. 68:19 And it's actually super useful. 68:20 68:23 So those are called the-- 68:25 this is the normalized sum of square residuals. 68:27 These are called the residuals. 68:29 Those are whatever is residual when 68:32 I project my points onto the line that I've estimated. 68:36 And so in a way, those guys-- if you go back to this picture, 68:40 this was yi and this was xi transpose beta hat. 68:47 So if beta hat is close to beta, the difference 68:49 between yi and xi transpose beta should 68:52 be close to my epsilon i. 68:55 It's some sort of epsilon i hat. 68:57 69:00 Agreed? 69:02 And so that means that if I think 69:04 of those as being epsilon i hat, they 69:07 should be close to epsilon i, and so their norm 69:09 should be giving me something that looks like sigma squared. 69:14 And so that's why it actually makes sense. 69:16 It's just magical that everything works out together, 69:18 because I'm not projecting on the right line, 69:21 I'm actually projecting on the wrong line. 69:23 But in the end, things actually work out pretty well. 69:27 There's one thing-- so here, the theorem 69:28 is that this thing not only has the right expectation, 69:31 but also has a chi squared distribution. 69:33 That's what we just discussed. 69:34 So here, I'm just telling you this. 69:36 But it's not too hard to believe, 69:37 because it's actually the norm of some vector. 69:40 You could make this obvious, but again, I 69:42 didn't want to bring in too much linear algebra. 69:44 So to prove this, you actually have 69:46 to diagonalize the matrix p. 69:48 So you have to invoke the eigenvalue decomposition 69:53 and the fact that the norm is invariant by rotation. 69:56 So for those who are familiar with, what I can do 69:59 is just look at the decomposition of p 70:01 prime into ud u transpose where this is an orthogonal matrix, 70:08 and this is a diagonal matrix of eigenvalues. 70:10 And when I look at the norm squared of this thing, 70:13 I mean, I have, basically, the norm 70:14 squared of p prime times some epsilon. 70:20 It's the norm of ud u transpose epsilon squared. 70:26 The norm of a rotation of a vector 70:28 is the same as the norm of the vector, so this guy goes away. 70:32 This is not actually-- 70:34 I mean, you don't have to care about this if you 70:36 don't understand what I'm saying, so don't freak out. 70:37 This is really for those who follow. 70:39 What is the distribution of u transpose epsilon? 70:42 70:45 I take a Gaussian vector that has covariance matrix sigma 70:50 squared times the [? identity, ?] and I basically 70:52 rotate it. 70:54 What is its distribution? 70:57 Yeah? 70:58 AUDIENCE: The same. 70:59 PHILIPPE RIGOLLET: It's the same. 71:00 It's completely invariant, because the Gaussian 71:02 think of all directions as being the same. 71:04 So it doesn't really matter if I take a Gaussian or a rotated 71:07 Gaussian. 71:08 So this is also a Gaussian, so I'm 71:10 going to call it epsilon prime. 71:11 And I am left with just the norm of epsilon primes. 71:15 So this is the sum of the dj's squared times epsilon 71:23 j squared. 71:24 71:27 And we just said that the eigenvalues of p 71:30 are either 0 or 1, because it's a projector. 71:33 And so here, I'm going to get only 0's and 1's. 71:36 So I'm really just summing a certain number 71:39 of epsilon i squared. 71:42 So square root of standard Gaussians-- 71:45 sorry, with a sigma squared somewhere. 71:48 And basically, how many am I summing? 71:50 Well, the n minus p, the number of non-0 eigenvalues 71:55 of p prime. 71:57 So that's how it shows up. 72:00 When you see this, what theorem am I using here? 72:05 Cochran's theorem. 72:06 This is this magic book. 72:07 I'm actually going to dump everything that I'm not going 72:09 to prove to you and say, oh, this is actually Cochran's. 72:11 No, Cochran's theorem is really just 72:12 telling me something about orthogonality of things, 72:15 and therefore, independence of things. 72:17 And Cochran's theorem was something 72:19 that I used when I wanted to use what? 72:23 That's something I used just one slide before. 72:27 Student t-test, right? 72:28 I used Cochran's theorem to see that the numerator 72:30 and the denominator of the student statistic 72:33 were independent of each other. 72:35 And this is exactly what I'm going to do here. 72:37 72:40 I'm going to actually write a test to test, maybe, 72:42 if the beta j's are equal to 0. 72:44 I'm going to form a numerator, which is beta hat minus beta. 72:49 This is normal. 72:50 And we know that beta hat has a Gaussian distribution. 72:53 I'm going to standardized by something 72:54 that makes sense to me. 72:55 And I'm not going to go into details, 72:56 because we're out of time. 72:58 But there's the sigma hat that shows up. 72:59 And then there's a gamma j, which takes into account 73:03 the fact that my x's-- 73:06 if I look at the distribution of beta, which is gone, I think-- 73:12 yeah, beta is gone. 73:14 Oh, yeah, that's where it is. 73:16 The covariance matrix depends on this matrix x transpose x. 73:20 So this will show up in the variance. 73:22 In particular, diagonal elements are going to play a role here. 73:25 And so that's what my gammas are. 73:26 The gammas is the j's diagonal element of this matrix. 73:30 So we'll resume that on Tuesday, so 73:35 don't worry too much if this is going too fast. 73:38 I'm not supposed to cover it, but just so you 73:40 get a hint of why Cochran's theorem actually was useful. 73:45 So I don't know if we actually ended up recording. 73:51 I have your homework. 73:53 And as usual, I will give it to you outside. 73:56