https://www.youtube.com/watch?v=yP1S37BiEsQ&list=PLUl4u3cNGP60uVBMaoNERc6knT_MgPKS0&index=12 字幕記錄 00:00 00:00 The following content is provided under a Creative 00:02 Commons license. 00:03 Your support will help MIT OpenCourseWare 00:06 continue to offer high quality educational resources for free. 00:10 To make a donation or to view additional materials 00:12 from hundreds of MIT courses, visit 00:15 MITOpenCourseWare@OCW.MIT.edu. 00:17 00:20 PHILIPPE RIGOLLET: It's because if I was not, 00:22 this would be basically the last topic we would ever see. 00:25 And this is arguably, probably the most important topic 00:29 in statistics, or at least that's probably 00:30 the reason why most of you are taking this class. 00:33 Because regression implies prediction, 00:36 and prediction is what people are after to now, right? 00:39 You don't need to understand what 00:40 the model for the financial market 00:41 is if you actually have a formula 00:43 to predict what the stock prices are going to be tomorrow. 00:47 And regression, in a way, allows us to do that. 00:49 And we'll start with a very simple version of regression, 00:52 which is linear regression, which is the most standard one. 00:55 And then we'll move on to slightly more advanced notions 00:58 such as nonparametric regression. 00:59 At least, we're going to see the principles behind it. 01:02 And I'll touch upon a little bit of high dimensional regression, 01:06 which is what people are doing today. 01:09 So the goal of regression is to try 01:12 to predict one variable based on another variable. 01:16 All right, so here the notation is very important. 01:19 It's extremely standard. 01:22 It goes everywhere essentially, and essentially you're 01:25 trying to explain why as a function of x, 01:29 which is the usual y equals f of x question-- 01:33 except that, you know, if you look at a calculus class, 01:36 people tell you y equals f of x, and they give you 01:39 a specific form for f, and then you do something. 01:42 Here, we're just going to try to estimate 01:43 what this length function is. 01:46 And this is why we often call y the explained variable 01:49 and x the explanatory variable. 01:52 All right, so we're statisticians, 01:55 so we start with data. 01:56 All right, then what does our data look like? 01:58 Well, it looks like a bunch of input, output 02:01 to this relationship. 02:03 All right, so we have a bunch of xi, yi. 02:05 Those are pairs, and I can do a scatterplot of those guys. 02:09 So each point here has a x-coordinate, which is xi, 02:14 and a y-coordinate, which is yi, and here, I 02:16 have a bunch of endpoints. 02:17 And I just draw them like that. 02:19 Now, the functions we're going to be interested in 02:23 are often function of the form y equals a plus b times x, OK. 02:30 And that means that this function looks like this. 02:32 02:36 So if I do x and y, this function 02:38 looks exactly like a line, and clearly those points 02:41 are not on the line. 02:42 And it will basically never happen 02:44 that those points are on a line. 02:45 There's a famous T-shirt from, I think, 02:48 U.C. Berkeley's staff department, 02:50 that shows this picture and put a line between them 02:52 like we're going to see it. 02:53 And it says, oh, statisticians, so many points, 02:56 and you still managed to miss all of them. 02:59 And so essentially, we don't believe that this relationship 03:04 y is equal to a plus bx is true, but maybe up to some noise. 03:08 And that's where the statistics is going to come into play. 03:11 There's going to be some random noise that's going to play out, 03:13 and hopefully the noise is going to be spread out evenly, 03:17 so that we can average it if we have enough points. 03:20 Average it out, OK. 03:22 And so this epsilon here is not necessarily due to randomness. 03:26 But again, just like we did modeling in the first place, 03:29 it essentially accounts for everything 03:30 we don't understand about this relationship. 03:33 All right, so for example-- 03:36 so here, I'm not going to be-- 03:37 give me one second, so we'll see an example in a second. 03:41 But the idea here is that if you have data, 03:44 and if you believe that it's of the form, 03:45 a plus b x plus some noise, you're 03:47 trying to find the line that will explain your data 03:50 the best, right? 03:51 In the terminology we've been using before, 03:54 this would be the most likely line that explains the data. 03:58 So we can see that it's slightly-- 03:59 we've just added another dimension 04:01 to our statistical problem. 04:02 We don't have just x's, but we have y's, and we're 04:04 trying to find the most likely explanation of the relationship 04:07 between y and x. 04:09 All right, and so in practice, the way 04:12 it's going to look like is that we're going to have basically 04:14 two parameters to find the slope b 04:17 and the intercept a, and given data, 04:20 the goal is going to be to try to find the best possible line. 04:23 All right? 04:24 So what we're going to find is not 04:25 exactly a and b, the ones that actually generate the data, 04:29 but some estimators of those parameters, a hat and b hat 04:33 constructed from the data. 04:35 All right, so we'll see that more generally, 04:38 but we're not going to go too much in the details of this. 04:40 There's actually quite a bit that you 04:42 can understand if you do what's called 04:43 univariate regression when x is actually 04:47 a real valued random variable. 04:49 So when this happens, this is called univariate regression. 04:52 04:59 And when x is in rp for p larger than or equal to 2, 05:05 this is called multivariate regression. 05:07 05:16 OK, and so here we're just trying to explain y 05:20 is a plus bx plus epsilon. 05:23 And here we're going to have something more complicated. 05:26 We're going to have y, which is equal to a plus b1, x1 plus b2, 05:33 x2 plus bp, xp plus epsilon-- 05:39 where x is equal to-- 05:42 the coordinates of x are given by x1, 2xp, rp. 05:46 OK, so it's still linear. 05:49 Right, they still add all the coordinates 05:51 of x with a coefficient in front of them, 05:53 but it's a bit more complicated than just one coefficient 05:56 for one coordinate of x, OK? 05:58 So we'll come back to multivariate regression. 06:03 Of course, you can write this as x transpose b, right? 06:08 So this entire thing here, this linear combination 06:14 is of the form x transpose b, where 06:17 b is the vector that has coordinates b1 to bp. 06:23 OK? 06:25 Sorry, here, it's in [? rd, ?] p is the natural notation. 06:31 All right, so our goal here, in the univariate one, 06:35 is to try to write the model, make sense 06:38 of this little twiddle here-- 06:40 essentially, from a statistical modeling question, 06:44 the question is going to be, what distributional assumptions 06:47 do you want to put on epsilon? 06:48 Are you going to say they're Gaussian? 06:50 Are you going to say they're binomial? 06:52 07:00 OK, are you going to say they're binomial? 07:03 Are you going to say they're Bernoulli? 07:05 So that's going to be what we we're going to make sense of, 07:07 and then we're going to try to find a method 07:10 to estimate a and b. 07:11 And then maybe we're going to try to do 07:13 some inference about a and b-- 07:15 maybe test if a and b take certain values, if they're 07:18 less than something, maybe find some confidence 07:20 regions for a and b, all right? 07:24 So why would you want to do this? 07:25 Well, I'm sure all of you have an application, if I give you 07:29 some x, you're trying to predict what y is. 07:32 Machine learning is all about doing this, right? 07:34 Without maybe trying to even understand 07:36 the physics behind this, they're saying, 07:38 well, you give me a bag of words, 07:40 I want to understand whether it's going to be a spam or not. 07:43 You give me a bunch of economic indicators, 07:47 I want you to tell me how much I should be selling my car for. 07:51 You give me a bunch of measurements on some patient, 07:55 I want you to predict how this person is 07:57 going to respond to my drug-- and things like this. 08:00 All right, and often we actually don't have much modeling 08:04 intuition about what the relationship between x and y 08:07 is, and this linear thing is basically the simplest function 08:10 we can think of. 08:11 Arguably, linear functions are the simplest functions 08:15 that are not trivial. 08:16 Otherwise, we would just say, well, let's just predict x of y 08:19 to be a constant, meaning it does not depend on x. 08:21 But if you want it to depend on x, then 08:23 your functions are basically as simple as it gets. 08:25 It turns out, amazingly, this does the trick quite often. 08:30 So for example, if you look at economics, 08:33 you might want to assume that the demand is 08:35 a linear function of the price. 08:38 So if your price is zero, there's 08:40 going to be a certain demand. 08:41 And as the price increases, the demand is going to move. 08:45 Do you think b is going to be positive or negative here? 08:47 08:51 What? 08:52 Typically, it's negative unless we're 08:53 talking about maybe luxury goods, 08:56 where you know, the more expensive, 08:57 the more people actually want it. 09:00 I mean, if we're talking about actual economic demand, 09:02 that's probably definitely negative. 09:06 It doesn't have to be, you know, clearly linear, 09:11 so that you can actually make it linear, transform it 09:13 into something linear. 09:14 So for example, you have this like multiplicative 09:17 relationship, PV equals nRT, which is the Ideal gas law. 09:24 If you want to actually write this relationship, 09:26 if you want to predict what the pressure is 09:28 going to be as a function of the volume and the temperature-- 09:33 and well, let's assume that n is the Avogadro constant, 09:37 and let's assume that the radius is actually fixed. 09:42 Then you take the log on each side, so you get PV equals nRT. 09:47 10:03 So what that means is that log PV is equal to log nRT. 10:07 10:10 So that means log P plus log V is equal to the log nR plus log 10:23 T. So we said that R is constant, so this is actually 10:28 your constant. 10:29 I'm going to call it a. 10:31 And then that means that log P is 10:35 equal to minus log V. That log P is equal to a minus log 10:49 V plus log T. OK? 10:55 And so in particular, if I write b equal to negative 1 11:01 and c equal to plus 1, this gives me the formula 11:04 that I have here. 11:06 Now again, it might be the case that this is the ideal gas law. 11:10 So in practice, if I start recording pressure, 11:12 and temperature, and volume, I might make measurement errors, 11:16 there might be slightly different conditions 11:18 in such a way that I'm not going to get exactly those. 11:21 And I'm just going to put this little twiddle 11:23 to account for the fact that the points that I'm 11:25 going to be recording for log pressure, 11:28 log volume, and log temperature are not going 11:30 to be exactly on one line. 11:32 OK, they're going to be close. 11:33 Actually, in those physics experiments, 11:36 usually, they're very close because the conditions 11:39 are controlled under lab experiments. 11:41 So it means that the noise is very small. 11:44 But for other cases, like demand and prices, 11:47 it's not a law of physics, and so this must change. 11:50 Even the linear structure is probably not clear, right. 11:53 At some points, there's probably going 11:54 to be some weird curvature happening. 11:57 All right, so this slide is just to tell you maybe you 12:00 don't have, obviously, a linear relationship, 12:03 but maybe you do if you start taking 12:04 logs exponentials, squares. 12:08 You can sometimes take the product of two variables, 12:10 things like this, right. 12:12 So this is variable transformation, 12:13 and it's mostly domain-specific, so we're not 12:15 going to go into more details of this. 12:18 Any questions? 12:19 12:22 All right, so now I'm going to be giving-- 12:27 so if we start thinking a little more about what 12:29 these coefficients should be, well, 12:32 remember-- so everybody's clear why 12:34 I don't put the little i here? 12:36 12:41 Right, I don't put the little i because I'm just 12:43 talking about a generic x and a generic y, 12:47 but the observations are x1, y1, right. 12:49 So typically, on the blackboard I'm 12:53 often going to write only xy, but the data really is x1, 13:02 y1, all the way to xn, yn. 13:07 So those are those points in this two dimensional plot. 13:10 But I think of those as being independent copies of the pair 13:21 xy. 13:24 They have to have-- 13:26 to contain their relationship. 13:27 And so when I talk about distribution 13:29 of those random variables, I talk about the distribution 13:32 of xy, and that's the same. 13:34 All right, so the first thing you might want to ask 13:36 is, well, if I have an infinite amount of data, 13:41 what can I hope to get for a and b? 13:44 If my simple size goes to infinity, 13:46 then I should actually know exactly what 13:48 the distribution of xy is. 13:50 And so there should be an a and a b 13:52 that captures this linear relationship between y and x. 13:57 And so in particular, we're going 13:59 to try to ask the population, or theoretic, values of a and b, 14:02 and you can see that you can actually 14:04 compute them explicitly. 14:05 So let's just try to find how. 14:08 So as I said, we have a bunch of points 14:10 on this line close to a line, and I'm 14:16 trying to find the best fit. 14:20 All right, so this guy is not a good fit. 14:23 This guy is not a good fit. 14:24 And we know that this guy is a good fit somehow. 14:27 So we need to mathematically formulate the fact 14:30 that this line here is better than this line here 14:35 or better than this line here. 14:37 So what we're trying to do is to create a function that 14:41 has values that are smaller for this curve 14:43 and larger for these two curves. 14:45 And the way we do it is by measuring the fit, 14:47 and the fit is essentially the aggregate distance 14:51 of all the points to the curve. 14:55 And there's many ways I can measure 14:56 the distance to a curve. 14:58 So if I want to find so-- let's just open a parenthesis. 15:01 If I have a point here-- so we're 15:03 going to do it for one point at a time. 15:05 So if I have a point, there's many ways 15:07 I can measure its distance to the curve, right? 15:09 I can measure it like that. 15:12 That is one distance to the curve. 15:14 I can measure it like that by having a right angle here that 15:19 is one distance to the curve. 15:20 Or I can measure it like that. 15:23 That is another distance to the curve, right. 15:27 There's many ways I can go for it. 15:29 It turns out that one is actually 15:31 going to be fairly convenient for us, 15:33 and that's the one that says, let's look at the square 15:36 of the value of x on the curve. 15:38 So if this is the curve, y is equal to a plus bx. 15:43 15:51 Now, I'm going to think of this point as a random point, 15:54 capital X, capital Y, so that means 15:57 that it's going to be x1, y1 or x2, y2, et cetera. 16:02 Now, I want to measure the distance. 16:04 Can somebody tell me which of the three-- 16:06 the first one, the second one, or the third one-- 16:08 this formula, expectation of y minus a minus bx squared is-- 16:13 which of the three is it representing? 16:18 AUDIENCE: The second one. 16:20 PHILIPPE RIGOLLET: The second one 16:21 where I have the right angle? 16:22 OK, everybody agrees with this? 16:26 Anybody wants to vote for something else? 16:28 Yeah? 16:29 AUDIENCE: The third one? 16:30 PHILIPPE RIGOLLET: The third one? 16:31 Everybody agrees with the third one? 16:34 So by default, everybody's on the first one? 16:38 Yeah, it is the vertical distance actually. 16:42 And the reason is if it was the one with the straight angle, 16:44 with the right angle, it would actually 16:46 be a very complicated mathematical formula, 16:48 so let's just see y, right? 16:51 And by y, I mean y. 16:53 OK, so this means that this is my x, and this is my y. 16:59 17:02 All right, so that means that this point is xy. 17:05 So what I'm measuring is the difference 17:07 between y minus a plus b times x. 17:15 This is the thing I'm going to take the expectation off-- 17:18 the square and then the expectation-- so a 17:20 plus b times x, if this is this line, this is this point. 17:24 So that's this value here. 17:27 This value here is a plus bx, right? 17:33 So what I'm really measuring is the difference 17:35 between y and N plus bx, which is this distance here. 17:38 17:42 And since I like things like Pythagoras theorem, 17:45 I'm actually going to put a square here 17:47 before I take the expectation. 17:51 So now this is a random variable. 17:53 This is this random variable. 17:55 And so I want a number, so I'm going to turn it 17:58 into a deterministic number. 18:00 And the way I do this is by taking expectation. 18:03 And if you think expectations should be close to average, 18:07 this is the same thing as saying, 18:09 I want that in average, the y's are 18:12 close to the a plus bx, right? 18:14 So we're doing it in expectation, 18:16 but that's going to translate into doing it 18:18 in average for all the points. 18:20 All right, so this is the thing I want to measure. 18:22 So that's this vertical distance. 18:24 Yeah? 18:26 OK. 18:26 18:32 This is my fault actually. 18:36 Maybe we should close those shades. 18:37 18:50 OK, I cannot do just one at a time, sorry. 18:53 19:11 All right, so now that I do those vertical distances, 19:15 I can ask-- well, now, I have this function, 19:18 right-- to have a function that takes two parameters a and b, 19:22 maps it to the expectation of y minus a plus bx squared. 19:30 Sorry, the square is here. 19:32 And I could ask, well, this is a function that 19:35 measures the fit of the parameters a and b, right? 19:38 This function should be small. 19:40 The value of this function here, function 19:45 of a and b that measures how close the point xy is 20:07 to the line a plus b times x while y 20:14 is equal to a plus b times x in expectation. 20:18 20:23 OK, agreed? 20:24 This is what we just said. 20:27 Again, if you're not comfortable with the reason why 20:29 you get expectations, just think about having data points 20:32 and taking the average value for this guy. 20:34 So it's basically an aggregate distance 20:36 of the points to their line. 20:41 OK, everybody agrees this is a legitimate measure? 20:44 If all my points were on the line-- if my distribution-- 20:48 if y was actually equal to a plus bx for some a 20:51 and b then this function would be equal to 0 20:54 for the correct a and b, right? 20:57 If they are far-- well, it's going 20:59 to depend on how much noise I'm getting, 21:01 but it's still going to be minimized for the best one. 21:04 So let's minimize this thing. 21:06 So here, I don't make any-- 21:11 again, sorry. 21:12 I don't make an assumption on the distribution of x or y. 21:21 Here, I assume, somehow, that the variance of x 21:27 is not equal to 0. 21:28 Can somebody tell me why? 21:29 Yeah? 21:30 AUDIENCE: Not really a question-- the slides, 21:33 you have y minus a minus bx quantity squared expectation 21:38 of that, and here you've written square of the expectation. 21:41 PHILIPPE RIGOLLET: No, here I'm actually 21:42 in the expectation of the square. 21:46 If I wanted to write the square of the expectation, 21:49 I would just do this. 21:52 So let's just make it clear. 21:53 22:00 Right? 22:01 Do you want me to put an extra set of parenthesis? 22:03 That's what you want me to do? 22:06 AUDIENCE: Yeah, it's just confusing with the [INAUDIBLE] 22:11 PHILIPPE RIGOLLET: OK, that's the one that makes sense, so 22:13 the square of the expectation? 22:14 AUDIENCE: Yeah. 22:15 PHILIPPE RIGOLLET: Oh, the expectation of the square, 22:17 sorry. 22:17 22:20 Yeah, dyslexia. 22:22 All right, any question? 22:25 Yeah? 22:25 AUDIENCE: Does this assume that the error is Gaussian? 22:28 PHILIPPE RIGOLLET: No. 22:29 22:32 AUDIENCE: I mean, in the sense that like, 22:34 if we knew that the error was, like, 22:36 even the minus followed like-- so even the minus x 22:40 to the fourth distribution, would we want to minimise 22:44 the expectation of what the fourth power of y minus 22:48 a equals bx in order to get [? what the ?] [? best is? ?] 22:52 PHILIPPE RIGOLLET: Why? 22:53 22:57 So you know the answers to your question, 22:59 so I just want you to use the words that-- 23:01 right, so why would you want to use the fourth power? 23:04 AUDIENCE: Well, because, like, we 23:06 want to more strongly penalize deviations 23:08 because we'd expect very large deviations to be 23:11 very rare, or more rare, than it would 23:15 with the Gaussian [INAUDIBLE] power. 23:18 PHILIPPE RIGOLLET: Yeah so, that would be the maximum likely 23:19 estimator that you're describing to me, right? 23:21 I can actually write the likelihood 23:22 of a pair of numbers ab. 23:25 And if I know this, that's actually 23:26 what's going to come into it because I 23:28 know that the density is going to come into play when 23:31 I talk about there. 23:32 But here, I'm just talking about-- 23:34 this is a mechanical tool. 23:36 I'm just saying, let's minimize the distance to the curve. 23:39 Another thing I could have done is take the absolute value 23:42 of this thing, for example. 23:43 I just decided to take the square root before I did it. 23:46 OK, so regardless of what I'm doing, 23:48 I'm just taking the squares because that's just 23:50 going to be convenient for me to do my computations for now. 23:53 But we don't have any statistical model 23:55 at this point. 23:56 I didn't say anything-- that y follows this. 23:59 X follows this. 24:00 I'm just doing minimal assumptions 24:01 as we go, all right? 24:04 So the variance of x is not equal to 0? 24:06 Could somebody tell me why? 24:07 24:11 What would my cloud point look like if the variance of x 24:14 was equal to 0? 24:16 Yeah, they would all be at the same point. 24:18 So it's going to be hard for me to start fitting in a line, 24:20 right? 24:21 I mean, best case scenario, I have this x. 24:24 It has variance, zero, so this is the expectation of x. 24:26 And all my points have the same expectation, 24:31 and so, yes, I could probably fit that line. 24:33 But that wouldn't help very much for other x's. 24:38 So I need a bit of variance so that things spread out 24:41 a little bit. 24:42 24:47 OK, I'm going to have to do this. 24:51 I think it's just my-- 24:52 25:10 All right, so I'm going to put a little bit of variance. 25:13 And the other thing is here, I don't want to do much more, 25:15 but I'm actually going to think of x as having means zero. 25:22 And the way I do this is as follows. 25:24 Let's define x tilde, which is x minus the expectation of x. 25:30 OK, so definitely the expectation of x tilde is what? 25:33 25:36 Zero, OK. 25:38 And so now I want to minimize in ab, expectation 25:43 of y minus a plus b, x squared. 25:53 And the way I'm going to do this is by turning x into x tilde 26:03 and stuffing the extra-- 26:07 and putting the extra expectation of x into the a. 26:12 So I'm going to write this as an expectation of y minus a plus 26:19 b expectation of x-- 26:25 which I'm going to a tilde-- 26:27 and plus b x tilde. 26:30 26:33 OK? 26:35 And everybody agrees with this? 26:38 So now I have two parameters, a tilde and b, 26:41 and I'm going to pretend that now x tilde-- 26:44 so now the role of x is played by x tilde, which is now 26:50 a centered random variable. 26:53 OK, so I'm going to call this guy a tilde, 26:55 but for my computations I'm going to call it a. 26:58 So how do I find the minimum of this thing? 27:00 27:05 Derivative equal to zero, right? 27:06 So here it's a quadratic thing. 27:08 It's going to be like that. 27:09 I take the derivative, set it to zero. 27:10 So I'm first going to take the derivative with respect 27:13 to a and set it equal to zero, so that's equivalent to saying 27:16 that the expectation of-- 27:18 well, here, I'm going to pick up a 2-- 27:21 y minus a plus bx tilde is equal to zero. 27:33 And then I also have that the derivative with respect to b is 27:36 equal to zero, which is equivalent to the expectation 27:40 of-- well, I have a negative sign somewhere, 27:42 so let me put it here-- 27:43 minus 2x tilde, y minus a plus bx tilde. 27:50 27:55 OK, see that's why I don't want to put too many parenthesis. 27:58 28:03 OK. 28:05 So I just took the derivative with respect 28:07 to a, which is just basically the square, 28:09 and then I have a negative 1 that comes out from inside. 28:12 And then I take the derivative with respect 28:14 to b, and since b has x tilde. 28:17 In [? factor, ?] it comes out as well. 28:19 All right, so the minus 2's really won't matter for me. 28:24 And so now I have two equations. 28:26 The first equation, while it's pretty simple, 28:28 it's just telling me that the expectation of y minus a 28:31 is equal to zero. 28:33 So what I know is that a is equal to the expectation of y. 28:41 And really that was a tilde, which 28:44 implies that the a I want is actually 28:47 equal to the expectation of y minus b 29:00 times the expectation of x. 29:05 OK? 29:05 29:10 Just because a tilde is a plus b times the expectation of x. 29:13 29:16 So that's for my a. 29:19 And then for my b, I use the second one. 29:22 So the second one tells me that the expectation of x tilde of y 29:27 is equal to a plus b times the expectation of x tilde 29:32 which is zero, right? 29:33 29:38 OK? 29:39 But this a is actually a tilde in this problem, 29:41 so it's actually a plus b expectation of x. 29:47 29:51 Now, this is the expectation of the product 29:53 of two random variables, but x tilde is centered, right? 29:57 It's x minus expectation of x, so this thing is actually 30:00 equal to the covariance between x and y 30:03 by definition of covariance. 30:05 30:09 So now I have everything I need, right. 30:11 How do I just-- 30:14 I'm sorry about that. 30:16 So I have everything I need. 30:18 Now, I now have two equations with two unknowns, 30:22 and all I have to do is to basically plug it in. 30:25 So it's essentially telling me that the covariance of xy-- 30:29 so the first equation tells me that the covariance of xy 30:31 is equal to a plus b expectation of x, but a is expectation of y 30:36 minus b expectation of x. 30:39 So it's-- well, actually, maybe I should start with b. 30:45 30:54 Oh, sorry. 30:56 OK, I forgot one thing. 30:59 This is not true, right. 31:00 I forgot this term. 31:02 x tilde multiplies x tilde here, so what 31:05 I'm left with is x tilde-- 31:07 it's minus b times the expectation of x tilde squared. 31:11 So that's actually minus b times the variance of x 31:14 tilde because x tilde is already centered, 31:17 which is actually the variance of x. 31:19 31:23 So now I have that this thing is actually a plus b expectation 31:29 of x minus b variance of x. 31:36 And I also have that a is equal to expectation 31:42 of y minus b expectation of x. 31:45 31:53 So if I sum the two, those guys are going to cancel. 31:58 Those guys are going to cancel. 32:00 And so what I'm going to be left with is covariance of xy 32:05 is equal to expectation of x, expectation of y, 32:10 and then I'm left with this term here, minus 32:12 b times the variance of x. 32:14 32:17 And so that tells me that b-- 32:20 why do I still have the variance there? 32:21 32:34 AUDIENCE: So is the covariance really 32:37 the expectation of x tilde times y minus expectation of y? 32:43 Because y is not centered, correct? 32:46 PHILIPPE RIGOLLET: Yeah. 32:47 AUDIENCE: OK, but x is still the center. 32:48 PHILIPPE RIGOLLET: But x is still the center, right. 32:50 So you just need to have one that's 32:52 centered for this to work. 32:53 32:57 Right, I mean, you can check it. 32:58 But basically when you're going to have 33:00 the product of the expectations, you only need one of the two 33:02 in the product to be zero. 33:03 So the product is zero. 33:04 33:09 OK, why do I keep my-- 33:11 so I get a, a, and then the b expectation. 33:14 OK, so that's probably earlier that I made a mistake. 33:16 33:25 So I get-- so this was a tilde. 33:29 Let's just be clear about the-- 33:30 33:40 So that tells me that a tilde-- 33:43 maybe it's not super fair of me to-- 33:45 33:48 yeah, OK, I think I know where I made a mistake. 33:50 I should not have centered. 33:51 I wanted to make my life easier, and I should not 33:54 have done that. 33:55 And the reason is a tilde depends on b, 33:59 so when I take the derivative with respect 34:01 to b, what I'm left with here-- 34:04 since a tilde depends on b, when I 34:06 take the derivative of this guy, I actually 34:09 don't get a tilde here, but I really get-- 34:12 34:17 so again, this was not-- 34:20 so that's the first one. 34:21 34:30 This is actually x here-- 34:33 because when I take the derivative with respect to b. 34:38 And so now, what I'm left with is that the expectation-- so 34:40 yeah, I'm basically left with nothing that helps. 34:43 So I'm sorry about. 34:46 Let's start from the beginning because this is not 34:49 getting us anywhere, and a fix is not going to help. 34:53 So let's just do it again. 34:55 Sorry about that. 34:56 So let's not center anything and just do brute force 34:59 because we're going to-- 35:01 b x squared. 35:04 All right. 35:07 Partial, with respect to a, is giving 35:09 equal zero is equivalent, so my minus 2 35:11 is going to cancel, right. 35:13 So I'm going to actually forget about this. 35:14 So it's actually telling me that the expectation 35:17 of y minus a plus bx is equal to zero, which 35:25 is equivalent to a plus b expectation of x, is 35:31 equal to the expectation of y. 35:33 Now, if I take the derivative with respect to 35:35 b and set it equal to zero, this is telling me 35:38 that the expectation of-- 35:41 well, it's the same thing except that this time I'm 35:43 going to pull out an x. 35:45 35:52 This guy is equal to zero-- 35:54 this guy is not here-- 35:56 and so that implies that the expectation of xy 36:03 is equal to a times the expectation of x, 36:09 plus b times the expectation of x square. 36:16 OK? 36:17 36:21 All right, so the first one is actually not giving me much, 36:26 so I need to actually work with the two of those guys. 36:29 So I'm going to take the first-- 36:31 so let me rewrite those two inequalities that I have. 36:33 I have a plus b, e of x is equal to e of y. 36:40 And then I have e of xy. 36:43 36:50 OK, and now what I do is that I multiply this guy. 37:01 So I want to cancel one of those things, right? 37:03 So what I'm going to-- 37:04 37:12 so I'm going to take this guy, and I'm 37:13 going to multiply it by e of x and take the difference. 37:19 So I do times e of x, and then I take the sum of those two, 37:26 and then those two terms are going to cancel. 37:28 So then that tells me that b times e 37:33 of x squared, plus the expectation of xy is equal to-- 37:45 so this guy is the one that cancelled. 37:48 37:53 Then I get this guy here, expectation 37:56 of x times the expectation of y, plus the guy that 38:02 remains here-- 38:04 which is b times the expectation of x square. 38:08 38:11 So here I have b expectation of x, the whole thing squared. 38:16 And here I have b expectation of x square. 38:18 So if I pull this guy here, what do I get? 38:22 b times the variance of x, OK? 38:26 So I'm going to move here. 38:28 And this guy here, when I move this guy here, 38:31 I get the expectation of x times y, 38:32 minus the expectation of x times the expectation of y. 38:35 So this is actually telling me that the covariance of x and y 38:40 is equal to b times the variance of x. 38:45 And so then that tells me that b is 38:48 equal to covariance of xy divided by the variance of x. 38:55 And that's why I actually need the variance 38:57 of x to be non-zero because I couldn't do that otherwise. 39:01 And because if it was, it would mean 39:03 that b should be plus infinity, which 39:04 is what the limit of this guy is when the variance goes 39:08 to zero or negative infinity. 39:11 I can not sort them out. 39:14 All right, so I'm sorry about the mess, 39:16 but that should be more clear. 39:19 Then a, of course, you can write it 39:21 by plugging in the value of b, so you 39:23 know it's only a function of your distribution, right? 39:27 So what are the characteristics of the distribution-- 39:29 so distribution can have a bunch of things. 39:31 It can have movements of order 4, of order 26. 39:34 It can have heavy tails or light tails. 39:36 But when you compute least squares, 39:39 the only thing that matters are the variance 39:41 of x, the expectation of the individual ones-- 39:45 and really what captures how y changes when you change x, 39:50 is captured in the covariance. 39:51 The rest is really just normalization. 39:54 It's just telling you, I want things to cross the y-axis 39:58 at the right place. 39:59 I want things to cross the x-axis at the right place. 40:02 But the slope is really captured by how much more covariance 40:05 you have relative to the variance of x. 40:08 So this is essentially setting the scale for the x-axis, 40:12 and this is telling you for a unit scale, 40:15 this is the unit of y that you're changing. 40:20 OK, so we have explicit forms. 40:23 And what I could do, if I wanted to estimate those things, 40:26 is just say, well again, we have expectations, right? 40:32 The expectation of xy minus the product of the expectations, 40:36 I could replace expectations by averages 40:38 and get an empirical covariance just 40:40 like we can replace the expectations for the variance 40:42 and get a sample covariance. 40:44 And this is basically what we're going to be doing. 40:47 All right, this is essentially what you want. 40:49 The problem is that if you view it that way, 40:51 you sort of prevent yourself from being able to solve 40:54 the multivariate problem. 40:56 Because it's only in the univariate problem 40:58 that you have closed form solutions for your problem. 41:00 But if you actually go to multivariate, 41:03 this is not where you want to replace expectations 41:05 by averages. 41:06 You actually want to replace expectation by averages here. 41:09 41:12 And once you do it here, then you 41:14 can actually just solve the minimisation problem. 41:17 41:23 OK, so one thing that arises from this guy 41:29 is that this is an interesting formula. 41:35 41:40 All right, think about it. 41:43 If I have that y is a plus bx plus some noise. 42:00 Things are no longer on something. 42:02 I have that y is equal to a bx plus some noise, which 42:08 is usually denoted by epsilon. 42:11 So that's the distribution, right? 42:12 If I tell you the distribution of x, and I 42:15 say y is a plus b epsilon-- 42:17 I tell you the distribution of y, 42:18 and if [? they mean ?] that those two are independent, 42:21 you have a distribution on y. 42:23 So what happens is that I can actually always say-- well, you 42:27 know, this is equivalent to saying 42:28 that epsilon is equal to y minus a plus bx, right? 42:35 I can always write this as just-- 42:37 I mean, as tautology. 42:40 But here, for those guys-- 42:42 this is not for any guy, right. 42:43 This is really for the best fit, a 42:45 and b, those ones that satisfy this gradient is 42:50 equal to zero thing. 42:51 Then what we had is that the expectation of epsilon 42:55 was equal to expectation of y minus a plus 42:59 b expectation of x by linearity of the expectation, which 43:03 was equal to zero. 43:05 So for this best fit we have zero. 43:10 Now, the covariance between x and y-- 43:13 43:17 Between, sorry, x and epsilon, is what? 43:20 Well, it's the covariance between x-- 43:23 and well, epsilon was y minus a plus bx. 43:27 43:30 Now, the covariance is bilinear, so what I have 43:33 is that the covariance of this is 43:35 the covariance of xn times y-- 43:38 sorry, of x and y, minus the variance-- well, 43:41 minus a plus b, covariance of x and x, 43:50 which is the variance of x? 43:54 43:59 Covariance of xy minus a plus b variance of x. 44:03 44:12 OK, I didn't write it. 44:13 So here I have covariance of xy is 44:16 equal to b variance of x, right? 44:17 44:34 Covariance of xy. 44:35 Yeah, that's because they cannot do that with the covariance. 44:38 44:44 Yeah, I have those averages again. 44:46 No, because this is centered, right? 44:48 Sorry, this is centered, so this is actually 44:51 equal to the expectation of x times y minus a plus bx. 44:56 45:01 The covariance is equal to the product 45:03 just because this insight is actually centered. 45:05 So this is the expectation of x times y 45:09 minus the expectation of a times the expectation of x, plus b 45:20 minus b times the expectation of x squared. 45:23 45:32 Well, actually maybe I should not really go too far. 45:34 45:38 So this is actually the one that I need. 45:40 But if I stop here, this is actually equal to zero, right. 45:47 Those are the same equations. 45:49 45:52 OK? 45:53 Yeah? 45:53 AUDIENCE: What are we doing right now? 45:55 PHILIPPE RIGOLLET: So we're just saying 45:57 that if I actually believe that this best fit was the one that 46:01 gave me the right parameters, what would 46:02 that imply on the noise itself, on this epsilon? 46:05 So here we're actually just trying 46:07 to find some necessary condition for the noise to hold-- 46:10 for the noise. 46:11 And so those conditions are, that first, the expectation 46:14 is zero. 46:15 That's what we've got here. 46:17 And then, that the covariance between the noise and x 46:20 has to be zero as well. 46:22 OK, so those are actually conditions 46:24 that the noise must satisfy. 46:26 But the noise was just not really defined as noise itself. 46:29 We were just saying, OK, if we're 46:31 going to put some assumptions on the epsilon, what 46:35 do we better have? 46:36 So the first one is that it's centered, which is good, 46:38 because otherwise, the noise would shift everything. 46:41 So now when you look at a linear regression model-- 46:45 typically, if you open a book, it doesn't start by saying, 46:48 let the noise be the difference between y 46:50 and what I actually want y to be. 46:52 It says let y be a plus bx plus epsilon. 46:57 So conversely, if we assume that this is the model that we have, 47:02 then we're going to have to assume that epsilon-- 47:04 we're going to assume that epsilon is centered, 47:06 and that the covariance between x and epsilon is zero. 47:10 Actually, often, we're going to assume much more. 47:13 And one way to ensure that those two things are satisfied 47:17 is to assume that x is independent of epsilon, 47:19 for example. 47:21 If you assume that x is independent of epsilon, 47:23 of course the covariance is going to be zero. 47:28 Or we might assume that the conditional expectation 47:30 of epsilon, given x, is equal to zero, then that implies that. 47:35 OK, now the fact that it's centered is one thing. 47:38 So if we make this assumption, the only thing it's telling us 47:43 is that those ab's that come-- right, we started from there. 47:47 y is equal to a plus bx plus some epsilon for some a, 47:51 for some b. 47:51 What it turns out is that those a's and b's are actually 47:55 the ones that you would get by solving this expectation 47:58 of square thing. 48:00 All right, so when you asked-- 48:02 back when you were following-- 48:04 so when you asked, you know, why don't we 48:07 take the square, for example, or the power 48:10 4, or something like this-- 48:12 then here, I'm saying, well, if I have y is equal to a plus bx, 48:15 I don't actually need to put too much assumptions on epsilon. 48:19 If epsilon is actually satisfying those two things, 48:22 expectation is equal to zero and the covariance 48:25 with x is equal to zero, then the right a and b 48:28 that I'm looking for are actually the ones that 48:30 come with the square-- 48:32 not with power 4 or power 25. 48:36 So those are actually pretty weak assumptions. 48:39 If we want to do inference, we're 48:41 going to have to assume slightly more. 48:43 If we want to use T-distributions at some point, 48:45 for example, and we will, we're going 48:47 to have to assume that epsilon has a Gaussian distribution. 48:50 So if you want to start doing more statistics beyond just 48:53 like doing this least square thing, which is minimizing 48:56 the square of criterion, you're actually 48:58 going to have to put more assumptions. 48:59 But right now, we did not need them. 49:01 We only need that epsilon as mean zero and covariant 49:04 zero with x. 49:04 49:08 OK, so that was basically probabilistic, right. 49:13 If I were to do probability and I 49:14 were trying to model the relationship between two 49:17 random variables, x and y, in the form 49:20 y is a plus bx plus some noise, this is what would come out. 49:24 Everything was expectations. 49:25 There was no data involved. 49:27 So now let's go to the data problem, which is now, 49:33 I do not know what those expectations are. 49:35 In particular, I don't know what the covariance of x and y is, 49:38 and I don't know with the expectation of x 49:40 and the expectation of y r. 49:42 So I have data to do that. 49:44 So how am I going to do this? 49:45 49:49 Well, I'm just going to say, well, 49:50 if I want x1, y1, xn, yn, and I'm going 49:57 to assume that they're [? iid. ?] 49:59 And I'm actually going to assume that they 50:01 have some model, right. 50:02 So I'm going to assume that I have that a-- 50:06 so that Yi follows the same model. 50:09 50:14 So epsilon i [? rad, ?] and I won't 50:17 say that expectation of epsilon i is zero and covariance of xi, 50:23 epsilon i is equal to zero. 50:25 So I'm going to put the same model on all the data. 50:28 So you can see that a is not ai, and b is not bi. 50:31 It's the same. 50:32 So as my data increases, I should 50:34 be able to recover the correct things-- 50:36 as the size of my data increases. 50:39 OK, so this is what the statistical problem look like. 50:43 You're given the points. 50:45 There is a true line from which this point 50:47 was generated, right. 50:48 There was this line. 50:49 There was a true ab that I use to draw this plot, 50:54 and that was the line. 50:55 So first I picked an x, say uniformly at 50:59 on this intervals, 0 to 2. 51:02 I said that was this one. 51:03 Then I said well, I want y to be a plus bx, 51:06 so it should be here, but then I'm 51:08 going to add some noise epsilon to go away again 51:10 back from this line. 51:13 And that's actually me, here, we actually got two points correct 51:16 on this line. 51:18 So there's basically two epsilons 51:20 that were small enough that the dots actually 51:22 look like they're on the line. 51:24 Everybody's clear about what I'm drawing? 51:27 So now of course if you're a statistician, 51:28 you don't see this. 51:29 You only see this. 51:30 And you have to recover this guy, 51:32 and it's going to look like this. 51:34 You're going to have an estimated line, which 51:36 is the red one. 51:37 And the blue line, which is the true one, the one that 51:42 actually generated the data. 51:44 And your question is, while this line corresponds 51:46 to some parameters a hat and b hat, 51:48 how could I make sure that those two lines-- how far those two 51:51 lines are? 51:52 And one to address this question is 51:53 to say how far is a from a hat, and how far is b from b hat? 51:57 OK? 51:58 Another question, of course, that you may ask 52:00 is, how do you find a hat and b hat? 52:04 And as you can see, it's basically the same thing. 52:07 Remember, what was a-- so b was the covariance between x 52:15 and y divided by the variance of x, right? 52:21 We check and rewrite this. 52:22 The expectation of xy minus expectation 52:26 of x times the expectation of y, divided 52:30 by expectation of x squared minus expectation of x. 52:35 The whole thing's-- 52:37 OK? 52:39 If you look at the expression for b hat, 52:42 I basically replaced all the expectations by bars. 52:47 So I said, well, this guy I'm going 52:49 to estimate by an average. 52:53 So that's the xy bar, and is 1 over n, 52:59 [? sum ?] from [? i co ?] 1, to n of Xi, times Yi. 53:03 53:05 x bar, of course, is just the one that we're used to. 53:08 53:12 And same for y bar. 53:14 X squared bar, the one that's here, 53:20 is the average of the squares. 53:22 And x bar square is the square of the average. 53:24 53:39 OK, so you just basically replace this guy by x bar, 53:44 this guy by y bar, this guy by x square bar, 53:47 and this guy by x bar and no square. 53:52 OK, so that's basically one way to do it. 53:54 Everywhere you see an expectation, 53:56 you replace it by an average. 53:58 That's the usual statistical hammer. 54:02 You can actually be slightly more subtle about this. 54:04 54:09 And as an exercise, I invite you-- 54:12 just to make sure that you know how to do this competition, 54:14 it's going to be exactly the same kind of competitions 54:17 that we've done. 54:18 But as an exercise, you can check 54:20 that if you actually look at say, well, 54:23 what I wanted to minimize here, I had an expectation, right? 54:25 54:32 And I said, let's minimize this thing. 54:35 Well, let's replace this by an average first. 54:41 54:51 And now minimize. 54:54 OK, so if I do this, it turns out 54:57 I'm going to actually get the same result. 55:00 The minimum of the average is basically-- 55:03 when I replace the average by-- sorry, 55:06 when I replace the expectation by the average 55:09 and then minimize, it's the same thing 55:11 as first minimizing and then replacing expectation 55:13 by averages in this case. 55:17 Again, this is a much more general principle 55:21 because if you don't have a closed 55:23 form for the minimum like for some, say, likelihood problems, 55:27 well, you might not actually have a possibility 55:30 to just look at what the formula looks like-- see where 55:32 the expectations show up-- and then just plug in the averages 55:35 instead. 55:36 So this is the one you want to keep in mind. 55:39 And again, as an exercise. 55:41 55:47 OK, so here, and then you do expectation 55:48 replaced by averages. 55:52 And then that's the same answer, and I encourage 55:57 you to solve the exercise. 56:00 OK, everybody's clear that this is actually the same expression 56:03 for a hat and b hat that we had before that we had for a and b 56:07 when we replaced the expectations by averages? 56:12 Here, by the way, I minimize the sum rather than the average. 56:16 It's clear to everyone that this is the same thing, right? 56:19 56:22 Yep? 56:23 AUDIENCE: [INAUDIBLE] sum replacing it [INAUDIBLE] 56:27 minimize the expectation, I'm assuming 56:29 it's switched with the derivative 56:31 on the expectation [INAUDIBLE]. 56:33 56:37 PHILIPPE RIGOLLET: So we did switch 56:39 the derivative and the expectation before you came, 56:43 I think. 56:44 56:47 All right, so indeed, the picture 56:49 was the one that we said, so visually, this 56:52 is what we're doing. 56:53 We're looking among all the lines. 56:55 For each line, we compute this distance. 56:58 So if I give you another line there 57:00 would be another set of arrows. 57:01 You look at their length. 57:02 You square it. 57:03 And then you sum it all, and you find 57:05 the line that has the minimum sum of squared lengths 57:08 of the arrows. 57:09 All right, and those are the arrows that we're looking at. 57:11 But again, you could actually think of other distances, 57:14 and you would actually get different-- 57:17 you could actually get different solutions, right. 57:19 So there's something called, mean absolute deviation, 57:22 which rather than minimizing this thing, 57:24 is actually minimizing the sum from i to co 1 to n 57:27 of the absolute value of y minus a plus bXi. 57:33 And that's not something for which 57:36 you're going to have a closed form, as you can imagine. 57:39 You might have something that's sort of implicit, 57:42 but you can actually still solve it numerically. 57:44 And this is something that people also 57:46 like to use but way, way less than the least squares one. 57:50 AUDIENCE: [INAUDIBLE] 57:52 PHILIPPE RIGOLLET: What did I just what? 57:53 AUDIENCE: [INAUDIBLE] 57:56 The sum of the absolute values of Yi minus a plus bXi. 58:02 So it's the same except I don't square here. 58:04 58:07 OK? 58:08 58:11 So arguably, you know, predicting a demand 58:18 based on price is a fairly naive problem. 58:21 Typically, what we have is a bunch of data 58:23 that we've collected, and we're hoping that, 58:25 together, they can help us do a better prediction. 58:29 All right, so maybe I don't have only the price, 58:31 but maybe I have a bunch of other social indicators. 58:35 Maybe I know the competition, the price of the competition. 58:40 And maybe I know a bunch of other things 58:42 that are actually relevant. 58:43 And so I'm trying to find a way to combine a bunch of points, 58:48 a bunch of measures. 58:50 There's a nice example that I like, 58:52 which is people were trying to measure something 58:56 related to your body mass index, so basically 59:00 the volume of your-- the density of your body. 59:04 And the way you can do this is by just, really, 59:07 weighing someone and also putting them 59:10 in some cubic meter of water and see how much overflows. 59:13 And then you have both the volume 59:15 and the mass of this person, and you 59:20 can start computing density. 59:23 But as you can imagine, you know, 59:25 I would not personally like to go to a gym 59:27 when the first thing they ask me is to just go 59:29 in a bucket of water, and so people 59:33 try to find ways to measure this based on other indicators that 59:36 are much easier to measure. 59:38 For example, I don't know, the length of my forearm, 59:41 and the circumference of my head, and maybe my belly 59:45 would probably be more appropriate here. 59:46 And so you know, they just try to find something 59:48 that actually makes sense. 59:50 And so there's actually a nice example 59:52 where you can show that if you measure-- 59:53 I think one of the most significant 59:55 was with the circumference of your wrist. 59:56 This is actually a very good indicator of your body density. 60:02 And it turns out that if you stuff all the bunch of things 60:06 together, you might actually get a very good formula that 60:09 explains things. 60:10 All right, so what we're going to do 60:12 is rather than saying we have only one x 60:14 to explain y's, let's say we have 60:15 20 x's that we're trying to combine to explain y. 60:19 And again, just like assuming something of the form, 60:22 y is a plus b times x was the simplest thing we could do, 60:26 here we're just going to assume that we have y is a plus 60:28 b1, x1 plus b2, x2, plus b3, x3. 60:31 And we can write it in a vector form 60:33 by writing that Yi is Xi transposed b, which 60:39 is now a vector plus epsilon i. 60:42 OK, and here, on the board, I'm going 60:44 to have a hard time doing boldface, 60:46 but all these things are vectors except for y, 60:52 which is a number. 60:53 Yi is a number. 60:54 It's always the value of my y-axis. 60:57 So even if my x-axis lives on-- 60:59 this is x1, and this is x2, y is really just the real valued 61:04 function. 61:05 And so I'm going to get a bunch of points, x1,y1, 61:07 and I'm going to see how much they respond. 61:10 So for example, my body density is y, 61:13 and then all the x's are a bunch of other things. 61:16 Agreed with that? 61:17 So this is an equation that holds on the real line, 61:20 but this guy here is an r p, and this guy's an rp. 61:27 61:30 It's actually common to talk to call b, beta, 61:33 when it's a vector, and that's the usual linear regression 61:38 notation. 61:39 Y is x beta plus epsilon. 61:42 So x's are called explanatory variables. 61:45 y is called explained variable, or dependent variable, 61:50 or response variable. 61:52 It has a bunch of names. 61:53 You can use whatever you feel more comfortable with. 61:55 It should actually be explicit, right, 61:57 so that's all you care about. 61:58 62:01 Now, what we typically do is that rather-- so you 62:05 notice here, that there's actually no intercept. 62:07 If I actually fold that back down to one dimension, 62:10 there's actually a is equal to zero, right? 62:13 If I go back to p is equal to 1, that 62:18 would imply that Yi is, well, say, beta times 62:22 x plus epsilon i. 62:24 And that's not good, I want to have an intercept. 62:27 And the way I do this, rather than writing 62:29 a plus this, and you know, just have 62:31 like an overload of notation, what I am actually doing 62:35 is that I fold back. 62:37 I fold my intercept back into my x. 62:40 62:43 And so if I measure 20 variables, 62:46 I'm going to create a 21st variable, which 62:48 is always equal to 1. 62:49 OK, so you should need to think of x as being 1. 62:52 And then x1 xp. 62:58 And sorry, xp minus 1, I guess. 63:00 OK, and now this is an rp. 63:02 63:05 I'm always going to assume that the first one is 1. 63:07 I can always do that. 63:09 If I have a table of data-- 63:11 if my data is given to me in an Excel spreadsheet-- 63:15 and here I have the density that I measured on my data, 63:19 and then maybe here I have the height, 63:22 and here I have the wrist circumference. 63:25 And I have all these things. 63:26 All I have to do is to create another column here of ones, 63:31 and I just put 1-1-1-1-1. 63:34 OK, that's all I have to do to create this guy. 63:37 Agreed? 63:39 And now my x is going to be just one of those rows. 63:43 So that's this is Xi, this entire row. 63:46 And this entry here is Yi. 63:47 63:54 So now, for my noise coefficients, 63:56 I'm still going to ask for the same thing 63:59 except that here, the covariance is not between x-- 64:04 between one random variable and another random variable. 64:07 It's between a random vector and a random variable. 64:10 OK, how do I measure the covariance between a vector 64:13 and a random variable? 64:14 64:23 AUDIENCE: [INAUDIBLE] 64:25 PHILIPPE RIGOLLET: Yeah, so basically-- 64:29 AUDIENCE: [INAUDIBLE] 64:31 PHILIPPE RIGOLLET: Yeah, I mean, the covariance vector 64:33 is equal to 0 is the same thing as [INAUDIBLE] equal to zero, 64:36 but yeah, this is basically thought of entry-wise. 64:39 For each coordinate of x, I want that the covariance 64:41 between epsilon and this coordinate of x is equal to 0. 64:47 So I'm just asking this for all coordinates. 64:50 Again, in most instances, we're going 64:52 to think that epsilon is independent 64:53 of x, and that's something we can understand without thinking 64:56 about coordinates. 64:59 Yep? 65:00 AUDIENCE: [INAUDIBLE] like what if beta equals alpha 65:03 [INAUDIBLE]? 65:04 65:06 PHILIPPE RIGOLLET: I'm sorry, can you repeat the question? 65:09 I didn't hear. 65:09 AUDIENCE: Is this the parameter of beta, a parameter? 65:12 PHILIPPE RIGOLLET: Yeah, beta is the parameter 65:13 we're looking for, right. 65:14 Just like it was the pair ab has become the whole vector of beta 65:18 now. 65:19 AUDIENCE: And what's [INAUDIBLE]?? 65:20 65:22 PHILIPPE RIGOLLET: Well, can you think of an intercept 65:25 of a function that take-- 65:26 I mean, there is one actually. 65:28 There's the one for which betas-- 65:30 all the betas that don't correspond 65:31 to the vector of all ones, so the intercept 65:35 is really the weight that I put on this guy. 65:38 That's the beta that's going to come to this guy, 65:40 but we don't really talk about intercept. 65:44 So if x lives in two dimensions, the way 65:49 you want to think about this is you 65:50 take a sheet of paper like that, so now I 65:54 have points that live in three dimensions. 65:57 So let's say one direction here is x1. 65:59 This direction is x2, and this direction is y. 66:02 And so what's going to happen is that I'm 66:04 going to have my points that live in this three 66:07 dimensional space. 66:08 And what I'm trying to do when I'm 66:10 trying to do a linear model for those guys-- 66:12 when I assume a linear model. 66:13 What I assume is that there's a plane in those three 66:17 dimensions. 66:17 So think of this guy as going everywhere, 66:20 and there's a plane close to which all my points should be. 66:23 That's what's happening in two dimensional. 66:26 If you see higher dimensions then congratulations to you, 66:29 but I can't. 66:30 66:33 But you know, you can definitely formalize that fairly easily 66:36 mathematically and just talk about vectors. 66:38 66:40 So now here, if I talk about the least square error estimator, 66:44 or just the least squares estimator of beta, 66:47 it's simply the same thing as before. 66:49 Just like we said-- 66:52 so remember, you should think of as beta 66:56 as being both the pair a b generalized. 66:59 So we said, oh, we wanted to minimize the expectation of y 67:05 minus a plus bx squared, right? 67:13 Now, so that's in-- for p is equal to 1. 67:16 Now for p lower than or equal to 2, 67:19 we're just going to write it as y minus x transpose beta 67:28 squared. 67:29 67:34 OK, so I'm just trying to minimize this quantity. 67:37 Of course, I don't have access to this, 67:40 so what I'm going to do with them going to replace 67:42 my expectation by an average. 67:44 67:51 So here I'm using the notation t because beta is the true one, 67:54 and I don't want you to just-- 67:56 so here, I have a variable t that's just moving around. 67:59 And so now I'm going to take the square of this thing. 68:02 And when I minimize this over all t in rp, the arc min, 68:08 the minimum is attained at beta hat, which is my estimator. 68:19 OK? 68:20 68:25 So if I want to actually compute-- 68:29 yeah? 68:29 AUDIENCE: I'm sorry, on the last slide 68:31 did we require the expectation of [INAUDIBLE] to be zero? 68:36 PHILIPPE RIGOLLET: You mean the previous slide? 68:38 AUDIENCE: Yes. 68:38 [INAUDIBLE] 68:40 PHILIPPE RIGOLLET: So again, I'm just defining an estimator 68:42 just like I would tell you, just take the estimator that 68:45 has coordinates for everywhere. 68:46 AUDIENCE: So I'm saying like [? in that sign, ?] we'll say 68:48 the noise [? terms ?] we want to satisfy the covariance of that 68:51 [? side. ?] We also want them to satisfy expectation of each 68:55 [? noise turn ?] zero? 68:56 69:07 PHILIPPE RIGOLLET: And so the answer is yes. 69:09 I was just trying to think if this was captured. 69:13 So it is not captured in this guy 69:15 because this is just telling me that the expectation 69:17 of epsilon i minus expectation of some i is equal to zero. 69:23 OK, so yes I need to have that epsilon has mean zero-- 69:27 let's assume that expectation of epsilon 69:29 is zero for this problem. 69:31 69:43 And we're going to need something 69:45 about some sort of question about the variance being 69:47 not equal to zero, right, but this is going to come up later. 69:51 So let's think for one second about doing the same approach 69:54 as we did before. 69:55 Take the partial derivative with respect 69:57 to the first coordinate of t, with respect 69:59 to the second coordinate of t, with respect 70:01 to the third coordinate of t, et cetera. 70:03 So that's what we did before. 70:04 We had two equations, and we reconciled them 70:07 because it was fairly easy to solve, right? 70:10 But in general, what's going to happen 70:11 is we're going to have a system of equations. 70:13 We're going to have a system of p equations, one for each 70:17 of the coordinates of t. 70:19 And we're going to have p unknowns, each coordinate of t. 70:23 And so we're going to have the system to solve-- 70:26 actually, i turns out it's going to be a linear system. 70:28 But it's not going to be something 70:29 that we're going to be able to solve coordinate by coordinate. 70:32 It's going to be annoying to solve. 70:34 You know, you can guess that what's going to happen, right. 70:36 Here, it involved the covariance between x and epsilon, right. 70:40 That's what it involved to understand-- 70:43 sorry, the correlation between x and y 70:47 to understand how the solution of this problem was. 70:50 In this case, there's going to be 70:52 only the covariance between x1 and y, x2 and y, x3, et 70:57 cetera, all the way to xp and y. 70:59 There's also going to be all the cross covariances between xj 71:02 and xk. 71:04 And so this is going to be a nightmare 71:05 to solve, like, in this system. 71:08 And what we do is that we go on to using a matrix notation, 71:12 so that when we take derivatives, 71:14 we talk about gradients, and then we 71:16 can invert matrices and solve linear systems in a somewhat 71:20 formal manner by just saying that, if I want to solve 71:23 the system ax equals b-- 71:27 rather than actually solving this 71:28 for each coordinate of x individually, 71:30 I just say that x is equal to a inverse times. 71:33 So that's really why we're going to the equation one, 71:37 because we have a formalism to write that x 71:40 is the solution of the system. 71:42 I'm not telling you that this is going 71:43 to be easy to solve numerically, but at least I can write it. 71:48 And so here's how it goes. 71:51 I have a bunch of vectors. 71:52 71:55 So what are my vectors, right? 71:56 So I have x1-- 71:57 oh, by the way, I didn't actually 71:59 mention that when I put the lowercase, when 72:01 I put the subscript, I'm talking about the observation. 72:03 And when I put the superscript, I'm 72:05 talking about the coordinates, right? 72:07 So I have x1, which is equal to x1, x1 [? 1, ?] 72:13 x 1p, x2, which is 1. 72:19 x2, 1, x2 p, all the way to xn, which is 1, xn 1, x np. 72:32 All right, so those are n observed x's, and then I 72:35 have another y1, y2, yn, that comes paired with those guys. 72:40 OK? 72:42 So the first thing is that I'm going 72:44 to stack those guys into some vector 72:46 that I'm going to call y. 72:47 So maybe I should put an arrow for the purpose 72:49 of the blackboard, and it's just y1 to yn. 72:53 OK, so this is a vector in rn. 72:56 Now, if I want to stack those guys together, 72:59 I can either create a long vector of size n times p, 73:03 but the problem is that I lose the role of who's a coordinate 73:05 and who's an observation. 73:08 And so it's actually nicer for me 73:10 to just put those guys next to each other 73:12 and create one new variable. 73:15 And so the way I'm going to do this is-- rather than actually 73:18 stacking those guys like that, I'm getting their transpose 73:22 and stack them as rows of a matrix. 73:24 OK, so I'm going to create a matrix, which 73:26 here is denoted typically by-- 73:28 I'm going to write x double bar. 73:31 And here, I'm going to actually just-- so since I'm 73:33 taking those guys like this, the first column 73:35 is going to be only ones. 73:37 73:40 And then I'm going to have-- 73:41 well, x1, 1, [? 1, ?] x1, p. 73:47 And here, I'm going to have x n1, x np. 73:52 OK, so here the number of rows is n, and the number of columns 73:57 is p. 73:58 One row per observation, one column per coordinate. 74:02 74:05 And again, I make your life miserable because this really 74:10 should be p minus 1 because I already used 74:13 the first one for this guy. 74:15 I'm sorry about that. 74:16 It's a bit painful. 74:18 So usually we don't even write what's in there. 74:20 So we don't have to think about it. 74:21 Those are just vectors of size p. 74:23 OK? 74:25 So now that I created this thing, 74:27 I can actually just basically stack up all my models. 74:31 So Yi equals Xi transpose beta plus epsilon i for all i 74:39 equal 1 to n. 74:41 This transforms into-- this is equivalent to saying 74:44 that the vector y is equal to the matrix x 74:47 times beta plus a matrix, plus a vector epsilon, 74:51 where epsilon is just epsilon 1 to epsilon n, right. 74:57 So I have just this system, which 74:59 I write as a matrix, which really just consists 75:02 in stacking up all these equations next to each other. 75:04 75:10 So now that I have this model-- this is the usual least squares 75:12 model. 75:13 And here, when I want to write my least squares criterion 75:16 in terms of matrices, right? 75:17 My least squares criterion, remember, 75:19 was sum from i equal 1 to n of Yi minus Xi transposed beta 75:27 squared. 75:28 Well, here it's really just the sum 75:31 of the square of the coefficients of the vector 75:35 y minus x beta. 75:37 So this is actually equal to the norm squared 75:40 of y minus x beta square. 75:43 75:46 That's just the square. 75:47 Norm is, by definition, the sum of the square 75:49 of the coordinates. 75:51 And so now I can actually talk about minimizing 75:53 a norm squared, and here it's going 75:56 to be easier for me to take derivatives. 75:58 All right, so we'll do that next time. 76:01