https://www.youtube.com/watch?v=k2inA31Gups&list=PLUl4u3cNGP60uVBMaoNERc6knT_MgPKS0&index=16 字幕記錄 00:00 00:00 The following content is provided under a Creative 00:02 Commons license. 00:03 Your support will help MIT OpenCourseWare 00:06 continue to offer high-quality educational resources for free. 00:10 To make a donation or to view additional materials 00:12 from hundreds of MIT courses, visit MIT OpenCourseWare 00:16 at ocw.mit.edu. 00:17 00:20 PHILIPPE RIGOLLET: So today, we're 00:21 going to close this chapter, this short chapter, 00:24 on Bayesian inference. 00:26 Again, this was just an overview of what you 00:28 can do in Bayesian inference. 00:32 And last time, we started defining 00:34 what's called Jeffreys priors. 00:36 Right? 00:36 So when you do Bayesian inference, 00:38 you have to introduce a prior on your parameter. 00:41 And we said that usually, it's something 00:43 that encodes your domain knowledge about where 00:45 the parameter could be. 00:47 But there's also some principle way to do it, 00:49 if you want to do Bayesian inference without really 00:51 having to think about it. 00:53 And for example, one of the natural priors 00:56 were those non-informative priors, right? 00:58 If you were on a compact set, it's 00:59 a uniform prior of this set. 01:01 If you're on an infinite set, you can still think of taking 01:04 the [? 01s ?] prior. 01:06 And that's called an [INAUDIBLE] That's always equal to 1. 01:09 And that's an improper prior if you are an infinite set 01:13 or proportional to one. 01:14 And so another prior that you can think of, 01:17 in the case where you have a Fisher information, which 01:20 is well-defined, is something called Jefferys prior. 01:23 And this prior is a prior which is 01:25 proportional to square root of the determinant of the Fisher 01:28 information matrix. 01:29 And if you're in one dimension, it's 01:31 basically proportional to a square root of the Fisher 01:37 information coefficient, which we know, for example, 01:40 is the asymptotic variance of the maximum likelihood 01:44 estimator. 01:45 And it turns out that it's basically. 01:48 So square root of this thing is basically 01:50 one over the standard deviation of the maximum likelihood 01:54 estimator. 01:55 And so you can compute this, right? 01:56 So you can compute for the maximum likelihood estimator. 01:59 We know that the variance is going 02:01 to be p1 minus p in the Bernoulli 02:09 statistical experiment. 02:11 So you get this one over the square root of this thing. 02:13 And for example, in the Gaussian setting, 02:16 you actually have the Fisher information, 02:19 even in the multi-variate one, is actually 02:22 going to be something like the identity matrix. 02:24 So this is proportional to 1. 02:25 It's the improper prior that you get, in this case, OK? 02:29 Meaning that, for the Gaussian setting, 02:31 no place where you center your Gaussian 02:33 is actually better than any other. 02:36 All right. 02:36 So we basically left on this slide, 02:40 where we saw that Jeffreys prior satisfy 02:43 a reparametrization [INAUDIBLE] invariant 02:46 by transformation of your parameter, which 02:49 is a desirable property. 02:51 And the way, it says that, well, if I have my prior on theta, 02:57 and then I suddenly decide that theta is not 02:59 the parameter I want to use to parameterize my problem, 03:01 actually what I want is phi of theta. 03:04 So think, for example, as theta being the mean of a Gaussian, 03:07 and phi of theta as being mean to the cube. 03:11 OK? 03:11 This is a one-to-one map phi, right? 03:15 So for example, if I want to go from theta to theta cubed, 03:20 and now I decide that this is the actual parameter that I 03:22 want, well, then it means that, on this parameter, 03:26 my original prior is going to induce another prior. 03:29 And here, it says, well, this prior 03:30 is actually also Jeffreys prior. 03:33 OK? 03:33 So it's essentially telling you that, 03:35 for this new parametrization, if you take Jeffreys prior, then 03:38 you actually go back to having exactly something that's 03:41 of the form's [INAUDIBLE] of determinant of the Fisher 03:43 information, but this thing with respect 03:45 to your new parametrization All right. 03:47 And so why is this true? 03:50 Well, it's just this change of variable theorem. 03:53 So it's essentially telling you that, if you call-- 03:58 let's call p-- well, let's go pi tilde of eta prior over eta. 04:08 And you have pi of theta as the prior 04:11 over theta, than since eta is of the form phi of theta, 04:18 just by change of variable, so that's essentially 04:26 a probability result. It says that pi tilde of eta 04:33 is equal to pi of eta times d pi of theta times d 04:42 theta over d eta and-- 04:48 04:55 sorry, is that the one? 04:57 Sorry, I'm going to have to write it, 04:58 because I always forget this. 04:59 05:05 So if I take a function-- 05:07 05:14 OK. 05:14 So what I want is to check. 05:16 05:38 OK, so I want the function of eta that I can here. 05:41 And what I know is that this is h of phi of theta. 05:48 All right? 05:48 So sorry, eta is phi of theta, right? 05:51 Yeah. 05:53 So what I'm going to do is I'm going 05:54 to do the change of variable, theta is phi inverse of eta. 06:09 So eta is phi of theta, which means 06:14 that d eta is equal to d-- 06:20 well, to phi prime of theta d theta. 06:26 So when I'm going to write this, I'm going to get integral of h. 06:31 Actually, let me write this, as I 06:33 am more comfortable writing this as e 06:36 with respect to eta of h of eta. 06:40 OK? 06:40 So that's just eta according to being drawn from the prior. 06:44 And I want to write this as the integral of he of eta times 06:47 some function, right? 06:49 So this is the integral of h of phi 06:58 of theta pi of theta d theta. 07:03 Now, I'm going to do my change of variable. 07:06 So this is going to be the integral of h of eta. 07:09 And then pi of phi of-- 07:16 so theta is phi inverse of eta. 07:20 And then d theta is phi prime of theta d theta, OK? 07:27 And so what is pi of phi theta? 07:30 So this thing is proportional. 07:32 So we're in, say, dimension 1, so it's 07:33 proportional of square root of the Fisher information. 07:38 And the Fisher information, we know, 07:39 is the expectation of the square of the derivative of the log 07:44 likelihood, right? 07:45 So this is square root of the expectation 07:48 of d over d theta of log of-- 08:03 well, now, I need the density. 08:06 Well, let's just call it l of theta. 08:10 And I want this to be taken at phi inverse of eta squared. 08:17 08:19 And then what I pick up is the-- 08:21 08:23 so I'm going to put everything under the square. 08:25 So I get phi prime of theta squared d theta. 08:31 OK? 08:33 So now, I have the expectation of a square. 08:35 This does not depend, so this is-- sorry, this is l of theta. 08:38 This is the expectation of l of theta of an x, right? 08:42 That's for some variable, and the expectation here 08:44 is with respect to x. 08:45 That's just the definition of the Fisher information. 08:49 So now I'm going to squeeze this guy into the expectation. 08:52 It does not depend on x. 08:53 It just acts as a constant. 08:55 And so what I have now is that this is actually 08:57 proportional to the integral of h 08:59 eta times the square root of the expectation with respect 09:05 to x of what? 09:06 Well, here, I have d over d theta of log of theta. 09:10 And here, this guy is really d eta over d theta, right? 09:15 09:19 Agree? 09:21 So now, what I'm really left by-- so I get d over d theta 09:24 times d-- 09:25 sorry, times d theta over d eta. 09:28 09:42 so that's just d over d eta of log of eta x. 09:51 10:00 And then this guy is now becoming d eta, right? 10:04 OK, so this was a mess. 10:06 10:09 This is a complete mess, because I actually want to use phi. 10:12 I should not actually introduce phi at all. 10:14 I should just talk about d eta over d theta type of things. 10:21 And then that would actually make my life so much easier. 10:24 OK. 10:25 I'm not going to spend more time on this. 10:26 This is really just the idea, right? 10:28 You have square root of a square in there. 10:30 And then, when you do your change of variable, 10:31 you just pick up a square. 10:32 You just pick up something in here. 10:35 And so you just move this thing in there. 10:38 You get a square. 10:38 It goes inside the square. 10:40 And so your derivative of the log likelihood 10:42 with respect to theta becomes a derivative of the log 10:44 likelihood with respect to eta. 10:46 And that's the only thing that's happening here. 10:48 I'm just being super sloppy, for some reason. 10:52 OK. 10:54 And then, of course, now, what you're left with 10:56 is that this is really just proportional. 10:59 Well, this is actually equal. 11:00 Everything is proportional, but this 11:02 is equal to the Fisher information tilde with respect 11:05 to eta now. 11:07 Right? 11:07 You're doing this with respect to eta. 11:09 And so that's your new prior with respect to eta. 11:17 OK. 11:17 So one thing that you want to do, 11:21 once you have-- so remember, when you actually 11:23 compute your posterior rate, rather 11:26 than having-- so you start with a prior, 11:29 and you have some observations, let's say, x1 to xn. 11:32 11:36 When you do Bayesian inference, rather than spitting 11:41 out just some theta hat, which is an estimator for theta, 11:45 you actually spit out an entire posterior distribution-- 11:48 11:53 pi of theta, given x1 xn. 11:57 OK? 11:57 So there's an entire distribution 11:59 on the [INAUDIBLE] theta. 12:01 And you can actually use this to perform inference, rather 12:04 than just having one number. 12:06 OK? 12:06 And so you could actually build confidence regions 12:09 from this thing. 12:10 OK. 12:11 And so a Bayesian confidence interval-- 12:16 so if your set of parameters is included in the real line, 12:21 then you can actually-- it's not even guaranteed 12:23 to be to be an interval. 12:25 So let me call it a confidence region, so a Bayesian 12:33 confidence region, OK? 12:40 So it's just a random subspace. 12:43 So let's call it r, is included in theta. 12:47 And when you have the deterministic one, 12:49 we had a definition, which was with respect to the randomness 12:53 of the data, right? 12:54 That's how you actually had a random subset. 12:57 So you had a random confidence interval. 12:59 Here, it's actually conditioned on the data, 13:02 but with respect to the randomness 13:03 that you actually get from your posterior distribution. 13:06 OK? 13:07 So such that the probability that your theta 13:16 belongs to this confidence region, 13:18 given x1 xn is, say, at least 1 minus alpha. 13:24 Let's just take it equal to 1 minus alpha. 13:27 OK so that's a confidence region at level 1 minus alpha. 13:34 OK, so that's one way. 13:36 So why would you actually-- 13:38 when I actually implement Bayesian inference, 13:41 I'm actually spitting out that entire distribution. 13:44 I need to summarize this thing to communicate it, right? 13:47 I cannot just say this is this entire function. 13:49 I want to know where are the regions 13:51 of high probability, where my perimeter is supposed to be? 13:54 And so here, when I have this thing, what I actually 13:56 want to have is something that says, 13:58 well, I want to summarize this thing 14:00 into some subset of the real line, in which I'm 14:03 sure that the area under the curve, here, of my posterior 14:08 is actually 1 minus alpha. 14:11 And there's many ways to do this, right? 14:13 14:16 So one way to do this is to look at level sets. 14:22 14:27 And so rather than actually-- so let's 14:29 say my posterior looks like this. 14:32 I know, for example, if I have a Gaussian distribution, 14:35 I can actually take my posterior to be-- my posterior is 14:38 actually going to be Gaussian. 14:39 14:43 And what I can do is to try to cut it here on the y-axis 14:50 so that now, the area under the curve, when I cut here, 14:54 is actually 1 minus alpha. 14:59 OK, so I have some threshold tau. 15:02 If tau goes to plus infinity, then I'm 15:05 going to have that this area under the curve 15:07 here is going to-- 15:10 15:18 AUDIENCE: [INAUDIBLE] 15:19 PHILIPPE RIGOLLET: Well, no. 15:21 So the area under the curve, when 15:23 tau is going to plus infinity, think 15:24 of the small, the when tau is just right here. 15:27 AUDIENCE: [INAUDIBLE] 15:29 PHILIPPE RIGOLLET: So this is actually going to 0, right? 15:32 And so I start here. 15:33 And then I start going down and down and down and down, 15:36 until I actually get something which is going down to 1 plus 15:39 alpha. 15:40 And if tau is going down to 0, then my area under the curve 15:44 is going to-- 15:44 15:48 if tau is here, I'm cutting nowhere. 15:51 And so I'm getting 1, right? 15:52 15:56 Agree? 15:56 Think of, when tau is very close to 0, 16:00 I'm cutting [? s ?] s very far here. 16:02 And so I'm getting some area under the curve, 16:04 which is almost everything. 16:06 And so it's going to 1-- as tau going down to 0. 16:08 Yeah? 16:09 AUDIENCE: Does this only work for [INAUDIBLE] 16:12 PHILIPPE RIGOLLET: No, it does not. 16:14 I mean-- so this is a picture. 16:17 So those two things work for all of them, right? 16:20 But when you have a [? bimodal, ?] actually, 16:22 this is actually when things start 16:23 to become interesting, right? 16:24 So when we built a frequentist confidence interval, 16:30 it was always of the form x bar plus or minus something. 16:34 But now, if I start to have a posterior that 16:36 looks like this, what I'm going to start cutting off, 16:40 I'm going to have two-- 16:41 I mean, my confidence region is going 16:44 to be the union of those two things, right? 16:47 And it really reflects the fact that there 16:50 is this bimodal thing. 16:51 It's going to say, well, with hyperbole, 16:53 I'm actually going to be either here or here. 16:56 Now, the meaning here of a Bayesian confidence region 16:59 and the confidence interval are completely distinct notions, 17:02 right? 17:03 And I'm going to work out on example with you 17:06 so that we can actually see that sometimes-- 17:08 I mean, both of them, actually you 17:10 can come up with some crazy paradoxes. 17:11 So since we don't have that much time, 17:13 I will actually talk to you about why, in some instances, 17:17 it's actually a good idea to think of Bayesian confidence 17:19 intervals rather than frequentist ones. 17:22 So before we go into more details about what 17:25 those Bayesian confidence intervals are, 17:27 let's remind ourselves what does it 17:29 mean to have a frequentist confidence interval? 17:33 Right? 17:33 17:46 OK. 17:46 So when I have a frequentist confidence interval, 17:49 let's say something like x bar n to minus 1.96 sigma over root n 17:59 and x bar n plus 1.96 sigma over root n, 18:06 so that's the confidence interval 18:07 that you get for the mean of some Gaussian 18:10 with known variants to be equal to sigma square, OK. 18:16 So what we know is that the meaning of this 18:18 is the probability that theta belongs 18:20 to this is equal to 95%, right? 18:25 And this, more generally, you can 18:27 think of being q alpha over 2. 18:29 And what you're going to get is 1 minus alpha here, OK? 18:33 So what does it mean here? 18:34 Well, it looks very much like what we have here, 18:37 except that we're not conditioning on x1 xn. 18:39 And we should not. 18:40 Because there was a question like that in the midterm-- 18:43 if I condition on x1 xn, this probability is either 0 or 1. 18:47 OK? 18:48 Because once I condition-- so here, 18:50 this probability, actually, here is with respect 18:52 to the randomness in x1 xn. 18:55 So if I condition-- 18:56 18:58 so let's build this thing, r freq, for frequentist. 19:04 19:07 Well, given x1 xn-- 19:11 and actually, I don't need to know x1 xn really. 19:13 What I need to know is what xn bar is. 19:16 Well, this thing now is what? 19:18 It's 1, if theta is in r, and it's 0, 19:22 if theta is not in r, right? 19:27 That's all there is. 19:28 This is a deterministic confidence interval, 19:29 once I condition x1 xn. 19:32 So I have a number. 19:33 The average is maybe 3. 19:35 And so I get 3. 19:36 Either theta is between 3 minus 0.5 or in 3 plus 0.5, 19:41 or it's not. 19:42 And so there's basically-- 19:44 I mean, I write it as probability, 19:45 but it's really not a probalistic statement. 19:47 It's either it's true or not. 19:49 Agreed? 19:50 So what does it mean to have a frequentist confidence 19:52 interval? 19:53 It means that if I were-- 19:55 and here, where the word frequentist comes from, 19:58 it says that if I repeat this experiment over and over, 20:02 meaning that on Monday, I collect a sample of size n, 20:06 and I build a confidence interval, 20:09 and then on Tuesday, I collect another sample of size n, 20:12 and I build a confidence interval, 20:13 and on Wednesday, I do this again and again, what's going 20:17 to happen is the following. 20:18 I'm going to have my true theta that lives here. 20:21 And then on Monday, this is the confidence interval 20:23 that I build. 20:25 OK, so this is the real line. 20:28 The true theta is here, and this is the confidence interval 20:31 I build on Monday. 20:32 All right? 20:32 So x bar was here, and this is my confidence interval. 20:37 On Tuesday, I build this confidence interval maybe. 20:41 x bar was closer to theta, but smaller. 20:44 But then on Wednesday, I build this confidence interval. 20:49 I'm not here. 20:50 It's not in there. 20:51 And that's this case. 20:53 Right? 20:54 It happens that it's just not in there. 20:56 And then on Thursday, I build another one. 20:57 I almost miss it, but I'm in there, et cetera. 21:01 Maybe here, Here, I miss again. 21:04 And so what it means to have a confidence interval-- so what 21:07 does it mean to have a confidence interval at 95%? 21:12 AUDIENCE: [INAUDIBLE] 21:15 PHILIPPE RIGOLLET: Yeah, so it means that if I repeat this 21:18 the frequency of times-- 21:19 hence, the word frequentist-- at which 21:21 I'm actually going to overlap that, 21:24 I'm actually going to contain theta, should be 95%. 21:26 That's what frequentist means. 21:28 So it's just a matter of trusting that. 21:31 So on one given thing, one given realization of your data, 21:35 it's not telling you anything. 21:36 [INAUDIBLE] it's there or not. 21:38 So it's not really something that's actually 21:42 something that assesses the confidence of your decision, 21:46 such as data is in there or not. 21:48 It's something that assesses the confidence 21:50 you have in the method that you're using. 21:52 If you were you repeat it over and again, 21:54 it'd be the same thing. 21:56 It would be 95% of the time correct, right? 21:58 So for example, we know that we could build a test. 22:02 So it's pretty clear that you can actually 22:04 build a test for whether theta is equal to theta naught 22:09 or not equal to theta naught, by just 22:10 checking whether theta naught is in a confidence interval 22:13 or not. 22:13 And what it means is that, if you actually 22:15 are doing those tests at 5%, that means that 5% of the time, 22:21 if you do this over and again, 5% of the time 22:23 you're going to be wrong. 22:24 I mentioned my wife does market research. 22:27 And she does maybe, I don't know, 100,000 tests a year. 22:31 And if they do all of them at 1%, 22:34 then it means that 1% of the time, which is a lot of time, 22:37 right? 22:38 When you do 100,000 a year, it's 1,000 of them 22:40 are actually wrong. 22:41 OK, I mean, she's actually hedging 22:44 against the fact that 1% of them that are going to be wrong. 22:47 That's 1,000 of them that are going to be wrong. 22:49 Just like, if you do this 100,000 times at 95%, 22:52 5,000 of those guys are actually not going 22:54 to be the correct ones. 22:56 OK? 22:56 So I mean, it's kind of scary. 22:58 But that's the way it is. 23:01 So that's with the frequentist interpretation of this is. 23:03 Now, as I mentioned, when we started this Bayesian chapter, 23:07 I said, Bayesian statistics converge to-- 23:10 I mean, Bayesian decisions and Bayesian methods converge 23:14 to frequentist methods. 23:16 When the sample size is large enough, 23:18 they lead to the same decisions. 23:20 And in general, they need not be the same, 23:22 but they tend to actually, when the sample 23:24 size is large enough, to have the same behavior. 23:27 Think about, for example, the posterior 23:30 that you have when you have in the Gaussian case, right? 23:34 We said that, in the Gaussian case, 23:36 what you're going to see is that it's 23:38 as if you had an extra observation which 23:40 was essentially given by your prior. 23:43 OK? 23:44 And now, what's going to happen is that, when this just one 23:50 observation among n plus 1, it's really 23:53 going to be totally drawn, and you 23:55 won't see it when the sample size grows larger. 23:58 So Bayesian methods are particularly useful when 24:00 you have a small sample size. 24:02 And when you have a small sample size, the effect of the prior 24:05 is going to be bigger. 24:06 But most importantly, you're not going 24:08 to have to repeat this thing over and again. 24:10 And you're going to have a meaning. 24:11 You're going to have to have something 24:13 that has a meaning for this particular data set 24:15 that you have. 24:16 When I said that the probability that theta belongs to r-- 24:19 and here, I'm going to specify the fact that it's a Bayesian 24:22 confidence region, like this one-- 24:24 this is actually conditionally on the data 24:27 that you've collected. 24:29 It says, given this data, given the points that you have-- 24:32 just put in some numbers, if you want, in there-- 24:34 it's actually telling you the probability 24:36 that theta belongs to this Bayesian thing, 24:39 to this Bayesian confidence region. 24:41 Here, since I have conditioned on x1 xn, 24:44 this probability is really just with respect to theta 24:46 drawn from the prior, right? 24:51 And so now, it has a slightly different meaning. 24:54 It's just telling you that when-- 24:57 it's really making a statement about where 24:59 the regions of hyperability of your posterior are. 25:03 Now, why is that useful? 25:05 Well, there's actually an interesting story that 25:11 goes behind Bayesian methods. 25:13 Anybody knows the story of the USS I think it's Scorpion? 25:17 Do you know the story? 25:18 So that was an American vessel that disappeared. 25:22 I think it was close to Bermuda or something. 25:25 But you can tell the story of the Malaysian Airlines, 25:28 except that I don't think it's such a successful story. 25:31 But the idea was essentially, we're 25:33 trying to find where this thing happened. 25:36 And of course, this is a one-time thing. 25:39 You need something that works once. 25:41 You need something that works for this particular vessel. 25:44 And you don't care, if you go to the Navy, and you tell them, 25:46 well, here's a method. 25:48 And for 95 out of 100 vessels that you're going to lose, 25:51 we're going to be able to find it. 25:53 And they want this to work for this particular one. 25:57 And so they were looking, and they were 25:59 diving in different places. 26:02 And suddenly, they brought in this guy. 26:04 I forget his name. 26:05 I mean, there's a whole story about this on Wikipedia. 26:08 And he started collecting the data 26:10 that they had from different dives and maybe from currents. 26:13 And he started to put everything in. 26:14 And he said, OK, what is the posterior distribution 26:17 of the location of the vessel, given all the things 26:21 that I've seen? 26:22 And what have you seen? 26:23 Well, you've seen that it's not here, it's not there, 26:25 and it's not there. 26:26 And you've also seen that the currents were going that way, 26:29 and the winds were going that way. 26:30 And you can actually put some modeling traits 26:32 to understand this. 26:33 Now, given this, for this particular data that you have, 26:37 you can actually think of having a two-dimensional density that 26:41 tells you where it's more likely that the vessel is. 26:44 And where are you going to be looking for? 26:46 Well, if it's a multimodal distribution, 26:48 you're just going to go to the highest mode first, 26:50 because that's where it's the most likely to be. 26:52 And maybe it's not there, so you're just 26:53 going to update your posterior, based on the fact 26:55 that it's not there, and do it again. 26:56 And actually, after two dives, I think, 26:59 he actually found the thing. 27:01 And that's exactly where Bayesian statistics 27:03 start to kick in. 27:03 Because you put a lot of knowledge into your model, 27:08 but you also can actually factor in a bunch of information, 27:11 right? 27:11 The model, he had to build a model 27:13 that was actually taking into account and currents, and when. 27:17 And what you can have as a guarantee is that, 27:20 when you talk about the probability 27:22 that this vessel is in this location, 27:27 given what you've observed in the past, 27:28 it actually has some sense. 27:30 Whereas, if you were to use a frequentist approach, 27:34 then there's no probability. 27:35 Either it's underneath this position or it's not, right? 27:38 So that's actually where it start to make sense. 27:41 And so you can actually build this. 27:43 And there's actually a lot of methods 27:44 that are based on, for search, that 27:47 are based on Bayesian methods. 27:48 I think, for example, the Higgs boson 27:50 was based on a lot of Bayesian methods, 27:51 because this is something you need to find [INAUDIBLE],, 27:54 right? 27:54 I mean, there was a lot of prior that has to be built in. 27:57 OK. 27:57 So now, you build this confidence interval. 27:59 And the nicest way to do it is to use level sets. 28:02 But again, just like for Gaussians, I mean, if I had, 28:05 even in the Gaussian case, I decided 28:12 to go at x bar plus or minus something, 28:16 but I could go at something that's completely asymmetric. 28:19 So what's happening is that here, this method 28:21 guarantees that you're going to have the narrowest 28:23 possible confidence intervals. 28:24 That's essentially what it's telling you, OK? 28:27 Because every time I'm choosing a point, starting from here, 28:31 I'm actually putting as much area under the curve as I can. 28:36 All right. 28:38 So those are called Bayesian confidence [? interval. ?] 28:41 Oh yeah, and I promised you that we're 28:43 going to work on some example that actually 28:46 gives a meaning to what I just told you, with actual numbers. 28:50 So this is something that's taken from Wasserman's book. 28:56 And also, it's coming from a paper, 29:01 from a stats paper, from [? Wolpert ?] and I 29:03 don't know who, from the '80s. 29:05 And essentially, this is how it works. 29:07 So assume that you have n equals 2 observations. 29:10 29:14 And you have y1, so those observations are y1-- 29:18 no, sorry, let's call them x1, which 29:20 is theta, plus epsilon 1 and x2, which is theta plus epsilon 2, 29:26 where epsilon 1 and epsilon 2 are iid. 29:31 And the probability that epsilon i is equal 29:33 to plus 1 is equal to the probability 29:35 that epsilon i is equal to minus 1 is equal to 1/2. 29:38 OK, so it's just the uniform sign plus minus 1, OK? 29:44 Now, let's think about so you're trying 29:46 to do some inference on theta. 29:47 Maybe you actually want to find some inference on theta 29:50 that's actually based on-- 29:51 and that's based only on the x1 and x2. 29:55 OK? 29:56 So I'm going to actually build a confidence interval. 29:58 But what I really want to build is a-- 30:01 30:03 but let's start thinking about how 30:05 I would find an estimator for those two things. 30:07 Well, what values am I going to be getting, right? 30:09 So I'm going to get either theta plus 1 or theta minus 1. 30:13 And actually, I can get basically four 30:15 different observations, right? 30:19 Sorry, four different pairs of observations-- 30:21 30:30 plus plus theta minus 1. 30:32 Agreed? 30:33 Those are the four possible observations that I can get. 30:37 Agreed? 30:38 Either they're both equal to plus 1, both equal to minus 1, 30:42 or one of the two is equal to plus 30:44 1, the other one to minus 1, or the epsilons. 30:46 OK. 30:47 So those are the four observations I can get. 30:49 So in particular, if they take the same value, 30:56 and you know it's either theta plus 1 or theta minus 1, 30:59 and if they take a different value, I know one of them 31:02 is theta plus 1, and one is actually theta minus 1. 31:04 So in particular, if I take the average of those two guys, when 31:07 they take different values, I know I'm actually 31:09 getting theta right. 31:10 So let's build a confidence region. 31:14 OK, so I'm actually going to take a confidence region, which 31:16 is just a singleton. 31:18 31:21 And I'm going to say the following. 31:23 Well, if x1 is equal to x2, I'm just going to take x1 minus 1, 31:32 OK? 31:33 So I'm just saying, well, I'm never 31:34 going to able to resolve whether it's plus 1 or minus 1 31:37 that actually gives me the best one, 31:38 so I'm just going to take a dive and say, well, it's 31:41 just plus 1. 31:42 OK? 31:44 And then, if they're different, then here, 31:47 I can do much better. 31:50 I'm going to actually just think the average. 31:52 31:56 OK? 31:58 Now, what I claim is that this is a confidence region-- 32:08 and by default, when I don't mention it, 32:10 this is a frequentist confidence region-- 32:16 at level 75%. 32:18 32:21 OK? 32:21 So let's just check that. 32:23 To check that this is correct, I need 32:24 to check that the probability under the realization of x1 32:27 and x2, that theta belongs, is one of those two guys, 32:30 is actually equal to 0.75. 32:33 Yes? 32:33 AUDIENCE: What are the [INAUDIBLE] 32:36 PHILIPPE RIGOLLET: Well, it's just the frequentist confidence 32:39 interval that does not need to be an interval. 32:41 Actually, in this case, it's going to be an interval. 32:44 But that's just what it means. 32:46 Yeah, region for Bayesian was just because-- 32:50 I mean, the confidence intervals, 32:51 when we're frequentist, we tend to make them 32:53 intervals, because we want-- 32:54 but when you're Bayesian, and you're doing this level set 32:56 thing, you cannot really guarantee, 32:58 unless its [INAUDIBLE] is going to be an interval. 33:00 So region is just a way to not have to say interval, 33:02 in case it's not. 33:03 33:06 OK. 33:06 So I have this thing. 33:08 So what I need to check is the probability that theta 33:11 is in one of those two things, right? 33:13 So what I need to find is the probability that theta 33:16 is an [INAUDIBLE] Well, x1 minus 1 and x1 is not equal to x2. 33:24 And those are disjoint events, so it's plus the probability 33:26 that theta is in x1 plus x2 over 2 and x1-- 33:35 sorry, that's equal. 33:37 That's different. 33:39 OK. 33:40 And OK, just before we actually finish the computation, 33:42 why do I have 75%? 33:44 75% is 3/4. 33:46 So it means that we have four cases. 33:48 And essentially, I did not account for one case. 33:52 And it's true. 33:52 I did not account for this case, when 33:56 the both of the epsilon i's are equal to minus 1. 34:01 Right? 34:01 So this is essentially the one I'm not going 34:03 to be able to account for. 34:04 And so we'll see that in a second. 34:06 So in this case, we know that everything goes great. 34:09 Right? 34:09 So in this case, this is-- 34:11 OK. 34:11 Well, let's just start from the first line. 34:13 So the first line is the probability 34:15 that theta is equal to x1 minus 1 and those two are equal. 34:20 So this is the probability that theta is equal to-- 34:28 well, this is theta plus epsilon 1 minus 1. 34:36 And epsilon 1 is equal to epsilon 2, right? 34:43 Because I can remove the theta from here, 34:45 and I can actually remove the theta from here, 34:47 so that this guy here is just epsilon 1 is equal to 1. 34:50 So when I intersect with this guy, 34:52 it's actually the same thing as epsilon 1 is equal to 1, 34:54 as well-- 34:56 episilon 2 is equal to 1, as well, OK? 34:59 So this first thing is actually equal to the probability 35:05 that epsilon 1 is equal to 1 and epsilon 2 is equal to 1, 35:10 which is equal to what? 35:14 AUDIENCE: [INAUDIBLE] 35:15 PHILIPPE RIGOLLET: Yeah, 1/4, right? 35:17 So that's just the first case over there. 35:19 They're independent. 35:21 Now, I still need to do the second one. 35:23 So this case is what? 35:24 Well, when those things are equal, x1 plus x2 over 2 35:28 is what? 35:29 Well, I get theta plus theta over 2. 35:31 So that's just equal to the probability 35:33 that epsilon 1 plus epsilon 2 over 2 is equal to 0 35:39 and epsilon 1 is different from epsilon 2. 35:43 Agreed? 35:44 35:46 I just removed the thetas from these equations, because I can. 35:49 They're just on both sides every time. 35:51 35:54 OK. 35:55 And so that means what? 35:56 That means that the second part-- so this thing 35:58 is actually equal to 1/4 plus the probability 36:02 that epsilon 1 and epsilon 2 over 2 is equal to 0. 36:05 I can remove the 2. 36:06 So this is just the probability that one is 1, 36:08 and the other one is minus 1, right? 36:10 So that's equal to the probability 36:12 that epsilon 1 is equal to 1 and epsilon 2 is equal to minus 1 36:17 plus the probability that epsilon 1 is equal to minus 1 36:21 and epsilon 2 is equal to plus 1, OK? 36:24 Because they're disjoint events. 36:25 So I can break them into the sum of the two. 36:28 And each of those guys is also one of the atomic part of it. 36:32 It's one of the basic things. 36:33 And so each of those guys has probability 1/4. 36:36 And so here, we can really see that we accounted 36:38 for everything, except for the case when epsilon 1 was equal 36:41 to minus 1, and epsilon 2 was equal to minus 1. 36:44 So this is 1/4. 36:45 This is 1/4. 36:46 So the whole thing is equal to 3/4. 36:49 So now, what we have is that the probability that epsilon 1 36:56 is in-- 36:57 so the probability that data belongs to this confidence 37:03 region is equal to 3/4. 37:06 And that's very nice. 37:07 But the thing is some people are sort of-- 37:09 I mean, it's not super nice to be able to see this, 37:12 because, in a way, I know that, if I observe x1 and x2 that 37:17 are different, I know for sure that theta, 37:24 that I actually got the right theta, right? 37:25 That this confidence interval is actually 37:27 happening with probability 1. 37:31 And the problem is that I do not know-- 37:34 I cannot make this precise with the notion of frequentist 37:37 confidence intervals. 37:39 OK? 37:39 Because frequentist confidence intervals 37:41 have to account for the fact that, in the future, 37:43 it might not be the case that x1 and x2 are different. 37:47 So Bayesian confidence regions, by definition-- 37:53 well, they're all gone-- 37:54 but they are conditioned on the data that I have. 37:57 And so that's what I want. 37:58 I want to be able to make this statement conditionally 38:00 and the data that I have. 38:02 OK. 38:03 So if I want to be able to make this statement, 38:06 if I want to build a Bayesian confidence region, 38:08 I'm going to have to put a prior on theta. 38:10 So without loss of generality-- 38:12 I mean, maybe with-- but let's assume 38:16 that pi is a prior on theta. 38:25 And let's assume that pi of j is strictly positive 38:31 for all integers j equal, say, 0-- 38:35 well, actually, for all j in the integers, positive or negative. 38:42 OK. 38:43 So that's a pretty weak assumption on my prior. 38:46 I'm just assuming that theta is some integer. 38:52 And now, let's build our Bayesian confidence region. 38:57 Well, if I want to build a Bayesian confidence region, 38:59 I need to understand what my posterior is going to be. 39:01 OK? 39:02 And if I want to understand what my posterior is going to be, 39:04 I actually need to build a likelihood, right? 39:11 So we know that it's the product of the likelihood 39:16 and of the prior divided by-- 39:20 OK. 39:21 39:31 So what is my likelihood? 39:32 So my likelihood is the probability 39:35 of x1 x2, given theta. 39:40 Right? 39:41 That's what the likelihood should be. 39:45 And now let's say that actually, just 39:49 to make things a little simpler, let 39:51 us assume that x1 is equal to, I don't know, 5, 40:07 and x2 is equal to 7. 40:11 OK? 40:12 So I'm not going to take the case where they're actually 40:16 equal to each other, because I know that, in this case, 40:19 x1 and x2 are different. 40:20 I know I'm going to actually nail exactly what theta is, 40:23 by looking at the average of those guys, right? 40:26 Here, it must be that theta is equal to 6. 40:30 So what I want is to compute the likelihood at 5 and 7, OK? 40:34 40:38 And what is this likelihood? 40:42 Well, if theta is equal to 6, that's 40:53 just the probability that I will observe 5 and 7, right? 41:00 So what is the probability I observe 5 and 7? 41:01 41:04 Yeah? 41:05 1? 41:06 AUDIENCE: 1/4. 41:08 PHILIPPE RIGOLLET: That's 1/4, right? 41:10 As the probability, I have minus 1 for the first epsilon 1, 41:15 right? 41:15 So this is infinity 6. 41:17 This is the probability that epsilon 1 is equal to minus 1, 41:23 and epsilon 2 is equal to plus 1, which is equal to 1/4. 41:28 So this probability is 1/4. 41:31 If theta is different from 6, what is this probability? 41:35 So if theta is different from 6, since we 41:37 know that we've only loaded the integers-- 41:41 so if theta has to be another integer, 41:46 what is the probability that I see 5 and 7? 41:49 AUDIENCE: 0. 41:49 PHILIPPE RIGOLLET: 0. 41:50 41:53 So that's my likelihood. 41:55 And if I want to know what my posterior is, 42:00 well, it's just pi of theta times 42:03 p of 5/6, given theta, divided by the sum over all T's, say, 42:10 in Z. Right? 42:11 So now, I just need to normalize this thing. 42:14 So of pi of T, p of 4/6, given T. Agreed? 42:21 42:24 That's just the definition of the posterior. 42:27 But when I sum these guys, there's 42:30 only one that counts, because, for those things, 42:34 we know that this is actually equal to 0 for every T, 42:38 except for when T is equal to 6. 42:41 So this entire sum here is actually 42:45 equal to pi of 6 times p of 5/6-- 42:54 sorry, 5/7, of 5/7, given that theta 43:03 is equal to 6, which we know is equal to 1/4. 43:08 And I did not tell you what pi of 6 was. 43:10 43:16 But it's the same thing here. 43:18 The posterior for any theta that's not 6 43:21 is actually going to be-- this guy's going to be equal to 0. 43:23 So I really don't care what this guy is. 43:26 So what it means is that my posterior becomes what? 43:29 43:33 It becomes the posterior pi of theta, 43:40 given 5 and 7 is equal to-- well, when theta is not 43:46 equal to 6, this is actually 0. 43:49 So regardless of what I do here, I get something which is 0. 43:52 43:55 And if theta is equal to 6, what I get 43:58 is pi of 6 times p of 5/7, given 6, 44:02 which I've just computed here, which is 1/4 divided 44:05 by pi of 6 times 1/4. 44:08 So it's the ratio of two things that are identical. 44:10 So I get 1. 44:13 So now, my posterior tells me that, given 44:16 that I observe 5 and 7, theta has 44:22 to be 1 with probability-- has to be 6 with probability 1. 44:27 So now, I say that this thing here-- so now, this 44:32 is not something that actually makes 44:34 sense when I talk about frequentist confidence 44:37 intervals. 44:38 They don't really make sense, to talk about confidence 44:40 intervals, given something. 44:42 And so now, given that I observe 5 and 7, 44:44 I know that the probability of theta is equal to 1. 44:46 And in this sense, the Bayesian confidence interval 44:50 is actually more meaningful. 44:54 So one thing I want to actually say about this Bayesian 44:56 confidence interval is that it's-- 44:58 45:01 I mean, here, it's equal to the value 1, right? 45:03 So it really encompasses the thing that we want. 45:05 But the fact that we actually computed 45:06 it using the Bayesian posterior and the Bayesian rule 45:09 did not really matter for this argument. 45:10 All I just said was that it had a prior. 45:12 But just what I want to illustrate 45:15 is the fact that we can actually give a meaning 45:17 to the probability that theta is equal to 6, 45:21 given that I see 5 and 7. 45:23 Whereas, we cannot really in the other cases. 45:26 And we don't have to be particularly 45:28 precise in the prior and theta to be able to give theta this-- 45:31 to give this meaning. 45:32 OK? 45:35 All right. 45:36 45:38 So now, as I said, I think the main power of Bayesian 45:43 inference is that it spits out the posterior distribution, 45:45 and not just the single number, like frequentists 45:48 would give you. 45:50 Then we can say decorate, or theta hat, or point estimate, 45:55 with maybe some confidence interval. 45:56 Maybe we can do a bunch of tests. 45:58 But at the end of the day, we just have, 46:01 essentially, one number, right? 46:02 Then maybe we can understand where 46:04 the fluctuations of this number are in a frequentist setup. 46:07 but the Bayesian framework is essentially 46:11 giving you a natural method. 46:13 And you can interpret it in terms of the probabilities that 46:15 are associated to the prior. 46:17 But you can actually also try to make some-- 46:21 so a Bayesian, if you give me any prior, 46:25 you're going to actually build an estimator from this prior, 46:29 maybe from the posterior. 46:30 And maybe it's going to have some frequentist properties. 46:32 And that's what's really nice about [? Bayesians, ?] is 46:35 that you can actually try to give 46:36 some frequentist properties of Bayesian methods, that 46:39 are built using Bayesian methodology. 46:42 But you cannot really go the other way around. 46:44 If I give you a frequency methodology, 46:46 how are you going to say something about the fact 46:48 that there's a prior going on, et cetera? 46:51 And so this is actually one of the things 46:53 there's actually some research that's going on for this. 46:55 They call it Bayesian posterior concentration. 46:58 And one of the things-- so there's something 46:59 called the Bernstein-von Mises theorem. 47:01 And those are a class of theorems, 47:03 and those are essentially methods that tell you, well, 47:06 if I actually run a Bayesian method, 47:10 and I look at the posterior that I get-- 47:12 it's going to be something like this-- 47:14 but now, I try to study this in a frequentist point of view, 47:16 there's actually a true parameter of theta 47:18 somewhere, the true one. 47:20 There's no prior for this guy. 47:21 This is just one fixed number. 47:23 Is it true that as my sample size is 47:25 going to go to infinity, then this thing is going 47:27 to concentrate around theta? 47:29 And the rate of concentration of this thing, 47:31 the size of this width, the standard deviation 47:35 of this thing, is something that should decay maybe 47:38 like 1 over square root of n, or something like this. 47:40 And the rate of posterior concentration, 47:43 when you characterize it, it's called the Bernstein-von Mises 47:45 theorem. 47:46 And so people are looking at this 47:47 in some non-parametric cases. 47:49 You can do it in pretty much everything 47:51 we've been doing before. 47:52 You can do it for non-parametric regression estimation 47:55 or density estimation. 47:56 You can do it for, of course-- you 47:58 can do it for sparse estimation, if you want. 48:01 OK. 48:01 So you can actually compute the procedure and-- 48:04 48:08 yeah. 48:09 And so you can think of it as being just a method somehow. 48:12 Now, the estimator I'm talking about-- so 48:14 that's just a general Bayesian posterior concentration. 48:18 But you can also try to understand 48:20 what is the property of something that's 48:22 extracted from this posterior. 48:24 And one thing that we actually describe 48:26 was, for example, well, given this guy, 48:28 maybe it's a good idea to think about what 48:30 the mean of this thing is, right? 48:32 So there's going to be some theta hat, 48:35 which is just the integral of theta pi theta, given x1 xn-- 48:41 so that's my posterior-- 48:43 d theta. 48:44 Right? 48:44 So that's the posterior mean. 48:46 That's the expected value with respect 48:48 to the posterior distribution. 48:50 And I want to know how does this thing behave, 48:53 how close it is to a true theta if I actually 48:56 am in a frequency setup. 48:58 So that's the posterior mean. 48:59 49:04 But this is not the only thing I can actually spit out, right? 49:08 This is definitely uniquely defined. 49:09 If you give me a distribution, I can actually 49:13 spit out its posterior mean. 49:15 But I can also think of the posterior median. 49:17 49:21 But now, if this is not continuous, 49:23 you might have some uncertainty. 49:24 Maybe the median is not uniquely defined, 49:26 and so maybe that's not something you use as much. 49:29 Maybe you can actually talk about the posterior mode. 49:31 49:35 All right, so for example, if you're posterior density looks 49:38 like this, then maybe you just want 49:40 to summarize your posterior with this number. 49:43 So clearly, in this case, it's not such a good idea, 49:46 because you completely forget about this mode. 49:48 But maybe that's what you want to do. 49:49 Maybe you want to focus on the most peak mode. 49:53 And this is actually called maximum a posteriori. 49:58 As I said, maybe you want a sample 49:59 from this posterior distribution. 50:03 OK, and so in all these cases, these Bayesian estimators 50:06 will depend on the prior distribution. 50:09 And the hope is that, as the sample size grows, 50:11 you won't see that again. 50:14 OK. 50:14 So to conclude, let's just do a couple of experiments. 50:20 So if I look at-- 50:22 50:25 did we do this? 50:26 Yes. 50:26 So for example, so let's focus on the posterior mean. 50:30 50:34 And we know-- so remember in experiment one-- 50:45 [INAUDIBLE] example one, what we had 50:48 was x1 xn that were [? iid, ?] Bernoulli p, 50:56 and the prior I put on p was a beta with parameter aa. 51:06 OK? 51:07 And if I go back to what we computed, 51:09 you can actually compute the posterior of this thing. 51:12 And we know that it's actually going to be-- 51:15 sorry, that was uniform? 51:17 Where is-- yeah. 51:18 So what we get is that the posterior, this thing 51:31 is actually going to be a beta with parameter 51:36 a plus the sum, so a plus the number of 1s 51:42 and a plus the number of 0s. 51:44 51:48 OK? 51:49 And the beta was just something that looked like-- 51:53 51:56 the density was p to the a minus 1, 1 minus p. 52:00 52:05 OK? 52:05 So if I want to understand the posterior mean, 52:11 I need to be able to compute the expectation of a beta, 52:13 and then maybe plug in a for a plus 52:16 this guy and minus this guy. 52:17 OK. 52:18 So actually, let me do this. 52:21 OK. 52:22 So what is the expectation? 52:23 52:26 So what I want is something that looks 52:27 like the integral between 0 and 1 of p times a minus 1-- 52:34 sorry, p times p a minus 1, 1 minus p, b minus 1. 52:42 Do we agree that this-- 52:43 and then there's a normalizing constant. 52:46 Let's call it c. 52:49 OK? 52:49 52:53 So this is what I need to compute. 52:56 So that's c of a and b. 52:57 53:00 Do we agree that this is the posterior 53:01 mean with respect to a beta with parameters a and b? 53:08 Right? 53:09 I just integrate p against the density. 53:13 So what does this thing look like? 53:14 Well, I can actually move this guy in here. 53:18 And here, I'm going to have a plus 1 minus 1. 53:23 OK? 53:26 So the problem is that this thing is actually-- 53:29 the constant is going to play a big role, right? 53:31 Because this is essentially equal 53:33 to c a plus 1b divided by c ab, where 53:40 ca plus 1b is just the normalizing 53:42 constant of a beta a plus 1 b. 53:46 So I need to know the ratio of those two constants. 53:48 53:58 And this is not something-- 53:59 I mean, this is just a calculus exercise. 54:01 So in this case, what you get is-- 54:06 sorry. 54:08 In this case, you get-- 54:09 54:12 well, OK, so we get essentially a divided by, 54:34 I think, it's a plus b. 54:37 Yeah, it's a plus b. 54:38 54:41 So that's this quantity. 54:43 54:47 OK? 54:47 54:51 And when I plug in a to be this guy and b to be this guy, what 54:56 I get is a plus sum of the xi. 55:02 And then I get a plus this guy, a plus n minus this guy. 55:06 So those two guys go away, and I'm 55:07 left with 2a plus n, which does not work. 55:14 No, that actually works. 55:15 And so now what I do, I can actually divide and get 55:18 this thing, over there. 55:19 OK. 55:20 So what you can see, the reason why this thing has been divided 55:23 is that you can really see that, as n goes to infinity, 55:27 then this thing behaves like xn bar, which 55:30 is our frequentist estimator. 55:31 The effect of a is actually going away. 55:34 The effect of the prior, which is completely captured by a, 55:37 is going away as n goes to infinity. 55:40 Is there any question? 55:42 55:47 You guys have a question. 55:48 What is it? 55:50 Do you have a question? 55:51 AUDIENCE: Yeah, on the board, is that divided 55:53 by some [INAUDIBLE] stuff? 55:56 PHILIPPE RIGOLLET: Is that divided by what? 55:58 AUDIENCE: That a over a plus b, and then you just expanded-- 56:00 PHILIPPE RIGOLLET: Oh yeah, yeah, 56:01 then I said that this is equal to this, right. 56:05 So that's for a becomes a plus sum of the xi's, and b becomes 56:15 a plus n minus sum of the xi's. 56:20 OK. 56:20 So that's just for the posterior one. 56:22 AUDIENCE: What's [INAUDIBLE] 56:26 PHILIPPE RIGOLLET: This guy? 56:27 AUDIENCE: Yeah. 56:28 PHILIPPE RIGOLLET: 2a. 56:28 AUDIENCE: 2a. 56:29 Oh, OK. 56:30 PHILIPPE RIGOLLET: Right. 56:31 So I get a plus a plus n. 56:34 And then those two guys cancel. 56:37 OK? 56:38 And that's what you have here. 56:41 So for a is equal to 1/2-- 56:44 and I claim that this is Jeffreys prior. 56:47 Because remember, Jeffreys was [INAUDIBLE] was square root 56:53 and was proportional to the square root of p1 minus 56:56 p, which I can write as p to the 1/2, 1 minus p to the 1/2. 57:01 So it's just the case a is equal to 1/2. 57:03 OK. 57:04 So if I use Jeffreys prior, I just plug in a equals to 1/2, 57:07 and this is what I get. 57:10 OK? 57:12 So those things are going to have an impact again when 57:14 n is moderately large. 57:16 For large n, those things, whether you take Jeffreys prior 57:19 or you take whatever a you prefer, 57:20 it's going to have no impact whatsoever. 57:23 But n is of the order of 10 maybe, 57:26 then you're going to start to see some impact, 57:28 depending on what a you want to pick. 57:30 57:33 OK. 57:34 And then in the second example, well, here we actually 57:38 computed the posterior to be this guy. 57:42 Well, here, I can just read off what the expectation is, right? 57:45 I mean, I don't have to actually compute 57:47 the expectation of a Gaussian. 57:48 It's just that xn bar. 57:50 And so in this case, there's actually no-- 57:52 I mean, when I have a non-informative prior 57:57 for a Gaussian, then I have basically xn in bar. 58:01 As you can see, actually, this is an interesting example. 58:04 When I actually look at the posterior, 58:06 it's not something that cost me a lot to communicate to you, 58:09 right? 58:10 There's one symbol here, one symbol here, and one symbol 58:12 here. 58:13 I tell you the posterior is a Gaussian with mean xn bar 58:17 and variance 1/n. 58:19 When I actually turn that into a poster mean, 58:23 I'm dropping all this information. 58:26 I'm just giving you the first parameter. 58:27 So you can see there's actually much more information 58:30 in the posterior than there is in the posterior mean. 58:35 The posterior mean is just a point. 58:37 It's not telling me how confident I am in this point. 58:39 And this thing is actually very interesting. 58:41 OK. 58:42 So you can talk about the posterior variance 58:44 that's associated to it, right? 58:45 You can talk about, as an output, 58:47 you could give the posterior mean and posterior variance. 58:49 And those things are actually interesting. 58:53 All right. 58:53 So I think this is it. 58:56 So as I said, in general, just like in this case, 59:05 the impact of the prior is being washed away 59:07 as the sample size goes to infinity. 59:10 Just well, like here, there's no impact of the prior. 59:12 It was an noninvasive one. 59:14 But if you actually had an informative one, [? CF ?] 59:17 homework-- yeah? 59:18 AUDIENCE: [INAUDIBLE] 59:19 PHILIPPE RIGOLLET: Yeah, so [? CF ?] homework, 59:21 you would actually see an impact of the prior, which, 59:23 again, would be washed away as your sample size increases. 59:25 Here, it goes away. 59:26 You just get xn bar over 1. 59:29 And actually, in these cases, you 59:31 see that the posterior distribution converges 59:35 to-- sorry, the Bayesian estimator 59:37 is asymptotically normal. 59:39 This is different from the distribution of the posterior, 59:43 right? 59:43 This is just the posterior mean, which happens 59:45 to be asymptotically normal. 59:47 But the posterior may not have a-- 59:49 I mean, here, the posterior is a beta, right? 59:53 I mean, it's not normal. 59:55 OK, so there's different-- those things 59:57 are two different things. 59:59 Your question? 60:01 AUDIENCE: What was the prior [INAUDIBLE] 60:04 PHILIPPE RIGOLLET: All 1, right? 60:05 That was the improper prior. 60:06 AUDIENCE: OK. 60:08 And so that would give you the same thing as [INAUDIBLE],, not 60:12 just the proportion. 60:13 PHILIPPE RIGOLLET: Well, I mean, yeah. 60:15 So it's essentially telling you that-- 60:17 so we said that, when you have a non-informative prior, 60:23 essentially, the maximum likelihood is the maximum 60:25 a posteriori, right? 60:26 But in this case, there's so much symmetry, 60:28 that it just so happens that the maximum in this thing 60:30 is completely symmetric around its maximum. 60:32 So it means that the expectation is equal to the maximum, 60:34 to [INAUDIBLE] max. 60:35 60:40 Yeah? 60:41 AUDIENCE: I read somewhere that one 60:43 of the issues with Bayesian methods 60:45 is that we choose the wrong prior, 60:46 and it could mess up your results. 60:49 PHILIPPE RIGOLLET: Yeah, but hence, 60:51 do not pick the wrong prior. 60:53 I mean, of course, it would. 60:55 I mean, it would mess up your res-- of course. 60:57 I mean, you're putting extra information. 60:58 But you could say the same thing by saying, 61:00 well, the issue with frequentist method 61:03 is that, if you mess up the choice of your likelihood, 61:06 then it's going to mess up your output. 61:09 So here, you just have two chances of messing it up, 61:11 right? 61:12 You have the-- well, it's gone. 61:14 So you have the product of the likelihood and the prior, 61:17 and you have one more chance to-- 61:20 but it's true, if you assume that the model is 61:22 right, then, of course, finding the wrong prior could 61:25 completely mess up things if your prior, for example, 61:28 has no support on the true parameter. 61:30 But if your prior has a positive weight on the true parameter 61:34 as n goes to infinity-- 61:38 I mean, OK, I cannot speak for all counterexamples 61:40 in the world. 61:41 But I'm sure, under minor technical conditions, 61:44 you can guarantee that your posterior 61:46 mean is going to converge to what 61:48 you need it to converge to. 61:49 61:53 Any other question? 61:54 61:57 All right. 61:58 So I think this closes the more traditional mathematical-- not 62:07 mathematical, but traditional statistics part of this class. 62:11 And from here on, we'll talk about more multivariate 62:14 statistics, starting with principal component analysis. 62:17 So that's more like when you have multiple data. 62:19 We started, in a way, to talk about multivariate statistics 62:22 when we talked about multivariate regression. 62:25 But we'll move on to principal component analysis. 62:28 I'll talk a bit about multiple testing. 62:30 I haven't made my mind yet about what 62:32 we'll talk really in December. 62:34 But I want to make sure that you have 62:36 a taste and a flavor of what is being interesting in statistics 62:41 these days, especially as you go towards more [INAUDIBLE] 62:44 learning type of questions, where really, the focus is 62:46 on prediction rather than the modeling itself. 62:48 We'll talk about logistic regression, 62:50 as well, for example, which is generalized 62:52 linear models, which is just the generalization in the case 62:55 that y does not take value in the whole real line, maybe 0,1, 63:00 for example, for regression. 63:03 All right. 63:03 Thanks.