https://www.youtube.com/watch?v=bFZ-0FH5hfs&list=PLUl4u3cNGP60uVBMaoNERc6knT_MgPKS0&index=15 字幕記錄 00:00 00:00 The following content is provided under a Creative 00:02 Commons license. 00:03 Your support will help MIT OpenCourseWare 00:06 continue to offer high quality educational resources for free. 00:10 To make a donation or to view additional materials 00:12 from hundreds of MIT courses, visit 00:15 MITOpenCourseWare@OCW.MIT.edu 00:21 PHILIPPE RIGOLLET: So today WE'LL actually just do a brief 00:26 chapter on Bayesian statistics. 00:28 And there's entire courses on Bayesian statistics, 00:31 there's entire books on Bayesian statistics, 00:33 there's entire careers in Bayesian statistics. 00:36 So admittedly, I'm not going to be 00:39 able to do it justice and tell you 00:40 all the interesting things that are happening 00:42 in Bayesian statistics. 00:44 But I think it's important as a statistician 00:47 to know what it is, how it works, 00:49 because it's actually a weapon of choice 00:52 for many practitioners. 00:55 And because it allows them to incorporate their knowledge 00:58 about a problem in a fairly systematic manner. 01:00 So if you look at like, say the Bayesian statistics literature, 01:04 it's huge. 01:05 And so here I give you sort of a range 01:09 of what you can expect to see in Bayesian statistics 01:12 from your second edition of a traditional book, something 01:18 that involves computation, some things that 01:20 involve risk thinking. 01:22 And there's a lot of Bayesian thinking. 01:24 There's a lot of things that you know 01:26 talking about sort of like philosophy of thinking 01:29 Bayesian. 01:30 This book, for example, seems to be one of them. 01:32 This book is definitely one of them. 01:34 This one represents sort of a wide, a broad literature 01:38 on Bayesian statistics, for applications for example, 01:42 in social sciences. 01:43 But even in large scale machine learning, 01:45 there's a lot of Bayesian statistics happening, 01:47 particular using something called Bayesian parametrics, 01:50 or hierarchical Bayesian modeling. 01:53 So we do have some experts at MIT in the c-cell. 01:59 Tamara Broderick for example, is a person 02:02 who does quite a bit of interesting work 02:04 on Bayesian parametrics. 02:06 And if that's something you want to know more about, 02:08 I urge you to go and talk to her. 02:10 So before we go into more advanced things, 02:14 we need to start with what is the Bayesian approach. 02:17 What do Bayesians do, and how is it 02:19 different from what we've been doing so far? 02:22 So to understand the difference between Bayesians 02:26 and what we've been doing so far is, 02:28 we need to first put a name on what we've been doing so far. 02:31 It's called frequentist statistics. 02:32 Which usually Bayesian versus frequentist statistics, 02:36 by versus I don't mean that there is naturally 02:38 in opposition to them. 02:40 Actually, often you will see the same method that 02:43 comes out of both approaches. 02:45 So let's see how we did it, right. 02:46 The first thing, we had data. 02:48 We observed some data. 02:50 And we assumed that this data was generated randomly. 02:52 The reason we did that is because this 02:54 would allow us to leverage tools from probability. 02:57 So let's say by nature, measurements, you do a survey, 03:01 you get some data. 03:03 Then we made some assumptions on the data generating process. 03:06 For example, we assumed they were iid. 03:07 That was one of the recurring things. 03:09 Sometimes we assume it was Gaussian. 03:11 If you wanted to use say, T-test. 03:13 Maybe we did some nonparametric statistics. 03:15 We assume it was a smooth function or maybe 03:18 linear regression function. 03:20 So those are our modeling. 03:21 And this was basically a way to say, well, 03:24 we're not going to allow for any distributions for the data 03:28 that we have. 03:29 But maybe a small set of distributions 03:31 that indexed by some small parameters, for example. 03:34 Or at least remove some of the possibilities. 03:38 Otherwise, there's nothing we can learn. 03:41 And so for example, this was associated 03:45 to some parameter of interest, say data or beta 03:48 in the regression model. 03:51 Then we had this unknown problem and this unknown thing, 03:55 a known parameter. 03:56 And we wanted to find it. 03:57 We wanted to either estimate it or test it, 03:59 or maybe find a confidence interval for the subject. 04:02 So, so far I should not have said anything that's new. 04:06 But this last sentence is actually 04:08 what's going to be different from the Bayesian part. 04:10 And particular, this unknown but fixed things 04:12 is what's going to be changing. 04:14 04:16 In the Bayesian approach, we still 04:18 assume that we observe some random data. 04:22 But the generating process is slightly different. 04:24 It's sort of a two later process. 04:25 And there's one process that generates 04:27 the parameter and then one process 04:28 that, given this parameter generates the data. 04:31 So what the first layer does, nobody really 04:35 believes that there's some random process that's 04:38 happening, about generating what is going 04:41 to be the true expected number of people 04:44 who turn their head to the right when they kiss. 04:47 But this is actually going to be something that brings us 04:49 some easiness for us to incorporate 04:53 what we call prior belief. 04:57 We'll see an example in a second. 04:58 But often, you actually have prior belief 05:01 of what this parameter should be. 05:02 When we, say least squares, we looked 05:05 over all of the vectors in all of R to the p, 05:09 including the ones that have coefficients equal 05:11 to 50 million. 05:15 Those are things that we might be able to rule out. 05:18 We might be able to rule out that on a much smaller scale. 05:21 For example, well I'm not an expert 05:24 on turning your head to the right or to the left. 05:29 But maybe you can rule out the fact 05:30 that almost everybody is turning their head 05:33 in the same direction, or almost everybody is turning their head 05:35 to another direction. 05:38 So we have this prior belief. 05:39 And this belief is going to play say, hopefully 05:43 less and less important role as we collect more and more data. 05:47 But if we have a smaller amount of data, 05:49 we might want to be able to use this information, 05:52 rather than just shooting in the dark. 05:54 And so the idea is to have this prior belief. 05:58 And then, we want to update this prior belief 06:00 into what's called the posterior belief after we've 06:03 seen some data. 06:04 Maybe I believe that there's something 06:08 that should be in some range. 06:09 But maybe after I see data, it's comforting me in my beliefs. 06:12 So I'm actually having maybe a belief that's more. 06:15 So belief encompasses basically what you think 06:18 and how strongly you think about it. 06:20 That's what I call belief. 06:21 So for example, if I have a belief about some parameter 06:24 theta, maybe my belief is telling me 06:26 where theta should be and how strongly I 06:28 believe in it, in the sense that I have a very narrow region 06:32 where theta could be. 06:35 The posterior beliefs, as well, you see some data. 06:37 And maybe you're more confident or less confident about what 06:40 you've seen. 06:40 Maybe you've shifted your belief a little bit. 06:42 And so that's what we're going to try to see, 06:44 and how to do this in a principal manner. 06:48 To understand this better, there's 06:50 nothing better than an example. 06:52 So let's talk about another stupid statistical question. 06:56 Which is, let's try to understand p. 06:58 Of course, I'm not going to talk about politics from now on. 07:01 So let's talk about p, the proportion of women 07:03 in the population. 07:04 07:15 And so what I could do is to collect some data, X1, Xn 07:21 and assume that they're Bernoulli 07:23 with some parameter, p unknown. 07:25 So p is in 0, 1. 07:30 OK, let's assume that those guys are iid. 07:33 So this is just an indicator for each of my collected data, 07:38 whether the person I randomly sample is a woman, I get a one. 07:42 If it's a man, I get a zero. 07:43 07:46 Now the question is, I sample these people randomly. 07:49 I do you know their gender. 07:51 And the frequentist approach was just saying, 07:54 OK, let's just estimate p hat being Xn bar. 07:58 And then we could do some tests. 08:01 So here, there's a test. 08:02 I want to test maybe if p is equal to 0.5 or not. 08:05 That sounds like a pretty reasonable thing to test. 08:09 But we want to also maybe estimate p. 08:13 But here, this is a case where we definitely prior belief 08:16 of what p should be. 08:17 We are pretty confident that p is not going to be 0.7. 08:22 We actually believe that we should 08:23 be extremely close to one half, but maybe not exactly. 08:29 Maybe this population is not the population in the world. 08:32 But maybe this is the population of, say some college 08:35 and we want to understand if this college has half women 08:38 or not. 08:40 Maybe we know it's going to be close to one half, 08:42 but maybe we're not quite sure. 08:43 08:46 We're going to want to integrate that knowledge. 08:49 So I could integrate it in a blunt manner by saying, 08:52 discard the data and say that p is equal to one half. 08:55 But maybe that's just a little too much. 08:57 So how do I do this trade off between adding the data 09:01 and combining it with this prior knowledge? 09:06 In many instances, essentially what's going to happen 09:09 is this one half is going to act like one new observation. 09:14 So if you have five observations, 09:17 this is just the sixth observation, 09:18 which will play a role. 09:20 If you have a million observations, 09:21 you're going to have a million and one. 09:22 It's not going to play so much of a role. 09:24 That's basically how it goes. 09:25 09:28 But, definitely not always because we'll 09:33 see that if I take my prior to be a point minus one half here, 09:36 it's basically as if I was discarding my data. 09:39 So essentially, there's also your ability 09:41 to encompass how strongly you believe in this prior. 09:45 And if you believe infinitely more in the prior 09:47 than you believe in the data you collected, 09:49 then it's not going to act like one more observation. 09:54 The Bayesian approach is a tool to one, 09:56 include mathematically our prior. 09:59 And our prior belief into statistical procedures. 10:02 Maybe I have this prior knowledge. 10:04 But if I'm a medical doctor, it's not clear to me 10:06 how I'm going to turn this into some principal way of building 10:09 estimators. 10:10 And the second goal is going to be 10:12 to update this prior belief into a posterior belief 10:16 by using the data. 10:17 10:22 How do I do this? 10:23 And at some point, I sort of suggested 10:25 that there's two layers. 10:28 One is where you draw the parameter at random. 10:31 And two, once you have the parameter, 10:35 conditionless parameter, you draw your data. 10:39 Nobody believed this actually is happening, that nature is just 10:42 rolling dice for us and choosing parameters at random. 10:45 But what's happening is that, this idea 10:48 that the parameter comes from some random distribution 10:51 actually captures, very well, this idea that how 10:54 you would encompass your prior. 10:56 How would you say, my belief is as follows? 10:59 Well here's an example about p. 11:01 I'm 90% sure that p is between 0.4 and 0.6. 11:07 And I'm 95% sure that p is between 0.3 and 0.8. 11:14 So essentially, I have this possible value of p. 11:18 And what I know is that, there's 90% here between 0.4 and 0.6. 11:35 And then I have 0.3 and 0.8. 11:39 And I know that I'm 95% sure that I'm in here. 11:44 If you remember, this sort of looks like the kind of pictures 11:47 that I made when I had some Gaussian, for example. 11:50 And I said, oh here we have 90% of the observations. 11:54 And here, we have 95% of the observations. 11:57 12:00 So in a way, if I were able to tell you 12:04 all those ranges for all possible values, 12:07 then I would essentially describe a probability 12:10 distribution for p. 12:13 And what I'm saying is that, p is going 12:15 to have this kind of shape. 12:16 So of course, if I tell you only two twice this information 12:19 that there's 90% I'm here, and I'm between here and here. 12:22 And 95%, I'm between here and here, then there's 12:24 many ways I can accomplish that, right. 12:26 I could have something that looks like this, maybe. 12:28 12:33 It could be like this. 12:35 There's many ways I can have this. 12:37 Some of them are definitely going 12:38 to be mathematically more convenient than others. 12:42 And hopefully, we're going to have things 12:44 that I can parameterize very well. 12:47 Because if I tell you this is this guy, 12:49 then there's basically one, two three, four, five, six, 12:54 seven parameters. 12:56 So I probably don't want something 12:57 that has seven parameters. 12:59 But maybe I can say, oh, it's a Gaussian and I all 13:01 I have to do is to tell you where it's centered 13:03 and what the standard deviation is. 13:04 13:07 So the idea of using this two layer thing, 13:11 where we think of the parameter p 13:12 as being drawn from some distribution, 13:14 is really just a way for us to capture this information. 13:17 Our prior belief being, well there's 13:20 this percentage of chances that it's there. 13:22 But the percentage of this chance, I'm not I'm 13:24 deliberately not using probability here. 13:28 So it's really a way to get close to this. 13:30 13:33 That's why I say, the true parameter is not random. 13:36 But the Bayesian approach does as if it was random. 13:40 And then, just spits out a procedure 13:42 out of this thought process, this thought experiment. 13:49 So when you practice Bayesian statistics a lot, 13:54 you start getting automatisms. 13:57 You start getting some things that you do without really 14:00 thinking about it. just like when 14:02 you you're a statistician, the first thing you do is, 14:04 can I think of this data as being Gaussian for example? 14:07 When you're Bayesian you're thinking about, 14:09 OK I have a set of parameters. 14:11 So here, I can describe my parameter 14:14 as being theta in general, in some big space 14:20 parameter of theta. 14:21 But what spaces did we encounter? 14:24 Well, we encountered the real line. 14:27 We encountered the interval 0, 1 for Bernoulli's And we 14:31 encountered some of the positive real line 14:36 for exponential distributions, etc. 14:39 And so what I'm going to need to do, 14:42 if I want to put some prior on those spaces, 14:44 I'm going to have to have a usual set of tools 14:47 for this guy, usual set of tools for this guy, 14:49 usual sort of tools for this guy. 14:51 And by usual set of tools, I mean 14:52 I'm going to have to have a family of distributions that's 14:54 supported on this. 14:56 So in particular, this is the speed 14:59 in which my parameter that I usually denote 15:01 by p for Bernoulli lives. 15:03 And so what I need is to find a distribution on the interval 0, 15:07 1 just like this guy. 15:13 The problem with the Gaussian is that it's 15:15 not on the interval 0, 1. 15:17 It's going to spill out in the end. 15:20 And it's not going to be something that works for me. 15:22 And so the question is, I need to think about distributions 15:25 that are probably continuous. 15:27 Why would I restrict myself to discrete distributions that 15:30 are actually convenient and for Bernoulli, one that's actually 15:34 basically the main tool that everybody is using 15:36 is the so-called beta distribution. 15:39 So the beta distribution has two parameters. 15:42 15:50 So x follows a beta with parameters 15:56 a and b if it has a density, f of x 16:05 is equal to x to the a minus 1. 16:09 1 minus x to the b minus 1, if x is in the interval 0, 16:15 1 and 0 for all other x's. 16:22 OK? 16:23 16:27 Why is that a good thing? 16:30 Well, it's a density that's on the interval 0, 1 for sure. 16:33 But now I have these two parameters and a set of shapes 16:37 that I can get by tweaking those two parameters is incredible. 16:41 16:44 It's going to be a unimodal distribution. 16:46 It's still fairly nice. 16:47 It's not going to be something that goes like this and this. 16:49 Because if you think about this, what 16:52 would it mean if your prior distribution of the interval 0, 16:55 1 had this shape? 16:57 16:59 It would mean that, maybe you think that p is here 17:01 or maybe you think that p is here, 17:03 or maybe you think that p is here. 17:05 Which essentially means that you think 17:06 that p can come from three different phenomena. 17:10 And there's other models that are called mixers 17:12 for that, that directly account for the fact 17:15 that maybe there are several phenomena that are aggregated 17:19 in your data set. 17:21 But if you think that your data set is sort of pure, 17:23 and that everything comes from the same phenomenon, 17:25 you want something that looks like this, 17:28 or maybe looks like this, or maybe is sort of symmetric. 17:32 You want to get all this stuff. 17:34 Maybe you want something that says, well 17:36 if I'm talking about p being the probability of the proportion 17:42 of women in the whole world, you want something that's probably 17:45 really spiked around one half. 17:48 Almost the point math, because you know 17:50 let's agree that 0.5 is the actual number. 17:54 So you want something that says, OK maybe I'm wrong. 17:58 But I'm sure I'm not going to be really that way off. 18:01 So you want something that's really pointy. 18:03 But if it's something you've never checked, 18:06 and again I can not make references at this point, 18:09 but something where you might have some uncertainty that 18:13 should be around one half. 18:14 Maybe you want something that a little more allows 18:17 you to say, well, I think there's more around one half. 18:19 But there's still some fluctuations that are possible. 18:22 And in particular here, I talk about p, 18:25 where the two parameters a and b are actually the same. 18:29 I call them a. 18:30 One is called scale. 18:31 The other one is called shape. 18:33 Oh sorry, this is not a density. 18:35 So it actually has to be normalized. 18:38 When you integrate this guy, it's 18:40 going to be some function that depends on a 18:41 and b, actually depends on this function 18:43 through the beta function. 18:45 Which is this combination of gamma function, 18:47 so that's why it's called beta distribution. 18:51 That's the definition of the beta function when you 18:53 integrate this thing anyway. 18:55 You just have to normalize it. 18:56 That's just a number that depends on the a and b. 18:59 So here, if you take a equal to b, 19:01 you have something that essentially 19:03 is symmetric around one half. 19:05 Because what does it look like? 19:07 Well, so my density f of x, is going to be what? 19:10 It's going to be my constant times x, times one minus x 19:19 to a minus one. 19:21 And this function, x times 1 minus x looks like this. 19:26 We've drawn it before. 19:27 That was something that showed up 19:29 as being the variance of my Bernoulli. 19:36 So we know it's something that takes its maximum at one half. 19:42 And now I'm just taking a power of this guy. 19:44 So I'm really just distorting this thing 19:46 into some fairly symmetric manner. 19:51 19:56 This distribution that we actually take for p. 20:00 I assume that p, the parameter, notice 20:03 that this is kind of weird. 20:04 First of all, this is probably the first time 20:06 in this entire course that something 20:09 has a distribution when it's actually a lower case letter. 20:12 That's something you have to deal with, 20:13 because we've been using lower case letters for parameters. 20:16 And now we want them to have a distribution. 20:18 So that's what's going to happen. 20:20 This is called the prior distribution. 20:23 So really, I should write something like f of p 20:27 is equal to a constant times p, 1 minus p, to the n minus 1. 20:35 Well no, actually I should not because then it's confusing. 20:39 One thing in terms of notation that I'm 20:41 going to write, when I have a constant here 20:43 and I don't want to make it explicit. 20:45 And we'll see in a second why I don't need to make it explicit. 20:48 I'm going to write this as f of x 20:53 is proportional to x 1 minus x to the n minus 1. 21:04 That's just to say, equal to some constant that does not 21:08 depend on x times this thing. 21:11 21:16 So if we continue with our experiment 21:21 where I'm drawing this data, X1 to Xn, 21:25 which is Bernoulli p, if p has some distribution 21:29 it's not clear what it means to have a Bernoulli 21:31 with some random parameter. 21:32 So what I'm going to do is, then I'm going to first draw my p. 21:35 Let's say I get a number, 0.52. 21:38 And then, I'm going to draw my data conditionally on p. 21:41 So here comes the first and last flowchart of this class. 21:45 21:49 So nature first draws p. 21:51 21:53 p follows some data on a, a. 21:58 Then I condition on p. 21:59 22:02 And then I draw X1, Xn that are iid, Bernoulli p. 22:10 Everybody understand the process of generating this data? 22:14 So you first draw a parameter, and then you just 22:16 flip those independent biased coins with this particular p. 22:21 There's this layered thing. 22:23 22:26 Now conditionally p, right so here I have this prior about p 22:31 which was the thing. 22:32 So this is just the thought process again, 22:34 it's not anything that actually happens in practice. 22:36 This is my way of thinking about how the data was generated. 22:39 And from this, I'm going to try to come up with some procedure. 22:43 Just like, if your estimator is the average of the data, 22:47 you don't have to understand probability 22:49 to say that my estimator is the average of the data. 22:52 Anyone outside this room understands 22:54 that the average is a good estimator 22:55 for some average behavior. 22:58 And they don't need to think of the data 23:01 as being a random variable, et cetera. 23:02 So same thing, basically. 23:04 23:10 In this case, you can see that the posterior distribution 23:13 is still a beta. 23:14 23:18 What it means is that, I had this thing. 23:20 Then, I observed my data. 23:21 And then, I continue and here I'm 23:23 going to update my prior into some posterior 23:32 distribution, pi. 23:36 And here, this guy is actually also a beta. 23:39 23:43 My posterior distribution, p, is also 23:45 a beta distribution with the parameters 23:48 that are on this slide. 23:48 And I'll have the space to reproduce them. 23:51 So I start the beginning of this flowchart 23:54 as having p, which is a prior. 23:57 I'm going to get some observations 23:58 and then, I'm going to update what my posterior is. 24:01 24:04 This posterior is basically something 24:06 that's, in business statistics was 24:09 beautiful is as soon as you have this distribution, 24:13 it's essentially capturing all the information about the data 24:17 that you want for p. 24:19 And it's not just the point. 24:20 It's not just an average. 24:21 It's actually an entire distribution 24:23 for the possible values of theta. 24:27 And it's not the same thing as saying, well 24:30 if theta hat is equal to Xn bar, in the Gaussian case I know 24:35 that this is some mean, mu. 24:37 And then maybe it has varying sigma squared over n. 24:39 That's not what I mean by, this is my posterior distribution. 24:43 This is not what I mean. 24:46 This is going to come from this guy, the Gaussian thing 24:49 and the central limit theorem. 24:51 But what I mean is this guy. 24:52 And this came exclusively from the prior distribution. 24:58 If I had another prior, I would not necessarily 25:00 have a beta distribution on the output. 25:03 So when I have the same family of distributions 25:07 at the beginning and at the end of this flowchart, 25:11 I say that beta is a conjugate prior. 25:16 25:21 Meaning I put in beta as a prior and I get beta as [INAUDIBLE] 25:27 And that's why betas are so popular. 25:30 Conjugate priors are really nice, 25:32 because you know that whatever you put in, what you're going 25:35 to get in the end is a beta. 25:37 So all you have to think about is the parameters. 25:38 You don't have to check again what the posterior is 25:41 going to look like, what the PDF of this guy is going to be. 25:43 You don't have to think about it. 25:44 You just have to check what the parameters are. 25:46 And there's families of conjugate priors. 25:48 Gaussian gives Gaussian, for example. 25:51 There's a bunch of them. 25:52 And this is what drives people into using specific priors as 25:57 opposed to others. 25:58 It has nice mathematical properties. 26:00 Nobody believes that p is really distributed according to beta. 26:05 But it's flexible enough and super convenient 26:08 mathematically. 26:09 26:12 Now let's see for one second, before we actually 26:14 go any further. 26:17 I didn't mention A and B are both in here, 26:19 A and B are both positive numbers. 26:21 26:24 They can be anything positive. 26:27 So here what I did is that, I updated A 26:29 into a plus the sum of my data, and b 26:34 into b plus n minus the sum of my data. 26:38 So that's essentially, a becomes a plus the number of ones. 26:41 26:45 Well, that's only when I have a and a. 26:47 So the first parameters become itself plus the number of ones. 26:50 And the second one becomes itself 26:51 plus the number of zeros. 26:52 26:55 And so just as a sanity check, what does this mean? 26:59 If a it goes to zero, what is the beta when a goes to 0? 27:08 We can actually read this from here. 27:10 27:16 Actually, let's take a goes to-- 27:19 27:25 no. 27:26 Sorry, let's just do this. 27:27 27:38 I'll do it when we talk about non-informative prior, 27:40 because it's a little too messy. 27:42 27:47 How do we do this? 27:47 How did I get this posterior distribution, given the prior? 27:51 How do I update This well this is called Bayesian statistics. 27:56 And you've heard this word, Bayes before. 27:58 And the way you've heard it is in the Bayes formula. 28:02 What was the Bayes formula? 28:03 The Bayes formula was telling you 28:05 that the probability of A, given B was equal to something that 28:11 depended on the probability of B, given A. That's what it was. 28:14 28:16 You can actually either remember the formula 28:18 or you can remember the definition. 28:20 And this is what p of A and B divided by p of B. 28:26 So this is p of B, given A times p of A divided by p of B. 28:35 That's what Bayes formula is telling you. 28:37 Agree? 28:40 So now what I want is to have something that's telling me 28:46 how this is going to work. 28:49 What is going to play the role of those events, A and B? 28:54 Well one is going to be, this is going 28:59 to be the distribution of my parameter of theta, 29:01 given that I see the data. 29:03 And this is going to tell me, what 29:05 is the distribution of the data, given that I know what 29:07 my parameter if theta is. 29:09 But that part, if this is theta and this 29:11 is the parameter of theta, this is what 29:13 we've been doing all along. 29:15 The distribution of the data, given the parameter here 29:18 was n iid Bernoulli p. 29:22 I knew exactly what their joint probability mass function is. 29:27 Then, that was what? 29:29 So we said that this is going to be my data 29:32 and this is going to be my parameter. 29:34 29:37 So that means that, this is the probability of my data, 29:40 given the parameter. 29:43 This is the probability of the parameter. 29:45 What is this? 29:46 What did we call this? 29:49 This is the prior. 29:50 It's just the distribution of my parameter. 29:53 Now what is this? 29:56 Well, this is just the distribution 29:57 of the data, itself. 30:00 This is essentially the distribution of this, 30:06 if this was indeed not conditioned on p. 30:15 So if I don't condition on p, this data 30:18 is going to be a bunch of iid, Bernoulli with some parameter. 30:23 But the perimeter is random, right. 30:25 So for different realization of this data set, 30:27 I'm going to get different parameters for the Bernoulli. 30:30 And so that leads to some sort of convolution. 30:34 It's not really a convolution in this case, 30:36 but it's like some sort of composition of distributions. 30:38 I have the randomness that comes from here and then, 30:41 the randomness that comes from realizing the Bernoulli. 30:44 That's just the marginal distribution. 30:46 It actually might be painful to understand what this is, right. 30:49 In a way, it's sort of a mixture and it's not super nice. 30:52 But we'll see that this actually won't matter for us. 30:55 This is going to be some number. 30:57 It's going to be there. 30:58 But it will matter for us, what it is. 31:00 Because it actually does not depend on the parameter. 31:02 And that's all that matters to us. 31:04 31:09 Let's put some names on those things. 31:11 This was very informal. 31:12 So let's put some actual names on what we call prior. 31:19 So what is the formal definition of a prior, 31:22 what is the formal definition of a posterior, 31:24 and what are the rules to update it? 31:27 So I'm going to have my data, which is going to be X1, Xn. 31:30 31:35 Let's say they are iid, but they don't actually have to. 31:38 And so I'm going to have given, theta. 31:41 31:47 And when I say given, it's either 31:48 given like I did in the first part of this course 31:51 in all previous chapters, or conditionally on. 31:55 If you're thinking like a Bayesian, what I really mean 31:58 is conditionally on this random parameter. 32:02 It's as if it was a fixed number. 32:06 They're going to have a distribution, 32:08 X1, Xn is going to have some distribution. 32:12 Let's assume for now it's a PDF, pn of X1, Xn. 32:19 I'm going to write theta like this. 32:22 So for example, what is this? 32:24 Let's say this is a PDF. 32:27 It could be a PMF. 32:28 Everything I say, I'm going to think of them as being PDF's. 32:31 I'm going to combine PDF's with PDF's, but I 32:33 could combine PDF it PMF, PMF with PDF's or PMF with PMF. 32:37 So everywhere you see a D could be an M. 32:41 Now I have those things. 32:42 So what does it mean? 32:43 So here is an example. 32:46 X1, Xn or iid, and theta 1. 32:53 Now I know exactly what the joint PDF of this thing is. 32:57 It means that pn of X1, Xn given theta is equal to what? 33:03 Well it's 1 over 2pi to the power n 33:10 e, to the minus sum from i equal 1 to n 33:15 of xi minus theta squared divided by 2. 33:18 So that's just the joint distribution of n iid 33:21 and theta 1, random variables. 33:25 That's my pn given theta. 33:27 Now this is what we denoted by f sub theta before. 33:33 We had the subscript before, but now we just put a bar in theta 33:36 because we want to remember that this is actually 33:38 conditioned on theta. 33:40 But this is just notation. 33:42 You should just think of this as being, just the usual thing 33:46 that you get from some statistical model. 33:50 Now, that's going to be pn. 33:53 34:11 Theta has prior distribution, pi. 34:19 34:22 For example, so think of it as either PDF or PMF again. 34:29 For example, pi of theta was what? 34:33 Well it was some constant times theta to the a minus 1, 34:40 1 minus theta to a minus 1. 34:43 So it has some prior distribution, 34:45 and that's another PMF. 34:49 So now I'm given the distribution of my, 34:51 x is given theta and given the distribution of my theta. 34:54 I'm given this guy. 34:57 That's this guy. 35:00 I'm given that guy, which is my pi. 35:05 So that's my pn of X1, Xn given theta. 35:11 That's my pi of theta. 35:13 35:17 Well, this is just the integral of pn 35:21 of X1, Xn times pi of theta, d theta, 35:28 over all possible sets of theta. 35:29 That's just when I integrate out my theta, 35:33 or I compute the marginal distribution, 35:35 I did this by integrating. 35:37 That's just basic probability, conditional probabilities. 35:41 Then if I had the PMF, I would just 35:42 sum over the values of thetas. 35:43 35:49 Now what I want is to find what's called, 35:55 so that's the prior distribution, 35:58 and I want to find the posterior distribution. 36:01 36:15 It's pi of theta, given X1, Xn. 36:18 36:21 If I use Bayes' rule I know that this 36:23 is pn of X1, Xn, given theta times pi of theta. 36:34 And then it's divided by the distribution 36:37 of those guys, which I will write as integral over theta 36:41 of pn, X1, Xn, given theta times pi of theta, d theta. 36:48 36:55 Everybody's with me, still? 36:57 If you're not comfortable with this, 36:59 it means that you probably need to go read your couple of pages 37:03 on conditional densities and conditional 37:04 PMF's from your probably class. 37:07 There's really not much there. 37:08 It's just a matter of being able to define those quantities, f 37:13 density of x, given y. 37:15 This is just what's called a conditional density. 37:17 You need to understand what this object is 37:19 and how it relates to the joint distribution of x and y, 37:21 or maybe the distribution of x or the distribution of y. 37:24 37:27 But it's the same rules. 37:29 One way to actually remember this 37:31 is, this is exactly the same rules as this. 37:33 When you see a bar, it's the same thing as the probability 37:36 of this and this guy. 37:37 So for densities, it's just a comma 37:40 divided by the second the probably the second guy. 37:43 That's it. 37:45 So if you remember this, you can just do some pattern matching 37:48 and see what I just wrote here. 37:49 37:53 Now, I can compute every single one of these guys. 37:57 This something I get from my modeling. 38:04 So I did not write this. 38:05 It's not written in the slides. 38:09 But I give a name to this guy that was my prior distribution. 38:14 And that was my posterior distribution. 38:16 38:22 In chapter three, maybe what did we call this guy? 38:26 38:32 The one that does not have a name and that's in the box. 38:35 38:39 What did we call it? 38:40 38:43 AUDIENCE: [INAUDIBLE] 38:46 PHILLIPE RIGOLLET: It is the joint distribution of the Xi's. 38:48 38:51 And we gave it a name. 38:53 AUDIENCE: [INAUDIBLE] 38:54 PHILLIPE RIGOLLET: It's the likelihood, right? 38:56 This is exactly the likelihood. 38:57 This was the likelihood of theta. 38:59 39:03 And this is something that's very important to remember, 39:06 and that really reminds you that these things are really not 39:10 that different. 39:11 Maximum likelihood estimation and Bayesian estimation, 39:13 because your posterior is really just your likelihood times 39:18 something that's just putting some weights on the thetas, 39:23 depending on where you think theta should be. 39:26 If I had, say a maximum likelihood estimate, 39:28 and my likelihood and theta looked like this, 39:31 but my prior and theta looked like this. 39:33 I said, oh I really want thetas that are like this. 39:37 So what's going to happen is that, I'm 39:38 going to turn this into some posterior that looks like this. 39:41 39:44 So I'm just really waiting, this posterior, 39:47 this is a constant that does not depend on theta right? 39:49 Agreed? 39:50 I integrated over theta, so theta is gone. 39:53 So forget about this guy. 39:56 I have basically, that the posterior distribution up 39:59 to scaling, because it has to be a probability density and not 40:01 just anything any function that's positive, 40:03 is the product of this guy. 40:05 It's a weighted version of my likelihood. 40:06 That's all it is. 40:07 I'm just weighing the likelihood, 40:09 using my prior belief on theta. 40:13 And so given this guy a natural estimator, 40:16 if you follow the maximum likelihood principle, 40:19 would be the maximum of this posterior. 40:23 Agreed? 40:24 That would basically be doing exactly what maximum likelihood 40:28 estimation is telling you. 40:31 So it turns out that you can. 40:33 It's called Maximum A Posteriori, 40:35 and I won't talk much about this, or MAP. 40:39 That's Maximum a Posteriori. 40:44 So it's just the theta hat is the arc 40:47 max of pi theta, given X1, Xn. 40:50 40:54 And it sounds like it's OK. 40:56 I'll give you a density and you say, OK 40:58 I have a density for all values of my parameters. 41:00 You're asking me to summarize it into one number. 41:03 I'm just going to take the most likely number of those guys. 41:06 But you could summarize it, otherwise. 41:08 You could take the average. 41:10 You could take the median. 41:12 You could take a bunch of numbers. 41:14 And the beauty of Bayesian statistics 41:16 is that, you don't have to take any number in particular. 41:19 You have an entire posterior distribution. 41:21 This is not only telling you where theta is, 41:25 but it's actually telling you the difference 41:29 if you actually give as something 41:31 that gives you the posterior. 41:33 Now, let's say the theta is p between 0 and 1. 41:36 If my posterior distribution looks like this, 41:39 or my posterior distribution looks like this, 41:43 then those two guys have one, the same mode. 41:47 This is the same value. 41:49 And their symmetric, so they'll also have the same mean. 41:51 So these two posterior distributions 41:53 give me the same summary into one number. 41:55 However clearly, one is much more confident 41:58 than the other one. 41:59 So I might as well just spit it out as a solution. 42:04 You can do even better. 42:05 People actually do things, such as drawing a random number 42:09 from this distribution. 42:10 Say, this is my number. 42:12 That's kind of dangerous, but you 42:14 can imagine you could do this. 42:15 42:20 This is what works. 42:22 That's what we went through. 42:23 So here, as you notice I don't care so much about this part 42:28 here. 42:30 Because it does not depend on theta. 42:32 So I know that given the product of those two things, 42:35 this thing is only the constant that I need to divide 42:37 so that when I integrate this thing over theta, 42:40 it integrates to one. 42:41 Because this has to be a probability density on theta. 42:45 I can write this and just forget about that part. 42:47 And that's what's written on the top of this slide. 42:52 This notation, this sort of weird alpha, or I don't know. 42:57 Infinity sign propped to the right. 42:59 Whatever you want to call this thing 43:02 is actually just really emphasizing the fact 43:04 that I don't care. 43:06 I write it because I can, but you know what it is. 43:12 43:17 In some instances, you have to compute the integral. 43:19 In some instances, you don't have to compute the integral. 43:21 And a lot of Bayesian computation 43:23 is about saying, OK it's actually 43:25 really hard to compute this integral, 43:27 so I'd rather not doing it. 43:28 So let me try to find some methods that will allow me 43:31 to sample from the posterior distribution, 43:33 without having to compute this. 43:35 And that's what's called Monte-Carlo Markov 43:37 chains, or MCMC, and that's exactly what they're doing. 43:40 They're just using only ratios of things, 43:42 like that for different thetas. 43:44 And which means that if you take ratios, 43:45 the normalizing constant is gone and you don't 43:47 need to find this integral. 43:50 So we won't go into those details at all. 43:53 That would be the purpose of an entire course 43:54 on Bayesian inference. 43:56 Actually, even Bayesian computations 43:59 would be an entire course on its own. 44:02 And there's some very interesting things 44:03 that are going on there, the interface of stats 44:05 and computation. 44:06 44:10 So let's go back to our example and see if we can actually 44:12 compute any of those things. 44:13 Because it's very nice to give you some data, some formulas. 44:17 Let's see if we can actually do it. 44:19 In particular, can I actually recover this claim 44:23 that the posterior associated to a beta prior with a Bernoulli 44:31 likelihood is actually giving me a beta again? 44:35 What was my prior? 44:36 44:42 So p was following a beta AA, which 44:45 means that p, the density. 44:48 44:53 That was pi of theta. 44:56 Well I'm going to write this as pi of p-- 44:59 was proportional to p to the A minus 1 times 1 minus p 45:05 to the A minus 1. 45:08 So that's the first ingredient I need to complete my posterior. 45:11 I really need only two, if I wanted to bound up to constant. 45:14 The second one was p hat. 45:16 45:20 We've computed that many times. 45:22 And we had even a nice compact way of writing it, 45:25 which was that pn of X1, Xn, given the parameter p. 45:32 So the joint density of my data, given p, that's my likelihood. 45:36 The likelihood of p was what? 45:38 Well it was p to the sum of Xi's. 45:41 45:44 1 minus p to the n minus some of the Xi's. 45:46 45:50 Anybody wants me to parse this more? 45:53 Or do you remember seeing that from maximum likelihood 45:56 estimation? 45:57 Yeah? 45:57 AUDIENCE: [INAUDIBLE] 46:02 PHILLIPE RIGOLLET: That's what conditioning does. 46:04 46:10 AUDIENCE: [INAUDIBLE] previous slide. 46:15 [INAUDIBLE] bottom there, it says D pi of t. 46:19 Shouldn't it be dt pi of t? 46:23 PHILLIPE RIGOLLET: So D pi of T is 46:25 a measure theoretic notation, which I used without thinking. 46:29 And I should not because I can see it upsets you. 46:32 D pi of T is just a natural way to say 46:35 that I integrate against whatever I'm 46:38 given for the prior of theta. 46:43 In particular, if theta is just the mix of a PDF and a point 46:48 mass, maybe I say that my p takes 46:51 value 0.5 with probability 0.5. 46:54 And then is uniform on the interval with probability 0.5. 46:58 For this, I neither have a PDF nor a PMF. 47:01 But I can still talk about integrating with respect 47:04 to this, right? 47:04 It's going to look like, if I take a function f of T, 47:08 D pi of T is going to be one half of f of one half. 47:14 That's the point mass with probability one half, 47:16 at one half. 47:17 Plus one half of the integral between 0 and 1, of f of TDT. 47:23 This is just the notation, which is actually funnily enough, 47:26 interchangeable with pi of DT. 47:29 47:32 But if you have a density, it's really 47:34 just the density pi of TDT. 47:39 If pi is really a density, but that's 47:41 when it's when pi is and measure and not a density. 47:44 47:46 Everybody else, forget about this. 47:49 This is not something you should really 47:51 worry about at this point. 47:52 This is more graduate level probability classes. 47:55 But yeah, it's called measure theory. 47:57 And that's when you think of pi as being a measure 47:59 in an abstract fashion. 47:59 You don't have to worry whether it's a density 48:01 or not, or whether it has a density. 48:04 48:08 So everybody is OK with this? 48:10 48:15 Now I need to compute my posterior. 48:17 And as I said, my posterior is really 48:23 just the product of the likelihood weighted 48:25 by the prior. 48:28 Hopefully, at this stage of your application, 48:33 you can multiply two functions. 48:35 So what's happening is, if I multiply this guy 48:37 with this guy, p gets this guy to the power 48:41 this guy plus this guy. 48:42 48:53 And then 1 minus p gets the power n minus some of Xi's. 49:00 So this is always from I equal 1 to n. 49:02 And then plus A minus 1 as well. 49:04 49:10 This is up to constant, because I still need to solve this. 49:15 And I could try to do it. 49:17 But I really don't have to, because I 49:18 know that if my density has this form, then 49:24 it's a beta distribution. 49:25 And then I can just go on Wikipedia 49:26 and see what should be the normalization factor. 49:29 But I know it's going to be a beta distribution. 49:31 It's actually the beta with parameter. 49:34 So this is really my beta with parameter, sum of Xi, 49:39 i equal 1 to n plus A minus 1. 49:43 And then the second parameter is n minus sum 49:46 of the Xi's plus A minus 1. 49:49 49:54 I just wrote what was here. 49:59 What happened to my one? 50:01 Oh no, sorry. 50:02 Beta has the power minus 1. 50:05 So that's the parameter of the beta. 50:08 And this is the parameter of the beta. 50:10 50:15 Beta is over there, right? 50:16 So I just replace A by what I see. 50:19 A is just becoming this guy plus this guy 50:22 and this guy plus this guy. 50:26 Everybody is comfortable with this computation? 50:28 50:34 We just agreed that beta priors for Bernoulli observations 50:38 are certainly convenient. 50:42 Because they are just conjugate, and we know 50:44 that's what is going to come out in the end. 50:46 That's going to be a beta as well. 50:48 I just claim it was convenient. 50:50 It was certainly convenient to compute this, right? 50:52 There was certainly some compatibility 50:55 when I had to multiply this function by that function. 50:57 And you can imagine that things could go much more wrong, 51:00 than just having p to some power and p to some power, 1 minus p 51:03 to some power, when it might just be some other power. 51:06 Things were nice. 51:09 Now this is nice, but I can also question the following things. 51:12 Why beta, for one? 51:14 The beta tells me something. 51:17 That's convenient, but then how do I pick A? 51:20 I know that A should definitely capture the fact that where 51:27 I want to have my p most likely located. 51:30 But it also actually also captures 51:32 the variance of my beta. 51:34 And so choosing different As is going 51:36 to have different functions. 51:37 If I have A and B, If I started with the beta with parameter. 51:43 If I started with a B here, I would just pick up the B here. 51:48 Agreed? 51:49 And that would just be a symmetric. 51:51 But they're going to capture mean and variance 51:53 of this thing. 51:53 And so how do I pick those guys? 51:56 If I'm a doctor and you're asking me, 51:59 what do you think the chances of this drug working 52:01 in this kind of patients is? 52:03 And I have to spit out the parameters of a beta for you, 52:06 it might be a bit of a complicated thing to do. 52:08 So how do you do this, especially for problems? 52:10 So by now, people have actually mastered 52:14 the art of coming up with how to formulate those numbers. 52:19 But in new problems that come up, how do you do this? 52:21 What happens if you want to use Bayesian methods, 52:23 but you actually do not know what you expect to see? 52:30 To be fair, before we started this class, I hope all of you 52:33 had no idea whether people tend to bend their head to the right 52:36 or to the left before kissing. 52:38 Because if you did, well you have too much time 52:40 on your hands and I should double your homework. 52:42 52:44 So in this case, maybe you still want 52:46 to use the Bayesian machinery. 52:48 Maybe you just want to do something nice. 52:50 It's nice right, I mean it worked out pretty well. 52:53 What if you want to do? 52:54 Well you actually want to use some priors that 52:56 carry no information, that basically do not prefer 53:00 any theta to another theta. 53:02 Now, you could read this slide or you 53:05 could look at this formula. 53:06 53:10 We just said that this pi here was just here 53:14 to weigh some thetas more than others, depending 53:18 on their prior belief. 53:19 If our prior belief does not want 53:21 to put any preference towards some thetas than to others, 53:24 what do I do? 53:26 AUDIENCE: [INAUDIBLE] 53:27 PHILLIPE RIGOLLET: Yeah, I remove it. 53:29 And the way to remove something we multiply by, 53:31 is just replace it by one. 53:32 That's really what we're doing. 53:35 If this was a constant not depending on theta, 53:38 then that would mean that we're not preferring any theta. 53:41 And we're looking at the likelihood. 53:44 But not as a function that we're trying to maximize, 53:46 but it is a function that we normalize in such a way 53:50 that it's actually a distribution. 53:52 So if I have pi, which is not here, 53:54 this is really just taking the like likelihood, 53:56 which is a positive function. 53:57 It may not integrate to 1, so I normalize it 53:59 so that it integrates to 1. 54:02 And then I just say, well this is my posterior distribution. 54:05 Now I could just maximize this thing 54:06 and spit out my maximum likelihood estimator. 54:09 But I can also integrate and find 54:10 what the expectation of this guy is. 54:12 I can find what the median of this guy is. 54:14 I can sample data from this guy. 54:16 I can build, understand what the variance of this guy is. 54:19 Which is something we did not do when we just did 54:21 maximum likelihood estimation because given a function, all 54:24 we cared about was the arc max of this function. 54:27 54:31 These priors are called uninformative. 54:36 This is just replacing this number by one or by a constant. 54:43 Because it still has to be a density. 54:45 54:49 If I have a bounded set, I'm just 54:50 looking for the uniform distribution 54:52 on this bounded set, the one that puts constant one 54:56 over the size of this thing. 54:59 But if I have an invalid set, what 55:01 is the density that takes a constant value 55:03 on the entire real line, for example? 55:07 What is this density? 55:08 55:13 AUDIENCE: [INAUDIBLE] 55:16 PHILLIPE RIGOLLET: Doesn't exist, right? 55:18 It just doesn't exist. 55:20 The way you can think of it is a Gaussian 55:22 with the variance going to infinity, maybe, 55:24 or something like this. 55:26 But you can think of it in many ways. 55:27 You can think of the limit of the uniform between minus T 55:32 and T, with T going to infinity. 55:34 But this thing is actually zero. 55:36 There's nothing there. 55:39 You can actually still talk about this. 55:41 You could always talk about this thing, where 55:44 you think of this guy as being a constant, 55:46 remove this thing from this equation, and just say, 55:49 well my posterior is just the likelihood 55:51 divided by the integral of the likelihood over theta. 55:54 And if theta is the entire real line, so be it. 55:58 As long as this integral converges, 56:00 you can still talk about this stuff. 56:01 56:04 This is what's called an improper prior. 56:06 56:09 An improper prior is just a non-negative function defined 56:11 in theta, but it does not have to integrate neither to one, 56:17 nor to anything. 56:18 56:20 If I integrate the function equal to 1 56:22 on the entire real line, what do I get? 56:24 56:27 Infinity. 56:28 56:32 It's not a proper prior, and it's called and improper prior. 56:35 And those improper priors are usually 56:39 what you see when you start to want non-informative priors 56:42 on infinite sets of datas. 56:44 That's just the nature of it. 56:46 You should think of them as being the uniform distribution 56:50 of some infinite set, if that thing were to exist. 56:52 56:56 Let's see some examples about non-informative priors. 57:01 If I'm in the interval 0, 1 this is a finite set. 57:04 So I can talk about the uniform prior 57:07 on the interval 0, 1 for a parameter, p of a Bernoulli. 57:10 57:26 If I want to talk about this, then it 57:28 means that my prior is p follows some uniform on the interval 57:35 0, 1. 57:37 So that means that f of x is 1 if x is in 0, 1. 57:48 Otherwise, there is actually not even a normalization. 57:52 This thing integrates to 1. 57:53 And so now if I look at my likelihood, 57:56 it's still the same thing. 57:57 So my posterior becomes theta X1, Xn. 58:04 That's my posterior. 58:07 I don't write the likelihood again, 58:08 because we still have it-- 58:09 well we don't have it here anymore. 58:11 58:15 The likelihood is given here. 58:17 Copy, paste over there. 58:20 The posterior is just this thing times 1. 58:23 So you will see it in a second. 58:24 So it's p to the power sum of the Xi's, one minus p 58:28 to the power, n minus sum of the Xi's. 58:31 And then it's multiplied by 1, and then divided by this 58:36 integral between 0 and 1 of p, sum of the Xi's. 58:42 1 minus p, n minus sum of the Xi's. 58:47 Dp, which does not depend on p. 58:51 And I really don't care what the thing actually is. 58:53 58:58 That's posterior of p. 59:03 And now I can see, well what is this? 59:06 It's actually just the beta with parameters. 59:12 This guy plus 1. 59:14 59:19 And this guy plus 1. 59:21 59:34 I didn't tell you what the expectation of a beta was. 59:38 We don't know what the expectation of a beta 59:39 is, agreed? 59:42 If I wanted to find say, the expectation of this thing that 59:45 would be some good estimator, we know 59:47 that the maximum of this guy-- what 59:49 is the maximum of this thing? 59:51 59:54 Well, it's just this thing, it's the average of the Xi's. 59:57 That's just the maximum likelihood estimator 59:59 for Bernoulli. 60:00 We know it's the average. 60:01 Do you think if I take the expectation of this thing, 60:03 I'm going to get the average? 60:05 60:13 So actually, I'm not going to get the average. 60:15 I'm going to get this guy plus this guy, divided by n plus 1. 60:19 60:27 Let's look at what this thing is doing. 60:28 It's looking at the number of ones and it's adding one. 60:34 And this guy is looking at the number of zeros 60:36 and it's adding one. 60:39 Why is it adding this one? 60:41 What's going on here? 60:42 60:47 This is going to matter mostly when the number of ones 60:52 is actually zero, or the number of zeros is zero. 60:56 Because what it does is just pushes the zero from non-zero. 61:00 And why is that something that this Bayesian method actually 61:03 does for you automatically? 61:04 It's because when we put this non-informative 61:06 prior on p, which was uniform on the interval 0, 1. 61:11 In particular, we know that the probability 61:12 that p is equal to 0 is zero. 61:16 And the probability p is equal to 1 is zero. 61:19 And so the problem is that if I did not 61:21 add this 1 with some positive probability, 61:24 I wouldn't be allowed to spit out something that actually had 61:28 p hat, which was equal to 0. 61:30 If by chance, let's say I have n is equal to 3, 61:33 and I get only 0, 0, 0, that could happen with probability. 61:37 1 over pq, one over 1 minus pq. 61:41 61:46 That's not something that I want. 61:47 And I'm using my priors. 61:49 My prior is not informative, but somehow it captures the fact 61:51 that I don't want to believe p is going 61:53 to be either equal to 0 or 1. 61:56 So that's sort of taken care of here. 61:59 So let's move away a little bit from the Bernoulli example, 62:05 shall we? 62:06 I think we've seen enough of it. 62:08 And so let's talk about the Gaussian model. 62:10 Let's say I want to do Gaussian inference. 62:12 62:17 I want to do inference in a Gaussian model, 62:19 using Bayesian methods. 62:20 62:30 What I want is that Xi, X1, Xn, or say 0, 1 iid. 62:39 62:44 Sorry, theta 1, iid conditionally on theta. 62:47 62:50 That means that pn of X1, Xn, given theta 62:56 is equal to exactly what I wrote before. 62:58 So 1 square root to pi, to the n exponential minus one half 63:04 sum of Xi minus theta squared. 63:09 So that's just the joint distribution 63:11 of my Gaussian with mean data. 63:13 And the another question is, what 63:14 is the posterior distribution? 63:17 Well here I said, let's use the uninformative prior, 63:22 which is an improper prior. 63:23 It puts weight on everyone. 63:25 That's the so-called uniform on the entire real line. 63:29 So that's certainly not a density. 63:31 But it can still just use this. 63:34 So all I need to do is get this divided 63:40 by normalizing this thing. 63:44 But if you look at this, essentially I 63:47 want to understand. 63:49 So this is proportional to the exponential 63:52 minus one half sum from I equal 1 63:55 to n of Xi minus theta squared. 63:58 And now I want to see this thing as a density, 64:01 not on the Xi's but on theta. 64:03 64:06 What I want is a density on theta. 64:10 So it looks like I have chances of getting something 64:13 that looks like a Gaussian. 64:16 To have a Gaussian, I would need to see minus one half. 64:19 And then I would need to see theta minus something 64:21 here, not just the sum of something minus thetas. 64:25 So I need to work a little bit more, 64:29 to expand the square here. 64:31 So this thing here is going to be 64:32 equal to exponential minus one half sum from I equal 1 64:37 to n of Xi squared minus 2Xi theta plus theta squared. 64:45 65:10 Now what I'm going to do is, everything remember 65:13 is up to this little sign. 65:15 So every time I see a term that does not depend on theta, 65:19 I can just push it in there and just make it disappear. 65:22 Agreed? 65:24 This term here, exponential minus one half sum of Xi 65:28 squared, does it depend on theta? 65:31 No. 65:32 So I'm just pushing it here. 65:33 This guy, yes. 65:34 And the other one, yes. 65:35 So this is proportional to exponential sum of the Xi. 65:45 And then I'm going to pull out my theta, the minus one half 65:47 canceled with the minus 2. 65:50 And then I have minus one half sum from I 65:56 equal 1 to n of theta squared. 65:58 66:01 Agreed? 66:03 So now what this thing looks like, 66:05 this looks very much like some theta minus something squared. 66:09 This thing here is really just n over 2 times theta. 66:15 66:18 Sorry, times theta squared. 66:21 So now what I need to do is to write this of the form, theta 66:25 minus something. 66:26 Let's call it mu, squared, divided by 2 sigma squared. 66:31 I want to turn this into that, maybe up to terms 66:34 that do not depend on theta. 66:36 That's what I'm going to try to do. 66:39 So that's called completing the squaring. 66:40 That's some exercises you do. 66:42 You've done it probably, already in the homework. 66:44 And that's something you do a lot when 66:46 you do Bayesian statistics, in particular. 66:48 So let's do this. 66:50 What is it going to be the leading term? 66:51 Theta squared is going to be multiplied by this thing. 66:54 So I'm going to pull out my n over 2. 66:57 And then I'm going to write this as minus theta over 2. 67:03 And then I'm going to write theta minus something squared. 67:06 And this something is going to be one half of what 67:08 I see in the cross-product. 67:10 67:12 I need to actually pull this thing out. 67:14 So let me write it like that first. 67:18 So that's theta squared. 67:21 And then I'm going to write it as minus 2 times 1 over n sum 67:30 from I equal 1 to n of Xi's times theta. 67:36 That's exactly just a rewriting of what we had before. 67:39 And that should look much more familiar. 67:41 67:44 A squared minus 2 blap A, and then I missed something. 67:49 So this thing, I'm going to be able to rewrite 67:51 as theta minus Xn bar squared. 67:57 But then I need to remove the square of Xn bar. 68:00 Because it's not here. 68:01 68:09 So I just complete the square. 68:11 And then I actually really don't care with this thing actually 68:13 was, because it's going to go again in the little Alpha's 68:16 sign over there. 68:18 So this thing eventually is going 68:19 to be proportional to exponential 68:24 of minus n over 2 times theta of minus Xn bar squared. 68:31 And so we know that if this is a density that's 68:33 proportional to this guy, it has to be some n with mean, Xn bar. 68:44 And variance, this is supposed to be 1 over sigma squared. 68:47 This guy over here, this n. 68:49 So that's really just 1 over n. 68:50 68:53 So the posterior distribution is a Gaussian 69:01 centered at the average of my observations. 69:05 And with variance, 1 over n. 69:08 69:13 Everybody's with me? 69:14 69:16 Why I'm saying this, this was the output of some computation. 69:19 But it sort of makes sense, right? 69:21 It's really telling me that the more observations I have, 69:24 the more concentrated this posterior is. 69:26 Concentrated around what? 69:27 Well around this Xn bar. 69:30 That looks like something we've sort of seen before. 69:33 But it does not have the same meaning, somehow. 69:35 This is really just the posterior distribution. 69:37 69:40 It's sort of a sanity check, that I have this 1 over n 69:43 when I have Xn bar. 69:44 But it's not the same thing as saying 69:45 that the variance of Xn bar was 1 over n, like we had before. 69:48 69:55 As an exercise, I would recommend 69:59 if you don't get it, just try pi of theta 70:10 to be equal to some n mu 1. 70:15 70:18 Here, the prior that we used was completely non-informative. 70:22 What happens if I take my prior to be some Gaussian, which 70:25 is centered at mu and it has the same variance 70:27 as the other guys? 70:30 So what's going to happen here is that we're 70:32 going to put a weight. 70:33 And everything that's away from mu 70:34 is going to actually get less weight. 70:38 I want to know how I'm going to be updating 70:40 this prior into a posterior. 70:41 70:44 Everybody sees what I'm saying here? 70:47 So that means that pi of theta has the density proportional 70:50 to exponential minus one half theta minus mu squared. 70:55 So I need to multiply my posterior with this, 71:00 and then see. 71:01 It's actually going to be a Gaussian. 71:03 This is also a conjugate prior. 71:04 It's going to spit out another Gaussian. 71:06 You're going to have to complete a square again, and just check 71:09 what it's actually giving you. 71:10 And so spoiler alert, it's going to look 71:12 like you get an extra observation, which is actually 71:14 equal to mu. 71:15 71:18 It's going to be the average of n plus 1 observations. 71:22 The first n1's being X1 to Xn. 71:24 And then, the last one being mu. 71:27 And it sort of makes sense. 71:30 That's actually a fairly simple exercise. 71:34 Rather than going into more computation, 71:36 this is something you can definitely 71:37 do when you're in the comfort of your room. 71:41 I want to talk about other types of priors. 71:43 The first thing I said is, there's this beta prior 71:47 that I just pulled out of my hat and that was just convenient. 71:50 Then there was this non-informative prior. 71:52 It was convenient. 71:53 It was non-informative, so if you don't know anything 71:56 else maybe that's what you want to do. 71:58 The question is, are there any other priors that 72:01 are sort of principled and generic, in the sense 72:04 that the uninformative prior was generic, right? 72:08 It was equal to 1, that's as generic as it gets. 72:11 So is there anything that's generic as well? 72:14 Well, there's this priors that are called Jeffrey's priors. 72:17 And Jeffrey's prior, which is proportional to square root 72:20 of the determinant of the Fisher information of theta. 72:23 72:26 This is actually a weird thing to do. 72:28 It says, look at your model. 72:31 Your model is going to have a Fisher information. 72:34 Let's say it exists. 72:34 72:38 Because we know it does not always exist. 72:39 For example, in the multinomial model, 72:41 we didn't have a Fisher information. 72:44 The determinant of a matrix is somehow 72:46 measuring the size of a matrix. 72:48 If you don't trust me, just think 72:50 about the matrix being of size one by one, 72:53 then the determinant is just the number that you have there. 72:56 And so this is really something that looks like the Fisher 73:00 information. 73:01 73:04 It's proportional to the amount of information 73:06 that you have at a certain point. 73:09 And so what my prior is saying well, 73:12 I want to put more weights on those thetas that 73:14 are going to just extract more information from the data. 73:17 73:20 You can actually compute those things. 73:22 In the first example, Jeffrey's prior 73:26 is something that looks like this. 73:28 In one dimension, Fisher information 73:30 is essentially one the word variance. 73:33 That's just 1 over the square root of the variance, 73:35 because I have the square root. 73:37 And when I have the Jeffrey's prior, when I have the Gaussian 73:45 case, this is the identity matrix 73:48 that I would have in the Gaussian case. 73:50 The determinant of the identities is 1. 73:52 So square root of 1 is 1, and so I would basically get 1. 73:56 And that gives me my improper prior, my uninformative prior 73:59 that I had. 74:01 So the uninformative prior 1 is fine. 74:03 Clearly, all the thetas carry the same information 74:06 in the Gaussian model. 74:08 Whether I translate it here or here, 74:10 it's pretty clear none of them is actually 74:12 better than the other. 74:13 But clearly for the Bernoulli case, 74:16 the p's that are closer to the boundary carry 74:22 more information. 74:23 I sort of like those guys, because they just 74:26 carry more information. 74:27 So what I do is, I take this function. 74:29 So p1 minus p. 74:30 Remember, it's something that looks like this. 74:34 On the interval 0, 1. 74:35 74:38 This guy, 1 over square root of p1 minus p 74:40 is something that looks like this. 74:42 74:45 Agreed 74:47 What it's doing is sort of wants to push 74:49 towards the piece that actually carry more information. 74:54 Whether you want to bias your data that 74:56 way or not, is something you need to think about. 74:59 When you put a prior on your data, on your parameter, 75:01 you're sort of biasing towards this idea your data. 75:06 That's maybe not such a good idea, 75:07 when you have some p that's actually close to one half, 75:13 for example. 75:13 You're actually saying, no I don't 75:14 want to see a p that's close to one half. 75:16 Just make a decision, one way or another. 75:18 But just make a decision. 75:19 So it's forcing you to do that. 75:20 75:23 Jeffrey's prior, I'm running out of time 75:26 so I don't want to go into too much detail. 75:29 We'll probably stop here, actually. 75:31 75:44 So Jeffrey's priors have this very nice property. 75:47 It's that they actually do not care about the parameterization 75:51 of your space. 75:53 If you actually have p and you suddenly 75:56 decide that p is not the right parameter for Bernoulli, 75:58 but it's p squared. 76:00 You could decide to parameterize this by p squared. 76:03 Maybe your doctor is actually much more able 76:05 to formulate some prior assumption on p squared, 76:08 rather than p. 76:09 You never know. 76:11 And so what happens is that Jeffrey's priors 76:14 are an invariant in this. 76:15 And the reason is because the information carried by p 76:18 is the same as the information carried by p squared, somehow. 76:21 76:28 They're essentially the same thing. 76:30 76:32 You need to have one to one map. 76:34 Where you basically for each parameter, before 76:37 you have another parameter. 76:39 Let's call Eta the new parameters. 76:40 76:45 The PDF of the new prior indexed by Eta this time 76:50 is actually also Jeffrey's prior. 76:52 But this time, the new Fisher information 76:55 is not the Fisher information with respect to theta. 76:57 But it's this Fisher information associated 77:00 to this statistical model indexed by Eta. 77:03 So essentially, when you change the parameterization 77:08 of your model, you still get Jeffrey's prior 77:10 for the new parameterization. 77:12 Which is, in a way, a desirable property. 77:15 77:19 Jeffrey's prior is just an uninformative priors, 77:21 or priors you want to use when you 77:24 want a systematic way without really thinking about what 77:26 to pick for your mile. 77:27 77:35 I'll finish this next time. 77:37 And we'll talk about Bayesian confidence regions. 77:39 We'll talk about Bayesian estimation. 77:41 Once I have a posterior, what do I get? 77:44 And basically, the only message is 77:45 going to be that you might want to integrate 77:47 against the posterior. 77:48 Find the posterior, the expectation of your posterior 77:51 distribution. 77:52 That's a good point estimator for theta. 77:54 77:56 We'll just do a couple of computation. 78:01