字幕記錄 00:00 hello everyone um 00:02 you can 00:03 see my title and this is also the title 00:06 of my forthcoming book that's scheduled 00:09 for publication sometime next year 00:12 you can find a fairly complete draft of 00:14 the book video lectures a copy of these 00:18 slides 00:19 and other material my website 00:23 the purpose of the book is to provide a 00:26 new conceptual framework for 00:28 reinforcement learning and its 00:30 connections with the traditional control 00:32 methodologies of model protective 00:34 control 00:35 and adaptive control 00:39 the book is based on 00:43 visualization 00:44 and 00:45 and the geometric interpretations for 00:47 the most part 00:49 uh 00:50 on the other hand there is quite a bit 00:52 of mathematical analysis 00:54 uh that is given in my earlier books 00:58 that supports the visualization 01:03 so 01:04 i'd like to start 01:07 with 01:08 two remarkable programs 01:10 the first program is alpha 0 01:14 which 01:15 placed chess and created 01:17 a sensation 01:19 uh when it came out in 2017 01:24 and the other program is much earlier 01:25 from 25 years ago it's called td gammon 01:29 uh and 01:31 it's the place backgammon 01:33 uh and 01:35 uh at that time it created a lot of 01:37 enthusiasm about the prospects of 01:39 reinforcement learning 01:40 so i was talking about alpha zero and td 01:43 gammon and these two programs are very 01:44 remarkable even though they were quite a 01:47 few years apart uh 01:49 when they were 01:50 they were implemented 01:52 uh they're quite similar 01:54 and both of them involve two algorithms 01:57 some people don't quite realize that 01:59 there are two algorithms that are 02:00 involved in these programs 02:02 one algorithm is the offline 02:06 training algorithm by which the program 02:09 learns how to evaluate positions 02:12 and also how to select moves in given 02:15 positions 02:16 and this is done with a lot of data 02:18 neural network approximations and so on 02:22 the second algorithm is online play 02:24 algorithm 02:26 which is 02:27 what the program uses to play against 02:29 other computer programs or human 02:31 opponents 02:32 and it is based on multi-step look ahead 02:36 roll out and cost function 02:37 approximations 02:39 and all of this has strong connections 02:41 with dynamic programming the classical 02:44 method of policy duration 02:46 and 02:47 reinforcement learning type of 02:49 approximations 02:51 and what i'm going to try to do in this 02:53 talk is aim to understand this 02:55 methodology 02:56 so it applies far more generally 02:59 and particularly 03:01 on issues of how it connects with 03:03 control system design methods of mpc 03:06 multi-predictive control 03:08 and also adaptive control 03:11 but there are extensions in a broader 03:14 context 03:15 for example discrete optimization 03:17 integer programming and the like 03:19 although i'm not going to talk about 03:20 those in this particular talk 03:22 okay i'm going to start describing the 03:24 online play algorithm in alpha 0 alphago 03:27 and other 03:29 related 03:30 uh td gammon and other 03:34 settings and then i'm going to discuss 03:36 the offline 03:37 training algorithm 03:39 now the online play algorithm is what 03:41 you see here in blue okay 03:44 the 03:45 online play algorithm uses the results 03:47 of offline training at two points 03:51 um 03:52 it uses a position evaluator to evaluate 03:56 positions and a bass player 03:59 that it generates moves at given 04:02 positions 04:03 so 04:04 this is online 04:06 play but 04:08 there is crucially 04:11 used information 04:13 by the offer provided by the offline 04:15 training algorithm 04:18 and 04:19 here what we're trying to do is improve 04:22 the bass player the one train offline 04:25 by 04:26 the following process 04:28 uh 04:29 given that we are at a at some position 04:32 let's say after k moves a current 04:34 position x k 04:36 we search forward for several moves 04:38 through a look ahead tree of moves and 04:41 counter moves up to a certain depth 04:45 then from the leaves of that tree we 04:47 simulate the bass player for some move 04:50 more moves 04:52 and then 04:53 we arrive at some positions here which 04:57 we evaluate using the position evaluate 05:00 the evaluator 05:02 and we approximate the effect of these 05:03 future moves by using that evaluation 05:07 and now we back up these values back to 05:10 the root and we calculate a value 05:13 of each possible move at the current 05:16 position 05:17 and we play the best move 05:20 so that's what we do at a given position 05:22 and then we go to the next position 05:25 after one move and do the same thing and 05:27 so on 05:29 now all of these programs use 05:31 destruction but there are differences 05:34 uh alpha zero does not use this middle 05:37 portion 05:38 it uses a very long initial portion 05:42 the look ahead portion is very long and 05:44 does not need to use this middle portion 05:47 alpha go uses a version of this middle 05:50 portion 05:52 uh td gammon 05:54 use relies very much on the middle 05:56 portion 05:57 because the look ahead portion is short 06:00 in td gun td gammon plays backgammon 06:02 which is a stochastic game and the look 06:04 ahead 3 expands very quickly so it 06:07 cannot use long look ahead so it relies 06:10 on this 06:11 uh middle portion 06:13 the truncated rollout in fact the turn 06:15 rollout which was coined by the 06:17 originator of the gammon jerry the 06:19 sorrow 06:20 and i'm using the term here in the same 06:22 way that he has used it 06:26 there are also 06:28 similarities with model predictive 06:30 control 06:31 model predictive control structures are 06:34 remarkable 06:35 remarkably similar to this structure 06:37 into the alpha zero structure 06:40 there is an initial look ahead portion 06:43 where minimization takes place this is 06:45 called the control interval and then 06:48 there is sometimes a rollout portion 06:51 which is called the prediction interval 06:54 and 06:55 and then there's a terminal cost 06:57 approximation 06:58 so 06:59 uh 07:00 model predictive control 07:02 is uh 07:03 quite uh is pretty close to this 07:06 structure except some of the uh some of 07:09 the parts are implemented differently 07:11 because model predictive control deals 07:13 with uh infinite state in control spaces 07:16 for the most part couriers these games 07:19 deal with discrete state spaces 07:24 so let me go now into the offline 07:26 training algorithm alpha zero 07:29 it is 07:30 approximate policy duration 07:33 uh with 07:34 self-generated data 07:37 what we do here is generate a sequence 07:40 of players 07:41 and the current layer is used to train 07:44 an improved player and this process is 07:46 repeated several times until we stop 07:48 somewhere and we give the trained player 07:50 at the end to the online play algorithm 07:56 so 07:57 uh police generation has two parts every 07:59 step of policy evaluations improvement 08:01 these are done approximately with with 08:04 neural networks 08:05 so the current player 08:07 uh is evaluated by playing many games 08:10 collecting data and 08:14 representing its evaluation 08:16 by a neural network 08:18 which is trained with 08:20 lots and lots of data 08:23 now once we have this value this the 08:25 evaluation of the current player 08:28 we can improve 08:30 approximately improve the current player 08:33 by using multi-step look-ahead 08:35 minimization 08:38 and 08:39 an approximate form that's called monte 08:41 carlo tree search 08:44 and the improved layer after we collect 08:46 all this data from the improved player 08:48 we represent it with a policy neural 08:50 network through training 08:54 so we have a sequence of value and 08:56 policy networks generated here and these 08:59 are given at the end to the online 09:00 player for user frontline play 09:04 now 09:05 td gammon uses a very similar 09:08 policy duration algorithm for offline 09:10 training using the so-called td lambda 09:14 method 09:15 it uses it generates it it uses it 09:18 it 09:20 it does not have a policy network 09:22 the functionality of the policy network 09:24 is uh is uh is is provided by using the 09:28 value network through 09:29 one step look ahead 09:31 and it also does not use monte carlo 09:33 research and that's not really 09:36 fundamental it does not really affect 09:38 the overall 09:40 character of the process 09:44 so 09:45 here are some major empirical 09:48 observations that i want to focus on 09:50 here's the online player it uses an 09:54 offline player that has been trained 09:56 with neural networks 09:57 but the online player plays much better 10:00 in alpha zero than the offline trained 10:03 player this is just a fact it's an 10:06 obvious fact for people that that that 10:09 have used these programs 10:11 um 10:12 the 10:13 online player of alpha zero is just a 10:15 phenomenal chess player 10:17 by contrast the offline player is 10:21 relatively mediocre 10:23 and so there's tremendous 10:25 performance improvement 10:27 by having online play on top of offline 10:31 training 10:34 td gamma 10:36 plays much better with this middle 10:38 portion the truncated rollout portion 10:41 than without the rollout portion 10:43 now tessuro had the two programs in the 10:45 90s the first one was 10:48 in 10:48 1991 10:50 and 10:51 it was based on this structure but 10:54 without the middle portion 10:56 then in 1996 he introduced the middle 10:59 portion the rollout portion 11:01 and he found a great deal of improvement 11:04 the 1991 11:06 uh 11:07 program 11:08 was playing 11:11 his performance was subhuman it was was 11:14 being beaten by the best humans 11:16 by contrast the 1996 program 11:20 uh had superhuman performance and has 11:22 not been improved since that time 11:25 now these are empirical facts 11:28 uh and what we like to do is to explain 11:31 them provide draw insights from them 11:34 and 11:35 and 11:36 generalize them for to other contexts 11:39 by using abstract dynamic programming 11:42 ideas 11:43 abstract bellman operators 11:45 visualization for the most part to draw 11:48 inside 11:49 and we will focus on the central lore of 11:52 newton's method this is the cue 11:54 the key new research that i'm going to 11:57 present today 11:58 how newton's method ties everything 12:01 together 12:05 so here are the principal viewpoints of 12:08 this stock 12:12 i'm going to show you that online play 12:15 is just a single step of newton's method 12:17 for solving the belmont equation 12:19 associated with the problem the dynamic 12:21 programming problem 12:23 this is true for one step look ahead 12:26 in the case of multi-step look ahead the 12:29 newton step is preceded by some value 12:31 iterations 12:33 uh in departments of 12:35 numerical analysis this is called a 12:37 newton sor method it's a well-known 12:39 method 12:40 uh with a lot of literature sor stands 12:43 for successive over-relaxation 12:46 uh and 12:48 so 12:49 so 12:50 online play is a newton step 12:53 uh 12:54 offline training 12:56 on the other hand provides the starting 12:58 point for the newton step 13:02 and in my opinion 13:04 online plays a real workhorse 13:06 everything happens in online play the 13:08 big performance improvement 13:10 is during online play 13:13 by comparison offline training plays a 13:16 secondary role 13:18 and a major reason for this is that 13:20 online play is an exact newton step it 13:23 does not involve approximations uh it's 13:26 not degraded by neural network uh 13:29 training errors 13:31 um 13:32 so that that makes it different and much 13:36 more powerful 13:40 now 13:41 the offline training can be done by many 13:43 different methods in reinforcement 13:45 learning we have temporal differences 13:46 methods aggregation we have linear 13:49 programming there are many different 13:50 ways to do the offline training 13:53 however these differences in the various 13:55 imperfections of these methods affect 13:58 only the starting point 13:59 but they do not affect at all the 14:01 newton's step which is exact and they're 14:04 all washed out by the newton's step the 14:06 newton sap is a big equalizer arrows 14:09 that you may make in offline training 14:11 are washed out by the newton step 14:17 now there's a cultural difference 14:19 between the reinforcement learning 14:21 artificial intelligence community 14:25 and the control community 14:28 particularly more ballistics and 14:30 adaptive control 14:32 people in reinforcement learning focus 14:35 primarily on offline training issues 14:38 for yes 14:39 people in modern predictive control 14:42 are focused largely on online play and 14:46 also stability issues which are not 14:48 dealt with by reinforcement learning 14:51 community almost at all 14:56 so we're going to try to bridge this gap 14:58 through this visualizations 15:01 and 15:02 i'm going to talk a little bit about 15:03 adaptive control 15:05 uh but instead of using the 15:07 reoptimization in response to changes in 15:09 the estimated model 15:11 we're going to use multi-step look ahead 15:13 and draw out 15:15 and 15:16 and 15:18 and 15:19 the this framework is going to involve 15:22 an exact newton step instead of 15:25 that applied to a changing belmont 15:28 equation changing in response to a 15:30 system identification process 15:32 but the bottom line is that it's still a 15:34 newton's step 15:36 and it's very powerful 15:41 now 15:42 all of the surprising great generality 15:44 abstract dynamic programming is a very 15:48 general uh framework 15:50 uh it can accommodate arbitrary state in 15:53 control spaces 15:55 stochastic systems deterministic systems 15:58 hybrid systems 16:00 involving both discrete and 16:02 and continuous 16:04 state space and the control space 16:06 components multi-asian systems games and 16:10 mini max finite horizon infinite verizon 16:14 discrete optimization integer 16:16 programming dynamic programming is a 16:18 very very general methodology and 16:20 abstract dynamic programming captures 16:23 this generality 16:25 and because everything i'm going to talk 16:26 about is couched within uh it comes from 16:29 the 16:30 abstract dynamic programming framework 16:32 it has 16:34 it has very broad 16:36 applications 16:40 now some of these viewpoints may be a 16:42 little controversial and personal 16:45 so 16:46 i thought of showing you this cartoon 16:50 these two people would look at the same 16:51 number and this guy says it's a six but 16:54 this guy and sisters are nine 16:56 and uh i think this cartoon may have 16:59 some relevance to to my talk today 17:02 so let me give you an outline first i'm 17:04 going to talk about discounted and 17:06 undiscounted different horizon problems 17:08 to provide some 17:10 context 17:11 uh then i'm going to talk about abstract 17:14 dynamic programming abstract bellman 17:16 operators 17:18 then i'm going to show by visualization 17:21 how online play can be viewed as a 17:23 newton step 17:26 then discuss stability issues the reason 17:28 stability and how you can visualize it 17:32 visualization of 17:34 rollout truncated rollout and policy 17:36 duration 17:38 uh then i'm going to talk about a 17:40 familiar problem the classical linear 17:41 quadratic problem where the 17:43 visualizations are in terms of recovery 17:46 equation operators as opposed to bellman 17:48 operators 17:50 then i'm going to discuss application 17:52 model credit control and the connections 17:54 there and then also adaptive control 17:56 indirect adapted control with model 17:59 estimation 18:01 so let me start with 18:04 discounted and undiscounted infinite 18:07 horizon problems 18:09 so what we have here is a system 18:12 discrete time system 18:17 excuse me 18:19 involving a state which i denote by x k 18:22 is a state of time k a control uk 18:25 and a random disturbance wk is random 18:29 and the system evolves according to this 18:31 equation 18:32 and 18:35 we're interested in stationary policies 18:37 that map state into controls 18:40 and 18:41 this control my satisfy 18:43 control constraint for every x 18:46 um 18:48 there is a cost for this transition 18:50 uh 18:51 this encoded by this function g and the 18:54 cost is discounted by a factor alpha 18:57 alpha could be less than one or could be 19:00 equal to one 19:02 and at the at the stage k the the cost 19:06 is discounted by alpha k 19:09 and the cost of a policy starting from 19:12 some initial condition x zero 19:14 is just the limit of the end stage cost 19:17 this denotes expected value over all the 19:19 w's this is a random variable and we 19:22 take this expected value and take the 19:23 limit 19:24 as the 19:25 the number of stages n goes to infinity 19:30 so this is a function for every state 19:32 there is a value here 19:35 given by this function and the optimal 19:37 cost function is obtained by minimizing 19:40 overall 19:42 overall policies the policy cost 19:44 functions 19:46 and i have in mind primarily three 19:48 contexts 19:50 one is discounted problems where alpha 19:52 is strictly less than one and g is 19:54 bounded this is the nice case 19:57 everything i'm going to talk about 19:59 is correct 20:01 for discounted problems 20:03 now for other problems 20:05 uh there may be exceptions 20:08 and uh and and i'm going to mention a 20:11 few of those as i go along 20:14 uh the second class of problems is 20:16 stochastic shortest path problems where 20:18 there's a special cause-free termination 20:20 state think of the origin as being 20:22 determination state 20:24 and alpha is equal to one 20:28 and then 20:29 the mpc type of problems 20:32 where the the 20:34 system is deterministic 20:36 the cost per stage is positive 20:39 the origin is the termination state 20:42 and this is a regulation problem the aim 20:44 is to drive the state to the original 20:46 keep it close to the origin and this 20:49 is encoded in this positive cost the 20:52 cost is zero determination state 20:53 positive everywhere else 20:56 so these are the problems i have in mind 20:59 and let me give you a synopsis of the 21:01 theory 21:03 the first result is that j star the 21:04 optimal cost function satisfies 21:06 bellman's equation 21:09 now there is one such scalar equation 21:11 for every state so it's a functional 21:13 equation 21:15 and j star 21:16 satisfies this equation 21:19 um 21:21 often uniquely but not necessarily 21:23 that's why i have this question marks 21:26 there's an optimality condition 21:28 if 21:29 for a given x mu stars of x attains the 21:31 minimum in this equation for every x 21:34 then the polish that you obtain is 21:35 optimal again there are some exceptions 21:37 to that 21:41 and there are 21:42 regarding algorithms the first major 21:44 algorithm is value iteration it 21:46 generates a sequence of cost functions 21:48 starting from sun j0 which should be 21:51 more or less arbitrary 21:53 and uses this 21:55 recursion here this iteration 21:57 which is just the right hand side of the 21:59 bellman equation but with j k minus 1 22:03 replacing j star 22:05 this generates a sequence of finite 22:07 horizon costs 22:09 and 22:10 it's one of the major algorithms the 22:12 other major algorithm is policy 22:14 iteration 22:15 generates a sequence of policies uk 22:18 and their cost functions mu zero is 22:20 arbitrary 22:22 the typical iteration starts with some 22:25 policy mu and generates a new policy new 22:28 tilde i'm going to be using this symbols 22:30 a lot in this talk mu for current policy 22:33 mutilda for the next policy 22:36 in two steps there's a policy evaluation 22:38 step that computes the cost function jmu 22:42 and there is a policy improvement step 22:45 that calculates the improved policy by 22:47 using 22:48 this equation here 22:51 by doing for every x this minimization 22:54 which is the same minimization as in 22:56 bellman's equation except that we use j 22:59 mu the cost of the 23:01 current policy in place of j star 23:05 okay 23:06 so now let's talk about approximation in 23:09 value space a major approximate dynamic 23:12 programming reinforcement learning 23:14 framework 23:16 um here 23:18 we react 23:20 it's very difficult to find j star so 23:22 you replace it with an approximation j 23:24 tilde 23:26 in bellman's equation 23:28 and when we arrive at the state x we 23:31 perform this mean symbolization 23:33 which balances 23:35 the 23:36 cost of the first stage plus the future 23:39 costs as evaluated using this j tilde 23:43 and also appropriately discounted this 23:45 minimization is over all the possible 23:47 controls at the given state and provides 23:50 a control to use 23:52 so we apply this control and then we go 23:54 to the next stage at the next state we 23:57 apply we do the same minimize this type 23:59 of minimization again and we proceed 24:01 going forward 24:03 this is called one step look ahead 24:05 because we look forward only one step 24:07 before going to the approximation 24:10 multi-step look ahead is sort of the 24:12 same thing except that 24:14 we 24:16 minimize the cost of the first l stages 24:20 plus the future cost as evaluated by j 24:23 tilde and now this is an l step problem 24:27 and it involves minimization overall 24:29 control uk and policies from k plus 1 to 24:33 k mark plus 24:35 l minus 1 24:37 we solve this problem and we use at x k 24:40 this first control and we throw away all 24:43 the rest 24:45 either way one step look ahead to a 24:47 multi-step look ahead 24:49 this defines a look ahead policy 24:52 mu tilde new tildes of xk is the control 24:56 that's obtained from this minimization 24:58 it's the minimizing uk 25:02 i'm going to be using a lot of this 25:04 symbols j tilde is cost approximation 25:06 new tilde is the look ahead policy 25:09 obtained by either one step look ahead 25:11 on a multi-step look ahead and here's 25:14 the key fact 25:17 j mu tilde the cost function of the look 25:20 ahead policy 25:21 is a result of a newton step for solving 25:24 bellman's equation starting from j tilde 25:28 now this is true for 25:30 one step look ahead 25:32 uh for multi-step look ahead 25:35 l greater than one 25:37 is the result of a newton sor step which 25:40 is some value iterations followed by a 25:44 newton step 25:46 either way 25:48 the error decreases super linearly 25:51 that's a generic and the key property of 25:54 newton's method whereby 25:58 the errors that diminish very rapidly 26:02 and in particular 26:04 this ratio goes to zero as 26:06 as j tilde 26:08 gets too close uh it tends to j star 26:13 so 26:14 you had an error 26:16 in cost 26:18 between j tilde and j star 26:21 and you have a new error after the 26:22 newton step that gives you the cost of 26:25 the look ahead policy 26:27 compared with j star and this numerator 26:30 is tiny relative to the denominator this 26:33 arrows reduce very fast super linearly 26:39 so now let me 26:42 go into abstract dynamic programming 26:45 concepts and how you interpret all this 26:47 in terms of bellman operators and 26:50 through visualization 26:55 okay so here's the bellman operator it's 26:57 the right hand side of what you have in 26:59 bellman's equation 27:01 and it maps cos functions j 27:04 into cos functions t nuisance j 27:08 so j has a component for every x 27:11 t mu sub j has a component also for 27:14 every x and it's given by this 27:16 expression 27:18 involving the cost of the first stage 27:20 using mu and the cost of the 27:23 remaining stages as measured by j 27:27 so there's one such equation a scalar 27:29 equation for every x 27:32 and i call this the new bellman operator 27:35 and 27:36 there's also the min bellman operator 27:39 which is the one that appears at the 27:40 right hand side of bellman's equation 27:43 and it involves minimization here over 27:45 all 27:46 controls and 27:49 if you take this values here and 27:52 minimize over mu what you get is tjs of 27:55 x 27:56 again 27:57 j is 27:58 a function of the state 28:01 tj is also a function of the state so 28:03 this is a functional this is an operator 28:05 of mapping functions into functions 28:08 and bellman's equation can be written in 28:11 shorthand by using these operators 28:14 the min bellman opera equation 28:17 is represented as a fixed point equation 28:19 j star is a fixed point of t 28:22 and there is a bellman equation for 28:24 every policy 28:25 j mu for our policy mu j nu the cost of 28:28 that policy is a fixed point of this new 28:32 bellman operator 28:36 so team u and t transform real valued 28:39 functions j into functions t mu so j and 28:41 p sub j and for this talk i'm going to 28:44 assume that these are real values 28:47 now 28:48 these are pretty 28:50 big operators how many dimensions so do 28:53 they have 28:54 well 28:55 for every x it's a scalar equation so 28:58 the dimension is the number of states it 28:59 could be astronomical or it could be 29:01 infinite 29:04 on the other hand for every x these are 29:07 real values functions of j 29:10 and 29:11 you can view them in uh 29:14 for for the for a single uh state 29:19 okay as an example suppose you have a 29:21 two-state system 29:23 then j is a two dimensional vector j1 j2 29:27 this is a real number it is a real 29:29 number this is a vector in r2 29:33 and 29:33 t sub j has two components it's also a 29:36 vector in r2 the components tjs of one 29:39 and tjs of two 29:41 so 29:43 even for a two-state system we have 29:46 an operator 29:47 that 29:48 maps two-dimensional space to 29:50 two-dimensional space so you need a 29:52 four-dimensional graph to represent it 29:54 that poses a little bit of a problem 29:58 and 30:01 but on the other hand 30:03 even though team u and t are large and 30:05 perhaps difficult to visualize 30:08 look at first sight 30:10 they have some nice properties 30:12 the first is that they're both monotone 30:15 as you increase uniformly the j function 30:19 here and here 30:21 team users j and tj also increase the 30:25 reason is that 30:26 the expectation preserves monotonicity 30:29 expectation means 30:31 a positive 30:33 linear combination so it has the 30:35 monotonicity property 30:37 and similarly here as you increase j 30:40 because of the monotonicity because of 30:43 the 30:43 the expectation of operator 30:45 preservation's monotonicity 30:47 and the minimization of preserved 30:48 monotonicity you have that 30:51 tmu and tr monotone 30:56 as another important fact is that t mu 30:59 is linear okay expectations are linear 31:01 operations so this is a linear operator 31:04 [Music] 31:06 t 31:07 on the other hand is 31:09 concave it's concave because it 31:12 represents it can be viewed as the 31:14 minimum of linear operations if you take 31:17 the minimum of linear operations then 31:19 you get something that's concave 31:22 so in particular for every fixed state x 31:26 tjs of x is a concave function of j not 31:30 of x of j 31:31 j remember is a function t sub j 31:35 is also 31:37 a 31:38 a 31:39 a a function of j that's concave in the 31:42 traditional form the traditional uh 31:47 traditional meaning of convex analysis 31:52 so for infinite dimension state space 31:54 team u and t are infinite dimensional 31:56 operators 31:57 mapping and infinite dimensional 31:59 functions in the function space into 32:02 itself 32:03 so another question is how do we deal we 32:05 want to do visualization with this 32:06 operation how do we do that 32:09 okay so what we do is use 32:13 one dimensional slices of this 32:16 of this of the graphs of these operators 32:20 to give you an idea of this 32:22 let's consider a two-state and two 32:24 control example 32:26 we already have a four-dimensional graph 32:28 but we can represent as two 32:31 three-dimensional graphs and i'm going 32:33 to do it here 32:35 okay blue stands for the new operator 32:39 and the red starts for the stands for 32:41 the uh for the 32:43 uh the min operator 32:46 here is the 32:47 team u operator and has two components 32:51 one for state one and one for state two 32:54 and i'm showing these components here 32:55 for state one and for state two 32:58 and this is linear and monotone 33:00 and this is concave and monotone 33:03 so for state one 33:06 there are 33:07 two 33:08 controls here 33:10 each control defines 33:14 each control defines a 33:17 a a a linear has a linear graph it's the 33:20 blue graph that you see here 33:22 here is j1 j2 and this is a the linear 33:26 function 33:27 team use of js of one for state one 33:32 now the min bellman operator is obtained 33:35 by 33:37 by minimizing over these two so it's a 33:40 piecewise linear function it has this 33:43 red 33:44 graph here that you see 33:46 it's a function of j1 and j2 33:48 and 33:50 i'm showing also j star here j star is 33:53 uh is uh 33:55 is shown right here 33:57 and i want you i want to focus attention 34:00 on the fact that we can consider 34:02 slices one dimensional slices that go 34:06 through j star and you can see this 34:08 slice here 34:09 by using one dimensional slices we're 34:12 able to visualize this operator 34:15 on a piece of paper 34:16 rather than and and deal with a high 34:18 dimensionality 34:21 so there's a slice like that for stage 34:23 one and there's also a slice like that 34:26 for state two and the slice involves a 34:29 piecewise linear monotone function you 34:31 can see the 45 degree line here 34:34 and 34:36 so that's a visualization 34:38 for a two by two problem and from now on 34:41 we're going to be focusing on one 34:44 dimensional slices that pass through j 34:47 star 34:52 okay so let's visualize the new bellman 34:55 operator 34:57 it's given by this equation 35:00 the new bellman equation is this j mu 35:03 is a fixed point of t mu 35:06 and the operator has a linear graph is 35:09 this blue line that you see here 35:12 that defines t mu sub j as a function of 35:16 j 35:17 and to find the cost of mu j mu 35:21 i have to 35:23 find the point where this blue line 35:27 meets the 45 degree line 35:30 and this is the fixed point 35:32 of the operator t mu and it is the cost 35:36 of mu satisfies the bellman equation 35:39 of course this cost has to be somewhere 35:42 further to the right of the optimal cost 35:46 and all of these costs are 35:48 to the right there are many of these new 35:51 operators and they are all they are 35:53 defined by 35:55 in a similar way 35:59 okay now the min bellman operator the 36:02 min bellman operator is obtained 36:04 by minimizing overall controls or 36:07 minimizing over all these blue 36:10 operators for a given x 36:14 it's concave in monotone and is defined 36:17 as follows 36:18 here is t sub j 36:22 this red graph here as a function of j 36:26 and 36:28 to obtain its value let's say at some 36:30 point j tilde 36:33 we have to consider all the policies 36:36 find the point of intersection of these 36:38 policies with this vertical line 36:40 and go to the lowest point this is the 36:43 minimum over policies of t u sub j tilde 36:48 so that's how you find t sub j tilde and 36:51 you do the same 36:53 for every j 36:55 and basically the red curve is the lower 36:58 envelope 36:59 of the uh of the blue lines you have all 37:03 these blue lines and what lies below 37:06 that is the graph of the min operator 37:13 and now the optimal cost j star is 37:16 obtained as the fixed point of this 37:20 of this of the operator t 37:22 it is the point where the red curve 37:26 meets the 45 degree line so that's how 37:29 you visualize j star and all these 37:32 operators and the connection of the blue 37:35 operators with a red operator 37:39 i think to notice is that 37:42 the red operator is concave 37:46 and also 37:48 at j tilde 37:50 the 37:51 blue operator that attains the minimum 37:54 is tangent 37:56 provides a linearization 37:58 of the 38:00 red operator at this point 38:02 now this linearization is going to be 38:04 key for understanding the role of 38:06 newton's method 38:12 so 38:15 let me also illustrate how value 38:18 iteration can be viewed within this 38:20 operator context 38:24 we have a 38:25 piece of j as a function of j 38:29 and in value iteration we generate a 38:31 sequence of cost functions 38:34 we start with some j0 and we find tj 38:38 zero okay 38:39 it's this is j1 38:43 then we reflect j1 38:45 across the 45 degree line to this point 38:48 here 38:50 and then apply t to it and we obtain the 38:53 second iterator j2 so j2 is t squared j0 38:58 we reflect it again across the 45 degree 39:01 line and we obtain j2 39:05 i'm sorry j3 and so on 39:08 so there is this staircase construction 39:11 that 39:12 visual that provides a visualization of 39:14 of value iteration 39:17 and you can see that uh 39:19 there's a tendency to converge to the 39:21 fixed point but not always it depends on 39:24 on what j0 is the slope of t in 39:28 particular if t is a contraction mapping 39:30 which means that the slope is uniformly 39:33 less than 1 39:34 then you get convergence and you get 39:36 also a unique fixed point but in that's 39:39 the favorable case but that doesn't 39:42 always hold true 39:45 so that's the visualization for 39:47 for value iteration 39:49 and 39:51 now let me 39:53 go into newton's method and this 39:55 connection with online play and 39:59 one step look ahead 40:03 okay so let's look at the familiar 40:06 newton's method for solving 40:08 a fixed point problem uh finding a fixed 40:11 point of the operator t 40:13 this is a generic operator here and i'm 40:16 showing it in red concave and monotone 40:19 as it is in dynamic programming 40:22 and newton's method is an iterative 40:24 method you generate a sequence of uh 40:26 iterates jk 40:28 you start at j0 generate j1 j2 and so on 40:32 at the typical step 40:35 you are at jk 40:36 and what you do 40:38 is you linearize t at j k 40:42 and this blue line is a linearization it 40:45 involves the first order taylor series 40:48 expansion of t 40:50 at the point j k 40:53 we now solve the linearized fixed point 40:56 equation in other words we find this 40:58 point here and that's the new iterate j 41:01 k plus one 41:03 so that's what newton's method does 41:06 and it's pretty similar to one of the 41:08 things i showed you earlier for online 41:10 play 41:11 and for one step look ahead 41:16 so 41:17 jk linearization solve the linearized 41:19 problem jk plus one 41:21 then linearize at this point solve the 41:24 linearized problem 41:26 obtain jk plus two 41:28 and so on and you can see that this 41:31 process can be very fast indeed 41:32 quadratic under the right conditions 41:35 um 41:36 and 41:37 okay i've shown also here t to be 41:39 differentiable but then we will relax 41:41 this assumption uh in 41:44 in 41:45 with what follows 41:48 so this is generic newton's method 41:51 and here 41:55 here is how it relates to 41:57 to one step look ahead 41:59 now in one step look ahead what we do is 42:02 we have a j tilde a cost approximation 42:04 and we find new tilde 42:07 a one step look ahead policy which is 42:11 obtained 42:12 by minimizing overall possible policies 42:16 so what one step look ahead does is it 42:19 linearizes t a j tilde 42:22 and obtains this blue operator which is 42:25 the look ahead policy mu tilde 42:30 so 42:32 how do we represent the cost function of 42:34 this look ahead policy 42:36 well 42:37 it is simply 42:38 the intersection of this blue line with 42:41 a 45 degree line and it gives you j mu 42:45 tilde as the fixed point of t mu tilde 42:48 so we start with this cost approximation 42:51 and we end up of this 42:54 one step look ahead policy cost and you 42:56 can see that it is an honest goodness 42:59 newton step 43:02 now it's a 43:03 key insight 43:05 but then immediately some questions 43:07 arise 43:08 in the classical form of newton's method 43:10 you have to have this t uh operator be 43:14 differentiable twice 43:16 once differentiable 43:17 do we need that well the answer is no 43:20 you don't need that 43:22 you can substitute 43:24 concavity and monotonicity 43:26 for differentiability 43:28 this is known 43:30 uh in numerical analysis if you look at 43:33 the classical books like ortega and 43:34 rainbow from from the 60s 43:37 there are 43:38 there are uh 43:40 there is theory associated with that 43:42 primarily with monotonicity 43:45 uh 43:47 the 43:48 but but it's also 43:50 it's also true that you don't have to 43:52 have 43:53 differentiability here concavity is 43:56 sufficient 43:57 because this blue line 43:59 in the in the case of a 44:01 non-differentiable operator is just a 44:03 sub-gradient you use sub-gradients 44:06 instead of 44:08 of 44:09 of uh derivatives uh and gradients 44:13 and the subject the important thing is 44:15 that you have a tangent hyperplane a 44:17 supporting hyperplane and this can be 44:19 provided by a sub gradient or by a 44:22 gradient 44:24 just as well 44:26 now you do need some assumptions for 44:28 this process to to be valid because 44:31 there are all sorts of anomalies that 44:32 can occur here 44:34 but everything works out fine for 44:37 contracting problems such as the 44:38 discounted problem that i gave earlier 44:43 and notice one more thing 44:46 the newton step is very fast a super 44:48 linear convergence 44:50 so it smooths out starting point 44:52 variations 44:54 if you make variations to j tilde there 44:57 will be corresponding variations to j mu 45:00 tilde but these variations are tiny 45:03 relative to these variations 45:05 so 45:07 you may make some errors here and these 45:09 are smoothed out by the newton step and 45:12 there is lots of empirical evidence 45:14 to support 45:16 all this 45:24 okay now let me go into stability issues 45:27 and how do how we visualize them 45:32 here i'm going to be focusing mostly on 45:34 the npc context deterministic system 45:36 cause free terminal state and so on 45:39 so stability is paramount importance of 45:41 course in npc 45:43 but the question arises how do you 45:45 define stability given that we 45:48 that the state space can be discrete and 45:50 can be continuous 45:51 so instead of going to concepts of 45:53 yapping of stability and the like we 45:56 adopt a 45:58 an optimization based definition of 46:00 stability and we say that the policy is 46:03 stable if its cost function is finite 46:05 for all x 46:07 if the cost of the policy does not blow 46:09 up 46:10 for any initial state then we say that 46:13 the policy is stable and this has the 46:15 advantage it's a very general definition 46:19 and generally this is true if team use a 46:22 contraction or has slope less than one 46:24 in the context of 46:26 our figures 46:29 so let's look at one step look ahead 46:32 um 46:33 we want to 46:35 the original stability we defined to be 46:37 the set of all j tilde cost 46:39 approximations for which we look ahead 46:42 policy is stable 46:45 so you can see here that if you choose j 46:48 tilde 46:50 to the right of some threshold 46:52 then the corresponding policies because 46:55 of the monotonicity 46:57 are stable 46:59 on the other hand if you choose them to 47:01 the left of some threshold 47:03 the policies are not stable because they 47:06 don't intersect basically the 45 degree 47:08 line they have slope greater than one 47:12 and the threshold 47:14 is precisely the point where 47:17 the bellman operator 47:19 has a standard the 45 degree line 47:23 so this is how the original stability is 47:25 demarcated j tilde to the right of 47:31 this threshold point 47:33 gives you a stable policy 47:35 and 47:37 otherwise it's an unstable policy of 47:39 course this is for a one-dimensional 47:41 slice 47:42 for a multi-dimensional case it's more 47:44 complicated than that but the general 47:46 idea still applies 47:50 now this is for one step look ahead 47:53 for multi-step look ahead now multi-step 47:56 look ahead tends to push j tilde into 47:59 the stable region 48:04 now remember that multi-step look ahead 48:06 amounts to 48:08 some value iterations 48:10 and then taking the newton step 48:12 so this gives you an example where a 48:15 two-step look ahead 48:16 so 48:18 what are the stable j tilde the ones 48:20 that after you perform 48:25 after you perform 48:27 one value duration you cross into 48:30 this stable region for one step look 48:33 ahead 48:34 so the threshold moves to the right and 48:36 the reach of stability is expanded by 48:39 multi-step look ahead 48:41 and there are rigorous studies 48:44 mathematical analysis that verify this 48:47 and it's also a known fact in npc 48:49 [Music] 48:52 so it makes sense to push j till that 48:54 door was jnu 48:56 which is stable and that's 48:58 what this multi-step look ahead does and 49:01 it's also an idea that underlies the 49:03 concept of rolla 49:09 so now let me go into rollout and policy 49:12 duration and how we visualize them 49:19 okay now roll out is simply a newton 49:22 step starting from j tilde equal to the 49:25 cost function of a policy 49:28 so you have some policy new which we 49:30 call base policy 49:32 it has an evaluation okay obtained by 49:34 this intersection point this is j mu 49:38 and then rollout 49:40 does a newton step linearizes here and 49:43 this gives you the rollout policy 49:46 so that's simply what it is 49:49 and it can be implemented 49:51 in a number of ways including simulation 49:55 and an interesting fact is that if you 49:58 have a stable policy to begin with then 50:00 the rollout policy will also be stable 50:03 you can see this from this figure 50:08 now poor this iteration is simply 50:10 repeated rollout you start from a policy 50:13 do it all out on it and then do a 50:15 rollout on the rollout and so on and 50:17 generate a sequence of policies 50:20 the visualization is the following 50:23 let's say mu k is the current policy and 50:26 this is its evaluation the cost of mu k 50:31 you do a newton's step 50:32 linearize the bellman equation a j u k 50:36 then obtain mu k plus one this is the 50:39 policy improvement step policy 50:41 evaluation policy improvement and obtain 50:44 a new policy 50:46 and its cost is uh 50:49 is 50:50 is 50:53 is obtained by 50:55 the new policy is this corresponds to 50:57 this blue line and its cost is obtained 51:00 by uh projecting on the horizontal this 51:04 point of intersection 51:06 so start with mu k generate uk plus one 51:10 then generate new k plus two and so on 51:13 and converts very fast 51:16 the pure form of policy duration is 51:18 exactly newton's method and does not 51:21 rely on differentiability at all 51:23 um 51:25 uh 51:26 the fact that the connection with 51:27 newton's method actually is well known 51:29 in control theory 51:31 uh starting from 51:34 climate paper or linear quadratic 51:36 problems 51:37 clyman was a student of mike athens at 51:39 mit and wrote this paper 51:42 that 51:43 that 51:44 was followed up by a lot of other works 51:46 both for linear quadratic problems and 51:48 also for more general 51:52 forms of policy duration 51:54 there are also corresponding results in 51:56 the markov decision process theory 51:59 all of this is very well known in 52:01 dynamic programming as well as 52:04 linear quadratic problems and the like 52:07 um 52:11 okay so now let's go into truncated role 52:14 rollout is a single step of newton's 52:17 single newton step starting from the 52:19 base policy 52:21 truncated rollout starts from sanjay 52:23 tilde 52:25 does 52:26 several value iterations using the new 52:29 operator 52:31 and then at that point takes the newton 52:32 step 52:34 so it is 52:37 first value iterations with 52:40 with the mu operator 52:42 uh m of them 52:44 m equals 4 in this particular figure 52:48 starting with some j tilde 52:51 and then 52:52 applying the newton step by doing this 52:54 minimization here and that's the cost of 52:57 the truncated rollout policy so 53:00 truncated roulade also involves a newton 53:02 step 53:06 now you can have a combination of 53:09 l-step look-ahead minimization and 53:12 m-step truncated rollout and the total 53:15 look ahead is l plus n 53:18 and an interesting fact it is the sum 53:20 that's important not so much the 53:22 individual values of l and m but the sum 53:25 is important 53:27 now 53:29 ronald with a policy is much cheaper 53:31 than look ahead minimization 53:33 so 53:35 truncated a lot can be viewed as an 53:37 economical substitute for multi-step 53:39 look ahead and this was the motivation 53:41 of tesoro when he introduced truncated 53:43 rollout in the context of backgammon 53:48 also people use this rationale in mpc 53:51 um 53:52 use 53:53 rollout with some base stable policy to 53:58 to enhance the starting point for the 54:01 newton step 54:06 so now let me go into a familiar problem 54:08 for people in control the linear 54:10 quadratic problem 54:12 in linear quadratic problems there are 54:14 similar visualizations but instead of 54:17 using the bellman operator we use a 54:19 equation operator 54:21 so in this particular for illustration 54:25 i'm using 54:27 a a one dimensional linear system with 54:30 an a coefficient and a b coefficient a 54:33 quadratic cost involving non-negative q 54:36 and r coefficients and alpha equals one 54:39 no discounting 54:40 and it is well known that the optimal 54:43 cost function is quadratic 54:45 uh with quadratic coefficient k star 54:48 and this k star 54:50 solves very ricardi equation 54:52 and the riccardi equations are 54:54 fixed-point equations the algebraic 54:56 richard equation is given by this 54:58 expression here involving a b r and 55:02 and q 55:04 is concave and 55:07 the solution of the ricardi equation is 55:09 obtained by intersection of this 45 55:11 degree line and the red graph here 55:14 actually there are two solutions 55:17 however there's a unique solution 55:20 that is positive and this this other 55:22 solution that's 55:24 that is not positive 55:27 this in fact this is true for for 55:29 multi-dimensional problems 55:32 um 55:33 the normal case is when q and r are 55:36 positive and k star has its unique 55:39 positive is its unique positive solution 55:42 however 55:43 just to sensitize you to exceptions 55:46 here's an exceptional case where q is 55:49 equal to zero 55:50 and now the 55:52 in that case 55:53 you lose observability 55:56 but uh 55:57 but 55:58 k star is actually equal to zero because 56:01 there's only cost and control and you 56:03 can set the control to zero and attain 56:04 optimal cost to zero 56:06 however the recovery equation has two 56:09 non-negative solutions and is another 56:10 solution here which has an interesting 56:13 interpretation 56:14 and this is an example where you don't 56:16 have the uniqueness and in fact if you 56:19 try value iteration for this exceptional 56:22 case you'll find that value iteration 56:24 converges to the 56:26 to the non-optimal solution rather than 56:29 k-star 56:32 this is just an example of exceptional 56:34 behavior and then many of those in other 56:37 contexts that are more complicated than 56:38 linear quadratic 56:44 okay so here's a common question 56:46 uh in the end 56:48 i want to have a policy and i 56:51 we say that this online play policy is 56:54 good why can i not learn this policy by 56:58 some neural network and use it 57:02 in fact there is a whole community in 57:04 reinforcement learning 57:05 that practices approximation in policy 57:08 space 57:09 where we don't care about cost function 57:11 approximations multi-step look ahead and 57:14 the like but instead we parameterize the 57:17 policy and we optimize the parameters by 57:20 using some kind of training perhaps 57:22 neural network training 57:24 there are methods called policy gradient 57:26 methods random search methods and so on 57:28 there's a wide variety it's a very broad 57:30 field 57:31 and 57:33 my 57:35 visualizations and my point of view 57:36 suggests that this viewpoint is flawed 57:40 because it lacks the exact newton step 57:44 that sort of does all the magic 57:46 it corrects all the errors of offline 57:48 draining and super linearly so 57:51 so there's no correction methods in this 57:54 in this 57:55 approach whereby we don't use a newton 57:57 step 57:59 and 58:01 to illustrate this point i have an 58:03 example here 58:05 it's a one-dimensional linear quadratic 58:08 example 58:09 with a known and fixed model 58:11 and we cut we consider a parametrization 58:15 of the policy 58:17 we use linear feedback policies with 58:20 coefficient l 58:22 and l we vary l okay this is the l 58:26 we vary the 58:27 broadly 58:30 um 58:31 okay l 58:33 around minus point four is optimal 58:37 and what i have plotted here 58:39 are the cost differences from the 58:40 optimum 58:42 without the newton step just this policy 58:45 apply this policy 58:47 we see that very quickly the performance 58:50 of the policy is degraded 58:52 on the other hand if you introduce one 58:54 step look ahead in other words if you 58:56 introduce a newton step 58:58 then 58:59 the performance is almost 59:00 indistinguishable 59:02 from the optimal so all this is is the 59:05 loss that you incur 59:07 for not having a newton step to correct 59:10 the errors 59:11 in the 59:12 sub optimal policy parametrization 59:20 let me say a few things about model 59:21 prediction control 59:25 well i'm not doing very badly with time 59:27 i'm glad about that 59:29 so i'm going to 59:31 go a little slowly here 59:36 okay model predict control is a very uh 59:39 is a major methodology perhaps a 59:41 dominant methodology in control of your 59:43 research these days 59:45 it dates to 59:48 close to 40 years 59:50 and has extensive literature it has been 59:53 applied to very very challenging 59:54 problems it's a lot of experience with 59:56 it 59:58 the classical form of npc applies to 60:01 continue state and control uh 60:04 problems deterministic problems with 60:06 positive cost 60:08 so we have a non-linear system 60:10 and a cost function that's positive 60:12 everywhere except at the origin 60:15 where the cost is zero so basically this 60:17 encodes our desire to drive the system 60:19 to the origin and keep it there 60:23 and 60:25 now 60:27 mpc is culturally different from 60:30 the 60:32 the reinforcement learning artificial 60:35 intelligence practice of 60:37 uh 60:38 of 60:40 of sequential decision making 60:43 uh 60:44 the 60:45 architecture of npc as i mentioned 60:46 earlier is very similar to alpha zero 60:49 it includes look ahead minimization 60:51 which is often called controlled 60:53 interval 60:56 in some versions of mpc there's a 60:58 so-called production interval to improve 61:01 the terminal cost approximation and 61:03 there is which amounts to rollout with 61:06 some 61:06 stable policy 61:09 and there is also a terminal cost 61:11 function that's being used 61:14 so all the elements are there 61:16 on the other hand 61:18 mpc focuses 61:20 on continuous space applications and 61:22 also stability 61:24 issues 61:25 and 61:27 also 61:28 relies primarily on online play 61:33 there is some online offline training in 61:35 npc for example computing offline 61:38 terminal cost approximations computing 61:41 some stable policies to do rollout with 61:44 and also 61:45 computing target tubes 61:48 safe regions in order to deal with state 61:50 constraints 61:53 one of the central motivations of npc is 61:55 to be able to deal with 61:57 state constraints critical state 61:59 constraints 62:01 that 62:01 for which 62:03 the linear quadratic 62:05 unconstrained framework does not deal 62:07 very well 62:08 uh so 62:09 uh to do that there are some interesting 62:12 issues of how to 62:14 to keep to to to to use controls select 62:17 controls that keep you within the safe 62:20 region and there's a theory of 62:23 target to principality which actually is 62:26 very dear to my heart because my phd 62:29 thesis 62:30 50 years ago 62:32 dealt precisely uh with this topic of 62:36 the not introducing the notion of target 62:38 tube and also 62:40 uh how you construct these target tubes 62:43 um 62:44 my nobody read my thesis for about 15 62:47 years but then all of a sudden through 62:49 npc 62:50 it became an interesting topic 62:55 okay so this is the classical form of 62:56 mpc but there are a lot of 62:59 extensions 63:01 and you can look at 63:03 for example the recent books by rollins 63:05 main and dil and 63:07 borelli bemporad and morari 63:09 which describe a lot of 63:12 a lot of extensions 63:15 that deal with stochastic uncertainties 63:19 hybrid systems involving 63:22 both continuous and discrete control and 63:25 space components 63:28 minimax versions 63:30 uh 63:32 deal with target tubes and so on 63:36 so there's a lot of 63:38 a lot of 63:39 very broad literature dealing with a 63:41 broad variety of problems 63:44 moreover 63:47 more recently there has been 63:48 a 63:49 some 63:50 emphasis on getting 63:53 on using data 63:56 simulation obtain data 63:59 and roll out 64:01 with some offline trained policy 64:04 and this come under the jungle name of 64:07 learning npc 64:10 uh this term was introduced in the paper 64:12 by rosolia and borelli who have done 64:15 interesting work in this area 64:19 and there has been also 64:21 a recent paper 64:23 by 64:24 uchaul 64:25 uh and johansson and martinson 64:28 professors at kth 64:30 and i'm also a co-author in this paper 64:37 uh so now let me go into 64:40 adaptive control 64:42 with model estimation what's called 64:44 indirect adaptive control is combined 64:47 identification model identification 64:50 and adaptation of control in response 64:56 so what we have here is a system that's 64:58 changing over time its model is changing 65:01 over time 65:03 data is collected from the system and we 65:06 run 65:07 a model estimation system identification 65:10 algorithm 65:11 the results of which are passed into an 65:13 adaptive controller 65:15 who makes proper adaptations 65:18 and 65:20 the 65:21 indirect adaptive control viewpoint is 65:23 uh 65:24 to re-optimize the controller 65:27 uh 65:28 when the estimated model changes 65:31 on the other hand there's a problem with 65:33 that this 65:34 reoptimization may be difficult 65:37 or very time consuming 65:40 uh so 65:42 uh there is room for simpler 65:44 alternatives 65:46 and the 65:47 rollout provides and rollout combined 65:50 with multi-step look ahead provides 65:53 uh 65:53 such an alternative 65:57 you have 66:00 you use multi-step look ahead 66:03 that takes into account the current 66:06 model as estimated by the system 66:08 identification process 66:11 but the rollout itself uses some robust 66:14 policy 66:15 however all of this involves a newton 66:17 step 66:18 so we use a newton step instead of 66:20 reoptimization 66:22 and if you're close 66:24 if your changes from the nominal are not 66:26 very serious then the newton step i 66:29 argue is just as good 66:31 as reoptimization 66:33 and computationally simpler 66:36 so we capitalize on the fast convergence 66:39 of the newton step to avoid expensive 66:42 reoptimizations 66:44 so that's the idea for the use of 66:47 rollout in adaptive control as a 66:49 substitute to re-optimization 66:53 and 66:55 let me 66:56 uh 66:57 let me give you an example 67:00 here 67:02 this is a 67:04 one-dimensional linear system 67:09 and 67:10 it involves an unknown parameter in 67:13 parameter b 67:14 and 67:16 and our result may also be variable here 67:18 in the cost function 67:21 now 67:22 as b changes 67:25 the optimal cost as measured by the 67:27 quadratic cost for efficient changes 67:30 if you use a fixed policy in particular 67:33 the policy that is optimal for the 67:35 nominal values of b equals 2 67:38 and r equals 0.5 67:40 then the performance 67:43 degrades 67:44 so 67:45 robustness is not really fixing the 67:47 problem here if you have large 67:50 deviations from the num for the nominal 67:52 parameters uh then you can get 67:56 you can get a lot of degradation 67:58 performance 67:59 on the other hand if you use rollout 68:02 including the newton step there's hardly 68:04 any change from the optimal it's just as 68:06 effective as reoptimization 68:10 and this is true when we change b but 68:12 keep r fixed and the right figure we 68:16 change the cost parameter at the r and 68:19 keep b fixed 68:22 so 68:23 this is 68:25 what i want to say that using a robust 68:27 control as a base policy without the 68:28 newton's step is often flawed 68:31 and we correct the flow by introducing 68:33 uh the newton step 68:37 okay let me conclude 68:39 uh 68:40 uh i want to make a few points here the 68:43 first one is that there is much to be 68:44 gained by using online play on top of 68:46 offline training there's a potential for 68:49 very large improvements 68:52 if you just use 68:54 just offline training without online 68:57 play it may not work very well because 68:59 we lack the newton step that corrects 69:02 for training errors 69:03 and can deal with changing system 69:05 parameters 69:06 on the other hand uses just online play 69:09 without offline training we may miss out 69:11 on performance 69:12 because 69:14 we will not have good starting points 69:16 for the newton step 69:18 so that's the first point i want to make 69:20 the second point i want to make is that 69:23 newton's method is central in all this 69:24 this is a new insight and can guide both 69:27 analysis 69:29 explain behavior 69:31 and provide motivation for new 69:34 algorithmic designs 69:36 and again the newton's step is exact all 69:39 the approximations neutron net neural 69:42 networks and the like 69:43 going to the starting point for the 69:45 newton step which washes out training 69:48 method differences and errors 69:53 another point is that 69:55 the cultural divide between the 69:58 between reinforcement learning 70:00 as practiced by the ai community 70:03 and the control community 70:06 we are dealing with very similar 70:08 problems using similar methods 70:10 and the cultural differences can be 70:12 bridged by combined offline training and 70:15 online play in the model alpha zero 70:20 um mpc uses a very similar architecture 70:22 to alpha zero but can benefit from rl uh 70:26 and artificial intelligence ideas 70:29 and we can also approach indirect 70:32 adaptive control through rollout using a 70:34 newton step in place of re-optimization 70:38 the last point i want to make is that 70:40 all of this is very general even though 70:42 i focused on on just a few problem types 70:48 the generality comes from having 70:50 arbitrary state and control spaces 70:53 and the generality of abstract dynamic 70:56 programming 70:57 and in particular there are a lot of 71:00 interesting applications in discrete 71:02 optimization integer programming 71:04 multi-agent versions 71:07 which 71:09 involve 71:10 exploitation of the special structure 71:14 and 71:16 a lot of this is in my 2020 book the 71:20 details of that 71:22 finally there are exceptional behaviors 71:24 all over the place once you get away 71:26 from the benign uh 71:29 discounted case 71:31 and there's a lot of room for client 71:32 class clarification from by mathematical 71:35 analysis 71:37 and computational experimentation 71:42 so uh let me close with some words of 71:45 optimism 71:48 the we had tremendous successes in both 71:51 artificial intelligence reimbursement 71:53 learning and mpc 71:55 and this provides hope for the future 71:59 more success can be expected by bridging 72:01 the gap between these two communities 72:06 look ahead minimization and rollout can 72:09 be a computational bottleneck 72:11 and the 72:13 people in both artificial intelligence 72:14 and nbc have been struggling with that 72:17 on the other hand we have massive 72:19 computational power 72:21 through distributed computation a lot of 72:23 these computations can be parallelized 72:26 and all of this power can mitigate the 72:28 computational bottleneck and so 72:31 we can obtain online strategies that are 72:34 more sophisticated as a result 72:37 so all in all the future looks very 72:40 bright 72:41 and 72:42 i thank you very much for your 72:43 attendance 72:48 all right 72:49 thank you very much uh so please uh 72:53 we have time for questions it was a very 72:55 exciting talk i think i'd like to leave 72:57 the the floor open for questions 72:59 before asking my own questions if 73:01 there's time in the end um so 73:04 we can try if you raise your hand 73:06 yeah 73:08 and then i can pass the word so 73:11 uh you are kim you can unmute maybe if 73:14 you want to ask a question yes so 73:16 let me join mikkel first by saying a 73:18 fantastic talk it really opens up the 73:20 thought yeah 73:21 and perhaps my question is because i 73:23 haven't fully digested the concept yet 73:25 but when when i think about this 73:27 exploration the uh alphabet the pruning 73:30 in type of games and then with a proxy 73:31 of a cost of future 73:33 possible steps 73:35 i tend to view it more as a exploration 73:38 of state space so when you take the jump 73:40 to 73:41 dynamical systems is the state in the 73:43 dynamical system the same state that you 73:45 would apply say to a chessboard or is it 73:47 more of an abstract concept 73:51 that is used in the analysis 73:53 or is there a one-to-one correspondence 73:55 there the state spaces is arbitrary here 73:59 the visualizations are valid for an 74:01 arbitrary state space 74:03 now in the case of uh 74:05 chess and games the state space is 74:07 discrete the same is true for markovian 74:09 decision problem problem the type of 74:11 interest in operations research in other 74:12 places in the control uh computing in 74:16 the control applications it's mostly 74:17 continuous state space 74:19 now of course the minimization 74:22 the look ahead minimization depends very 74:24 much on the nature of the state space 74:27 indeed in chess you have forward search 74:30 through a tree 74:33 with alpha beta pruning and all this 74:35 other stuff multi-colored research 74:37 which facilitate and expedite this 74:40 search 74:41 on the other hand it's a discrete 74:43 process search through a tree 74:45 in mpc it's a different sort of thing 74:48 people in npc use 74:50 use continuous type methods or 74:52 combinations of continuous and discrete 74:54 methods that are closer to integer 74:58 programming 75:00 nonlinear programming or sequential 75:02 quadratic programming 75:05 so depending on the character of the 75:07 state space you would use different 75:09 methods for the look ahead minimization 75:13 uh the concept however is the same 75:16 it's just the 75:18 the 75:19 the implementation depends on the 75:21 problem at hand 75:24 i guess my question was uh so 75:27 so when you interpret it in terms of a 75:29 control problem is then does the state 75:32 describe the whole look ahead process or 75:34 is it 75:37 so maybe yeah maybe i just didn't fully 75:40 understand the 75:41 because but so when you take the control 75:44 theory view of it 75:46 um so is the look ahead 75:50 uh 75:51 algorithm something different from the 75:53 state or is that actually the state in 75:55 that interpretation 75:57 so that's the one 75:59 the state of evolution is supposed to be 76:01 known here 76:03 the the system equation is supposed to 76:06 be known it could be stochastic 76:08 however we can predict the future fully 76:11 for a given policy we can predict the 76:13 future fully that we have a model for it 76:16 right i don't know if i've answered your 76:17 question i 76:19 think yeah i'll probably have to think 76:20 about it more but it's input definitely 76:23 to that 76:23 thanks okay 76:26 uh so mario or maybe you can unmute and 76:30 ask a question 76:32 okay 76:33 thank you very much 76:37 um this talk was extremely inspiring for 76:40 me 76:40 um 76:41 there's one point which maybe i missed 76:44 um so it's very interesting to see how 76:46 this these methods are essentially 76:47 neutral steps but 76:49 i missed a bit the connection with the 76:51 policy gradient methods because policy 76:53 gradient methods essentially also do 76:55 something along these lines they take 76:57 some sort of a neutron step to 76:58 approximate it depending on which hash 77:00 approximation one uses so i was 77:03 wondering if you could comment a bit 77:04 more on this aspect 77:09 sure newton's method 77:11 can be used in several different 77:14 optimization context 77:16 uh 77:17 what i was talking about here is 77:18 newton's method for solving fixed point 77:20 problems the bellman equation in 77:22 particular 77:24 in policy gradient methods which is 77:26 conceptually quite different from 77:29 the approximation in value space that 77:31 i've been using here 77:34 in in policy gradient methods is an 77:37 optimization problem to be solved and 77:39 the optimization 77:40 involves 77:42 not involves a parameter as a 77:44 characterization of the policy 77:46 so 77:48 for this you can use any optimization 77:50 method that you would like you can use 77:52 newton's method or you can use a 77:55 gradient method you can do random search 77:57 anything that you would like 77:59 but the context is different the cost 78:01 function is different 78:02 uh 78:03 you in in 78:05 in the policy gradient context you 78:08 optimize the performance with respect to 78:10 the controller parameters the 78:12 performance of the system 78:14 in here in my talk 78:16 it's newton's method is used 78:19 not for an optimization problem it's 78:21 used to solve a fixed point problem 78:23 which is not the same 78:26 right yeah maybe what confused me is 78:28 that in 78:29 policy gradient you take the expectation 78:32 over an initial uh distribution of over 78:35 a distribution of initial states 78:38 so that's what makes it similar but yeah 78:40 indeed it's a very different kind of 78:43 optimization problem thanks a lot okay 78:55 i see some hands but i don't hear 78:57 anybody 79:00 yeah i was waiting michael can i go on 79:03 you're muted now michael so that's 79:04 that's why we didn't hear you 79:07 okay so uh 79:09 right so so yeah thanks also very very 79:12 uh nice talk and it's good to see such a 79:14 broad audience from all over the world 79:16 uh in in this talk i have an uh control 79:20 uh question uh 79:22 uh dimitri so i mean you commented on 79:25 indirect adaptive control and the 79:27 interpretation here how one could carry 79:30 think about this in in this framework 79:32 so 79:34 you know at the time there was this 79:36 discussion about different architectures 79:37 for adaptive control indirect adaptive 79:39 control and direct adaptive control and 79:41 so on 79:42 so could you comment a little bit on the 79:44 role of models here because in indirect 79:48 adaptive control you explicitly have the 79:50 planned model you identify the plant 79:52 model and you use that for for the 79:54 design 79:55 so with your framework here is it so 79:58 that so to say one can argue that there 80:01 should be such a model 80:03 explicit 80:04 in the architecture for the control 80:07 um 80:08 so so is is this said to say an argument 80:11 because that would lead to this 80:13 particular structure that you present 80:16 here with the online play and offline 80:18 training 80:21 yes 80:22 speaking of adaptive control um 80:25 i'd like to take the opportunity to pay 80:27 some tribute to the outstanding 80:29 contributions of the swedish 80:31 mathematicians and control theories in 80:33 this area particularly carl astron who 80:36 has been a friend and a source of 80:39 inspiration for me over many years 80:43 about the difference between 80:46 indirect and direct adaptive control 80:51 i don't think from what i can tell that 80:55 direct adaptive control 80:57 can fall into the framework that i have 81:00 discussed 81:02 because it involves 81:04 primarily stability type of analysis and 81:07 also optimization type of analysis that 81:10 do not seem to fit the model of dynamic 81:13 programming very well 81:14 it's like in policy gradient methods so 81:17 you do an optimization but does not 81:19 involve so much dynamic programming you 81:22 just optimize 81:24 in the direct case direct data control 81:26 case you stabilize the system you try to 81:28 find parameters that stabilize the 81:30 system 81:31 and cost is kind of secondary but 81:33 sometimes they combine the two so i'm 81:36 also not an expert in model uh in in 81:39 direct adaptive control 81:41 and perhaps someone will be able to find 81:43 some connections that i don't see right 81:45 now 81:47 thank you thank you very much 81:50 great thanks 81:51 to raj 81:53 yes i 81:54 have two questions uh the first question 81:57 is about the convergence rate of uh 82:00 value iteration and polycitration 82:02 algorithms could you please elaborate a 82:06 bit on 82:08 this point and in particular how 82:12 can we uh make them faster and the 82:14 second question is about the optimality 82:17 gap of 82:18 uh rollout algorithms uh im by um 82:23 optimal signal i mean the difference 82:25 between the cost of 82:27 uh the drive policy and jstor 82:30 uh do you think if we can uh 82:33 find any bound on the 82:35 optimality gap of these algorithms 82:41 yeah the issue of convergence rate at 82:43 least for the 82:45 simpler cases 82:46 is fairly 82:48 fairly clear 82:50 value iteration has a linear convergence 82:52 rate 82:53 um 82:54 and is much slower than policy duration 82:57 policy duration has a super linear 82:58 convergence rate 83:00 converges in a finite number of 83:02 iterations if the states pay if the 83:03 number of policies is finite 83:05 so 83:06 polystyration is much faster 83:09 and rollout takes advantage of that 83:12 of that 83:13 of that 83:15 fast convergence 83:18 it's uh computational studies are show 83:21 very remarkably how effective rollout is 83:23 even though it's a single duration 83:26 particularly if you have a good starting 83:27 point that's obtained by some kind of 83:30 offline training it just works 83:33 wonders it's just 83:34 it's very reliable also 83:37 so 83:38 while there are error bounds if you look 83:41 at ibooks for example there are error 83:43 bounds associated with rollout policy 83:45 iteration value iteration and so on with 83:48 approximations uh with j tilde 83:50 approximations 83:52 um 83:53 these error bounds 83:55 are interesting mathematically 83:57 but they are very conservative and they 83:59 don't really tell the whole story this 84:01 sort of point of direction toward which 84:03 performance uh tends to go 84:06 but 84:09 i would not rely on this bounce very 84:12 much to draw quantitative conclusions or 84:14 base design decisions on those 84:18 so all in all 84:20 policy iteration is very effective but i 84:22 cannot fully justify 84:25 the 84:26 my argument on the other hand there's 84:28 like a 84:29 50-year experience on the on this 84:32 methodology and 84:35 what i'm saying is pretty reliable based 84:38 on just the experimental evidence 84:41 thank you 84:46 excellent uh 84:48 any other questions 84:53 maybe i can take the opportunity to ask 84:54 one and so so if i understand correctly 84:57 i mean a lot of things here okay they're 84:58 very abstract um 85:01 but but it seems like you're using uh 85:03 full state 85:05 feedback let's say a full access to the 85:07 state of the system and what would 85:09 happen or what would happen computation 85:11 if you don't have like if i want to make 85:12 a machine for playing a texas holden 85:15 card game where i can only have partial 85:17 observations let's say of the state 85:22 so does your question have to do with 85:24 imperfect state observation exactly yes 85:27 partial or imperfect 85:29 right yeah 85:31 okay 85:32 so once you pass into a perfect state 85:34 operation you're getting a more 85:35 complicated problem on the other hand it 85:38 is 85:38 uh it can be transformed 85:42 into a perfect state information problem 85:44 where the state is something much more 85:46 complicated it is a belief state 85:49 a probability distribution of where the 85:51 state is 85:53 now 85:54 that's where approximations are called 85:57 for perhaps neural network 85:58 approximations 86:00 and 86:00 there is not much experience in this 86:02 area using 86:04 this type of 86:06 value space methods including rollout 86:10 however i have two papers with some 86:12 collaborators from asu stephanie gill 86:16 and also 86:18 some of her students and these papers 86:20 are over have been generated over the 86:21 last couple of years with very 86:23 challenging very high dimensional 86:26 condp 86:28 problems 86:29 and everything works exactly the same as 86:32 for 86:32 lower dimensional problems of course the 86:35 implementation 86:36 uh is more complicated the offline 86:39 training takes days to run okay 86:44 but still when it comes to online play 86:47 it's possible to do it and the results 86:50 seem to be quite encouraging 86:52 so that's all i can say it's a very 86:54 interesting area for many people to work 86:57 on there are efforts in 86:59 various communities 87:01 and but um 87:03 the 87:04 my work the two papers i mentioned is 87:06 the ones that close more closely relate 87:08 to the viewpoints i've expressed today 87:13 could i also ask if you could share some 87:15 other insight i mean your framework i 87:17 think it's very beautiful it 87:18 conceptually explains a lot of things 87:20 that we can see 87:22 but algorithmically i mean you mentioned 87:23 a little bit to roll up rollout 87:26 optimization could be challenging but do 87:28 you also have some 87:29 say algorithmic insight that you can 87:32 share of how this could transform the 87:34 way we do computations not just an 87:36 analysis say and conceptually structure 87:39 algorithms 87:41 i think the computation is feasible 87:44 yeah i mean 87:46 based on my experience with the pomdp 87:49 problems which were horrendously 87:50 complicated uh i can tell you that's 87:53 feasible you have to be clever about it 87:56 you have to be lucky about it 87:58 but 87:59 but 88:00 we have used a lot of distributed 88:02 computation to do all this and like i 88:05 said training for days 88:07 so as the availability of parallel 88:09 computation increases 88:11 uh 88:12 it's going to have a big effect 88:14 because a lot of these 88:16 data in 88:18 database 88:20 computations 88:21 can be parallelized for example monte 88:23 carlo simulation is eminently 88:25 paralyzable 88:27 so it's not as hard or prohibitive as it 88:29 used to be 88:30 and in the future it's going to be more 88:32 accessible 88:33 so a lot of the things that in the past 88:36 had seemed to be 88:38 out of the realm of visibility may 88:39 become feasible or are feasible right 88:42 now 88:43 so i'm optimistic in that regard the 88:46 computational 88:47 resources are just so plentiful that sir 88:49 and that that that gives me a lot of 88:52 hope 88:54 sounds excellent i i think we have one 88:56 more question before concluding uh jin 88:59 zhao 89:02 uh yes thanks for thanks professor for 89:05 your talk and uh it's really interesting 89:07 uh actually i have two questions but in 89:10 terms of the time you may just answer 89:12 one of them uh i can stay on if anyone 89:15 wants to stay 89:17 beyond i'll be i can i can stay on the 89:19 problem 89:21 yeah go ahead please thank you um 89:25 first of all it's about yeah i can in 89:27 turn you mentioned that it's a 89:30 linearization of the 89:32 of some 89:34 uh the 89:36 newton is a newton step and linearize 89:38 the function or uh operator and then 89:41 when when you 89:43 talk about this a sub gradient how do 89:46 you define the proximity or the 89:49 metric in the function space because in 89:51 terms of the fractured derivative of 89:53 this operator you have to i am curious 89:56 about about 89:58 the metric in this space and secondly 90:00 you it's a monotone concave function and 90:04 i'm curious about what's the due of this 90:07 function look like because it's 90:09 you there might be another physical 90:11 meaning of the duel of the function but 90:13 that's all my question 90:16 um 90:17 yes 90:18 even though the problem 90:20 may be discrete 90:22 the state space may be discrete like a 90:24 finite number of states five states 10 90:26 states 100 states 90:28 the bellman equation is defined over 90:31 real valued functions right 90:34 and that's why you can talk about 90:35 convexity 90:37 and the metric there is uh okay you're 90:40 dealing with function spaces euclidean 90:42 space perhaps so that's uh that's what 90:45 you use 90:46 you could consider 90:48 the possibility of dual functions i 90:50 guess what you you would call conjugate 90:53 convex conjugate functions but i haven't 90:55 found any use for that so far but it's 90:58 it's a possibility 91:00 um i don't know if i've answered your 91:03 question 91:03 oh thank you thank you thank you yeah 91:06 okay 91:08 uh okay excellent i think we should we 91:10 can uh 91:12 conclude here and if there are people 91:14 that still have questions for professor 91:16 bertsekas 91:17 uh maybe you can just stay on stay on 91:19 and and those of us that want to 91:21 continue discussing a little bit more 91:23 can do so so 91:24 with that i think we conclude and uh 91:26 thank you so much i i wish we could all 91:29 upload also 91:32 with the sound but but we do it with our 91:33 little reactions and thank you very much 91:35 for everything here 91:38 uh thank you for the opportunity 91:41 and it was a pleasure to give this talk 91:48 i'd like to say hello to a few friends 91:50 whose i see whom i see in the audience 91:52 here 91:53 david and ben 91:54 and mark um 92:02 excellent i will i think now there's 92:05 some so few people left so those of you 92:07 that want us to speak to can just unmute 92:11 and we will sort out 92:17 i 92:17 i think i my 92:19 question that i 92:21 was similar to the one of mike mikhail 92:24 that 92:26 one of the reasons why you would use 92:28 policy iteration for example is that you 92:31 do not have a 92:33 a 92:36 model of the system or the rules of the 92:39 system that would allow you to do look 92:41 ahead 92:43 but i guess this 92:45 most of the methods that you have talked 92:47 about in this presentation does assume 92:49 some 92:50 some 92:51 model of the system 92:55 right you need to have a model like in 92:57 mpc 92:59 you need to have a model 93:00 on the other hand you may be estimating 93:02 the model online 93:04 and 93:06 and 93:07 that's that's our limitation 93:09 um 93:13 if you look at the problems where you 93:14 don't have a model 93:16 then you are in the area of dual control 93:19 what's called dual controlling uh in 93:21 traditional stochastic control theory 93:24 it's a concept that goes back to the 93:26 to the 60s and people have been trying 93:29 to deal with that also in conjunction 93:31 with adaptive control for decades and 93:33 it's one of the great uh challenges 93:36 of 93:38 of control theory and i don't think that 93:41 you're going to solve them with 93:42 reinforcement solve this problem with 93:44 reinforcement learning methods the 93:45 problems are very fundamental 93:47 so basically 93:51 i think that 93:52 [Music] 93:55 it's not very clear how you deal with 93:57 with this unknown or changing models 94:00 it's uh 94:02 certainly in my talk today i have 94:05 assumed a known model 94:08 um and only by 94:10 uh i i have high hand waves when it came 94:13 to estimating a model and using the 94:15 estimate in practice you do that but 94:18 whether this is sound or not it's not 94:20 very clear to me 94:25 all right yeah thank you 94:30 so 94:30 in 94:32 do you think we could get a handle on 94:34 that problem of how to select rollout 94:36 horizon i mean do you have any more 94:39 insight in that 94:40 for sure i mean 94:42 you could say okay the longer the better 94:44 but 94:45 is that is that always true the longer 94:47 the vector i know i 94:49 my video uh actually not 94:52 uh and 94:53 one can construct interesting examples 94:56 where uh the 94:58 where we uh you increase the length of 95:02 the rollout horizon and the performance 95:04 becomes worse 95:06 uh basically 95:08 uh 95:10 you 95:11 okay with the 95:13 with rollout with a policy you push 95:16 the starting point of the newton step 95:18 close to the cost of that policy 95:21 on the other hand you don't really want 95:22 to go 95:23 at the cost of that policy you want to 95:25 go close to the optimal cost and apply 95:28 the newton's step there 95:30 now how do you choose the 95:32 the rollout horizon to do that it's not 95:35 very clear 95:36 on the other hand in practice this can 95:39 be resolved by some kind of 95:41 experimentation basically that's i mean 95:44 that's 95:45 there's no theory that can guide you 95:46 about the length of the horizon because 95:48 what's a good horizon also depends on j 95:51 tilde the starting point 95:54 and 95:54 that's that cannot be predicted very 95:56 easily 95:59 so it's an interesting question to deal 96:01 with both practically and perhaps 96:03 also theoretically 96:19 um 96:21 i had another question 96:23 you talked about alpha 0 now but there's 96:26 also 96:27 are you familiar with the method mu0 96:29 that's also from deepmind 96:33 yeah do you have any comment about that 96:36 yes i do 96:37 uh i've actually looked at 96:40 uh this new zero paper not very 96:42 carefully but one statement struck me 96:45 that the monte carlo's research is not 96:48 very good 96:49 uh for the for for as it i mean we've 96:53 been hearing from deepmind about how 96:55 good monte carlo research is and people 96:58 have been trying to use monte carlo 97:00 research in all kinds of contexts even 97:03 though 97:04 even context where it was implausible 97:06 that it would be well shooted 97:09 now the same alpha zero deep mind people 97:11 are telling us that 97:13 that 97:14 that monte carlo research is not good 97:17 and 97:19 and so i don't know what to believe 97:22 uh but i have done my own 97:23 experimentation about monte carlo 97:25 research 97:27 uh 97:28 and i have found that's a hit or miss 97:30 proposition 97:31 sometimes it doesn't work at all 97:33 and no matter how hard you try 97:35 other times it may work 97:37 and if you look at how the method works 97:39 you can see that that's probably the 97:41 case 97:42 now depending on application it may work 97:44 very well and i think 97:46 that in alpha zero it has been at least 97:48 i've been told that it has been a major 97:52 contributor to success because it prunes 97:55 intelligently the look ahead tree 97:57 can save some computation time 98:00 on the other hand i don't know how to 98:02 make out of this new zero 98:04 they say that monte carlo research does 98:06 not work very well and they have 98:09 rejected it they have taken out of the 98:10 design but i don't understand the 98:13 reasons 98:14 just as i did not understand why the 98:16 original monte carlo research worked i 98:18 don't understand now why it does not 98:20 work 98:23 so i have more to learn about new 98:25 zealand perhaps you have some insight 98:27 that i do not because my reading of the 98:29 paper was so superficial i just know 98:32 that you know it is 98:34 a more 98:34 [Music] 98:35 deep modern deep learning approach where 98:37 you map everything to a sort of latent 98:40 vector space where you assume that you 98:42 can 98:42 kind of do everything 98:46 so it is but it is also less 98:48 interpretable because of it because you 98:53 yeah 98:54 well 98:56 one of the more controversial viewpoints 98:58 that of my talk today 99:01 is that 99:02 offline training 99:04 does not matter all that much 99:06 uh refining 99:08 a very very 99:10 very closely this offline training 99:12 process 99:14 and 99:15 it's not uh it doesn't make much 99:17 difference because the newton steps sort 99:19 of 99:20 washes these differences out 99:22 so 99:24 is it the deep neural networks that make 99:26 a difference 99:27 i mean tesoro so word does not use a 99:29 deep neural network uses a 99:31 two-stage network 99:33 uh 99:34 and 99:35 deep is not necessarily 99:37 better 99:38 maybe it's better for some application 99:40 image processing applications but in 99:42 dynamic programming 99:44 well 99:45 alpha zero has proved muscular proof has 99:48 there's some evidence that it has worked 99:51 but uh 99:53 but 99:55 but whether it works in general or not 99:58 i'm not so sure 99:59 uh 100:02 i like deep neural networks and the 100:04 ideas behind them 100:06 uh but 100:09 i am 100:11 i i do not think that it is critical 100:13 that you use deep neural networks in 100:15 offline training 100:22 all right thank you 100:26 professor i 100:28 i would like to 100:30 ask that 100:31 you mentioned about having this model 100:34 and and knowing the model and 100:38 you refer to the case that the model 100:40 could be a simulator as well right 100:42 because 100:45 for policy gradient and those type of 100:47 method they still need to have a 100:49 simulator model or whatever to collect 100:52 data in the first place 100:54 yeah yeah that's a good point yeah 100:58 yeah when i say model i don't 100:59 necessarily mean equations and 101:01 probability distributions and the like 101:02 it could be a computer model 101:04 i mean a computer model implements a 101:06 mathematical model 101:08 so what i mean by 101:11 by needing to have a model it could be a 101:13 computer model 101:15 as opposed to the case where 101:19 you don't have such a model and you're 101:21 identifying you need to do system 101:23 identification 101:25 so 101:27 so yeah you're right 101:29 the 101:30 a model has a broader meaning in my 101:33 context 101:43 yeah sorry 101:44 my zoom crashed that way i asked the 101:47 question and then i disappeared uh 101:49 i i apologize for that i asked about 101:51 this uh 101:52 rollout 101:54 length design but i can ask you you 101:56 shall later what the answer was i just 101:59 wanted to apologize for disappearing it 102:01 was not very polite 102:02 i i may not be able to teleport 102:05 now 102:07 thank you 102:08 but but professor i think one of your 102:11 slides that have truncated rollout and 102:14 this uh slope actually uh 102:17 explained the idea 102:19 elegantly that you have too long roll 102:22 out then the rollout would goes to that 102:25 uh 102:26 that cost function of that particular 102:28 policy 102:30 instead of the 102:31 uh 102:32 bellman equations solution i think that 102:35 figure would explain that it's not 102:37 necessarily good to have way too long 102:40 roll out 102:42 yeah yeah thank you very much you child 102:44 i think if you look at the draft of the 102:46 book there's a section on this question 102:48 precisely 102:49 uh the situation where a longer rollout 102:52 may help 102:53 but 102:54 uh even so there's no definitive answer 102:58 on how you do it it's just 103:01 you might gain some better understanding 103:03 about what's going on but you're looking 103:05 at this section of the book or the or 103:07 the figure that your child mentioned 103:10 but 103:14 but there's it's 103:16 it's hard to say 103:18 how much rollout would be most effective 103:23 on the other hand there's always a 103:24 newton step okay once you have a newton 103:27 step you can afford to you can play 103:29 around with different uh 103:31 parameters of the design the newton step 103:34 has the power to correct a lot 103:40 do you know in this in td gammon the the 103:42 author did he reveal uh i mean the i 103:45 guess the 103:46 the 103:47 design of td gamma did he reveal 103:50 how that was tuned i mean i guess or 103:54 google with alpha zero did they also 103:55 discuss this 103:57 in their own contributions 103:59 well 104:00 the soros paper very explicitly says 104:03 that it's important not to have a very 104:05 long rollout 104:07 but the reason that he gave is that 104:10 it takes a very long time to do a long 104:12 rollout and that's why he substituted he 104:15 introduced truncation 104:17 introducing a cost function 104:18 approximation at the end of the rollout 104:21 yeah 104:22 the solo had it all figured out 104:25 but 104:26 but he did not give any mathematical 104:28 analysis 104:29 and his motivations were always 104:32 computational how to make the thing work 104:35 because remember 104:37 thesaurus 104:38 rollout program it dates to 1996 and 104:42 computers were very primitive at that 104:44 time yeah and for the sort to play a 104:46 single game of backgammon 104:48 he had to pull together 104:51 a lot of workstations at ibm research 104:54 center to work all night in order to 104:57 play a single game 104:59 because 105:00 monte carlo simulation was involved 105:04 and it was important to use parallel 105:06 computation and also to keep the length 105:08 of truncated rollout short 105:13 that was a good use of of computing 105:15 resources in the end you know 105:17 yes it's amazing tesoro has been amazing 105:20 in this field and i'm afraid and i'm and 105:22 i think that his uh his great influence 105:25 and great contributions have not been 105:27 fully appreciated unfortunately in this 105:30 field of reinforcement learning 105:32 memory is very short yes 105:35 stretches back to a few years 105:38 anything that's been done 20 years ago 105:40 or longer is as if it never existed yeah 105:44 yes yes 105:46 i guess it's important 105:49 it's a more general problem oh yeah 105:52 people tend to be forgotten 105:54 so fast 105:56 yes 105:57 great but thank you dmitry we will not 106:00 forget you 106:01 yeah 106:02 it's good to hear 106:04 but uh i'm not sure i can hold you to it 106:06 [Laughter] 106:10 great this was fantastic um so i guess 106:13 we should conclude and uh i know you you 106:16 work with kale you shall and so that's 106:18 fantastic i hope you'll be able to come 106:20 to visit us at kth in a non-pandemic 106:23 future hopefully 106:25 love your city i'd love to visit 106:28 and also i hope that we can do some 106:31 further work and more computational 106:32 experimentation with meaningful problems 106:35 and i think there are many possibilities 106:38 and working with you chow has been 106:40 fantastic for me and gave me the 106:42 opportunity also to connect with your 106:44 group 106:46 that's great 106:47 all right so thanks a lot uh 106:50 i i'm leaving yeah also 106:53 thanks and and many many things 106:55 bye for now okay thank you very much 106:58 thank you dimitri and i will send you 106:59 the recording 107:01 okay guys thank you very much yeah i'd 107:03 like to have the reporting absolutely 107:05 i'll fix that thank you so much 107:07 take care bye-bye