字幕記錄


00:00
hello everyone um
00:02
you can
00:03
see my title and this is also the title
00:06
of my forthcoming book that's scheduled
00:09
for publication sometime next year
00:12
you can find a fairly complete draft of
00:14
the book video lectures a copy of these
00:18
slides
00:19
and other material my website
00:23
the purpose of the book is to provide a
00:26
new conceptual framework for
00:28
reinforcement learning and its
00:30
connections with the traditional control
00:32
methodologies of model protective
00:34
control
00:35
and adaptive control
00:39
the book is based on
00:43
visualization
00:44
and
00:45
and the geometric interpretations for
00:47
the most part
00:49
uh
00:50
on the other hand there is quite a bit
00:52
of mathematical analysis
00:54
uh that is given in my earlier books
00:58
that supports the visualization
01:03
so
01:04
i'd like to start
01:07
with
01:08
two remarkable programs
01:10
the first program is alpha 0
01:14
which
01:15
placed chess and created
01:17
a sensation
01:19
uh when it came out in 2017
01:24
and the other program is much earlier
01:25
from 25 years ago it's called td gammon
01:29
uh and
01:31
it's the place backgammon
01:33
uh and
01:35
uh at that time it created a lot of
01:37
enthusiasm about the prospects of
01:39
reinforcement learning
01:40
so i was talking about alpha zero and td
01:43
gammon and these two programs are very
01:44
remarkable even though they were quite a
01:47
few years apart uh
01:49
when they were
01:50
they were implemented
01:52
uh they're quite similar
01:54
and both of them involve two algorithms
01:57
some people don't quite realize that
01:59
there are two algorithms that are
02:00
involved in these programs
02:02
one algorithm is the offline
02:06
training algorithm by which the program
02:09
learns how to evaluate positions
02:12
and also how to select moves in given
02:15
positions
02:16
and this is done with a lot of data
02:18
neural network approximations and so on
02:22
the second algorithm is online play
02:24
algorithm
02:26
which is
02:27
what the program uses to play against
02:29
other computer programs or human
02:31
opponents
02:32
and it is based on multi-step look ahead
02:36
roll out and cost function
02:37
approximations
02:39
and all of this has strong connections
02:41
with dynamic programming the classical
02:44
method of policy duration
02:46
and
02:47
reinforcement learning type of
02:49
approximations
02:51
and what i'm going to try to do in this
02:53
talk is aim to understand this
02:55
methodology
02:56
so it applies far more generally
02:59
and particularly
03:01
on issues of how it connects with
03:03
control system design methods of mpc
03:06
multi-predictive control
03:08
and also adaptive control
03:11
but there are extensions in a broader
03:14
context
03:15
for example discrete optimization
03:17
integer programming and the like
03:19
although i'm not going to talk about
03:20
those in this particular talk
03:22
okay i'm going to start describing the
03:24
online play algorithm in alpha 0 alphago
03:27
and other
03:29
related
03:30
uh td gammon and other
03:34
settings and then i'm going to discuss
03:36
the offline
03:37
training algorithm
03:39
now the online play algorithm is what
03:41
you see here in blue okay
03:44
the
03:45
online play algorithm uses the results
03:47
of offline training at two points
03:51
um
03:52
it uses a position evaluator to evaluate
03:56
positions and a bass player
03:59
that it generates moves at given
04:02
positions
04:03
so
04:04
this is online
04:06
play but
04:08
there is crucially
04:11
used information
04:13
by the offer provided by the offline
04:15
training algorithm
04:18
and
04:19
here what we're trying to do is improve
04:22
the bass player the one train offline
04:25
by
04:26
the following process
04:28
uh
04:29
given that we are at a at some position
04:32
let's say after k moves a current
04:34
position x k
04:36
we search forward for several moves
04:38
through a look ahead tree of moves and
04:41
counter moves up to a certain depth
04:45
then from the leaves of that tree we
04:47
simulate the bass player for some move
04:50
more moves
04:52
and then
04:53
we arrive at some positions here which
04:57
we evaluate using the position evaluate
05:00
the evaluator
05:02
and we approximate the effect of these
05:03
future moves by using that evaluation
05:07
and now we back up these values back to
05:10
the root and we calculate a value
05:13
of each possible move at the current
05:16
position
05:17
and we play the best move
05:20
so that's what we do at a given position
05:22
and then we go to the next position
05:25
after one move and do the same thing and
05:27
so on
05:29
now all of these programs use
05:31
destruction but there are differences
05:34
uh alpha zero does not use this middle
05:37
portion
05:38
it uses a very long initial portion
05:42
the look ahead portion is very long and
05:44
does not need to use this middle portion
05:47
alpha go uses a version of this middle
05:50
portion
05:52
uh td gammon
05:54
use relies very much on the middle
05:56
portion
05:57
because the look ahead portion is short
06:00
in td gun td gammon plays backgammon
06:02
which is a stochastic game and the look
06:04
ahead 3 expands very quickly so it
06:07
cannot use long look ahead so it relies
06:10
on this
06:11
uh middle portion
06:13
the truncated rollout in fact the turn
06:15
rollout which was coined by the
06:17
originator of the gammon jerry the
06:19
sorrow
06:20
and i'm using the term here in the same
06:22
way that he has used it
06:26
there are also
06:28
similarities with model predictive
06:30
control
06:31
model predictive control structures are
06:34
remarkable
06:35
remarkably similar to this structure
06:37
into the alpha zero structure
06:40
there is an initial look ahead portion
06:43
where minimization takes place this is
06:45
called the control interval and then
06:48
there is sometimes a rollout portion
06:51
which is called the prediction interval
06:54
and
06:55
and then there's a terminal cost
06:57
approximation
06:58
so
06:59
uh
07:00
model predictive control
07:02
is uh
07:03
quite uh is pretty close to this
07:06
structure except some of the uh some of
07:09
the parts are implemented differently
07:11
because model predictive control deals
07:13
with uh infinite state in control spaces
07:16
for the most part couriers these games
07:19
deal with discrete state spaces
07:24
so let me go now into the offline
07:26
training algorithm alpha zero
07:29
it is
07:30
approximate policy duration
07:33
uh with
07:34
self-generated data
07:37
what we do here is generate a sequence
07:40
of players
07:41
and the current layer is used to train
07:44
an improved player and this process is
07:46
repeated several times until we stop
07:48
somewhere and we give the trained player
07:50
at the end to the online play algorithm
07:56
so
07:57
uh police generation has two parts every
07:59
step of policy evaluations improvement
08:01
these are done approximately with with
08:04
neural networks
08:05
so the current player
08:07
uh is evaluated by playing many games
08:10
collecting data and
08:14
representing its evaluation
08:16
by a neural network
08:18
which is trained with
08:20
lots and lots of data
08:23
now once we have this value this the
08:25
evaluation of the current player
08:28
we can improve
08:30
approximately improve the current player
08:33
by using multi-step look-ahead
08:35
minimization
08:38
and
08:39
an approximate form that's called monte
08:41
carlo tree search
08:44
and the improved layer after we collect
08:46
all this data from the improved player
08:48
we represent it with a policy neural
08:50
network through training
08:54
so we have a sequence of value and
08:56
policy networks generated here and these
08:59
are given at the end to the online
09:00
player for user frontline play
09:04
now
09:05
td gammon uses a very similar
09:08
policy duration algorithm for offline
09:10
training using the so-called td lambda
09:14
method
09:15
it uses it generates it it uses it
09:18
it
09:20
it does not have a policy network
09:22
the functionality of the policy network
09:24
is uh is uh is is provided by using the
09:28
value network through
09:29
one step look ahead
09:31
and it also does not use monte carlo
09:33
research and that's not really
09:36
fundamental it does not really affect
09:38
the overall
09:40
character of the process
09:44
so
09:45
here are some major empirical
09:48
observations that i want to focus on
09:50
here's the online player it uses an
09:54
offline player that has been trained
09:56
with neural networks
09:57
but the online player plays much better
10:00
in alpha zero than the offline trained
10:03
player this is just a fact it's an
10:06
obvious fact for people that that that
10:09
have used these programs
10:11
um
10:12
the
10:13
online player of alpha zero is just a
10:15
phenomenal chess player
10:17
by contrast the offline player is
10:21
relatively mediocre
10:23
and so there's tremendous
10:25
performance improvement
10:27
by having online play on top of offline
10:31
training
10:34
td gamma
10:36
plays much better with this middle
10:38
portion the truncated rollout portion
10:41
than without the rollout portion
10:43
now tessuro had the two programs in the
10:45
90s the first one was
10:48
in
10:48
1991
10:50
and
10:51
it was based on this structure but
10:54
without the middle portion
10:56
then in 1996 he introduced the middle
10:59
portion the rollout portion
11:01
and he found a great deal of improvement
11:04
the 1991
11:06
uh
11:07
program
11:08
was playing
11:11
his performance was subhuman it was was
11:14
being beaten by the best humans
11:16
by contrast the 1996 program
11:20
uh had superhuman performance and has
11:22
not been improved since that time
11:25
now these are empirical facts
11:28
uh and what we like to do is to explain
11:31
them provide draw insights from them
11:34
and
11:35
and
11:36
generalize them for to other contexts
11:39
by using abstract dynamic programming
11:42
ideas
11:43
abstract bellman operators
11:45
visualization for the most part to draw
11:48
inside
11:49
and we will focus on the central lore of
11:52
newton's method this is the cue
11:54
the key new research that i'm going to
11:57
present today
11:58
how newton's method ties everything
12:01
together
12:05
so here are the principal viewpoints of
12:08
this stock
12:12
i'm going to show you that online play
12:15
is just a single step of newton's method
12:17
for solving the belmont equation
12:19
associated with the problem the dynamic
12:21
programming problem
12:23
this is true for one step look ahead
12:26
in the case of multi-step look ahead the
12:29
newton step is preceded by some value
12:31
iterations
12:33
uh in departments of
12:35
numerical analysis this is called a
12:37
newton sor method it's a well-known
12:39
method
12:40
uh with a lot of literature sor stands
12:43
for successive over-relaxation
12:46
uh and
12:48
so
12:49
so
12:50
online play is a newton step
12:53
uh
12:54
offline training
12:56
on the other hand provides the starting
12:58
point for the newton step
13:02
and in my opinion
13:04
online plays a real workhorse
13:06
everything happens in online play the
13:08
big performance improvement
13:10
is during online play
13:13
by comparison offline training plays a
13:16
secondary role
13:18
and a major reason for this is that
13:20
online play is an exact newton step it
13:23
does not involve approximations uh it's
13:26
not degraded by neural network uh
13:29
training errors
13:31
um
13:32
so that that makes it different and much
13:36
more powerful
13:40
now
13:41
the offline training can be done by many
13:43
different methods in reinforcement
13:45
learning we have temporal differences
13:46
methods aggregation we have linear
13:49
programming there are many different
13:50
ways to do the offline training
13:53
however these differences in the various
13:55
imperfections of these methods affect
13:58
only the starting point
13:59
but they do not affect at all the
14:01
newton's step which is exact and they're
14:04
all washed out by the newton's step the
14:06
newton sap is a big equalizer arrows
14:09
that you may make in offline training
14:11
are washed out by the newton step
14:17
now there's a cultural difference
14:19
between the reinforcement learning
14:21
artificial intelligence community
14:25
and the control community
14:28
particularly more ballistics and
14:30
adaptive control
14:32
people in reinforcement learning focus
14:35
primarily on offline training issues
14:38
for yes
14:39
people in modern predictive control
14:42
are focused largely on online play and
14:46
also stability issues which are not
14:48
dealt with by reinforcement learning
14:51
community almost at all
14:56
so we're going to try to bridge this gap
14:58
through this visualizations
15:01
and
15:02
i'm going to talk a little bit about
15:03
adaptive control
15:05
uh but instead of using the
15:07
reoptimization in response to changes in
15:09
the estimated model
15:11
we're going to use multi-step look ahead
15:13
and draw out
15:15
and
15:16
and
15:18
and
15:19
the this framework is going to involve
15:22
an exact newton step instead of
15:25
that applied to a changing belmont
15:28
equation changing in response to a
15:30
system identification process
15:32
but the bottom line is that it's still a
15:34
newton's step
15:36
and it's very powerful
15:41
now
15:42
all of the surprising great generality
15:44
abstract dynamic programming is a very
15:48
general uh framework
15:50
uh it can accommodate arbitrary state in
15:53
control spaces
15:55
stochastic systems deterministic systems
15:58
hybrid systems
16:00
involving both discrete and
16:02
and continuous
16:04
state space and the control space
16:06
components multi-asian systems games and
16:10
mini max finite horizon infinite verizon
16:14
discrete optimization integer
16:16
programming dynamic programming is a
16:18
very very general methodology and
16:20
abstract dynamic programming captures
16:23
this generality
16:25
and because everything i'm going to talk
16:26
about is couched within uh it comes from
16:29
the
16:30
abstract dynamic programming framework
16:32
it has
16:34
it has very broad
16:36
applications
16:40
now some of these viewpoints may be a
16:42
little controversial and personal
16:45
so
16:46
i thought of showing you this cartoon
16:50
these two people would look at the same
16:51
number and this guy says it's a six but
16:54
this guy and sisters are nine
16:56
and uh i think this cartoon may have
16:59
some relevance to to my talk today
17:02
so let me give you an outline first i'm
17:04
going to talk about discounted and
17:06
undiscounted different horizon problems
17:08
to provide some
17:10
context
17:11
uh then i'm going to talk about abstract
17:14
dynamic programming abstract bellman
17:16
operators
17:18
then i'm going to show by visualization
17:21
how online play can be viewed as a
17:23
newton step
17:26
then discuss stability issues the reason
17:28
stability and how you can visualize it
17:32
visualization of
17:34
rollout truncated rollout and policy
17:36
duration
17:38
uh then i'm going to talk about a
17:40
familiar problem the classical linear
17:41
quadratic problem where the
17:43
visualizations are in terms of recovery
17:46
equation operators as opposed to bellman
17:48
operators
17:50
then i'm going to discuss application
17:52
model credit control and the connections
17:54
there and then also adaptive control
17:56
indirect adapted control with model
17:59
estimation
18:01
so let me start with
18:04
discounted and undiscounted infinite
18:07
horizon problems
18:09
so what we have here is a system
18:12
discrete time system
18:17
excuse me
18:19
involving a state which i denote by x k
18:22
is a state of time k a control uk
18:25
and a random disturbance wk is random
18:29
and the system evolves according to this
18:31
equation
18:32
and
18:35
we're interested in stationary policies
18:37
that map state into controls
18:40
and
18:41
this control my satisfy
18:43
control constraint for every x
18:46
um
18:48
there is a cost for this transition
18:50
uh
18:51
this encoded by this function g and the
18:54
cost is discounted by a factor alpha
18:57
alpha could be less than one or could be
19:00
equal to one
19:02
and at the at the stage k the the cost
19:06
is discounted by alpha k
19:09
and the cost of a policy starting from
19:12
some initial condition x zero
19:14
is just the limit of the end stage cost
19:17
this denotes expected value over all the
19:19
w's this is a random variable and we
19:22
take this expected value and take the
19:23
limit
19:24
as the
19:25
the number of stages n goes to infinity
19:30
so this is a function for every state
19:32
there is a value here
19:35
given by this function and the optimal
19:37
cost function is obtained by minimizing
19:40
overall
19:42
overall policies the policy cost
19:44
functions
19:46
and i have in mind primarily three
19:48
contexts
19:50
one is discounted problems where alpha
19:52
is strictly less than one and g is
19:54
bounded this is the nice case
19:57
everything i'm going to talk about
19:59
is correct
20:01
for discounted problems
20:03
now for other problems
20:05
uh there may be exceptions
20:08
and uh and and i'm going to mention a
20:11
few of those as i go along
20:14
uh the second class of problems is
20:16
stochastic shortest path problems where
20:18
there's a special cause-free termination
20:20
state think of the origin as being
20:22
determination state
20:24
and alpha is equal to one
20:28
and then
20:29
the mpc type of problems
20:32
where the the
20:34
system is deterministic
20:36
the cost per stage is positive
20:39
the origin is the termination state
20:42
and this is a regulation problem the aim
20:44
is to drive the state to the original
20:46
keep it close to the origin and this
20:49
is encoded in this positive cost the
20:52
cost is zero determination state
20:53
positive everywhere else
20:56
so these are the problems i have in mind
20:59
and let me give you a synopsis of the
21:01
theory
21:03
the first result is that j star the
21:04
optimal cost function satisfies
21:06
bellman's equation
21:09
now there is one such scalar equation
21:11
for every state so it's a functional
21:13
equation
21:15
and j star
21:16
satisfies this equation
21:19
um
21:21
often uniquely but not necessarily
21:23
that's why i have this question marks
21:26
there's an optimality condition
21:28
if
21:29
for a given x mu stars of x attains the
21:31
minimum in this equation for every x
21:34
then the polish that you obtain is
21:35
optimal again there are some exceptions
21:37
to that
21:41
and there are
21:42
regarding algorithms the first major
21:44
algorithm is value iteration it
21:46
generates a sequence of cost functions
21:48
starting from sun j0 which should be
21:51
more or less arbitrary
21:53
and uses this
21:55
recursion here this iteration
21:57
which is just the right hand side of the
21:59
bellman equation but with j k minus 1
22:03
replacing j star
22:05
this generates a sequence of finite
22:07
horizon costs
22:09
and
22:10
it's one of the major algorithms the
22:12
other major algorithm is policy
22:14
iteration
22:15
generates a sequence of policies uk
22:18
and their cost functions mu zero is
22:20
arbitrary
22:22
the typical iteration starts with some
22:25
policy mu and generates a new policy new
22:28
tilde i'm going to be using this symbols
22:30
a lot in this talk mu for current policy
22:33
mutilda for the next policy
22:36
in two steps there's a policy evaluation
22:38
step that computes the cost function jmu
22:42
and there is a policy improvement step
22:45
that calculates the improved policy by
22:47
using
22:48
this equation here
22:51
by doing for every x this minimization
22:54
which is the same minimization as in
22:56
bellman's equation except that we use j
22:59
mu the cost of the
23:01
current policy in place of j star
23:05
okay
23:06
so now let's talk about approximation in
23:09
value space a major approximate dynamic
23:12
programming reinforcement learning
23:14
framework
23:16
um here
23:18
we react
23:20
it's very difficult to find j star so
23:22
you replace it with an approximation j
23:24
tilde
23:26
in bellman's equation
23:28
and when we arrive at the state x we
23:31
perform this mean symbolization
23:33
which balances
23:35
the
23:36
cost of the first stage plus the future
23:39
costs as evaluated using this j tilde
23:43
and also appropriately discounted this
23:45
minimization is over all the possible
23:47
controls at the given state and provides
23:50
a control to use
23:52
so we apply this control and then we go
23:54
to the next stage at the next state we
23:57
apply we do the same minimize this type
23:59
of minimization again and we proceed
24:01
going forward
24:03
this is called one step look ahead
24:05
because we look forward only one step
24:07
before going to the approximation
24:10
multi-step look ahead is sort of the
24:12
same thing except that
24:14
we
24:16
minimize the cost of the first l stages
24:20
plus the future cost as evaluated by j
24:23
tilde and now this is an l step problem
24:27
and it involves minimization overall
24:29
control uk and policies from k plus 1 to
24:33
k mark plus
24:35
l minus 1
24:37
we solve this problem and we use at x k
24:40
this first control and we throw away all
24:43
the rest
24:45
either way one step look ahead to a
24:47
multi-step look ahead
24:49
this defines a look ahead policy
24:52
mu tilde new tildes of xk is the control
24:56
that's obtained from this minimization
24:58
it's the minimizing uk
25:02
i'm going to be using a lot of this
25:04
symbols j tilde is cost approximation
25:06
new tilde is the look ahead policy
25:09
obtained by either one step look ahead
25:11
on a multi-step look ahead and here's
25:14
the key fact
25:17
j mu tilde the cost function of the look
25:20
ahead policy
25:21
is a result of a newton step for solving
25:24
bellman's equation starting from j tilde
25:28
now this is true for
25:30
one step look ahead
25:32
uh for multi-step look ahead
25:35
l greater than one
25:37
is the result of a newton sor step which
25:40
is some value iterations followed by a
25:44
newton step
25:46
either way
25:48
the error decreases super linearly
25:51
that's a generic and the key property of
25:54
newton's method whereby
25:58
the errors that diminish very rapidly
26:02
and in particular
26:04
this ratio goes to zero as
26:06
as j tilde
26:08
gets too close uh it tends to j star
26:13
so
26:14
you had an error
26:16
in cost
26:18
between j tilde and j star
26:21
and you have a new error after the
26:22
newton step that gives you the cost of
26:25
the look ahead policy
26:27
compared with j star and this numerator
26:30
is tiny relative to the denominator this
26:33
arrows reduce very fast super linearly
26:39
so now let me
26:42
go into abstract dynamic programming
26:45
concepts and how you interpret all this
26:47
in terms of bellman operators and
26:50
through visualization
26:55
okay so here's the bellman operator it's
26:57
the right hand side of what you have in
26:59
bellman's equation
27:01
and it maps cos functions j
27:04
into cos functions t nuisance j
27:08
so j has a component for every x
27:11
t mu sub j has a component also for
27:14
every x and it's given by this
27:16
expression
27:18
involving the cost of the first stage
27:20
using mu and the cost of the
27:23
remaining stages as measured by j
27:27
so there's one such equation a scalar
27:29
equation for every x
27:32
and i call this the new bellman operator
27:35
and
27:36
there's also the min bellman operator
27:39
which is the one that appears at the
27:40
right hand side of bellman's equation
27:43
and it involves minimization here over
27:45
all
27:46
controls and
27:49
if you take this values here and
27:52
minimize over mu what you get is tjs of
27:55
x
27:56
again
27:57
j is
27:58
a function of the state
28:01
tj is also a function of the state so
28:03
this is a functional this is an operator
28:05
of mapping functions into functions
28:08
and bellman's equation can be written in
28:11
shorthand by using these operators
28:14
the min bellman opera equation
28:17
is represented as a fixed point equation
28:19
j star is a fixed point of t
28:22
and there is a bellman equation for
28:24
every policy
28:25
j mu for our policy mu j nu the cost of
28:28
that policy is a fixed point of this new
28:32
bellman operator
28:36
so team u and t transform real valued
28:39
functions j into functions t mu so j and
28:41
p sub j and for this talk i'm going to
28:44
assume that these are real values
28:47
now
28:48
these are pretty
28:50
big operators how many dimensions so do
28:53
they have
28:54
well
28:55
for every x it's a scalar equation so
28:58
the dimension is the number of states it
28:59
could be astronomical or it could be
29:01
infinite
29:04
on the other hand for every x these are
29:07
real values functions of j
29:10
and
29:11
you can view them in uh
29:14
for for the for a single uh state
29:19
okay as an example suppose you have a
29:21
two-state system
29:23
then j is a two dimensional vector j1 j2
29:27
this is a real number it is a real
29:29
number this is a vector in r2
29:33
and
29:33
t sub j has two components it's also a
29:36
vector in r2 the components tjs of one
29:39
and tjs of two
29:41
so
29:43
even for a two-state system we have
29:46
an operator
29:47
that
29:48
maps two-dimensional space to
29:50
two-dimensional space so you need a
29:52
four-dimensional graph to represent it
29:54
that poses a little bit of a problem
29:58
and
30:01
but on the other hand
30:03
even though team u and t are large and
30:05
perhaps difficult to visualize
30:08
look at first sight
30:10
they have some nice properties
30:12
the first is that they're both monotone
30:15
as you increase uniformly the j function
30:19
here and here
30:21
team users j and tj also increase the
30:25
reason is that
30:26
the expectation preserves monotonicity
30:29
expectation means
30:31
a positive
30:33
linear combination so it has the
30:35
monotonicity property
30:37
and similarly here as you increase j
30:40
because of the monotonicity because of
30:43
the
30:43
the expectation of operator
30:45
preservation's monotonicity
30:47
and the minimization of preserved
30:48
monotonicity you have that
30:51
tmu and tr monotone
30:56
as another important fact is that t mu
30:59
is linear okay expectations are linear
31:01
operations so this is a linear operator
31:04
[Music]
31:06
t
31:07
on the other hand is
31:09
concave it's concave because it
31:12
represents it can be viewed as the
31:14
minimum of linear operations if you take
31:17
the minimum of linear operations then
31:19
you get something that's concave
31:22
so in particular for every fixed state x
31:26
tjs of x is a concave function of j not
31:30
of x of j
31:31
j remember is a function t sub j
31:35
is also
31:37
a
31:38
a
31:39
a a function of j that's concave in the
31:42
traditional form the traditional uh
31:47
traditional meaning of convex analysis
31:52
so for infinite dimension state space
31:54
team u and t are infinite dimensional
31:56
operators
31:57
mapping and infinite dimensional
31:59
functions in the function space into
32:02
itself
32:03
so another question is how do we deal we
32:05
want to do visualization with this
32:06
operation how do we do that
32:09
okay so what we do is use
32:13
one dimensional slices of this
32:16
of this of the graphs of these operators
32:20
to give you an idea of this
32:22
let's consider a two-state and two
32:24
control example
32:26
we already have a four-dimensional graph
32:28
but we can represent as two
32:31
three-dimensional graphs and i'm going
32:33
to do it here
32:35
okay blue stands for the new operator
32:39
and the red starts for the stands for
32:41
the uh for the
32:43
uh the min operator
32:46
here is the
32:47
team u operator and has two components
32:51
one for state one and one for state two
32:54
and i'm showing these components here
32:55
for state one and for state two
32:58
and this is linear and monotone
33:00
and this is concave and monotone
33:03
so for state one
33:06
there are
33:07
two
33:08
controls here
33:10
each control defines
33:14
each control defines a
33:17
a a a linear has a linear graph it's the
33:20
blue graph that you see here
33:22
here is j1 j2 and this is a the linear
33:26
function
33:27
team use of js of one for state one
33:32
now the min bellman operator is obtained
33:35
by
33:37
by minimizing over these two so it's a
33:40
piecewise linear function it has this
33:43
red
33:44
graph here that you see
33:46
it's a function of j1 and j2
33:48
and
33:50
i'm showing also j star here j star is
33:53
uh is uh
33:55
is shown right here
33:57
and i want you i want to focus attention
34:00
on the fact that we can consider
34:02
slices one dimensional slices that go
34:06
through j star and you can see this
34:08
slice here
34:09
by using one dimensional slices we're
34:12
able to visualize this operator
34:15
on a piece of paper
34:16
rather than and and deal with a high
34:18
dimensionality
34:21
so there's a slice like that for stage
34:23
one and there's also a slice like that
34:26
for state two and the slice involves a
34:29
piecewise linear monotone function you
34:31
can see the 45 degree line here
34:34
and
34:36
so that's a visualization
34:38
for a two by two problem and from now on
34:41
we're going to be focusing on one
34:44
dimensional slices that pass through j
34:47
star
34:52
okay so let's visualize the new bellman
34:55
operator
34:57
it's given by this equation
35:00
the new bellman equation is this j mu
35:03
is a fixed point of t mu
35:06
and the operator has a linear graph is
35:09
this blue line that you see here
35:12
that defines t mu sub j as a function of
35:16
j
35:17
and to find the cost of mu j mu
35:21
i have to
35:23
find the point where this blue line
35:27
meets the 45 degree line
35:30
and this is the fixed point
35:32
of the operator t mu and it is the cost
35:36
of mu satisfies the bellman equation
35:39
of course this cost has to be somewhere
35:42
further to the right of the optimal cost
35:46
and all of these costs are
35:48
to the right there are many of these new
35:51
operators and they are all they are
35:53
defined by
35:55
in a similar way
35:59
okay now the min bellman operator the
36:02
min bellman operator is obtained
36:04
by minimizing overall controls or
36:07
minimizing over all these blue
36:10
operators for a given x
36:14
it's concave in monotone and is defined
36:17
as follows
36:18
here is t sub j
36:22
this red graph here as a function of j
36:26
and
36:28
to obtain its value let's say at some
36:30
point j tilde
36:33
we have to consider all the policies
36:36
find the point of intersection of these
36:38
policies with this vertical line
36:40
and go to the lowest point this is the
36:43
minimum over policies of t u sub j tilde
36:48
so that's how you find t sub j tilde and
36:51
you do the same
36:53
for every j
36:55
and basically the red curve is the lower
36:58
envelope
36:59
of the uh of the blue lines you have all
37:03
these blue lines and what lies below
37:06
that is the graph of the min operator
37:13
and now the optimal cost j star is
37:16
obtained as the fixed point of this
37:20
of this of the operator t
37:22
it is the point where the red curve
37:26
meets the 45 degree line so that's how
37:29
you visualize j star and all these
37:32
operators and the connection of the blue
37:35
operators with a red operator
37:39
i think to notice is that
37:42
the red operator is concave
37:46
and also
37:48
at j tilde
37:50
the
37:51
blue operator that attains the minimum
37:54
is tangent
37:56
provides a linearization
37:58
of the
38:00
red operator at this point
38:02
now this linearization is going to be
38:04
key for understanding the role of
38:06
newton's method
38:12
so
38:15
let me also illustrate how value
38:18
iteration can be viewed within this
38:20
operator context
38:24
we have a
38:25
piece of j as a function of j
38:29
and in value iteration we generate a
38:31
sequence of cost functions
38:34
we start with some j0 and we find tj
38:38
zero okay
38:39
it's this is j1
38:43
then we reflect j1
38:45
across the 45 degree line to this point
38:48
here
38:50
and then apply t to it and we obtain the
38:53
second iterator j2 so j2 is t squared j0
38:58
we reflect it again across the 45 degree
39:01
line and we obtain j2
39:05
i'm sorry j3 and so on
39:08
so there is this staircase construction
39:11
that
39:12
visual that provides a visualization of
39:14
of value iteration
39:17
and you can see that uh
39:19
there's a tendency to converge to the
39:21
fixed point but not always it depends on
39:24
on what j0 is the slope of t in
39:28
particular if t is a contraction mapping
39:30
which means that the slope is uniformly
39:33
less than 1
39:34
then you get convergence and you get
39:36
also a unique fixed point but in that's
39:39
the favorable case but that doesn't
39:42
always hold true
39:45
so that's the visualization for
39:47
for value iteration
39:49
and
39:51
now let me
39:53
go into newton's method and this
39:55
connection with online play and
39:59
one step look ahead
40:03
okay so let's look at the familiar
40:06
newton's method for solving
40:08
a fixed point problem uh finding a fixed
40:11
point of the operator t
40:13
this is a generic operator here and i'm
40:16
showing it in red concave and monotone
40:19
as it is in dynamic programming
40:22
and newton's method is an iterative
40:24
method you generate a sequence of uh
40:26
iterates jk
40:28
you start at j0 generate j1 j2 and so on
40:32
at the typical step
40:35
you are at jk
40:36
and what you do
40:38
is you linearize t at j k
40:42
and this blue line is a linearization it
40:45
involves the first order taylor series
40:48
expansion of t
40:50
at the point j k
40:53
we now solve the linearized fixed point
40:56
equation in other words we find this
40:58
point here and that's the new iterate j
41:01
k plus one
41:03
so that's what newton's method does
41:06
and it's pretty similar to one of the
41:08
things i showed you earlier for online
41:10
play
41:11
and for one step look ahead
41:16
so
41:17
jk linearization solve the linearized
41:19
problem jk plus one
41:21
then linearize at this point solve the
41:24
linearized problem
41:26
obtain jk plus two
41:28
and so on and you can see that this
41:31
process can be very fast indeed
41:32
quadratic under the right conditions
41:35
um
41:36
and
41:37
okay i've shown also here t to be
41:39
differentiable but then we will relax
41:41
this assumption uh in
41:44
in
41:45
with what follows
41:48
so this is generic newton's method
41:51
and here
41:55
here is how it relates to
41:57
to one step look ahead
41:59
now in one step look ahead what we do is
42:02
we have a j tilde a cost approximation
42:04
and we find new tilde
42:07
a one step look ahead policy which is
42:11
obtained
42:12
by minimizing overall possible policies
42:16
so what one step look ahead does is it
42:19
linearizes t a j tilde
42:22
and obtains this blue operator which is
42:25
the look ahead policy mu tilde
42:30
so
42:32
how do we represent the cost function of
42:34
this look ahead policy
42:36
well
42:37
it is simply
42:38
the intersection of this blue line with
42:41
a 45 degree line and it gives you j mu
42:45
tilde as the fixed point of t mu tilde
42:48
so we start with this cost approximation
42:51
and we end up of this
42:54
one step look ahead policy cost and you
42:56
can see that it is an honest goodness
42:59
newton step
43:02
now it's a
43:03
key insight
43:05
but then immediately some questions
43:07
arise
43:08
in the classical form of newton's method
43:10
you have to have this t uh operator be
43:14
differentiable twice
43:16
once differentiable
43:17
do we need that well the answer is no
43:20
you don't need that
43:22
you can substitute
43:24
concavity and monotonicity
43:26
for differentiability
43:28
this is known
43:30
uh in numerical analysis if you look at
43:33
the classical books like ortega and
43:34
rainbow from from the 60s
43:37
there are
43:38
there are uh
43:40
there is theory associated with that
43:42
primarily with monotonicity
43:45
uh
43:47
the
43:48
but but it's also
43:50
it's also true that you don't have to
43:52
have
43:53
differentiability here concavity is
43:56
sufficient
43:57
because this blue line
43:59
in the in the case of a
44:01
non-differentiable operator is just a
44:03
sub-gradient you use sub-gradients
44:06
instead of
44:08
of
44:09
of uh derivatives uh and gradients
44:13
and the subject the important thing is
44:15
that you have a tangent hyperplane a
44:17
supporting hyperplane and this can be
44:19
provided by a sub gradient or by a
44:22
gradient
44:24
just as well
44:26
now you do need some assumptions for
44:28
this process to to be valid because
44:31
there are all sorts of anomalies that
44:32
can occur here
44:34
but everything works out fine for
44:37
contracting problems such as the
44:38
discounted problem that i gave earlier
44:43
and notice one more thing
44:46
the newton step is very fast a super
44:48
linear convergence
44:50
so it smooths out starting point
44:52
variations
44:54
if you make variations to j tilde there
44:57
will be corresponding variations to j mu
45:00
tilde but these variations are tiny
45:03
relative to these variations
45:05
so
45:07
you may make some errors here and these
45:09
are smoothed out by the newton step and
45:12
there is lots of empirical evidence
45:14
to support
45:16
all this
45:24
okay now let me go into stability issues
45:27
and how do how we visualize them
45:32
here i'm going to be focusing mostly on
45:34
the npc context deterministic system
45:36
cause free terminal state and so on
45:39
so stability is paramount importance of
45:41
course in npc
45:43
but the question arises how do you
45:45
define stability given that we
45:48
that the state space can be discrete and
45:50
can be continuous
45:51
so instead of going to concepts of
45:53
yapping of stability and the like we
45:56
adopt a
45:58
an optimization based definition of
46:00
stability and we say that the policy is
46:03
stable if its cost function is finite
46:05
for all x
46:07
if the cost of the policy does not blow
46:09
up
46:10
for any initial state then we say that
46:13
the policy is stable and this has the
46:15
advantage it's a very general definition
46:19
and generally this is true if team use a
46:22
contraction or has slope less than one
46:24
in the context of
46:26
our figures
46:29
so let's look at one step look ahead
46:32
um
46:33
we want to
46:35
the original stability we defined to be
46:37
the set of all j tilde cost
46:39
approximations for which we look ahead
46:42
policy is stable
46:45
so you can see here that if you choose j
46:48
tilde
46:50
to the right of some threshold
46:52
then the corresponding policies because
46:55
of the monotonicity
46:57
are stable
46:59
on the other hand if you choose them to
47:01
the left of some threshold
47:03
the policies are not stable because they
47:06
don't intersect basically the 45 degree
47:08
line they have slope greater than one
47:12
and the threshold
47:14
is precisely the point where
47:17
the bellman operator
47:19
has a standard the 45 degree line
47:23
so this is how the original stability is
47:25
demarcated j tilde to the right of
47:31
this threshold point
47:33
gives you a stable policy
47:35
and
47:37
otherwise it's an unstable policy of
47:39
course this is for a one-dimensional
47:41
slice
47:42
for a multi-dimensional case it's more
47:44
complicated than that but the general
47:46
idea still applies
47:50
now this is for one step look ahead
47:53
for multi-step look ahead now multi-step
47:56
look ahead tends to push j tilde into
47:59
the stable region
48:04
now remember that multi-step look ahead
48:06
amounts to
48:08
some value iterations
48:10
and then taking the newton step
48:12
so this gives you an example where a
48:15
two-step look ahead
48:16
so
48:18
what are the stable j tilde the ones
48:20
that after you perform
48:25
after you perform
48:27
one value duration you cross into
48:30
this stable region for one step look
48:33
ahead
48:34
so the threshold moves to the right and
48:36
the reach of stability is expanded by
48:39
multi-step look ahead
48:41
and there are rigorous studies
48:44
mathematical analysis that verify this
48:47
and it's also a known fact in npc
48:49
[Music]
48:52
so it makes sense to push j till that
48:54
door was jnu
48:56
which is stable and that's
48:58
what this multi-step look ahead does and
49:01
it's also an idea that underlies the
49:03
concept of rolla
49:09
so now let me go into rollout and policy
49:12
duration and how we visualize them
49:19
okay now roll out is simply a newton
49:22
step starting from j tilde equal to the
49:25
cost function of a policy
49:28
so you have some policy new which we
49:30
call base policy
49:32
it has an evaluation okay obtained by
49:34
this intersection point this is j mu
49:38
and then rollout
49:40
does a newton step linearizes here and
49:43
this gives you the rollout policy
49:46
so that's simply what it is
49:49
and it can be implemented
49:51
in a number of ways including simulation
49:55
and an interesting fact is that if you
49:58
have a stable policy to begin with then
50:00
the rollout policy will also be stable
50:03
you can see this from this figure
50:08
now poor this iteration is simply
50:10
repeated rollout you start from a policy
50:13
do it all out on it and then do a
50:15
rollout on the rollout and so on and
50:17
generate a sequence of policies
50:20
the visualization is the following
50:23
let's say mu k is the current policy and
50:26
this is its evaluation the cost of mu k
50:31
you do a newton's step
50:32
linearize the bellman equation a j u k
50:36
then obtain mu k plus one this is the
50:39
policy improvement step policy
50:41
evaluation policy improvement and obtain
50:44
a new policy
50:46
and its cost is uh
50:49
is
50:50
is
50:53
is obtained by
50:55
the new policy is this corresponds to
50:57
this blue line and its cost is obtained
51:00
by uh projecting on the horizontal this
51:04
point of intersection
51:06
so start with mu k generate uk plus one
51:10
then generate new k plus two and so on
51:13
and converts very fast
51:16
the pure form of policy duration is
51:18
exactly newton's method and does not
51:21
rely on differentiability at all
51:23
um
51:25
uh
51:26
the fact that the connection with
51:27
newton's method actually is well known
51:29
in control theory
51:31
uh starting from
51:34
climate paper or linear quadratic
51:36
problems
51:37
clyman was a student of mike athens at
51:39
mit and wrote this paper
51:42
that
51:43
that
51:44
was followed up by a lot of other works
51:46
both for linear quadratic problems and
51:48
also for more general
51:52
forms of policy duration
51:54
there are also corresponding results in
51:56
the markov decision process theory
51:59
all of this is very well known in
52:01
dynamic programming as well as
52:04
linear quadratic problems and the like
52:07
um
52:11
okay so now let's go into truncated role
52:14
rollout is a single step of newton's
52:17
single newton step starting from the
52:19
base policy
52:21
truncated rollout starts from sanjay
52:23
tilde
52:25
does
52:26
several value iterations using the new
52:29
operator
52:31
and then at that point takes the newton
52:32
step
52:34
so it is
52:37
first value iterations with
52:40
with the mu operator
52:42
uh m of them
52:44
m equals 4 in this particular figure
52:48
starting with some j tilde
52:51
and then
52:52
applying the newton step by doing this
52:54
minimization here and that's the cost of
52:57
the truncated rollout policy so
53:00
truncated roulade also involves a newton
53:02
step
53:06
now you can have a combination of
53:09
l-step look-ahead minimization and
53:12
m-step truncated rollout and the total
53:15
look ahead is l plus n
53:18
and an interesting fact it is the sum
53:20
that's important not so much the
53:22
individual values of l and m but the sum
53:25
is important
53:27
now
53:29
ronald with a policy is much cheaper
53:31
than look ahead minimization
53:33
so
53:35
truncated a lot can be viewed as an
53:37
economical substitute for multi-step
53:39
look ahead and this was the motivation
53:41
of tesoro when he introduced truncated
53:43
rollout in the context of backgammon
53:48
also people use this rationale in mpc
53:51
um
53:52
use
53:53
rollout with some base stable policy to
53:58
to enhance the starting point for the
54:01
newton step
54:06
so now let me go into a familiar problem
54:08
for people in control the linear
54:10
quadratic problem
54:12
in linear quadratic problems there are
54:14
similar visualizations but instead of
54:17
using the bellman operator we use a
54:19
equation operator
54:21
so in this particular for illustration
54:25
i'm using
54:27
a a one dimensional linear system with
54:30
an a coefficient and a b coefficient a
54:33
quadratic cost involving non-negative q
54:36
and r coefficients and alpha equals one
54:39
no discounting
54:40
and it is well known that the optimal
54:43
cost function is quadratic
54:45
uh with quadratic coefficient k star
54:48
and this k star
54:50
solves very ricardi equation
54:52
and the riccardi equations are
54:54
fixed-point equations the algebraic
54:56
richard equation is given by this
54:58
expression here involving a b r and
55:02
and q
55:04
is concave and
55:07
the solution of the ricardi equation is
55:09
obtained by intersection of this 45
55:11
degree line and the red graph here
55:14
actually there are two solutions
55:17
however there's a unique solution
55:20
that is positive and this this other
55:22
solution that's
55:24
that is not positive
55:27
this in fact this is true for for
55:29
multi-dimensional problems
55:32
um
55:33
the normal case is when q and r are
55:36
positive and k star has its unique
55:39
positive is its unique positive solution
55:42
however
55:43
just to sensitize you to exceptions
55:46
here's an exceptional case where q is
55:49
equal to zero
55:50
and now the
55:52
in that case
55:53
you lose observability
55:56
but uh
55:57
but
55:58
k star is actually equal to zero because
56:01
there's only cost and control and you
56:03
can set the control to zero and attain
56:04
optimal cost to zero
56:06
however the recovery equation has two
56:09
non-negative solutions and is another
56:10
solution here which has an interesting
56:13
interpretation
56:14
and this is an example where you don't
56:16
have the uniqueness and in fact if you
56:19
try value iteration for this exceptional
56:22
case you'll find that value iteration
56:24
converges to the
56:26
to the non-optimal solution rather than
56:29
k-star
56:32
this is just an example of exceptional
56:34
behavior and then many of those in other
56:37
contexts that are more complicated than
56:38
linear quadratic
56:44
okay so here's a common question
56:46
uh in the end
56:48
i want to have a policy and i
56:51
we say that this online play policy is
56:54
good why can i not learn this policy by
56:58
some neural network and use it
57:02
in fact there is a whole community in
57:04
reinforcement learning
57:05
that practices approximation in policy
57:08
space
57:09
where we don't care about cost function
57:11
approximations multi-step look ahead and
57:14
the like but instead we parameterize the
57:17
policy and we optimize the parameters by
57:20
using some kind of training perhaps
57:22
neural network training
57:24
there are methods called policy gradient
57:26
methods random search methods and so on
57:28
there's a wide variety it's a very broad
57:30
field
57:31
and
57:33
my
57:35
visualizations and my point of view
57:36
suggests that this viewpoint is flawed
57:40
because it lacks the exact newton step
57:44
that sort of does all the magic
57:46
it corrects all the errors of offline
57:48
draining and super linearly so
57:51
so there's no correction methods in this
57:54
in this
57:55
approach whereby we don't use a newton
57:57
step
57:59
and
58:01
to illustrate this point i have an
58:03
example here
58:05
it's a one-dimensional linear quadratic
58:08
example
58:09
with a known and fixed model
58:11
and we cut we consider a parametrization
58:15
of the policy
58:17
we use linear feedback policies with
58:20
coefficient l
58:22
and l we vary l okay this is the l
58:26
we vary the
58:27
broadly
58:30
um
58:31
okay l
58:33
around minus point four is optimal
58:37
and what i have plotted here
58:39
are the cost differences from the
58:40
optimum
58:42
without the newton step just this policy
58:45
apply this policy
58:47
we see that very quickly the performance
58:50
of the policy is degraded
58:52
on the other hand if you introduce one
58:54
step look ahead in other words if you
58:56
introduce a newton step
58:58
then
58:59
the performance is almost
59:00
indistinguishable
59:02
from the optimal so all this is is the
59:05
loss that you incur
59:07
for not having a newton step to correct
59:10
the errors
59:11
in the
59:12
sub optimal policy parametrization
59:20
let me say a few things about model
59:21
prediction control
59:25
well i'm not doing very badly with time
59:27
i'm glad about that
59:29
so i'm going to
59:31
go a little slowly here
59:36
okay model predict control is a very uh
59:39
is a major methodology perhaps a
59:41
dominant methodology in control of your
59:43
research these days
59:45
it dates to
59:48
close to 40 years
59:50
and has extensive literature it has been
59:53
applied to very very challenging
59:54
problems it's a lot of experience with
59:56
it
59:58
the classical form of npc applies to
60:01
continue state and control uh
60:04
problems deterministic problems with
60:06
positive cost
60:08
so we have a non-linear system
60:10
and a cost function that's positive
60:12
everywhere except at the origin
60:15
where the cost is zero so basically this
60:17
encodes our desire to drive the system
60:19
to the origin and keep it there
60:23
and
60:25
now
60:27
mpc is culturally different from
60:30
the
60:32
the reinforcement learning artificial
60:35
intelligence practice of
60:37
uh
60:38
of
60:40
of sequential decision making
60:43
uh
60:44
the
60:45
architecture of npc as i mentioned
60:46
earlier is very similar to alpha zero
60:49
it includes look ahead minimization
60:51
which is often called controlled
60:53
interval
60:56
in some versions of mpc there's a
60:58
so-called production interval to improve
61:01
the terminal cost approximation and
61:03
there is which amounts to rollout with
61:06
some
61:06
stable policy
61:09
and there is also a terminal cost
61:11
function that's being used
61:14
so all the elements are there
61:16
on the other hand
61:18
mpc focuses
61:20
on continuous space applications and
61:22
also stability
61:24
issues
61:25
and
61:27
also
61:28
relies primarily on online play
61:33
there is some online offline training in
61:35
npc for example computing offline
61:38
terminal cost approximations computing
61:41
some stable policies to do rollout with
61:44
and also
61:45
computing target tubes
61:48
safe regions in order to deal with state
61:50
constraints
61:53
one of the central motivations of npc is
61:55
to be able to deal with
61:57
state constraints critical state
61:59
constraints
62:01
that
62:01
for which
62:03
the linear quadratic
62:05
unconstrained framework does not deal
62:07
very well
62:08
uh so
62:09
uh to do that there are some interesting
62:12
issues of how to
62:14
to keep to to to to use controls select
62:17
controls that keep you within the safe
62:20
region and there's a theory of
62:23
target to principality which actually is
62:26
very dear to my heart because my phd
62:29
thesis
62:30
50 years ago
62:32
dealt precisely uh with this topic of
62:36
the not introducing the notion of target
62:38
tube and also
62:40
uh how you construct these target tubes
62:43
um
62:44
my nobody read my thesis for about 15
62:47
years but then all of a sudden through
62:49
npc
62:50
it became an interesting topic
62:55
okay so this is the classical form of
62:56
mpc but there are a lot of
62:59
extensions
63:01
and you can look at
63:03
for example the recent books by rollins
63:05
main and dil and
63:07
borelli bemporad and morari
63:09
which describe a lot of
63:12
a lot of extensions
63:15
that deal with stochastic uncertainties
63:19
hybrid systems involving
63:22
both continuous and discrete control and
63:25
space components
63:28
minimax versions
63:30
uh
63:32
deal with target tubes and so on
63:36
so there's a lot of
63:38
a lot of
63:39
very broad literature dealing with a
63:41
broad variety of problems
63:44
moreover
63:47
more recently there has been
63:48
a
63:49
some
63:50
emphasis on getting
63:53
on using data
63:56
simulation obtain data
63:59
and roll out
64:01
with some offline trained policy
64:04
and this come under the jungle name of
64:07
learning npc
64:10
uh this term was introduced in the paper
64:12
by rosolia and borelli who have done
64:15
interesting work in this area
64:19
and there has been also
64:21
a recent paper
64:23
by
64:24
uchaul
64:25
uh and johansson and martinson
64:28
professors at kth
64:30
and i'm also a co-author in this paper
64:37
uh so now let me go into
64:40
adaptive control
64:42
with model estimation what's called
64:44
indirect adaptive control is combined
64:47
identification model identification
64:50
and adaptation of control in response
64:56
so what we have here is a system that's
64:58
changing over time its model is changing
65:01
over time
65:03
data is collected from the system and we
65:06
run
65:07
a model estimation system identification
65:10
algorithm
65:11
the results of which are passed into an
65:13
adaptive controller
65:15
who makes proper adaptations
65:18
and
65:20
the
65:21
indirect adaptive control viewpoint is
65:23
uh
65:24
to re-optimize the controller
65:27
uh
65:28
when the estimated model changes
65:31
on the other hand there's a problem with
65:33
that this
65:34
reoptimization may be difficult
65:37
or very time consuming
65:40
uh so
65:42
uh there is room for simpler
65:44
alternatives
65:46
and the
65:47
rollout provides and rollout combined
65:50
with multi-step look ahead provides
65:53
uh
65:53
such an alternative
65:57
you have
66:00
you use multi-step look ahead
66:03
that takes into account the current
66:06
model as estimated by the system
66:08
identification process
66:11
but the rollout itself uses some robust
66:14
policy
66:15
however all of this involves a newton
66:17
step
66:18
so we use a newton step instead of
66:20
reoptimization
66:22
and if you're close
66:24
if your changes from the nominal are not
66:26
very serious then the newton step i
66:29
argue is just as good
66:31
as reoptimization
66:33
and computationally simpler
66:36
so we capitalize on the fast convergence
66:39
of the newton step to avoid expensive
66:42
reoptimizations
66:44
so that's the idea for the use of
66:47
rollout in adaptive control as a
66:49
substitute to re-optimization
66:53
and
66:55
let me
66:56
uh
66:57
let me give you an example
67:00
here
67:02
this is a
67:04
one-dimensional linear system
67:09
and
67:10
it involves an unknown parameter in
67:13
parameter b
67:14
and
67:16
and our result may also be variable here
67:18
in the cost function
67:21
now
67:22
as b changes
67:25
the optimal cost as measured by the
67:27
quadratic cost for efficient changes
67:30
if you use a fixed policy in particular
67:33
the policy that is optimal for the
67:35
nominal values of b equals 2
67:38
and r equals 0.5
67:40
then the performance
67:43
degrades
67:44
so
67:45
robustness is not really fixing the
67:47
problem here if you have large
67:50
deviations from the num for the nominal
67:52
parameters uh then you can get
67:56
you can get a lot of degradation
67:58
performance
67:59
on the other hand if you use rollout
68:02
including the newton step there's hardly
68:04
any change from the optimal it's just as
68:06
effective as reoptimization
68:10
and this is true when we change b but
68:12
keep r fixed and the right figure we
68:16
change the cost parameter at the r and
68:19
keep b fixed
68:22
so
68:23
this is
68:25
what i want to say that using a robust
68:27
control as a base policy without the
68:28
newton's step is often flawed
68:31
and we correct the flow by introducing
68:33
uh the newton step
68:37
okay let me conclude
68:39
uh
68:40
uh i want to make a few points here the
68:43
first one is that there is much to be
68:44
gained by using online play on top of
68:46
offline training there's a potential for
68:49
very large improvements
68:52
if you just use
68:54
just offline training without online
68:57
play it may not work very well because
68:59
we lack the newton step that corrects
69:02
for training errors
69:03
and can deal with changing system
69:05
parameters
69:06
on the other hand uses just online play
69:09
without offline training we may miss out
69:11
on performance
69:12
because
69:14
we will not have good starting points
69:16
for the newton step
69:18
so that's the first point i want to make
69:20
the second point i want to make is that
69:23
newton's method is central in all this
69:24
this is a new insight and can guide both
69:27
analysis
69:29
explain behavior
69:31
and provide motivation for new
69:34
algorithmic designs
69:36
and again the newton's step is exact all
69:39
the approximations neutron net neural
69:42
networks and the like
69:43
going to the starting point for the
69:45
newton step which washes out training
69:48
method differences and errors
69:53
another point is that
69:55
the cultural divide between the
69:58
between reinforcement learning
70:00
as practiced by the ai community
70:03
and the control community
70:06
we are dealing with very similar
70:08
problems using similar methods
70:10
and the cultural differences can be
70:12
bridged by combined offline training and
70:15
online play in the model alpha zero
70:20
um mpc uses a very similar architecture
70:22
to alpha zero but can benefit from rl uh
70:26
and artificial intelligence ideas
70:29
and we can also approach indirect
70:32
adaptive control through rollout using a
70:34
newton step in place of re-optimization
70:38
the last point i want to make is that
70:40
all of this is very general even though
70:42
i focused on on just a few problem types
70:48
the generality comes from having
70:50
arbitrary state and control spaces
70:53
and the generality of abstract dynamic
70:56
programming
70:57
and in particular there are a lot of
71:00
interesting applications in discrete
71:02
optimization integer programming
71:04
multi-agent versions
71:07
which
71:09
involve
71:10
exploitation of the special structure
71:14
and
71:16
a lot of this is in my 2020 book the
71:20
details of that
71:22
finally there are exceptional behaviors
71:24
all over the place once you get away
71:26
from the benign uh
71:29
discounted case
71:31
and there's a lot of room for client
71:32
class clarification from by mathematical
71:35
analysis
71:37
and computational experimentation
71:42
so uh let me close with some words of
71:45
optimism
71:48
the we had tremendous successes in both
71:51
artificial intelligence reimbursement
71:53
learning and mpc
71:55
and this provides hope for the future
71:59
more success can be expected by bridging
72:01
the gap between these two communities
72:06
look ahead minimization and rollout can
72:09
be a computational bottleneck
72:11
and the
72:13
people in both artificial intelligence
72:14
and nbc have been struggling with that
72:17
on the other hand we have massive
72:19
computational power
72:21
through distributed computation a lot of
72:23
these computations can be parallelized
72:26
and all of this power can mitigate the
72:28
computational bottleneck and so
72:31
we can obtain online strategies that are
72:34
more sophisticated as a result
72:37
so all in all the future looks very
72:40
bright
72:41
and
72:42
i thank you very much for your
72:43
attendance
72:48
all right
72:49
thank you very much uh so please uh
72:53
we have time for questions it was a very
72:55
exciting talk i think i'd like to leave
72:57
the the floor open for questions
72:59
before asking my own questions if
73:01
there's time in the end um so
73:04
we can try if you raise your hand
73:06
yeah
73:08
and then i can pass the word so
73:11
uh you are kim you can unmute maybe if
73:14
you want to ask a question yes so
73:16
let me join mikkel first by saying a
73:18
fantastic talk it really opens up the
73:20
thought yeah
73:21
and perhaps my question is because i
73:23
haven't fully digested the concept yet
73:25
but when when i think about this
73:27
exploration the uh alphabet the pruning
73:30
in type of games and then with a proxy
73:31
of a cost of future
73:33
possible steps
73:35
i tend to view it more as a exploration
73:38
of state space so when you take the jump
73:40
to
73:41
dynamical systems is the state in the
73:43
dynamical system the same state that you
73:45
would apply say to a chessboard or is it
73:47
more of an abstract concept
73:51
that is used in the analysis
73:53
or is there a one-to-one correspondence
73:55
there the state spaces is arbitrary here
73:59
the visualizations are valid for an
74:01
arbitrary state space
74:03
now in the case of uh
74:05
chess and games the state space is
74:07
discrete the same is true for markovian
74:09
decision problem problem the type of
74:11
interest in operations research in other
74:12
places in the control uh computing in
74:16
the control applications it's mostly
74:17
continuous state space
74:19
now of course the minimization
74:22
the look ahead minimization depends very
74:24
much on the nature of the state space
74:27
indeed in chess you have forward search
74:30
through a tree
74:33
with alpha beta pruning and all this
74:35
other stuff multi-colored research
74:37
which facilitate and expedite this
74:40
search
74:41
on the other hand it's a discrete
74:43
process search through a tree
74:45
in mpc it's a different sort of thing
74:48
people in npc use
74:50
use continuous type methods or
74:52
combinations of continuous and discrete
74:54
methods that are closer to integer
74:58
programming
75:00
nonlinear programming or sequential
75:02
quadratic programming
75:05
so depending on the character of the
75:07
state space you would use different
75:09
methods for the look ahead minimization
75:13
uh the concept however is the same
75:16
it's just the
75:18
the
75:19
the implementation depends on the
75:21
problem at hand
75:24
i guess my question was uh so
75:27
so when you interpret it in terms of a
75:29
control problem is then does the state
75:32
describe the whole look ahead process or
75:34
is it
75:37
so maybe yeah maybe i just didn't fully
75:40
understand the
75:41
because but so when you take the control
75:44
theory view of it
75:46
um so is the look ahead
75:50
uh
75:51
algorithm something different from the
75:53
state or is that actually the state in
75:55
that interpretation
75:57
so that's the one
75:59
the state of evolution is supposed to be
76:01
known here
76:03
the the system equation is supposed to
76:06
be known it could be stochastic
76:08
however we can predict the future fully
76:11
for a given policy we can predict the
76:13
future fully that we have a model for it
76:16
right i don't know if i've answered your
76:17
question i
76:19
think yeah i'll probably have to think
76:20
about it more but it's input definitely
76:23
to that
76:23
thanks okay
76:26
uh so mario or maybe you can unmute and
76:30
ask a question
76:32
okay
76:33
thank you very much
76:37
um this talk was extremely inspiring for
76:40
me
76:40
um
76:41
there's one point which maybe i missed
76:44
um so it's very interesting to see how
76:46
this these methods are essentially
76:47
neutral steps but
76:49
i missed a bit the connection with the
76:51
policy gradient methods because policy
76:53
gradient methods essentially also do
76:55
something along these lines they take
76:57
some sort of a neutron step to
76:58
approximate it depending on which hash
77:00
approximation one uses so i was
77:03
wondering if you could comment a bit
77:04
more on this aspect
77:09
sure newton's method
77:11
can be used in several different
77:14
optimization context
77:16
uh
77:17
what i was talking about here is
77:18
newton's method for solving fixed point
77:20
problems the bellman equation in
77:22
particular
77:24
in policy gradient methods which is
77:26
conceptually quite different from
77:29
the approximation in value space that
77:31
i've been using here
77:34
in in policy gradient methods is an
77:37
optimization problem to be solved and
77:39
the optimization
77:40
involves
77:42
not involves a parameter as a
77:44
characterization of the policy
77:46
so
77:48
for this you can use any optimization
77:50
method that you would like you can use
77:52
newton's method or you can use a
77:55
gradient method you can do random search
77:57
anything that you would like
77:59
but the context is different the cost
78:01
function is different
78:02
uh
78:03
you in in
78:05
in the policy gradient context you
78:08
optimize the performance with respect to
78:10
the controller parameters the
78:12
performance of the system
78:14
in here in my talk
78:16
it's newton's method is used
78:19
not for an optimization problem it's
78:21
used to solve a fixed point problem
78:23
which is not the same
78:26
right yeah maybe what confused me is
78:28
that in
78:29
policy gradient you take the expectation
78:32
over an initial uh distribution of over
78:35
a distribution of initial states
78:38
so that's what makes it similar but yeah
78:40
indeed it's a very different kind of
78:43
optimization problem thanks a lot okay
78:55
i see some hands but i don't hear
78:57
anybody
79:00
yeah i was waiting michael can i go on
79:03
you're muted now michael so that's
79:04
that's why we didn't hear you
79:07
okay so uh
79:09
right so so yeah thanks also very very
79:12
uh nice talk and it's good to see such a
79:14
broad audience from all over the world
79:16
uh in in this talk i have an uh control
79:20
uh question uh
79:22
uh dimitri so i mean you commented on
79:25
indirect adaptive control and the
79:27
interpretation here how one could carry
79:30
think about this in in this framework
79:32
so
79:34
you know at the time there was this
79:36
discussion about different architectures
79:37
for adaptive control indirect adaptive
79:39
control and direct adaptive control and
79:41
so on
79:42
so could you comment a little bit on the
79:44
role of models here because in indirect
79:48
adaptive control you explicitly have the
79:50
planned model you identify the plant
79:52
model and you use that for for the
79:54
design
79:55
so with your framework here is it so
79:58
that so to say one can argue that there
80:01
should be such a model
80:03
explicit
80:04
in the architecture for the control
80:07
um
80:08
so so is is this said to say an argument
80:11
because that would lead to this
80:13
particular structure that you present
80:16
here with the online play and offline
80:18
training
80:21
yes
80:22
speaking of adaptive control um
80:25
i'd like to take the opportunity to pay
80:27
some tribute to the outstanding
80:29
contributions of the swedish
80:31
mathematicians and control theories in
80:33
this area particularly carl astron who
80:36
has been a friend and a source of
80:39
inspiration for me over many years
80:43
about the difference between
80:46
indirect and direct adaptive control
80:51
i don't think from what i can tell that
80:55
direct adaptive control
80:57
can fall into the framework that i have
81:00
discussed
81:02
because it involves
81:04
primarily stability type of analysis and
81:07
also optimization type of analysis that
81:10
do not seem to fit the model of dynamic
81:13
programming very well
81:14
it's like in policy gradient methods so
81:17
you do an optimization but does not
81:19
involve so much dynamic programming you
81:22
just optimize
81:24
in the direct case direct data control
81:26
case you stabilize the system you try to
81:28
find parameters that stabilize the
81:30
system
81:31
and cost is kind of secondary but
81:33
sometimes they combine the two so i'm
81:36
also not an expert in model uh in in
81:39
direct adaptive control
81:41
and perhaps someone will be able to find
81:43
some connections that i don't see right
81:45
now
81:47
thank you thank you very much
81:50
great thanks
81:51
to raj
81:53
yes i
81:54
have two questions uh the first question
81:57
is about the convergence rate of uh
82:00
value iteration and polycitration
82:02
algorithms could you please elaborate a
82:06
bit on
82:08
this point and in particular how
82:12
can we uh make them faster and the
82:14
second question is about the optimality
82:17
gap of
82:18
uh rollout algorithms uh im by um
82:23
optimal signal i mean the difference
82:25
between the cost of
82:27
uh the drive policy and jstor
82:30
uh do you think if we can uh
82:33
find any bound on the
82:35
optimality gap of these algorithms
82:41
yeah the issue of convergence rate at
82:43
least for the
82:45
simpler cases
82:46
is fairly
82:48
fairly clear
82:50
value iteration has a linear convergence
82:52
rate
82:53
um
82:54
and is much slower than policy duration
82:57
policy duration has a super linear
82:58
convergence rate
83:00
converges in a finite number of
83:02
iterations if the states pay if the
83:03
number of policies is finite
83:05
so
83:06
polystyration is much faster
83:09
and rollout takes advantage of that
83:12
of that
83:13
of that
83:15
fast convergence
83:18
it's uh computational studies are show
83:21
very remarkably how effective rollout is
83:23
even though it's a single duration
83:26
particularly if you have a good starting
83:27
point that's obtained by some kind of
83:30
offline training it just works
83:33
wonders it's just
83:34
it's very reliable also
83:37
so
83:38
while there are error bounds if you look
83:41
at ibooks for example there are error
83:43
bounds associated with rollout policy
83:45
iteration value iteration and so on with
83:48
approximations uh with j tilde
83:50
approximations
83:52
um
83:53
these error bounds
83:55
are interesting mathematically
83:57
but they are very conservative and they
83:59
don't really tell the whole story this
84:01
sort of point of direction toward which
84:03
performance uh tends to go
84:06
but
84:09
i would not rely on this bounce very
84:12
much to draw quantitative conclusions or
84:14
base design decisions on those
84:18
so all in all
84:20
policy iteration is very effective but i
84:22
cannot fully justify
84:25
the
84:26
my argument on the other hand there's
84:28
like a
84:29
50-year experience on the on this
84:32
methodology and
84:35
what i'm saying is pretty reliable based
84:38
on just the experimental evidence
84:41
thank you
84:46
excellent uh
84:48
any other questions
84:53
maybe i can take the opportunity to ask
84:54
one and so so if i understand correctly
84:57
i mean a lot of things here okay they're
84:58
very abstract um
85:01
but but it seems like you're using uh
85:03
full state
85:05
feedback let's say a full access to the
85:07
state of the system and what would
85:09
happen or what would happen computation
85:11
if you don't have like if i want to make
85:12
a machine for playing a texas holden
85:15
card game where i can only have partial
85:17
observations let's say of the state
85:22
so does your question have to do with
85:24
imperfect state observation exactly yes
85:27
partial or imperfect
85:29
right yeah
85:31
okay
85:32
so once you pass into a perfect state
85:34
operation you're getting a more
85:35
complicated problem on the other hand it
85:38
is
85:38
uh it can be transformed
85:42
into a perfect state information problem
85:44
where the state is something much more
85:46
complicated it is a belief state
85:49
a probability distribution of where the
85:51
state is
85:53
now
85:54
that's where approximations are called
85:57
for perhaps neural network
85:58
approximations
86:00
and
86:00
there is not much experience in this
86:02
area using
86:04
this type of
86:06
value space methods including rollout
86:10
however i have two papers with some
86:12
collaborators from asu stephanie gill
86:16
and also
86:18
some of her students and these papers
86:20
are over have been generated over the
86:21
last couple of years with very
86:23
challenging very high dimensional
86:26
condp
86:28
problems
86:29
and everything works exactly the same as
86:32
for
86:32
lower dimensional problems of course the
86:35
implementation
86:36
uh is more complicated the offline
86:39
training takes days to run okay
86:44
but still when it comes to online play
86:47
it's possible to do it and the results
86:50
seem to be quite encouraging
86:52
so that's all i can say it's a very
86:54
interesting area for many people to work
86:57
on there are efforts in
86:59
various communities
87:01
and but um
87:03
the
87:04
my work the two papers i mentioned is
87:06
the ones that close more closely relate
87:08
to the viewpoints i've expressed today
87:13
could i also ask if you could share some
87:15
other insight i mean your framework i
87:17
think it's very beautiful it
87:18
conceptually explains a lot of things
87:20
that we can see
87:22
but algorithmically i mean you mentioned
87:23
a little bit to roll up rollout
87:26
optimization could be challenging but do
87:28
you also have some
87:29
say algorithmic insight that you can
87:32
share of how this could transform the
87:34
way we do computations not just an
87:36
analysis say and conceptually structure
87:39
algorithms
87:41
i think the computation is feasible
87:44
yeah i mean
87:46
based on my experience with the pomdp
87:49
problems which were horrendously
87:50
complicated uh i can tell you that's
87:53
feasible you have to be clever about it
87:56
you have to be lucky about it
87:58
but
87:59
but
88:00
we have used a lot of distributed
88:02
computation to do all this and like i
88:05
said training for days
88:07
so as the availability of parallel
88:09
computation increases
88:11
uh
88:12
it's going to have a big effect
88:14
because a lot of these
88:16
data in
88:18
database
88:20
computations
88:21
can be parallelized for example monte
88:23
carlo simulation is eminently
88:25
paralyzable
88:27
so it's not as hard or prohibitive as it
88:29
used to be
88:30
and in the future it's going to be more
88:32
accessible
88:33
so a lot of the things that in the past
88:36
had seemed to be
88:38
out of the realm of visibility may
88:39
become feasible or are feasible right
88:42
now
88:43
so i'm optimistic in that regard the
88:46
computational
88:47
resources are just so plentiful that sir
88:49
and that that that gives me a lot of
88:52
hope
88:54
sounds excellent i i think we have one
88:56
more question before concluding uh jin
88:59
zhao
89:02
uh yes thanks for thanks professor for
89:05
your talk and uh it's really interesting
89:07
uh actually i have two questions but in
89:10
terms of the time you may just answer
89:12
one of them uh i can stay on if anyone
89:15
wants to stay
89:17
beyond i'll be i can i can stay on the
89:19
problem
89:21
yeah go ahead please thank you um
89:25
first of all it's about yeah i can in
89:27
turn you mentioned that it's a
89:30
linearization of the
89:32
of some
89:34
uh the
89:36
newton is a newton step and linearize
89:38
the function or uh operator and then
89:41
when when you
89:43
talk about this a sub gradient how do
89:46
you define the proximity or the
89:49
metric in the function space because in
89:51
terms of the fractured derivative of
89:53
this operator you have to i am curious
89:56
about about
89:58
the metric in this space and secondly
90:00
you it's a monotone concave function and
90:04
i'm curious about what's the due of this
90:07
function look like because it's
90:09
you there might be another physical
90:11
meaning of the duel of the function but
90:13
that's all my question
90:16
um
90:17
yes
90:18
even though the problem
90:20
may be discrete
90:22
the state space may be discrete like a
90:24
finite number of states five states 10
90:26
states 100 states
90:28
the bellman equation is defined over
90:31
real valued functions right
90:34
and that's why you can talk about
90:35
convexity
90:37
and the metric there is uh okay you're
90:40
dealing with function spaces euclidean
90:42
space perhaps so that's uh that's what
90:45
you use
90:46
you could consider
90:48
the possibility of dual functions i
90:50
guess what you you would call conjugate
90:53
convex conjugate functions but i haven't
90:55
found any use for that so far but it's
90:58
it's a possibility
91:00
um i don't know if i've answered your
91:03
question
91:03
oh thank you thank you thank you yeah
91:06
okay
91:08
uh okay excellent i think we should we
91:10
can uh
91:12
conclude here and if there are people
91:14
that still have questions for professor
91:16
bertsekas
91:17
uh maybe you can just stay on stay on
91:19
and and those of us that want to
91:21
continue discussing a little bit more
91:23
can do so so
91:24
with that i think we conclude and uh
91:26
thank you so much i i wish we could all
91:29
upload also
91:32
with the sound but but we do it with our
91:33
little reactions and thank you very much
91:35
for everything here
91:38
uh thank you for the opportunity
91:41
and it was a pleasure to give this talk
91:48
i'd like to say hello to a few friends
91:50
whose i see whom i see in the audience
91:52
here
91:53
david and ben
91:54
and mark um
92:02
excellent i will i think now there's
92:05
some so few people left so those of you
92:07
that want us to speak to can just unmute
92:11
and we will sort out
92:17
i
92:17
i think i my
92:19
question that i
92:21
was similar to the one of mike mikhail
92:24
that
92:26
one of the reasons why you would use
92:28
policy iteration for example is that you
92:31
do not have a
92:33
a
92:36
model of the system or the rules of the
92:39
system that would allow you to do look
92:41
ahead
92:43
but i guess this
92:45
most of the methods that you have talked
92:47
about in this presentation does assume
92:49
some
92:50
some
92:51
model of the system
92:55
right you need to have a model like in
92:57
mpc
92:59
you need to have a model
93:00
on the other hand you may be estimating
93:02
the model online
93:04
and
93:06
and
93:07
that's that's our limitation
93:09
um
93:13
if you look at the problems where you
93:14
don't have a model
93:16
then you are in the area of dual control
93:19
what's called dual controlling uh in
93:21
traditional stochastic control theory
93:24
it's a concept that goes back to the
93:26
to the 60s and people have been trying
93:29
to deal with that also in conjunction
93:31
with adaptive control for decades and
93:33
it's one of the great uh challenges
93:36
of
93:38
of control theory and i don't think that
93:41
you're going to solve them with
93:42
reinforcement solve this problem with
93:44
reinforcement learning methods the
93:45
problems are very fundamental
93:47
so basically
93:51
i think that
93:52
[Music]
93:55
it's not very clear how you deal with
93:57
with this unknown or changing models
94:00
it's uh
94:02
certainly in my talk today i have
94:05
assumed a known model
94:08
um and only by
94:10
uh i i have high hand waves when it came
94:13
to estimating a model and using the
94:15
estimate in practice you do that but
94:18
whether this is sound or not it's not
94:20
very clear to me
94:25
all right yeah thank you
94:30
so
94:30
in
94:32
do you think we could get a handle on
94:34
that problem of how to select rollout
94:36
horizon i mean do you have any more
94:39
insight in that
94:40
for sure i mean
94:42
you could say okay the longer the better
94:44
but
94:45
is that is that always true the longer
94:47
the vector i know i
94:49
my video uh actually not
94:52
uh and
94:53
one can construct interesting examples
94:56
where uh the
94:58
where we uh you increase the length of
95:02
the rollout horizon and the performance
95:04
becomes worse
95:06
uh basically
95:08
uh
95:10
you
95:11
okay with the
95:13
with rollout with a policy you push
95:16
the starting point of the newton step
95:18
close to the cost of that policy
95:21
on the other hand you don't really want
95:22
to go
95:23
at the cost of that policy you want to
95:25
go close to the optimal cost and apply
95:28
the newton's step there
95:30
now how do you choose the
95:32
the rollout horizon to do that it's not
95:35
very clear
95:36
on the other hand in practice this can
95:39
be resolved by some kind of
95:41
experimentation basically that's i mean
95:44
that's
95:45
there's no theory that can guide you
95:46
about the length of the horizon because
95:48
what's a good horizon also depends on j
95:51
tilde the starting point
95:54
and
95:54
that's that cannot be predicted very
95:56
easily
95:59
so it's an interesting question to deal
96:01
with both practically and perhaps
96:03
also theoretically
96:19
um
96:21
i had another question
96:23
you talked about alpha 0 now but there's
96:26
also
96:27
are you familiar with the method mu0
96:29
that's also from deepmind
96:33
yeah do you have any comment about that
96:36
yes i do
96:37
uh i've actually looked at
96:40
uh this new zero paper not very
96:42
carefully but one statement struck me
96:45
that the monte carlo's research is not
96:48
very good
96:49
uh for the for for as it i mean we've
96:53
been hearing from deepmind about how
96:55
good monte carlo research is and people
96:58
have been trying to use monte carlo
97:00
research in all kinds of contexts even
97:03
though
97:04
even context where it was implausible
97:06
that it would be well shooted
97:09
now the same alpha zero deep mind people
97:11
are telling us that
97:13
that
97:14
that monte carlo research is not good
97:17
and
97:19
and so i don't know what to believe
97:22
uh but i have done my own
97:23
experimentation about monte carlo
97:25
research
97:27
uh
97:28
and i have found that's a hit or miss
97:30
proposition
97:31
sometimes it doesn't work at all
97:33
and no matter how hard you try
97:35
other times it may work
97:37
and if you look at how the method works
97:39
you can see that that's probably the
97:41
case
97:42
now depending on application it may work
97:44
very well and i think
97:46
that in alpha zero it has been at least
97:48
i've been told that it has been a major
97:52
contributor to success because it prunes
97:55
intelligently the look ahead tree
97:57
can save some computation time
98:00
on the other hand i don't know how to
98:02
make out of this new zero
98:04
they say that monte carlo research does
98:06
not work very well and they have
98:09
rejected it they have taken out of the
98:10
design but i don't understand the
98:13
reasons
98:14
just as i did not understand why the
98:16
original monte carlo research worked i
98:18
don't understand now why it does not
98:20
work
98:23
so i have more to learn about new
98:25
zealand perhaps you have some insight
98:27
that i do not because my reading of the
98:29
paper was so superficial i just know
98:32
that you know it is
98:34
a more
98:34
[Music]
98:35
deep modern deep learning approach where
98:37
you map everything to a sort of latent
98:40
vector space where you assume that you
98:42
can
98:42
kind of do everything
98:46
so it is but it is also less
98:48
interpretable because of it because you
98:53
yeah
98:54
well
98:56
one of the more controversial viewpoints
98:58
that of my talk today
99:01
is that
99:02
offline training
99:04
does not matter all that much
99:06
uh refining
99:08
a very very
99:10
very closely this offline training
99:12
process
99:14
and
99:15
it's not uh it doesn't make much
99:17
difference because the newton steps sort
99:19
of
99:20
washes these differences out
99:22
so
99:24
is it the deep neural networks that make
99:26
a difference
99:27
i mean tesoro so word does not use a
99:29
deep neural network uses a
99:31
two-stage network
99:33
uh
99:34
and
99:35
deep is not necessarily
99:37
better
99:38
maybe it's better for some application
99:40
image processing applications but in
99:42
dynamic programming
99:44
well
99:45
alpha zero has proved muscular proof has
99:48
there's some evidence that it has worked
99:51
but uh
99:53
but
99:55
but whether it works in general or not
99:58
i'm not so sure
99:59
uh
100:02
i like deep neural networks and the
100:04
ideas behind them
100:06
uh but
100:09
i am
100:11
i i do not think that it is critical
100:13
that you use deep neural networks in
100:15
offline training
100:22
all right thank you
100:26
professor i
100:28
i would like to
100:30
ask that
100:31
you mentioned about having this model
100:34
and and knowing the model and
100:38
you refer to the case that the model
100:40
could be a simulator as well right
100:42
because
100:45
for policy gradient and those type of
100:47
method they still need to have a
100:49
simulator model or whatever to collect
100:52
data in the first place
100:54
yeah yeah that's a good point yeah
100:58
yeah when i say model i don't
100:59
necessarily mean equations and
101:01
probability distributions and the like
101:02
it could be a computer model
101:04
i mean a computer model implements a
101:06
mathematical model
101:08
so what i mean by
101:11
by needing to have a model it could be a
101:13
computer model
101:15
as opposed to the case where
101:19
you don't have such a model and you're
101:21
identifying you need to do system
101:23
identification
101:25
so
101:27
so yeah you're right
101:29
the
101:30
a model has a broader meaning in my
101:33
context
101:43
yeah sorry
101:44
my zoom crashed that way i asked the
101:47
question and then i disappeared uh
101:49
i i apologize for that i asked about
101:51
this uh
101:52
rollout
101:54
length design but i can ask you you
101:56
shall later what the answer was i just
101:59
wanted to apologize for disappearing it
102:01
was not very polite
102:02
i i may not be able to teleport
102:05
now
102:07
thank you
102:08
but but professor i think one of your
102:11
slides that have truncated rollout and
102:14
this uh slope actually uh
102:17
explained the idea
102:19
elegantly that you have too long roll
102:22
out then the rollout would goes to that
102:25
uh
102:26
that cost function of that particular
102:28
policy
102:30
instead of the
102:31
uh
102:32
bellman equations solution i think that
102:35
figure would explain that it's not
102:37
necessarily good to have way too long
102:40
roll out
102:42
yeah yeah thank you very much you child
102:44
i think if you look at the draft of the
102:46
book there's a section on this question
102:48
precisely
102:49
uh the situation where a longer rollout
102:52
may help
102:53
but
102:54
uh even so there's no definitive answer
102:58
on how you do it it's just
103:01
you might gain some better understanding
103:03
about what's going on but you're looking
103:05
at this section of the book or the or
103:07
the figure that your child mentioned
103:10
but
103:14
but there's it's
103:16
it's hard to say
103:18
how much rollout would be most effective
103:23
on the other hand there's always a
103:24
newton step okay once you have a newton
103:27
step you can afford to you can play
103:29
around with different uh
103:31
parameters of the design the newton step
103:34
has the power to correct a lot
103:40
do you know in this in td gammon the the
103:42
author did he reveal uh i mean the i
103:45
guess the
103:46
the
103:47
design of td gamma did he reveal
103:50
how that was tuned i mean i guess or
103:54
google with alpha zero did they also
103:55
discuss this
103:57
in their own contributions
103:59
well
104:00
the soros paper very explicitly says
104:03
that it's important not to have a very
104:05
long rollout
104:07
but the reason that he gave is that
104:10
it takes a very long time to do a long
104:12
rollout and that's why he substituted he
104:15
introduced truncation
104:17
introducing a cost function
104:18
approximation at the end of the rollout
104:21
yeah
104:22
the solo had it all figured out
104:25
but
104:26
but he did not give any mathematical
104:28
analysis
104:29
and his motivations were always
104:32
computational how to make the thing work
104:35
because remember
104:37
thesaurus
104:38
rollout program it dates to 1996 and
104:42
computers were very primitive at that
104:44
time yeah and for the sort to play a
104:46
single game of backgammon
104:48
he had to pull together
104:51
a lot of workstations at ibm research
104:54
center to work all night in order to
104:57
play a single game
104:59
because
105:00
monte carlo simulation was involved
105:04
and it was important to use parallel
105:06
computation and also to keep the length
105:08
of truncated rollout short
105:13
that was a good use of of computing
105:15
resources in the end you know
105:17
yes it's amazing tesoro has been amazing
105:20
in this field and i'm afraid and i'm and
105:22
i think that his uh his great influence
105:25
and great contributions have not been
105:27
fully appreciated unfortunately in this
105:30
field of reinforcement learning
105:32
memory is very short yes
105:35
stretches back to a few years
105:38
anything that's been done 20 years ago
105:40
or longer is as if it never existed yeah
105:44
yes yes
105:46
i guess it's important
105:49
it's a more general problem oh yeah
105:52
people tend to be forgotten
105:54
so fast
105:56
yes
105:57
great but thank you dmitry we will not
106:00
forget you
106:01
yeah
106:02
it's good to hear
106:04
but uh i'm not sure i can hold you to it
106:06
[Laughter]
106:10
great this was fantastic um so i guess
106:13
we should conclude and uh i know you you
106:16
work with kale you shall and so that's
106:18
fantastic i hope you'll be able to come
106:20
to visit us at kth in a non-pandemic
106:23
future hopefully
106:25
love your city i'd love to visit
106:28
and also i hope that we can do some
106:31
further work and more computational
106:32
experimentation with meaningful problems
106:35
and i think there are many possibilities
106:38
and working with you chow has been
106:40
fantastic for me and gave me the
106:42
opportunity also to connect with your
106:44
group
106:46
that's great
106:47
all right so thanks a lot uh
106:50
i i'm leaving yeah also
106:53
thanks and and many many things
106:55
bye for now okay thank you very much
106:58
thank you dimitri and i will send you
106:59
the recording
107:01
okay guys thank you very much yeah i'd
107:03
like to have the reporting absolutely
107:05
i'll fix that thank you so much
107:07
take care bye-bye