LAB 7 - Experimenting a Markov Decision Process(MDP)
Goal
- Understand the behavior of the system modeled , running a large number of simulations.
- Develop an automatic series of simulations for a parametric study of the problem. Feel free to use any tool you like for running simulations. Scripts are admitted.
Problem
Consider the system modeled in the following state diagram:
The model proposed consists of:
- a set of environment states S = {s1,s2,s3,...,sN}
N = number of states.
In the picture N = 6 - a set of action A = {"go left", "go right"}.
action "go right" has probability p
action "go left" has probability 1-p
On boundary environment states, there is a special action "Remain in the same environment state" - a set of "rewards" R = {r1,r2,r3,...,rN}
In our problem rewards are = {-1,0,0,0,0,+1)
At each time t, the agent perceives its state s(t) and the set of the possible actions A(s(t)).
It chooses randomly -according to a given probability- an action and receive from the environment
the new state s(t+1) and reward r(t+1).
At each time step we define V(t+1) = r(t+1) + γ*V(t), where γ is called "future reward
discounting factor" and its value is between 0.0 and 1.0.
What you have to do: Parametric study of P(S) and V, changing N,p,γ,Tfin
Defined P(s1) = number of times agent is in state s1/ total number of actions
P(s2) = number of times agent is in state s2/ total number of actions
...
P(S) = {P(s1),P(s2),...}
Study the distribution P(S) and the value V(t) changing the following parameters:
- N = number of environment states = 3,5,9,50
- p = probability of going right = 0,.1,.25,.5,.75,.9,1.0
- γ = future discounting factor = 0,.1,.5,.9,.999,1.0
- Tfin = number of time steps = 1,10,100,1000,10000,100000