LAB 7 - Experimenting a Markov Decision Process(MDP)

Goal

Problem

Consider the system modeled in the following state diagram:
The model proposed consists of:

At each time t, the agent perceives its state s(t) and the set of the possible actions A(s(t)). It chooses randomly -according to a given probability- an action and receive from the environment the new state s(t+1) and reward r(t+1).
At each time step we define V(t+1) = r(t+1) + γ*V(t), where γ is called "future reward discounting factor" and its value is between 0.0 and 1.0.

What you have to do: Parametric study of P(S) and V, changing N,p,γ,Tfin

Defined P(s1) = number of times agent is in state s1/ total number of actions
P(s2) = number of times agent is in state s2/ total number of actions
...
P(S) = {P(s1),P(s2),...}

Study the distribution P(S) and the value V(t) changing the following parameters: