MDP
MDP
Based on Prof Amir-massoud Farahmand's Lecture on RL
Define our MDP different from Introduction to RL book:
Where \(M(X)\) is the space of all probability distributions defined over the state space X
So a sample of the process can be described as a rollout or trajectory:
\[\tau = X_1, A_1, R_1, X_2, A_2, R_2, ...\]
A deterministic dynamical system always behave exactly the same given the same starting state and action, that is, they can be described as a transition function \(f\) instead of transition distribution \(P\)
\[x_{t+1} = f(x, a)\]
Policy
Policy induced Transition Kernels
Rewards and Returns
Let's define expected reward as:
\[r(x, a) = E[R | X=x, A=a]\]
then the expected reward following policy pi is defined as
\[r^{\pi} (x) = E_{a \sim \pi(a | x)} [R | X=x]\]
Define return following policy \(\pi\) as discounted sum of rewards starting from \(X_1\):
\[G^{\pi} = R_1 + \gamma R_2 + ... + \gamma^{T-1} R_{T} = \sum^{\infty}_{k=t} \gamma^{k - t}R_{t}\]
or any starting at any step \(X_{t}\):
\[G^{\pi}_{t} = \sum^{\infty}_{k=t} \gamma^{k - t}R_{t}\]
Note that \(\gamma\) is a part of the problem setting, it is usually not a hyperparameter. One way to think about is as a constant probability of termination of \(1 - \gamma\). Or we can set \(\gamma = 1\) when we have episodic tasks..
Example
Finite horizon example:
- Start at \(X_1 \sim \rho\)
- Select \(A_1 \sim \pi(\cdot | X_1)\)
- Get \(R_1 \sim R(\cdot | A_1, X_1)\)
- Get \(X_2 \sim P(\cdot | A_1, X_1)\)
- ...
- ...
- Get \(R_{T} \sim R(\cdot | X_{T}, A_{T})\)
- Get \(X_{T+1} \sim P(\cdot | X_{T}, A_{T})\) (This is the terminal state, which is often dropped because it has value of 0)
The process stops!
Value functions
Next we can define value functions as:
State value function (alias: Value function, V, v) starting from state x and following policy \(\pi\):
\[V^{\pi}(x) = E[G_1^{\pi} | X_{1} = x] = E[G^{\pi}_{t} | X_{t} = x] \text{ (Markov Property)}\]
Action value function following policy \(\pi\) starting from state x and taking action a:
\[Q^{\pi} (x, a) = E[G_1^{\pi} | X_{1} = x, A_{1} = a] = E[G^{\pi}_{t} | X_{t} = x, A_{t} = a] \text{ (Markov Property)}\]
in the terminal formulation
, the value function \(V_{term}^{\pi}\) can be written as:
\[V_{term}^{\pi} = E[R_{1} + R_2 + .... + R_{t + T} | X_1 = x] \;\;\;\;\;\; \forall x \in X\]
The process terminates at time \(t + T\) according to \(\gamma\). In this case the termination should be assumed to always occur in a finite number of steps (ie. if we have episodic task we can set \(\gamma = 1\)).
Another formulation
We can also define the T-step rollout or trajectory distribution \(P(\tau | \pi)\) following policy \(\pi\):
\[P(\tau | \pi) = \rho(X_1)\prod_{t=1}^{T} P(X_{t+1}| X_{t}, A_{t}) \pi(A_t | X_{t}) R(R_t | X_{t}, A_{t})\]
Then:
\[R(\tau) = G^{\pi}_{1} = \sum^{T}_{k=1} \gamma^{k - t}R_{t}\]
\[J(\pi) = E_{\tau \sim P(\tau | \pi) }[R(\tau)] = \int R(\tau) P(d\tau | \pi)\]
\[V^{\pi} (x) = E_{\tau \sim P(\tau | \pi, X_1=x)}[R(\tau) | X_1 = x]\]
And:
\[J(\pi) = \int V^{\pi} (x) d\rho(x)\]