Bellman Equations

Please read post on MDP prior this post

Bellman Equations

By expanding value functions, we reveal the recursive property of return:

\[ \begin{aligned} V^{\pi} (x) &= E[G^{\pi}_{t} | X_{t} = x]\\ & = E[R_{t} + \gamma G^{\pi}_{t+1} | X_{t} = x]\\ & = E_{a \sim \pi(a | s)}[R_{t} | X_{t} = x] + \gamma E[G^{\pi}_{t+1} | X_{t} = x]\\ & = r^{\pi} (x) + \gamma E[V^{\pi} (X_{t+1}) | X_{t} = x]\\ \end{aligned} \]

Where: \[V^{\pi} (X_{t+1}) = E[G^{\pi}_{t+1} | X_{t+1}] \]

\[\implies E_{X_{t+1} \sim P(\cdot | x_t, A_t)}[V^{\pi} (X_{t+1}) | X_t=x_t] = E_{X_{t+1} \sim P(\cdot | x_t, A_t)}[E[G^{\pi}_{t+1} | X_{t+1}] | X_{t}=x_t] = E[G^{\pi}_{t+1} | X_{t} = x_t]\]

This implies that neither side is random, by expanding:

\[E[V^{\pi} (X_{t+1}) | X_{t} = x]\]

we get:

\[E[V^{\pi} (X_{t+1}) | X_{t} = x] = \int P(dx^\prime | x, a) \pi (da | x) V^{\pi} (x^\prime)\]

Thus, the value function can be expressed as:

\[V^{\pi} (x) = r^{\pi} (x) + \gamma \int P(dx^\prime | x, a) \pi (da | x) V^{\pi} (x^\prime)\]

This is the bellman equation for a policy \(\pi\), this can be interpreted as: The value of following a policy \(\pi\) is the expected immediate reward that the \(\pi\)-following agent receives at that state plus the discounted average(expected) value that the agent receives at the next-state.

Same for action-value function, we have bellman equation:

\[Q^{\pi} (x, a) = r(x, a) + \gamma \int P(dx^\prime | x, a) V^{\pi}(x^\prime) = r(x, a) + \gamma \int P(dx^\prime | x, a) \pi (da^\prime | x^\prime) Q^{\pi} (x^\prime, a^\prime)\]

Compare with value function, we have expected reward base on action \(a\) plus discounted average(expected) value that the agent receives at the next-state following \(\pi\) (the choice of action is given at state \(s\) in action value function)