Bellman Equations
Please read post on MDP prior this post
Bellman Equations
By expanding value functions, we reveal the recursive property of return:
\[ \begin{aligned} V^{\pi} (x) &= E[G^{\pi}_{t} | X_{t} = x]\\ & = E[R_{t} + \gamma G^{\pi}_{t+1} | X_{t} = x]\\ & = E_{a \sim \pi(a | s)}[R_{t} | X_{t} = x] + \gamma E[G^{\pi}_{t+1} | X_{t} = x]\\ & = r^{\pi} (x) + \gamma E[V^{\pi} (X_{t+1}) | X_{t} = x]\\ \end{aligned} \]
Where: \[V^{\pi} (X_{t+1}) = E[G^{\pi}_{t+1} | X_{t+1}] \]
\[\implies E_{X_{t+1} \sim P(\cdot | x_t, A_t)}[V^{\pi} (X_{t+1}) | X_t=x_t] = E_{X_{t+1} \sim P(\cdot | x_t, A_t)}[E[G^{\pi}_{t+1} | X_{t+1}] | X_{t}=x_t] = E[G^{\pi}_{t+1} | X_{t} = x_t]\]
This implies that neither side is random, by expanding:
\[E[V^{\pi} (X_{t+1}) | X_{t} = x]\]
we get:
\[E[V^{\pi} (X_{t+1}) | X_{t} = x] = \int P(dx^\prime | x, a) \pi (da | x) V^{\pi} (x^\prime)\]
Thus, the value function can be expressed as:
\[V^{\pi} (x) = r^{\pi} (x) + \gamma \int P(dx^\prime | x, a) \pi (da | x) V^{\pi} (x^\prime)\]
This is the bellman equation for a policy \(\pi\), this can be interpreted as: The value of following a policy \(\pi\) is the expected immediate reward that the \(\pi\)-following agent receives at that state plus the discounted average(expected) value that the agent receives at the next-state.
Same for action-value function, we have bellman equation:
\[Q^{\pi} (x, a) = r(x, a) + \gamma \int P(dx^\prime | x, a) V^{\pi}(x^\prime) = r(x, a) + \gamma \int P(dx^\prime | x, a) \pi (da^\prime | x^\prime) Q^{\pi} (x^\prime, a^\prime)\]
Compare with value function, we have expected reward base on action \(a\) plus discounted average(expected) value that the agent receives at the next-state following \(\pi\) (the choice of action is given at state \(s\) in action value function)