Dynamic Programming
Dynamic Programming
We have defined concepts and properties such as value functions, bellman equations, bellman operators and etc. The question is how we can find the the optimal policy? Before we start, we assume that the dynamics of the MDP is given, that is, we know our transition distribution \(P\) and immediate reward distribution \(R\). The assumption of knowing the dynamics do not hold in the RL setting, but designing methods for finding the optimal policy with known model provides the foundation for developing methods for the RL setting.
DP methods benefit from the structure of the MDP such as the recursive structure encoded in the Bellman equation, in order to compute the value function.