Continuous Control with Deep Reinforcement Learning
Notations
A trajectory is defined as \((s_1, a_1, r_1, ...)\)
The initial state distribution: \(\rho (s_1)\)
Transition dynamics \(p(s_{t+1} | s_t, a_t)\)
Reward function \(r(s_t, a_t)\)
The return from a state is defined as the sum of discounted future reward:
\[R_t = \sum_{i=t}^{T} \gamma^{i - t} r(s_i, a_i)\]
The discounted state distribution following \(\pi\) is denoted as \(\rho^{\pi}\)
The action-value function:
\[Q^{\pi} (s_t, a_t) = E_{r_{i \geq t}, s_{i > t} \sim P, a_{i >t} \sim \pi} [R_t | s_t, a_t]\]
And the action-value function can be expanded using bellman equation:
\[Q^{\pi} (s_t, a_t) = E_{r_{t}, s_{t+1} \sim P} [r(s_t, a_t) + \gamma E_{a_{t+1} \sim \pi (\cdot | s_{t+1})}[Q^{\pi} (s_{t+1}, a_{t+1})] | s_t, a_t]\]
If we have deterministic policy \(\mu: S \rightarrow A\), we can remove the integral over next actions and have:
\[Q^{\mu} (s_t, a_t) = E_{r_{t}, s_{t+1} \sim P} [r(s_t, a_t) + \gamma Q^{\mu} (s_{t+1}, \mu(s_{t + 1})) | s_t, a_t]\]
This implies that we can learn \(Q^{\mu}\) off policy using state samples from another behavior policy \(\beta\).
The off policy loss for parameterized function approximators \(\theta^Q\) Bellman Residual Minimization is:
\[L(\theta^{Q}) = E_{s_t \sim \rho^\beta. a_t \sim \beta, r_t} [(Q(s_t, a_t | \theta^{Q}) - y_t)^2]\]
where:
\[y_t = r(s_t, a_t) + \gamma Q(s_{t+1}, \mu(s_{t+1}) | \theta^{Q})\]
The use of large, non-linear function approximators for learning value or action-value functions has often been avoided in the past since theoretical performance guarantees are impossible, and practically learning tends to be unstable.
The off policy deterministic policy gradient:
\[\nabla_{\theta^{\mu}} J \approx E_{s_{t} \sim \rho^{\beta}} [\nabla_{a} Q(s, a | \theta^Q) |_{s=s_t, a=\mu(s_t)}\nabla_{\theta_{\mu}} \mu(s | \theta^\mu) |_{s=s_t}]\]
As with Q learning, introducing non-linear function approximators means that convergence is no longer guaranteed for off-policy deterministic policy gradient.