td3
Addressing Function Approximation Error in Actor-Critic Methods
As in function approximation with discrete action space, overestimation property also exists in continuous control setting.
Background
The return is defined as the discounted sum of rewards:
The objective of RL is to find the optimal policy
In DPG, the policy's (actor) parameter can be updated:
Where the critic is the action value function:
In large state space, the critic can be approached using FA methods for example DQN (Approximate value iteration) in an off policy fashion (replay buffer):
with
In terms of DDPG with delayed parameters to provide stability, the delayed target
The weights of a target network are either updated periodically to exactly match the weights of the current network or by some portion as in DDPG.
Overestimation Bias
In Q-learning with discrete actions, if the target is susceptible to error
This issue extends to actor critic setting where policy is updated via gradient descent.
Overestimation Bias in Actor-Critic
Assumptions:
- Policy is updated using DPG.
are chosen to normalize the gradient (i.e , only provides direction)
Given current policy parameters
be the parameters from the actor update induced by the maximization of the approximate critic . The resulting policy is . be the parameters from the hypothetical actor update with respect to the true underlying value function (which is unknown during training). The resulting policy is .
Since gradient always point at the direction of maximum increase locally, so:
, such that if , then the approximate value of will be bounded by the approximate value of : , such that if , then the true value of will be bounded by the true value of :
If:
which means in expectation the approximated value function is greater than the true value function w.r.t policy
This implies that our policy improvement based on approximate value function would over-estimate certain actions and lead to sub-optimal policies.
Consequences of overestimation:
- Overestimation may develop into a more significant bias over many updates if left unchecked.
- Inaccurate value estimate may lead to poor policy updates.
Feedback loops of actor and critic is prone to overestimation because suboptimal actions may highly rated by suboptimal critic and reinforcing the suboptimal action in the next policy update.
Clipped Double Q-learning for Actor-Critic (Solution)
In DDQN, the target is estimated using the greedy action from current value network rather than current target network. In an actor-critic setting (DPG), the update is:
However, slow-changing policy in actor-critic, the current and target networks were too similar to make an independent estimation and offered little improvement.
Instead, we can go back and use similar formulation as Double Q-learning with a pair of actors
From above plot we can see that Double Q-learning AC is more effective, but it does not entirely eliminate the over-estimation. One cause can be the shared replay buffer, as a result, for some state
One solution is to simply upper-bound the less biased value estimate
This algorithm is called Clipped Double Q-learning
algorithm. With Clipped Double Q-learning, exaggeration of overestimation is eliminated. However, underestimation bias could occur.
In implementation, computational costs can be reduced by using a single actor optimized w.r.t
Addressing Variance
Besides impact on overestimation bias, high variance estimates provide a noisy gradient for the policy update which reduces learning speed and hurt performance in practice.
Since we never really learn
Then:
Thus, the value function we estimate
If the value estimate (sample of
Target Networks and Delayed Policy Updates
One way to reduce the estimation error is to delay each update to the target network. Consider the graph above:
- If
, we obtain batch semi-gradient TD algorithm for updating the critic, which may have high variance. - If
, we obtain approximate value iteration.
If target networks can be used to reduce the error over multiple updates, and policy updates on high-error states cause divergent behavior, then the policy network should be updated at a lower frequency than the value network, to first minimize error before introducing a policy update.
Thus, delaying policy updates
can be helpful in minimizing the estimation error by updating the policy and target networks after a fixed number of updates
Target Policy Smoothing Regularization
Since deterministic policies can overfit to narrow peaks in the value estimate, a regularization strategy is used for deep value learning, target policy smoothing which mimics the learning update from SARSA. The idea is that similar actions should have similar value (reduce the variance of target):
In practice, we can approximate this expectation over actions (
Where noise is clipped to have it close to the original action (make sure it is a small area around original action).
Algorithm
Conclusion
The paper focus on resolve overestimation error in actor critic setting (DPG) and discusses that failure can occur due to the interplay between the actor and critic updates. Value estimates diverge through overestimation when the policy is poor, and the policy will become poor if the value estimate itself is inaccurate (high variance).
Solutions:
Overestimation Error in value estimation: using clipped double Q-learning to estimate the target value
Estimation Error (TD error) and High Variance in value estimation: By using delayed policy and value updates with target networks, we have more stable and more accurate estimate of the value function by updating the value function several times.
Overfitting in Value estimation: by forcing the value estimate to be similar for similar actions, we smooth out the value estimation.
The general structure of the algorithm follows DDPG.