td3

Addressing Function Approximation Error in Actor-Critic Methods

As in function approximation with discrete action space, overestimation property also exists in continuous control setting.

Background

The return is defined as the discounted sum of rewards:

Rt=i=tTγitr(si,ai)

The objective of RL is to find the optimal policy πϕ with parameters ϕ which is defined as:

J(ϕ)=EsiPπϕ,aiπϕ[R0]

In DPG, the policy's (actor) parameter can be updated:

ϕJ(ϕ)=Esiρπϕ[ϕπϕaQπϕ(s,a)|a=πϕ(s)]

Where the critic is the action value function:

Qπϕ(s,a)=EsiPπϕ,aiπϕ[Rt|s,a]=r reward is deterministic so the expectation is itself +γEs,a[Qπ(s,a)|s,a]

In large state space, the critic can be approached using FA methods for example DQN (Approximate value iteration) in an off policy fashion (replay buffer):

Qk+1=argminQQT^πϕQkρπϕ2

with Qk parameterized by a target network Qθ, so the target T^πϕQθ can be write as:

T^πϕQθ=r+γQθ(s,a),aπϕ(|s)

In terms of DDPG with delayed parameters to provide stability, the delayed target T^πϕQθ is used instead:

T^πϕQθ=r+γQθ(s,a),aπϕ(|s)

The weights of a target network are either updated periodically to exactly match the weights of the current network or by some portion as in DDPG.

Overestimation Bias

In Q-learning with discrete actions, if the target is susceptible to error ϵ, then the maximum over the value alonge with its error will generally be greater than the true maximum even the error ϵ has zero mean:

Eϵ[maxa(Q(s,a)+ϵ)]maxaQ(s,a)

This issue extends to actor critic setting where policy is updated via gradient descent.

Overestimation Bias in Actor-Critic

Assumptions:

  1. Policy is updated using DPG.
  2. Z1,Z2 are chosen to normalize the gradient (i.e Z1E[]=1, only provides direction)


Given current policy parameters ϕ, let:

  1. ϕapprox be the parameters from the actor update induced by the maximization of the approximate critic Qθ(s,a). The resulting policy is πapprox. ϕapprox=ϕ+αZ1Esρπϕ[ϕπϕaQθ(s,a)|a=πϕ(s)]

  2. ϕtrue be the parameters from the hypothetical actor update with respect to the true underlying value function Qπϕ(s,a) (which is unknown during training). The resulting policy is πtrue. ϕtrue=ϕ+αZ2Esρπϕ[ϕπϕaQπϕ(s,a)|a=πϕ(s)]

Since gradient always point at the direction of maximum increase locally, so:

  1. ϵ1, such that if αϵ1, then the approximate value of πapprox will be bounded by the approximate value of πtrue: E[Qθ(s,πapprox(s))]E[Qθ(s,πtrue(s))]

  2. ϵ2, such that if αϵ2, then the true value of πapprox will be bounded by the true value of πtrue: E[Qπϕ(s,πtrue(s))]E[Qπϕ(s,πapprox(s))]

If: E[Qθ(s,πtrue(s))]E[Qπϕ(s,πtrue(s))]

which means in expectation the approximated value function is greater than the true value function w.r.t policy πtrue (e.g for some state action pair (s,πtrue(s))), the approximate value is much larger than the true value). At the same time, if α<min(ϵ1,ϵ2), then: E[Qθ(s,πapprox(s))]E[Qπϕ(s,πapprox(s))]

This implies that our policy improvement based on approximate value function would over-estimate certain actions and lead to sub-optimal policies.


Consequences of overestimation:

  1. Overestimation may develop into a more significant bias over many updates if left unchecked.
  2. Inaccurate value estimate may lead to poor policy updates.

Feedback loops of actor and critic is prone to overestimation because suboptimal actions may highly rated by suboptimal critic and reinforcing the suboptimal action in the next policy update.

Clipped Double Q-learning for Actor-Critic (Solution)

In DDQN, the target is estimated using the greedy action from current value network rather than current target network. In an actor-critic setting (DPG), the update is:

y=r+Qθ(s,πϕ(s))

However, slow-changing policy in actor-critic, the current and target networks were too similar to make an independent estimation and offered little improvement.

Instead, we can go back and use similar formulation as Double Q-learning with a pair of actors (πϕ1,πϕ2) and critics (Qθ1,Qθ2) where πϕ1 is optimized w.r.t Qϕ1 and πϕ2 is optimized w.r.t Qϕ2:

y1=r+γQθ2(s,πϕ1(s)) y2=r+γQθ1(s,πϕ2(s))


From above plot we can see that Double Q-learning AC is more effective, but it does not entirely eliminate the over-estimation. One cause can be the shared replay buffer, as a result, for some state s, we would have Qθ2(s,πϕ1(s))>Qθ1(s,πϕ1(s)). This is problematic because we know that Qθ1(s,πϕ1(s)) will generally overestimate the true value, the overestimation is exaggerated.

One solution is to simply upper-bound the less biased value estimate Qθ2 by the biased estimate Qθ1:

y1=r+γmini=1,2Qθi(s,πϕ1(s))

This algorithm is called Clipped Double Q-learning algorithm. With Clipped Double Q-learning, exaggeration of overestimation is eliminated. However, underestimation bias could occur.

In implementation, computational costs can be reduced by using a single actor optimized w.r.t Qθ1. We then use the same target y2=y1 for Qθ2. If Qθ2>Qθ1 then the update is identical to the standard update and induces no additional bias. If Qθ2<Qθ1, this suggests overestimation has occurred.

Addressing Variance

Besides impact on overestimation bias, high variance estimates provide a noisy gradient for the policy update which reduces learning speed and hurt performance in practice.

Since we never really learn Qπ exactly (we learn Qθ instead), there will always be some TD error in each update :

r+γE[Qθ(s,a)|s,a]E[δ(s,a)|s,a]=γE[Qθ(s,a)|s,a]γE[Qθ(s,a)|s,a]+Qθ(s,a)=Qθ(s,a)

Then:

Qθ(st,at)=rt+γE[Qθ(st+1,at+1)|st,at]E[δt|st,at]=rt+γE[rt+1+γE[Qθ(st+2,at+2)|st+1,at+1]E[δt+1|st+1,at+1]|st,at]E[δt|st,at]=EsiPπ,aiπ[i=tTγit(riδi)]

Thus, the value function we estimate Qθ(st,at), approximates the expected return minus the expected discounted sume of future TD-errors instead of the expected return Qπ.

If the value estimate (sample of Qθ) is a function of future reward and estimation error, it follows that the variance of the estimate will be proportional to the variance of future reward and estimation error. If the variance for γi(riδi) is large, then the accumulative variance will be large especially when gamma is large.

Target Networks and Delayed Policy Updates

One way to reduce the estimation error is to delay each update to the target network. Consider the graph above:

  1. If τ=1, we obtain batch semi-gradient TD algorithm for updating the critic, which may have high variance.
  2. If τ<1, we obtain approximate value iteration.

If target networks can be used to reduce the error over multiple updates, and policy updates on high-error states cause divergent behavior, then the policy network should be updated at a lower frequency than the value network, to first minimize error before introducing a policy update.

Thus, delaying policy updates can be helpful in minimizing the estimation error by updating the policy and target networks after a fixed number of updates d to the critic. To ensure the TD-error remains small, target network is also slowly updated by θτθ+(1τ)θ.

Target Policy Smoothing Regularization

Since deterministic policies can overfit to narrow peaks in the value estimate, a regularization strategy is used for deep value learning, target policy smoothing which mimics the learning update from SARSA. The idea is that similar actions should have similar value (reduce the variance of target):

y=r+Eϵ[Qθ(s,πϕ(s)+ϵ)]

In practice, we can approximate this expectation over actions (πϕ(s)+ϵ) by adding a small amount of random noise to the target policy and averaging over mini-batches:

y=r+γQθ(s,πϕ(s)+ϵ)

ϵclip(N(0,σ),c,c)

Where noise is clipped to have it close to the original action (make sure it is a small area around original action).

Algorithm

Conclusion

The paper focus on resolve overestimation error in actor critic setting (DPG) and discusses that failure can occur due to the interplay between the actor and critic updates. Value estimates diverge through overestimation when the policy is poor, and the policy will become poor if the value estimate itself is inaccurate (high variance).

Solutions:

  1. Overestimation Error in value estimation: using clipped double Q-learning to estimate the target value y1,y2 y1=r+γmini=1,2Qθi(s,πϕ(s)) y2=y1

  2. Estimation Error (TD error) and High Variance in value estimation: By using delayed policy and value updates with target networks, we have more stable and more accurate estimate of the value function by updating the value function several times.

  3. Overfitting in Value estimation: by forcing the value estimate to be similar for similar actions, we smooth out the value estimation. y=r+γQθ(s,πϕ(s)+ϵ) ϵclip(N(0,σ),c,c)

The general structure of the algorithm follows DDPG.