Double Q Learning

Double Q-learning

The Overestimation Phenomenon

Assume the agent observes during learning that action \(a\) and executed at state \(s\) resulting in the state \(s^{\prime}\) and some immediate reward \(r^{a}_{s}\). The Q-learning update can be written as:

\[Q(s, a) \leftarrow r^{a}_{s} + \gamma \max_{\hat{a}} Q(s^{\prime}, \hat{a})\]

It has been shown that repeated application of this update equation eventually yields \(Q\)-values that give rise to a policy which maximizes the expected cumulative discounted reward. However, these results only apply when \(Q-\)values are stored precisely. (e.g by a look up table)

Assume, instead, let \(Q\) be represented as a function approximator that induces some noise on the estimates of \(Q\). More specifically, let us assume that the current stored \(Q-\)values, denoted by \(Q^{approx}\), represent some implicit target values \(Q^{target}\), corrupted by a noise term \(Y^{\hat{a}}_{s^{\prime}}\), which is due to the function approximator:

\[Q^{approx} (s^{\prime}, \hat{a}) = Q^{target} (s^{\prime}, \hat{a}) + Y^{\hat{a}}_{s^{\prime}}\]

Here the noise is modeled by the family of random variable \(Y^{\hat{a}}_{s^{\prime}}\) with zero mean:

\[\begin{aligned} Z_s &\triangleq r^{a}_{s} + \gamma \max_{\hat{a}}Q^{approx} (s^{\prime}, \hat{a}) - (r^{a}_{s} + \gamma \max_{\hat{a}}Q^{target} (s^{\prime}, \hat{a})) \\ &= \gamma (\max_{\hat{a}}Q^{approx} (s^{\prime}, \hat{a}) - \max_{\hat{a}}Q^{target} (s^{\prime}, \hat{a}))\\ \end{aligned}\]

Clearly, the noise causes some error on the left-hand side.