knapsack

Posted on 2021-07-08 Edited on 2021-07-09 Views:

Symbols count in article: 2.8k Reading time ≈ 3 mins.

knapsack Problem

0–1 knapsack problem

Given weights and values of \(n\) items, put these items in a knapsack of capacity \(W\) to get the maximum total value in the knapsack.

\(f(i, j):\) using only first \(i\) items, with maximum capacity of \(j\), the maximum total value you can get.

Then, we can write our base cases:

\(f(0, :) = 0\)
\(f(:, 0) = 0\)

0 items have values 0 and no items can be put into knapsack with capacity of 0.

At \(f(i, j)\), comparing it to \(f(i - 1, j)\), we can choose to include item \(i\) or not. So, we can write out of transition function \(f(i, j)\):

If there is still extra capacity for item \(i\) with total capacity \(j\) (i.e \(sum(weights[: i + 1]) \leq j)\)), then we can just simply add values of item \(i\). \[f(i, j) = f(i - 1, j) + values(i)\]
If there is no extra capacity for item \(i\) at \(j\) and the current weight \(weight[i] > j\), then we only need to keep the previous maximum values because the item \(i\) won't fit whatsoever: \[f(i, j) = f(i - 1, j)\]
If there is no extra capacity for item \(i\) at \(j\) but \(weight[i] \leq j\), then we can choose to include this item and discard some past items, or move on without considering this item:
1. The total values if we include this item (\(\text{current value } + \text{ The maximum value we can get after we fill the capacity j with i}\) (we need to exclude the current item)): \[\text{total}_{include} = values[i] + f(i - 1, j - weights[i])\]
2. The maximum value exclude current item: \[f(i - 1, j)\]
By taking the maximum over these two, we have:

\[f(i, j) = \max(f(i - 1, j), \; values[i] + f(i - 1, j - weights[i]))\]

Unbounded knapsack problem

Given a knapsack weight \(W\) and a set of \(n\) items with certain value \(v_i\) and weight \(w_i\), we need to calculate the maximum amount that could make up this quantity exactly. This is different from classical Knapsack problem, here we are allowed to use unlimited number of instances of an item.

\(f(i, j):\) using only first \(i\) items, with maximum capacity of \(j\), the maximum total value you can get.

Then, we can write our base cases:

\(f(0, :) = 0\)
\(f(:, 0) = 0\)

0 items have values 0 and no items can be put into knapsack with capacity of 0.

At \(f(i, j)\), comparing it to \(f(i - 1, j)\), we can choose to include item \(i\) or not. So, we can write out of transition function \(f(i, j)\):

If there is still extra capacity for item \(i\) with total capacity \(j\) (i.e \(sum(weights[: i + 1]) \leq j)\)), then we can just simply add values of item \(i\). \[f(i, j) = f(i - 1, j) + values(i)\]
If there is no extra capacity for item \(i\) at \(j\) and the current weight \(weight[i] > j\), then we only need to keep the previous maximum values because the item \(i\) won't fit whatsoever: \[f(i, j) = f(i - 1, j)\]
If there is no extra capacity for item \(i\) at \(j\) but \(weight[i] \leq j\), then we can choose to include this item and discard some past items, or move on without considering this item:
1. The total values if we include this item (\(\text{current value } + \text{ The maximum value we have can get after we fill the capacity j with i}\) (we need to include the current item)): \[\text{total}_{include} = values[i] + f(i, j - weights[i])\]
2. The maximum value exclude current item: \[f(i - 1, j)\]
By taking the maximum over these two, we have:

\[f(i, j) = \max(f(i - 1, j), \; values[i] + f(i, j - weights[i]))\]

Calculus (2)

Posted on 2021-07-06 Edited on 2021-07-12 Views:

Symbols count in article: 8.2k Reading time ≈ 7 mins.

Multivariate Calculus (2)

Double Integrals

Double Integrals over Rectangles

In a similar manner as definite integral, we consider a function \(f\) of two variables defined on a closed rectangle:

\[R = [a, b] \times [c, d] = \{(x, y, z) \in \mathbb{R}^2 | a \leq x \leq b, c \leq y \leq d\}\]

and we first suppose that \(f(x, y) \leq 0\). The graph of \(f\) is a surface with equation \(z = f(x, y)\). Let \(S\) be the solid that lies above \(R\) and under the graph of \(f\):

\[S = \{(x, y, z) \in \mathbb{R}^3 | a \leq x \leq b, c \leq y \leq d, 0 \leq z \leq f(x, y)\}\]

We first divide the \(R\) into subrectangles. We accomplish this by dividing the interval \([a, b]\) into \(m\) subintervals \([x_{i-1}, x_{i}]\) of equal width \(\Delta x = \frac{(b - a)}{m}\) and dividing \([c, d]\) into \(n\) subintervals \([y_{j - 1}, y_j]\) of equal width \(\Delta y = \frac{(d - c)}{n}\). Then the subrectangles are defined as:

\[R_{i, j} = [x_{i - 1}, x_i] \times [y_{j - 1}, y_j] = \{(x, y) | x_{i - 1} \leq x \leq x_i, y_{j - 1} \leq y leq y_j\}\]

Then the area of individual subrectangle is:

\[\Delta A = \Delta x \Delta y\]

If we choose a sample point \((x^{*}_{i, j}, y^{*}_{i, j}) \in R_{i, j}\), then we can approximate the part of \(S\) that lies above each \(R_{i, j}\) (The volume) by a column with height \(f(x^{*}_{i, j}, y^{*}_{i, j})\) and base area \(\Delta A\). The volume of the column:

\[f(x^{*}_{i, j}, y^{*}_{i, j})\Delta A\]

Then the volume (only when \(f(x, y) \leq 0 \; \forall (x, y) \in S\)) of \(S\) can be approximated:

\[V \approx \sum^{m}_{i=1} \sum^{n}_{j=1}f(x^{*}_{i, j}, y^{*}_{i, j})\Delta A\]

This sum is called double Riemann sum.

Xavier Initialization

Posted on 2021-07-05 Edited on 2021-07-08 In DL , Techniques Views:

Symbols count in article: 9.3k Reading time ≈ 8 mins.

Understanding the Difficulty of Training Deep Feedforward Neural Networks

The analysis is driven by investigative experiments to monitor activations (watching for saturation of hidden units) and gradients, across layers and across training iterations.

Experimental Setting

Feedforward neural networks with one to five hidden layers, one thousand hidden units per layer and a softmax logistic regression for the output layer are optimized. The cost function is the negative log-liklihood \(-\log p_{Y | X}(y | x)\), where \((x, y)\) is the (input image, target class) ordered pair. The neural networks were optimized with stochastic back-propagation on mini-batches of size ten. An average gradient over the mini-batch are used to update parameters \(\theta\) in that direction, with \(\theta \leftarrow \theta - \epsilon g\).

We varied the type of non-linear activation function in the hidden layers:

Sigmoid: \(\frac{1}{(1 + e^{-x})}\)
Hyperbolic tangent: \(tanh (x)\)
Softsign: \(\frac{x}{(1 + |x|)}\) (similar to tanh but approaches asymptotes much slower, softer slope)

The initial weights \(W_{i, j} \sim U[-\frac{1}{\sqrt{n}}, \frac{1}{\sqrt{n}}]\), biases are initialized to 0. Where \(n\) is the size of the previous layer.

LReLU

Posted on 2021-07-05 Edited on 2021-07-22 Views:

Symbols count in article: 1k Reading time ≈ 1 mins.

Rectifier Nonlinearities Improve Neural Network Acoustic Models

Background

A single hidden unit's activation \(h^{i}\) is given by:

\[h^{(i)} = \sigma ({\mathbf{w}^i}^T \mathbf{x})\]

Where \(\sigma (\cdot)\) is the tanh function.

Some drawbacks of tanh:

Vanishing gradient problem when lower layers of a DNN have gradients of nearly 0 because higher layer units are nearly saturated at \(-1\) or \(1\).
Does not produce a sparse representation in the sense of hard zero sparsity when using tanh.

ReLU addresses the vanishing gradient problem because activate units will constantly have gradient of 1. It also provides sparse representation which is useful in classification. However, an inactivate unit may never activate because gradient-based optimization algorithm will not adjust the weights of a unit that never activates initially. Thus, we might expect the learning to be slow.

Leaky ReLU

LReLU allows for a small, non-zero gradient when the unit is saturated and not active:

\[ h_i = \max({\mathbf{w}^i}^T \mathbf{x}, 0) = \begin{cases} {\mathbf{w}^i}^T \mathbf{x}, \quad & {\mathbf{w}^i}^T \mathbf{x} > 0\\ 0.01{\mathbf{w}^i}^T \mathbf{x}, \quad & {\mathbf{w}^i}^T \mathbf{x} \leq 0\\ \end{cases} \]

semi-mdp

Posted on 2021-06-28 Views:

Symbols count in article: 0 Reading time ≈ 1 mins.

Intrinsic Motivation

Posted on 2021-06-25 Edited on 2021-07-03 In RL , Basics Views:

Symbols count in article: 5.7k Reading time ≈ 5 mins.

Intrinsically Motivated Reinforcement Learning

Definitions

Extrinsic Motivation

Being moved to do something because of some specific rewarding outcome.

Intrinsic Motivation

Being moved to do something because it is inherently enjoyable. Intrinsic motivation leads organisms to engage in exploration, play and other behavior driven by curiosity in the absence of explicit reward.

Behavior Intrinsically Motivated

Psychologists call behavior intrinsically motivated when it is engaged in for its own sake rather than as a step toward solving a specific problem of clear practical value.

Reinforcement Learning of Skills

Go Explore

Posted on 2021-06-25 Edited on 2021-07-03 In RL , PG Views:

Symbols count in article: 18k Reading time ≈ 17 mins.

Go-Explore: a New Approach for Hard-Exploration Problems

Definitions

What is hard-exploration problem?

Sparse rewards: precise sequences of many actions must be taken between obtaining rewards.
Deceptive rewards: the reward function provides misleading feedback for reaching overall global objective.

Both sparse and deceptive reward problems constitute "hard-exploration" problems.

What is Derailment?

Typical IM agents have two layers of exploration mechanisms:

Higher level IR incentive that rewards when new states are reached, the IM agents rely on this mechanism to return to high IR promising states.
A more basic exploratory mechanism such as epsilon-greedy exploration, action-space noise, or parameter-space noise. The IM agents rely on this mechanism to explore high IR states.

However, to return to a previously discovered high IR state, a longer, more complex and more precise sequence of actions is needed. In this case, the lower level exploration mechanism (e.g epsilon greedy) will "derail" the agent from ever returning to that state (The needed precise actions are naively perturbed by the basic exploration mechanism).

What is detachment?

Detachment is idea that an agent driven by Intrinsic Motivation could become detached from the frontiers of high intrinsic reward. IM is nearly always a consumable resource: a curious agent is curious about state to the extent that it has not often visited them. If an agent discovers multiple areas of the state space that produce high IR, its policy may in the short term focus on one such area. After exhausting some of the IR offered by that area, the policy may by chance begin consuming IR in another area. Once it has exhausted that IR, it is difficult for it to rediscover the frontier it detached from in the initial area, because it has already consumed the IR that led to that frontier and it likely will not remember how to return to that frontier due to catastrophic forgetting.

In theory, a replay buffer could prevent detachment, but in practicce it would have to be large to prevent data about the abandoned frontier to not be purged before it becomes needed, and large replay buffers introduce their own optimization stability difficulties.

overfeat

Posted on 2021-06-25 Views:

Symbols count in article: 0 Reading time ≈ 1 mins.

VGG (Under Construction)

Posted on 2021-06-24 Edited on 2021-06-25 In DL , CNN Views:

Symbols count in article: 3.1k Reading time ≈ 3 mins.

Very Deep Convolutional Networks For Large-Scale Image Recognition

Generic Architecture

Inputs are \(224 \times 224\) RGB images with zero mean for each pixel.
The convolutional layers have small filter sizes (\(3 \times 3\)), one architecture with \(1 \times 1\) filters (Network C).
Padding 1 and stride 1 to maintain the original image size.
5 max pooling layers with \(2 \times 2\) window and stride 2.
2 fully connected layers each with 4096 units, and the final fc output layer with 1000 units.
ReLU activation function.

Configurations

the shallowest network has 11 weight layers (8 conv layers and 3 FC layers), the deepest network has 19 weight layers (16 conv layers and 3 FC layers)
The width of conv layers are relative small, starting from 64 in the first layer and then increasing by a factor of 2 each max-pooling layer, until it reaches 512.

td3

Posted on 2021-06-23 Edited on 2021-07-17 In RL , PG Views:

Symbols count in article: 10k Reading time ≈ 9 mins.

Addressing Function Approximation Error in Actor-Critic Methods

As in function approximation with discrete action space, overestimation property also exists in continuous control setting.

Background

The return is defined as the discounted sum of rewards:

\[R_t = \sum^{T}_{i=t} \gamma^{i - t} r(s_i, a_i)\]

The objective of RL is to find the optimal policy \(\pi_{\phi}\) with parameters \(\phi\) which is defined as:

\[J(\phi) = E_{s_i \sim P_{\pi_\phi}, a_i \sim \pi_{\phi}} [R_0]\]

In DPG, the policy's (actor) parameter can be updated:

\[\nabla_{\phi} J(\phi) = E_{s_i \sim \rho_{\pi_\phi}} [\nabla_{\phi} \pi_{\phi} \nabla_{a} Q^{\pi_{\phi}} (s, a) |_{a = \pi_{\phi}(s)}]\]

Where the critic is the action value function:

\[Q^{\pi_{\phi}} (s, a) = E_{s_i \sim P_{\pi_{\phi}}, a_i \sim \pi_{\phi}}[R_t | s, a] = \underbrace{r}_{\text{ reward is deterministic so the expectation is itself }} + \gamma E_{s^{\prime}, a^{\prime}} [Q^{\pi} (s^{\prime}, a^{\prime}) | s, a]\]

In large state space, the critic can be approached using FA methods for example DQN (Approximate value iteration) in an off policy fashion (replay buffer):

\[Q_{k + 1} = {\arg\min}_{Q} \|Q - \hat{T}^{\pi_{\phi}} Q_{k}\|^2_{\rho_{\pi_{\phi}}}\]

with \(Q_{k}\) parameterized by a target network \(Q_{\theta^{\prime}}\), so the target \(\hat{T}^{\pi_{\phi}} Q_{\theta^{\prime}}\) can be write as:

\[\hat{T}^{\pi_{\phi}} Q_{\theta^{\prime}} = r + \gamma Q_{\theta^{\prime}} (s^{\prime}, a^{\prime}), \quad \quad a^{\prime} \sim \pi_{\phi} (\cdot | s^{\prime})\]

In terms of DDPG with delayed parameters to provide stability, the delayed target \(\hat{T}^{\pi_{\phi^{\prime}}} Q_{\theta^{\prime}}\) is used instead:

\[\hat{T}^{\pi_{\phi^{\prime}}} Q_{\theta^{\prime}} = r + \gamma Q_{\theta^{\prime}} (s^{\prime}, a^{\prime}), \quad \quad a^{\prime} \sim \pi_{\phi^{\prime}} (\cdot | s^{\prime})\]

The weights of a target network are either updated periodically to exactly match the weights of the current network or by some portion as in DDPG.

Overestimation Bias

In Q-learning with discrete actions, if the target is susceptible to error \(\epsilon\), then the maximum over the value alonge with its error will generally be greater than the true maximum even the error \(\epsilon\) has zero mean:

\[E_{\epsilon} [\max_{a^{\prime}} (Q(s^{\prime}, a^{\prime}) + \epsilon)] \geq \max_{a^{\prime}} Q(s^{\prime}, a^{\prime})\]

This issue extends to actor critic setting where policy is updated via gradient descent.

Overestimation Bias in Actor-Critic

Assumptions:

Policy is updated using DPG.
\(Z_1, Z_2\) are chosen to normalize the gradient (i.e \(Z^{-1} \|E[\cdot]\| = 1\), only provides direction)

Given current policy parameters \(\phi\), let:

\(\phi_{approx}\) be the parameters from the actor update induced by the maximization of the approximate critic \(Q_{\theta} (s, a)\). The resulting policy is \(\pi_{approx}\). \[\phi_{approx} = \phi + \frac{\alpha}{Z_1} E_{s \sim \rho_{\pi_{\phi}}} [\nabla_{\phi} \pi_{\phi} \nabla_{a} Q_{\theta} (s, a) |_{a = \pi_{\phi}(s)}]\]
\(\phi_{true}\) be the parameters from the hypothetical actor update with respect to the true underlying value function \(Q^{\pi_{\phi}} (s, a)\) (which is unknown during training). The resulting policy is \(\pi_{true}\). \[\phi_{true} = \phi + \frac{\alpha}{Z_2} E_{s \sim \rho_{\pi_{\phi}}} [\nabla_{\phi} \pi_{\phi} \nabla_{a} Q^{\pi_{\phi}} (s, a) |_{a = \pi_{\phi}(s)}]\]

Since gradient always point at the direction of maximum increase locally, so:

\(\exists \epsilon_1\), such that if \(\alpha \leq \epsilon_1\), then the approximate value of \(\pi_{approx}\) will be bounded by the approximate value of \(\pi_{true}\): \[E[Q_{\theta} (s, \pi_{approx} (s))] \geq E[Q_{\theta} (s, \pi_{true} (s))]\]
\(\exists \epsilon_2\), such that if \(\alpha \leq \epsilon_2\), then the true value of \(\pi_{approx}\) will be bounded by the true value of \(\pi_{true}\): \[E[Q^{\pi_{\phi}} (s, \pi_{true} (s))] \geq E[Q^{\pi_{\phi}} (s, \pi_{approx} (s))]\]

If: \[E[Q_{\theta} (s, \pi_{true} (s))] \geq E[Q^{\pi_{\phi}} (s, \pi_{true} (s))]\]

which means in expectation the approximated value function is greater than the true value function w.r.t policy \(\pi_{true}\) (e.g for some state action pair (\(s, \pi_{true} (s)\))), the approximate value is much larger than the true value). At the same time, if \(\alpha < \min(\epsilon_1, \epsilon_2)\), then: \[E[Q_{\theta} (s, \pi_{approx} (s))] \geq E[Q^{\pi_{\phi}} (s, \pi_{approx} (s))]\]

This implies that our policy improvement based on approximate value function would over-estimate certain actions and lead to sub-optimal policies.

Consequences of overestimation:

Overestimation may develop into a more significant bias over many updates if left unchecked.
Inaccurate value estimate may lead to poor policy updates.

Feedback loops of actor and critic is prone to overestimation because suboptimal actions may highly rated by suboptimal critic and reinforcing the suboptimal action in the next policy update.

Clipped Double Q-learning for Actor-Critic (Solution)

In DDQN, the target is estimated using the greedy action from current value network rather than current target network. In an actor-critic setting (DPG), the update is:

\[y = r + Q_{\theta^{\prime}} (s^{\prime}, \pi_{\phi} (s^{\prime}))\]

However, slow-changing policy in actor-critic, the current and target networks were too similar to make an independent estimation and offered little improvement.

Instead, we can go back and use similar formulation as Double Q-learning with a pair of actors \((\pi_{\phi_1}, \pi_{\phi_2})\) and critics \((Q_{\theta_1}, Q_{\theta_2})\) where \(\pi_{\phi_1}\) is optimized w.r.t \(Q_{\phi_1}\) and \(\pi_{\phi_2}\) is optimized w.r.t \(Q_{\phi_2}\):

\[y_1 = r + \gamma Q_{\theta^{\prime}_2} (s^{\prime}, \pi_{\phi_1} (s^{\prime}))\] \[y_2 = r + \gamma Q_{\theta^{\prime}_1} (s^{\prime}, \pi_{\phi_2} (s^{\prime}))\]

From above plot we can see that Double Q-learning AC is more effective, but it does not entirely eliminate the over-estimation. One cause can be the shared replay buffer, as a result, for some state \(s\), we would have \(Q_{\theta_2} (s, \pi_{\phi_1} (s)) > Q_{\theta_1} (s, \pi_{\phi_1} (s))\). This is problematic because we know that \(Q_{\theta_1} (s, \pi_{\phi_1} (s))\) will generally overestimate the true value, the overestimation is exaggerated.

One solution is to simply upper-bound the less biased value estimate \(Q_{\theta_2}\) by the biased estimate \(Q_{\theta_1}\):

\[y_1 = r + \gamma \min_{i=1, 2} Q_{\theta^{\prime}_{i}} (s^{\prime}, \pi_{\phi_1} (s^{\prime}))\]

This algorithm is called Clipped Double Q-learning algorithm. With Clipped Double Q-learning, exaggeration of overestimation is eliminated. However, underestimation bias could occur.

In implementation, computational costs can be reduced by using a single actor optimized w.r.t \(Q_{\theta_1}\). We then use the same target \(y_2=y_1\) for \(Q_{\theta_2}\). If \(Q_{\theta_2} > Q_{\theta_1}\) then the update is identical to the standard update and induces no additional bias. If \(Q_{\theta_2} < Q_{\theta_1}\), this suggests overestimation has occurred.

Addressing Variance

Besides impact on overestimation bias, high variance estimates provide a noisy gradient for the policy update which reduces learning speed and hurt performance in practice.

Since we never really learn \(Q^{\pi}\) exactly (we learn \(Q_{\theta}\) instead), there will always be some TD error in each update :

\[\begin{aligned} r + \gamma E[Q_{\theta} (s^{\prime}, a^{\prime}) | s, a] - E[\delta(s, a) | s, a] &= \gamma E[Q_{\theta} (s^{\prime}, a^{\prime}) | s, a] - \gamma E[Q_{\theta} (s^{\prime}, a^{\prime}) | s, a] + Q_{\theta} (s, a)\\ &= Q_{\theta} (s, a) \end{aligned}\]

Then:

\[\begin{aligned} Q_{\theta} (s_t, a_t) &= r_t + \gamma E[Q_{\theta} (s_{t+1}, a_{t+1}) | s_t, a_t] - E[\delta_{t} | s_t, a_t]\\ &= r_t + \gamma E[r_{t+1} + \gamma E[Q_{\theta} (s_{t+2}, a_{t+2}) | s_{t+1}, a_{t+1}] - E[\delta_{t+1} | s_{t+1}, a_{t+1}] | s_t, a_t] - E[\delta_{t} | s_t, a_t]\\ & = E_{s_i \sim P_{\pi}, a_i \sim \pi} [\sum^{T}_{i=t} \gamma^{i - t} (r_i - \delta_i)] \end{aligned}\]

Thus, the value function we estimate \(Q_{\theta} (s_t, a_t)\), approximates the expected return minus the expected discounted sume of future TD-errors instead of the expected return \(Q^{\pi}\).

If the value estimate (sample of \(Q_{\theta}\)) is a function of future reward and estimation error, it follows that the variance of the estimate will be proportional to the variance of future reward and estimation error. If the variance for \(\gamma^{i} (r_i - \delta_i)\) is large, then the accumulative variance will be large especially when gamma is large.

Target Networks and Delayed Policy Updates

One way to reduce the estimation error is to delay each update to the target network. Consider the graph above:

If \(\tau = 1\), we obtain batch semi-gradient TD algorithm for updating the critic, which may have high variance.
If \(\tau < 1\), we obtain approximate value iteration.

If target networks can be used to reduce the error over multiple updates, and policy updates on high-error states cause divergent behavior, then the policy network should be updated at a lower frequency than the value network, to first minimize error before introducing a policy update.

Thus, delaying policy updates can be helpful in minimizing the estimation error by updating the policy and target networks after a fixed number of updates \(d\) to the critic. To ensure the TD-error remains small, target network is also slowly updated by \(\theta^{\prime} \leftarrow \tau \theta + (1 - \tau) \theta^{\prime}\).

Target Policy Smoothing Regularization

Since deterministic policies can overfit to narrow peaks in the value estimate, a regularization strategy is used for deep value learning, target policy smoothing which mimics the learning update from SARSA. The idea is that similar actions should have similar value (reduce the variance of target):

\[y = r + E_{\epsilon} [Q_{\theta^{\prime}} (s^{\prime}, \pi_{\phi^{\prime}} (s^{\prime}) + \epsilon)]\]

In practice, we can approximate this expectation over actions (\(\pi_{\phi^{\prime}} (s^{\prime}) + \epsilon\)) by adding a small amount of random noise to the target policy and averaging over mini-batches:

\[y = r + \gamma Q_{\theta^{\prime}} (s^{\prime}, \pi_{\phi^{\prime}} (s^{\prime}) + \epsilon)\]

\[\epsilon \sim clip(N(0, \sigma), -c, c)\]

Where noise is clipped to have it close to the original action (make sure it is a small area around original action).

Algorithm

Conclusion

The paper focus on resolve overestimation error in actor critic setting (DPG) and discusses that failure can occur due to the interplay between the actor and critic updates. Value estimates diverge through overestimation when the policy is poor, and the policy will become poor if the value estimate itself is inaccurate (high variance).

Solutions:

Overestimation Error in value estimation: using clipped double Q-learning to estimate the target value \(y_1, y_2\) \[y_1 = r + \gamma \min_{i=1, 2} Q_{\theta^{\prime}_{i}} (s^{\prime}, \pi_{\phi^{\prime}} (s^{\prime}))\] \[y_2 = y_1\]
Estimation Error (TD error) and High Variance in value estimation: By using delayed policy and value updates with target networks, we have more stable and more accurate estimate of the value function by updating the value function several times.
Overfitting in Value estimation: by forcing the value estimate to be similar for similar actions, we smooth out the value estimation. \[y = r + \gamma Q_{\theta^{\prime}} (s^{\prime}, \pi_{\phi^{\prime}} (s^{\prime}) + \epsilon)\] \[\epsilon \sim clip(N(0, \sigma), -c, c)\]

The general structure of the algorithm follows DDPG.