Policy Gradient (3)

Policy Search Methods (REINFORCE)

We already now that, if we have sampled from the correct distribution and have an estimate for \(Q^{\pi_{\theta}} (X, A)\), we have an unbiased sample of the policy gradient (Discounted Setting):

\[\gamma^{k} \nabla \log \pi_{\theta}(A | X) Q^{\pi_{\theta}} (X, A)\]

The remaining issue is the computation of \(Q^{\pi_{\theta}} (X, A)\). This is essentially a Policy Evaluation problem, and we may use various action-value function estimators.


A simple approach uses MC estimates to estimate \(Q^{\pi_{\theta}} (X, A)\). This would lead to what is known as the REINFORCE algorithm:

In the on-policy setting when agent follows \(\pi_{\theta}\):

  • It generates the sequence \(X_1, A_1, R_1\) with \(A_{t} \sim \pi_{\theta} (\cdot | X_t), X_t \sim \rho^{\pi_{\theta}}\).
  • Then \(G_{t}^{\pi_{\theta}} = \sum_{k \geq t} \gamma^{k - t} R_{k}\) is an unbiased estimator of \(Q^{\pi_{\theta}} (X_t, A_t)\).
  • We replace the action-value function with the estimate.
  • The return however has high variance even though it is unbiased.
  • We can use a state dependent baseline function (usually the value function) to reduce the variance.
