Policy Gradient (1)

Policy Search Methods (Introduction)

So far, we have encountered methods for computing the optimal policy based on the computation of the value function. In this case, only value function was explicitly represented. There are also methods based on explicit representation of the policy and optimizing the performance of the agent by searching in the space of policies. These methods are called policy search methods.

Consider a stochastic policy \(\pi_\theta: X \rightarrow M(A)\) that maps state space to distribution defined over actions and parameterized by a \(\theta \in \Theta\). The set \(\Theta\) is the parameter space. The space of all parameterized policies can be described as:

\[\prod_\Theta = \{\pi_\theta: X \rightarrow M(A): \theta \in \Theta\}\]

This space depends on the mapping \(\pi_\theta\) and \(\Theta\)

One example how we can parameterize policy by softmax distribution. Given a function mapping \(f_{\theta}: X \times A \rightarrow \mathbb{R}\), the probability of choosing action \(a\) at state \(x\) is:

\[\pi_{\theta} (a | x) = \frac{exp(f_{\theta} (x, a))}{\sum_{a^{\prime}} exp(f_{\theta} (x, a^{\prime}))}\]

or a normal distribution defined by \(\mu_\theta (x), \Sigma_\theta(x)\):

\[\pi_{\theta} (\cdot | x) = N(\mu_\theta(x), \Sigma_{\theta} (x))\]

Why Parameterize Policy?

Explicit parameterization of policy allows us to easily choose a continuous action. For example, if the policy is normal distribution defined over actions, we can just sample actions according to this distribution. For value based methods, this can be challenging:

  • Even if we know \(Q^{*}\), computing the greedy policy is impossible over continuous action space.
  • VI and PI requires repeated calculation of the greedy policy.

How Can We Compare Policies?

The performance can be measured in various ways. In VFA, we have bellman residuals, bellman errors, projected bellman errors and etc (But the ideas are all based on maximizing the expected return). In policy gradient methods, we also maximize the expected return following \(\pi_\theta\), at the same time, we incorporate the variance or some other risk measures.

Our goal is to find a policy that maximize this performance measure within the space of \(\prod_\Theta\).

Reasonable Choice of Performance Measure

Assume that we only care about the performance at state \(x \in X\) (We only care about one state). Then the goal of policy search:

\[argmax_{\pi \in \prod_{\Theta}} V^{\pi} (x) = argmax_{\theta \in \Theta} V^{\pi_{\theta}} (x)\]

In this case, we are interested only in state \(x\), its performance, measured according to its expected return, is maximized. We may want to find a policy that is only good at our starting state \(x\) and ignore the performance at other states. The obtained policy is going to be initial-state-dependent. If we change \(x\) to another state \(x^{\prime} \neq x\), we get a different optimal policy (according to this performance measure).

Recall that, the optimal value function \(V^{*}\) is optimal at every state. The optimal policy \(\pi^{*}\) not only maximizes the value function at this particular state \(x\), but also at any other \(x^{\prime} \in X\). So, Instead of this extreme case of considering only the initial state \(x\), we can consider when the initial state is distributed according to some distribution \(\rho \sim M(X)\) (This formulation is emphasized in the MDP blog, where trajectory starts by sampling from this initial distribution).

The performance measure would be the average of following \(\pi_\theta\) over initial state distribution \(X_1 \sim \rho\):

\[J(\pi_\theta) = J_\rho(\pi_\theta) \triangleq \int_{x_1} V^{\pi_{\theta}} (x_1) d\rho(x_1) = E_{X_1 \sim \rho}[V^{\pi_{\theta}}(X_1)]\]

In this case:

  • The optimal policy maximizes \(J_\rho\)
  • \(J_\rho (\pi^*) \leq J_\rho (\pi^\theta)\) for any \(\pi_{\theta} \in \prod_{\Theta}\)
  • If \(\pi^{*} \notin \prod_{\Theta}\), then \(J_\rho (\pi^*) > J_\rho (\pi^\theta)\) for any \(\pi_{\theta} \in \prod_{\Theta}\)

Goal

So, in policy search methods, we aim to find the policy that maximizes the performance measure within \(\prod_{\Theta}\):

\[\bar{\pi} = argmax_{\pi_{\theta} \in \prod_{\Theta}} J_{\rho} (\pi_{\theta})\]

  • The maximizer is denoted as \(\bar{\pi}\) because the optimal policy \(\pi^{*}\) may not lie in the space of \(\prod_{\Theta}\)
  • For different \(\rho\), we may get different maximizes. However, if \(\pi^{*} \in \prod_{\Theta}\), \(\rho\) does not matter.
  • To emphasis the dependence of \(\bar{\pi}\) on \(\rho\), \(\bar{\pi}_{\rho}\) may be used.
  • Sometimes, \(J(\pi(\theta))\) or \(J_{\rho}(\pi_{\theta})\) is denoted as \(J_{\rho} (\theta)\) for simplification.

Next: How can we solve for \(\bar{\pi}_{\rho}\)?