Policy Iteration

Policy Iteration

A different approach is based on the iterative application of:

  1. Policy Evaluation: Given a policy \(\pi_k\), compute \(V^{\pi_k}, Q^{\pi_k}\)
  2. Policy Improvement: Find a new policy \(\pi_{k+1}\) that is better than \(\pi_{k}\), i.e., \(V^{\pi_{k+1}} \geq V^{\pi_k}\) (with a strict inequality in at least one state, unless at convergence)

So, how do we perform policy evaluation and policy improvement?

  1. Policy Evaluation: We can use the policy evaluation part of VI (\(V_{k+1} \leftarrow T^{\pi} V_{k}\)) to compute the value of a policy \(\pi_k\)

  2. Policy Improvement: We take the greedy policy,

    \[\pi_{k+1} \leftarrow \pi_g (x; Q^{\pi_k}) = argmax_{a \in A} Q^{\pi_k} (x, a), \forall x \in X\]

Why Greedy Policy? (Intuition)

Assume that at state \(x\), we act according to greedy policy \(\pi_g (x; Q^{\pi_{k}})\), and afterwards, we follow \(\pi_{k}\). (we modify our policy at state \(x\) to choose the greedy action)

The value of this new policy is:

\[Q^{\pi_k} (x, \pi_g (x; Q^{\pi_{k}})) = Q^{\pi_k} (x, argmax_{a \in A} Q^{\pi_k} (x, a)) = max_{a \in A} Q^{\pi_k} (x, a)\]

Compare \(max_{a \in A} Q^{\pi_k} (x, a)\) with \(V^{\pi_k} (x) = Q^{\pi_k} (x, \underbrace{\pi_k (x)}_{\text{any deterministic policy}})\), we can see that:

\[max_{a \in A} Q^{\pi_k} (x, a) = Q^{\pi_k} (x, \pi_g (x; Q^{\pi_{k}})) \geq Q^{\pi_k} (x, \underbrace{\pi_k (x)}_{\text{any deterministic policy}}) = V^{\pi_k} (x)\]

So this new policy is equal to or better than \(\pi_k\) at state \(x\).

Algorithm

Recall that:

  • \(V^{\pi_k}\) is an unique fixed point of bellman operator \(T^{\pi_{k}}\)

  • The greedy policy \(\pi_{k+1}\) satisfies:

    \[\begin{aligned} T^{\pi_{k+1}} Q^{\pi_k} &= r(x, a) + \int P(dx^{\prime} | x, a) \underbrace{\pi_{k+1} (da^{\prime} | x^{\prime})}_{\text{This policy is greedy policy}} Q^{\pi_k} (x^{\prime}, a^{\prime})\\ &= r(x, a) + \int P(dx^{\prime} | x, a) Q^{\pi_k} (x^{\prime}, argmax_{a^{\prime} \in A} Q^{\pi_k} (x^{\prime}, a^{\prime})\\ &= r(x, a) + \int P(dx^{\prime} | x, a) max_{a^{\prime} \in A} Q^{\pi_k} (x^{\prime}, a^{\prime})\\ &= T^{*} Q^{\pi_{k}} \end{aligned}\]

We can summarize each iteration of the Policy iteration algorithm as:

  • (Policy Evaluation) Given \(\pi_k\), compute Q^{_k}, i.e find a Q that satisfies \(Q = T^{\pi_k} Q\)

  • (Policy Improvement) Obtain \(\pi_{k+1}\) as a policy that satisfies (select greedy policy at each state \(x\)):

    \[T^{\pi_{k+1}} Q^{\pi_k} = T^{*} Q^{\pi_{k}}\]

We also have approximate policy iteration algorithms too, where policy evaluation or improvement steps (Truncate evaluation step, function approximation) are performed approximately:

\[Q \approx T^{\pi_k} Q\]

\[T^{\pi_{k+1}} Q^{\pi_k} \approx T^{*} Q^{\pi_{k}}\]

A generic algorithm can be found in Intro to RL book:

Implementation

Convergence of Policy Iteration

We can guarantee that the policy iteration algorithm converges to the optimal policy. For finite MDPs, the convergence happens in a finite number of iterations.

Policy Improvement Theorem

In other words, if we choose \(\pi^{\prime}\) to be the greedy policy, then we are guaranteed to have a better policy.

Poof

We have \(T^{*} Q^{\pi} \geq T^{\pi} Q^{\pi} = Q^{\pi}\) because for any \((x, a) \in X \times A\) it holds that:

Deterministic Policy \(\pi(x)\) \[r(x, a) + \gamma \int P(dx^{\prime} | x, a) max_{a^{\prime} \in A} Q^{\pi} (x^\prime, a^\prime) \geq r(x, a) + \gamma \int P(dx^{\prime} | x, a) Q^{\pi} (x^{\prime}, \pi(x^{\prime}))) \]

Stochastic Policy \(\pi(a | x)\)

\[r(x, a) + \gamma \int P(dx^{\prime} | x, a) max_{a^{\prime} \in A} Q^{\pi} (x^\prime, a^\prime) \geq r(x, a) + \gamma \int P(dx^{\prime} | x, a) \pi(da^{\prime} | x^{\prime}) Q^{\pi} (x^{\prime}, a^{\prime}))) \]

Thus, \[T^{\pi^{\prime}} Q^{\pi} = T^{*} Q^{\pi} \geq T^{\pi} Q^{\pi} = Q^{\pi}\]

We have:

\[T^{\pi^{\prime}} Q^{\pi} \geq Q^{\pi}\]

Then, we apply \(T^{\pi^\prime}\) to both sides, by monotonic property of bellman operator, we have:

\[T^{\pi^{\prime}}(T^{\pi^{\prime}} Q^{\pi}) \geq T^{\pi^{\prime}} Q^{\pi} = T^{*} Q^{\pi} \geq Q^{\pi}\]

Repeat this, we have for any \(m \geq 1\):

\[(T^{\pi^{\prime}})^{m} Q^{\pi} \geq T^{*} Q^{\pi} \geq Q^{\pi}\]

As \(m \rightarrow \infty\):

\[lim_{m \rightarrow \infty} (T^{\pi^{\prime}})^{m} Q^{\pi} = Q^{\pi^\prime}\]

\[\implies Q^{\pi^\prime} = lim_{m \rightarrow \infty} (T^{\pi^{\prime}})^{m} Q^{\pi} \geq Q^{\pi}\]

Proof: Convergence of Policy Iteration