Policy Iteration

Posted on 2021-05-03 Edited on 2021-05-05 In RL , DP Views:

Symbols count in article: 3.8k Reading time ≈ 3 mins.

Policy Iteration

A different approach is based on the iterative application of:

Policy Evaluation: Given a policy \(\pi_k\), compute \(V^{\pi_k}, Q^{\pi_k}\)
Policy Improvement: Find a new policy \(\pi_{k+1}\) that is better than \(\pi_{k}\), i.e., \(V^{\pi_{k+1}} \geq V^{\pi_k}\) (with a strict inequality in at least one state, unless at convergence)

So, how do we perform policy evaluation and policy improvement?

Policy Evaluation: We can use the policy evaluation part of VI (\(V_{k+1} \leftarrow T^{\pi} V_{k}\)) to compute the value of a policy \(\pi_k\)
Policy Improvement: We take the greedy policy,

\[\pi_{k+1} \leftarrow \pi_g (x; Q^{\pi_k}) = argmax_{a \in A} Q^{\pi_k} (x, a), \forall x \in X\]

Why Greedy Policy? (Intuition)

Assume that at state \(x\), we act according to greedy policy \(\pi_g (x; Q^{\pi_{k}})\), and afterwards, we follow \(\pi_{k}\). (we modify our policy at state \(x\) to choose the greedy action)

The value of this new policy is:

\[Q^{\pi_k} (x, \pi_g (x; Q^{\pi_{k}})) = Q^{\pi_k} (x, argmax_{a \in A} Q^{\pi_k} (x, a)) = max_{a \in A} Q^{\pi_k} (x, a)\]

Compare \(max_{a \in A} Q^{\pi_k} (x, a)\) with \(V^{\pi_k} (x) = Q^{\pi_k} (x, \underbrace{\pi_k (x)}_{\text{any deterministic policy}})\), we can see that:

\[max_{a \in A} Q^{\pi_k} (x, a) = Q^{\pi_k} (x, \pi_g (x; Q^{\pi_{k}})) \geq Q^{\pi_k} (x, \underbrace{\pi_k (x)}_{\text{any deterministic policy}}) = V^{\pi_k} (x)\]

So this new policy is equal to or better than \(\pi_k\) at state \(x\).

Algorithm

Recall that:

\(V^{\pi_k}\) is an unique fixed point of bellman operator \(T^{\pi_{k}}\)
The greedy policy \(\pi_{k+1}\) satisfies:
\[\begin{aligned} T^{\pi_{k+1}} Q^{\pi_k} &= r(x, a) + \int P(dx^{\prime} | x, a) \underbrace{\pi_{k+1} (da^{\prime} | x^{\prime})}_{\text{This policy is greedy policy}} Q^{\pi_k} (x^{\prime}, a^{\prime})\\ &= r(x, a) + \int P(dx^{\prime} | x, a) Q^{\pi_k} (x^{\prime}, argmax_{a^{\prime} \in A} Q^{\pi_k} (x^{\prime}, a^{\prime})\\ &= r(x, a) + \int P(dx^{\prime} | x, a) max_{a^{\prime} \in A} Q^{\pi_k} (x^{\prime}, a^{\prime})\\ &= T^{*} Q^{\pi_{k}} \end{aligned}\]

We can summarize each iteration of the Policy iteration algorithm as:

(Policy Evaluation) Given \(\pi_k\), compute Q^{_k}, i.e find a Q that satisfies \(Q = T^{\pi_k} Q\)
(Policy Improvement) Obtain \(\pi_{k+1}\) as a policy that satisfies (select greedy policy at each state \(x\)):

\[T^{\pi_{k+1}} Q^{\pi_k} = T^{*} Q^{\pi_{k}}\]

We also have approximate policy iteration algorithms too, where policy evaluation or improvement steps (Truncate evaluation step, function approximation) are performed approximately:

\[Q \approx T^{\pi_k} Q\]

\[T^{\pi_{k+1}} Q^{\pi_k} \approx T^{*} Q^{\pi_{k}}\]

A generic algorithm can be found in Intro to RL book:

Implementation

Convergence of Policy Iteration

We can guarantee that the policy iteration algorithm converges to the optimal policy. For finite MDPs, the convergence happens in a finite number of iterations.

Policy Improvement Theorem

In other words, if we choose \(\pi^{\prime}\) to be the greedy policy, then we are guaranteed to have a better policy.

Poof

We have \(T^{*} Q^{\pi} \geq T^{\pi} Q^{\pi} = Q^{\pi}\) because for any \((x, a) \in X \times A\) it holds that:

Deterministic Policy \(\pi(x)\) \[r(x, a) + \gamma \int P(dx^{\prime} | x, a) max_{a^{\prime} \in A} Q^{\pi} (x^\prime, a^\prime) \geq r(x, a) + \gamma \int P(dx^{\prime} | x, a) Q^{\pi} (x^{\prime}, \pi(x^{\prime}))) \]

Stochastic Policy \(\pi(a | x)\)

\[r(x, a) + \gamma \int P(dx^{\prime} | x, a) max_{a^{\prime} \in A} Q^{\pi} (x^\prime, a^\prime) \geq r(x, a) + \gamma \int P(dx^{\prime} | x, a) \pi(da^{\prime} | x^{\prime}) Q^{\pi} (x^{\prime}, a^{\prime}))) \]

Thus, \[T^{\pi^{\prime}} Q^{\pi} = T^{*} Q^{\pi} \geq T^{\pi} Q^{\pi} = Q^{\pi}\]

We have:

\[T^{\pi^{\prime}} Q^{\pi} \geq Q^{\pi}\]

Then, we apply \(T^{\pi^\prime}\) to both sides, by monotonic property of bellman operator, we have:

\[T^{\pi^{\prime}}(T^{\pi^{\prime}} Q^{\pi}) \geq T^{\pi^{\prime}} Q^{\pi} = T^{*} Q^{\pi} \geq Q^{\pi}\]

Repeat this, we have for any \(m \geq 1\):

\[(T^{\pi^{\prime}})^{m} Q^{\pi} \geq T^{*} Q^{\pi} \geq Q^{\pi}\]

As \(m \rightarrow \infty\):

\[lim_{m \rightarrow \infty} (T^{\pi^{\prime}})^{m} Q^{\pi} = Q^{\pi^\prime}\]

\[\implies Q^{\pi^\prime} = lim_{m \rightarrow \infty} (T^{\pi^{\prime}})^{m} Q^{\pi} \geq Q^{\pi}\]

GoGoGogo!

Policy Iteration

Policy Iteration

Why Greedy Policy? (Intuition)

Algorithm

Implementation

Convergence of Policy Iteration

Policy Improvement Theorem

Poof

Proof: Convergence of Policy Iteration