SARSA

SARSA

SARSA (State Action State Reward State Action) is an on-policy TD algorithm that follows a PI-like procedure:

  1. Estimate \(Q^{\pi_k}\) for a given policy \(\pi_k\)
  2. Perform policy improvement to obtain a new policy \(\pi_{k+1}\)

Compare with usual Policy iteration, SARSA uses Generalized policy iteration to improve the policy before \(Q\) converges to \(Q^{\pi_k}\).

Algorithm

Notice that, \(\pi_t\) here is a \(\epsilon\)-greedy policy because we want to ensure some exploration. The greedy part of the policy performs the policy improvement while the occasional random choice of actions allows the agent to have some exploration.

Implementation