SARSA
SARSA
SARSA (State Action State Reward State Action) is an on-policy TD algorithm that follows a PI-like procedure:
- Estimate \(Q^{\pi_k}\) for a given policy \(\pi_k\)
- Perform policy improvement to obtain a new policy \(\pi_{k+1}\)
Compare with usual Policy iteration, SARSA uses Generalized policy iteration
to improve the policy before \(Q\) converges to \(Q^{\pi_k}\).
Algorithm
Notice that, \(\pi_t\) here is a \(\epsilon\)-greedy policy because we want to ensure some exploration. The greedy part of the policy performs the policy improvement while the occasional random choice of actions allows the agent to have some exploration.
Implementation
Related Posts