Policy Gradient (2)
Policy Search Methods (Policy Gradient)
All
From introduction blog, our goal:
By taking the gradient w.r.t
We can swap integral and derivative as long as pathwise derivative estimator
or stochastic backpropagation estimator
(more details in Stochastic Graph).
This
The question is, how can we approximate this gradient in reality using samples?
Immediate Reward Problem (Easy Case)
First, let's consider an immediate reward problem:
In this setting, our
The value function
Then, the policy gradient
How can we compute
Recall that,
Then, we can draw samples from this expectation:
These samples are called REINFORCE
or Score function
estimators. These samples are unbiased samples of the policy gradient
Baseline
We know that above sample is unbiased sample of the policy gradient, but, how about its variance?
There are two source of variance:
- Variance that comes from sampling
and use this sample to estimate the expectation. - Variance that comes from sampling
and use this sample to estimate the expectation.
Recall the total variance formula:
One can show that the variance for
To see this more clearly, let's define:
This function
These two sources of variance make our estimate of the gradient inaccurate. We can reduce them using a baseline
.
Consider the variance of estimating
This shows that, if we add a state dependent function
by solving this optimization problem, we have:
We can also choose single scalar function
Policy Gradient (General Case)
Recall that:
Average Reward Setting
The difference with the immediate reward case, where we only need initial distribution
Recall that in average reward setting, the quality of a policy is measured as the long term expected reward or average rate reward or simply average reward following independent of starting state
(
Where
Stationary Distribution
stationary distribution
under ergodic
) and independent of starting state (
In general, let
The expected proportion of visits
is then:
The stationary distribution
can be interpreted as the long run expected proportion of time that the chain spends for each state over the state space, so it is defined as
So
We can easily sample states from this distribution by following
Discounted Setting
In discounted setting, the objective is defined as long term discounted return starting from initial state distribution
Recall that
By taking the expected value over initial state we have:
This is the expected number of time steps spend in state
Let's now define the discounted future state distribution
(similar to stationary distribution but with discount and depend on initial state)starting from state
Where , and
We can also define the discounted future-state distribution
of starting from
Expected Reward and Discounted Return
We can see the connection between expected reward and discounted return:
We can interchange summation and infinity sum because the sum is finite. Here
Policy Gradient Theorem
Proof: Average Reward Setting
In average reward setting,
Recall that,
That is:
Weighting both sides by (taking expectation)
Proof: Discounted Return Setting
In discounted return setting,
Recall that,
Now we expand
Then, we can make this an expectation by adding normalization term:
Then
Policy Gradients as Expectations
Recall that, the score function estimator
:
Thus, the results above can be written as:
Average Reward Setting:
Discounted Return Setting:
Now we have nice expectation over state distribution or discounted future state distribution, we can use samples to estimate the gradient.
Sampling from Expectations
Average Reward Setting
In average reward setting or in episodic tasks when we have
Discounted Return Setting
In discounted return setting, however, sampling is a little tricky.
- The agent starts an episode from
and follows - We get a sequence of states
However, these samples are from the stationary state distribution but not from the discounted future state distribution. In order to sample from the correct distribution, we either:
- Interpret
as termination, terminate at each time step with probability . - Add a weight
which depends on the current time step to account for the discounting.
If we have sampled from the correct distribution and have an estimate for
We can now use these samples in an SG scheme to improve the policy.
Next: REINFORECE and A2C
Reference
Amir-massoud Farahmand. Introduction to Reinforcement Learning (Spring 2021) Lecture 6
Sham M. Kakade. On the sample complexity of reinforcement learning. 2003. https://homes.cs.washington.edu/~sham/papers/thesis/sham_thesis.pdf
Thomas, P.. (2014). Bias in Natural Actor-Critic Algorithms. Proceedings of the 31st International Conference on Machine Learning, in PMLR 32(1):441-448
Sutton, R. S., McAllester, D. A., Singh, S. P., and Man- sour, Y. (1999). Policy gradient methods for reinforce- ment learning with function approximation. In Neural Information Processing Systems 12, pages 1057–1063.
Sutton & Barto's "Reinforcement Learning: An Introduction", 2nd edition
https://stats.stackexchange.com/questions/400087/how-deriving-the-formula-for-the-on-policy-distribution-in-episodic-tasks
MC:
http://wwwf.imperial.ac.uk/~ejm/M3S4/NOTES3.pdf
A First Look at Stochastic Processes https://www.worldscientific.com/worldscibooks/10.1142/11488