Policy Gradient (2)

Policy Search Methods (Policy Gradient)

All refer to θ for simplicity.

From introduction blog, our goal:

maximizeπθJρ(πθ)=maximizeθJρ(πθ)

By taking the gradient w.r.t θ, we have:

Jρ(πθ)=EX1ρ[Vπθ(X1)]=x1a[r(x1,a)πθ(a|x1)dadx1+xP(x|x1,a)πθ(a|x1)Vπθ(x)dxdadx1]=ax1xπθ(a|x1)[r(x1,a)+P(x|x1,a)Vπθ(x)]dxdadx1=x1ax[πθ(a|x1)(r(x1,a)+P(x|x1,a)Vπθ(x))]dxdadx1

We can swap integral and derivative as long as Vπθ is a continuous function of θ, the estimator of this expectation is called pathwise derivative estimator or stochastic backpropagation estimator (more details in Stochastic Graph).

This Jρ(πθ) is called the Policy Gradient. In ideal case, if we know the policy gradient (i.e if we know the environment dynamics P,R), we can use gradient ascent to update the parameters θ.

The question is, how can we approximate this gradient in reality using samples?

Immediate Reward Problem (Easy Case)

First, let's consider an immediate reward problem:

In this setting, our Jρ(πθ) is:

Jρ(πθ)=x1Vπθρ(x1)dx1=x1ar(x1,a)πθ(a|x1)ρ(x1)dadx1

The value function Vπθ in this setting is the same as rπθ, and the action space is continuous.

Then, the policy gradient Jρ(πθ) can be written as:

Jρ(πθ)=x1ar(x1,a)ρ(x1)πθ(a|x1)dadx1

How can we compute Jρ(πθ) ?

Recall that, log(θ)θ=1θ:

Jρ(πθ)=x1ar(x1,a)ρ(x1)πθ(a|x1)dadx1=x1ar(x1,a)ρ(x1)πθ(a|x1)πθ(a|x1)πθ(a|x1)dadx1=x1aρ(x1)πθ(a|x1)r(x1,a)log(πθ(a|x1))dadx1=EX1ρ,Aπθ(|X1)[r(X1,A)log(πθ(A|X1))]

Then, we can draw samples from this expectation:

r(X1,A)log(πθ(A|X1))

These samples are called REINFORCE or Score function estimators. These samples are unbiased samples of the policy gradient Jρ(πθ). We can use these unbiased but noisy samples to update parameters. This makes it Stochastic Gradient Ascent.

Baseline

We know that above sample is unbiased sample of the policy gradient, but, how about its variance?

There are two source of variance:

  1. Variance that comes from sampling X1ρ and use this sample to estimate the expectation.
  2. Variance that comes from sampling Aπθ(a|X1) and use this sample to estimate the expectation.

Recall the total variance formula:

Var[Y]=EX[Var[Y|X]]+VarX[E[Y|X]]

One can show that the variance for ith dimension of Jρ(πθ):

Var[r(X1,A)θilog(πθ(A|X1))]=EX1[VarA[r(X1,A)θilog(πθ(A|X1))|X1]]+VarX1[EA[r(X1,A)θilog(πθ(A|X1))|X1]]

To see this more clearly, let's define:

g(x;θ)=EAπθ(|x)[r(x,A)log(πθ(A|x))]

This function g:X×ΘRp is the policy gradient at a particular starting state x, and it is a p-dimensional vector (Assuming θ is p-dimensional). Then we have:

Var[r(X1,A)θilog(πθ(A|X1))]=EX1[VarA[r(X1,A)θilog(πθ(A|X1))|X1]]+VarX1[gi(X1;θ)]

These two sources of variance make our estimate of the gradient inaccurate. We can reduce them using a baseline.

Consider the variance of estimating g(x;θ) using a single sample r(x,A)log(πθ(A|x)) with Aπθ(|x). For each dimension i, we have:

EA[(r(x,A)b(x))θilog(πθ(A|x))|x]=a(r(x,a)b(x))πθ(a|x)θilog(πθ(a|x))da=ar(x,a)θiπθ(a|x)dab(x)aθiπθ(a|x)da=ar(x,a)θiπθ(a|x)dab(x)θiaπθ(a|x)da1=ar(x,a)θiπθ(a|x)dab(x)θi(1)=ar(x,a)θiπθ(a|x)da=EA[(r(x,A)b(x))θilog(πθ(A|x))|x]


This shows that, if we add a state dependent function b(x) to r(x,a), the expectation won't change. But, this may change the variance! We can formulate this idea to be an optimization problem and find a function b:XRp such that for all xX (for each bi):

minbii=1pVar[(r(x,A)+bi(x))θilog(πθ(A|x))|x]

by solving this optimization problem, we have:

bi(x)=EA[r(x,A)(θilog(πθ(A|x)))2|x]EA[(θilog(πθ(A|x)))2|x]

We can also choose single scalar function b:XR instead:

b(x)=EA[r(x,A)θilog(πθ(A|x))22|x]EA[θilog(πθ(A|x))22|x]

Policy Gradient (General Case)

Recall that:

Pπ(|y)=xXaP(x|y,a)π(a|y)dadx

Average Reward Setting

The difference with the immediate reward case, where we only need initial distribution ρ, in average reward case, we depend on the environment transition distribution Pπθ too. As we update θ, we obtain a different policy πθ, that is, we have a different dynamics transition distribution Pπθ.

Recall that in average reward setting, the quality of a policy is measured as the long term expected reward or average rate reward or simply average reward following πθ independent of starting state (x1 in the expectation is used to emphasize that the Expected reward is t step):

J(πθ)=limn1nE[R1+R2,....|A1,A2,...πθ,x1]=limn1nt=1nE[Rt|A1,A2,...πθ,x1]=limn1nt=1nx,aPπθ(Xt=x|x1;t)π(a|x)R(x,a)dadx=x,alimn1nt=1nPπθ(Xt=x|x1;t)π(a|x)R(x,a)dadx=xρπθ(x)aπθ(a|x)R(x,a)dadx


Where ρπθ(x)=limn1nt=1nPπθ(Xt=x|x1;t)=limtPπθ(Xt=x|x1;t).

Stationary Distribution

ρπθ above is the stationary distribution under πθ which is assumed to exist (when the markov chain is ergodic) and independent of starting state (x1 here is used to emphasize that this is t step transition probability).

In general, let n=1NI{Xn=j|X0=i} be the number of visits to state j in N steps starting from i. Then the expected visits to state j is:

E[n=1NI{Xn=j|X0=i}]=n=1NE[I{Xn=j}|X0=i]=n=1NP(Xn=j|X0=i)

The expected proportion of visits is then:

n=1NP(Xn=j|X0=i)N

The stationary distribution can be interpreted as the long run expected proportion of time that the chain spends for each state over the state space, so it is defined as π=(π0,π1,...) over the state space X:

πj=limn1nm=1nE[I{Xm=j|X0=i}] initial states i=limnm=1nP(Xm=j|X0=i)n=limnP(Xm=j)


So ρπθ() is the special distribution in which if you select actions according to πθ, you end up in the same distribution:

xρπθ(x)aπθ(a|x)xP(x|x,a)dxdadx=ρπθ(x)

We can easily sample states from this distribution by following πθ.

Discounted Setting

In discounted setting, the objective is defined as long term discounted return starting from initial state distribution X1ρ following πθ:

Jρ(πθ)=EX1ρ[Vπθ(X1)]

Recall that Pπθ(|x;k)=Pπθ(|x)k is the k step transition distribution:

k=0γkPπθ(Xk+1=x|x1)=Pπθ(X1=x|x1)+γPπθ(X2=x|x1)+...

By taking the expected value over initial state we have:

Ex1[k=0γkPπθ(Xk+1=x|x1)]=Ex1[Pπθ(X1=x|x1)]+γEx1[Pπθ(X2=x|x1)]+...=ρ(x)+γx1Pπθ(X2=x|x1)ρ(x1)+...=ρ(x)+γPπθ(X2=x)+...=Pr(X1=x)+γPr(X2=x)+...

This is the expected number of time steps spend in state x discounted by γk given we start at state X1ρ.

Let's now define the discounted future state distribution (similar to stationary distribution but with discount and depend on initial state)starting from state x1X following πθ:

ργπ(|x1)(1γ)k=0γkPπθ(|x1;k)

Where , and 1γ is the normalization term. We can easily verify that this is a valid distribution. One interpretation of this distribution is that the agent starts at x1 and at each time step t terminate with probability 1γ. This accounts for the discounted factor inside the average counts. That is, we can sample from the state stationary distribution and terminate at each step with probability 1γ. Or instead, we could weight each samples by γk to account for the termination.

We can also define the discounted future-state distribution of starting from ρ and following π as:

ργπ()=x1ργπ(|x1)ρ(x1)dx1

ργπ(x)=(1γ)Ex1[k=0γkPπ(Xk+1=x|x1)]=(1γ)(Pr(X1=x)+γPr(X2=x)+...)


Expected Reward and Discounted Return

We can see the connection between expected reward and discounted return:

Vπ(x)=Eπ[t=1γt1Rt|X1=x]=t=1γt1Eπ[Rt|X1=x]=t=1γt1x,aPπ(Xt=x|x;t)π(a|x)R(x,a)dadx=11γx,aργπ(x|x)π(a|x)R(x,a)dadx=11γEXργπ(|x),Aπ(|X)[R(X,A)]

We can interchange summation and infinity sum because the sum is finite. Here R() represents the realization of random variable R. This tells us that the value function at a state x is the expected reward when X is distributed according to ργπ(|x) and following π.

Policy Gradient Theorem

Proof: Average Reward Setting

In average reward setting, dπ(x)=ρπθ(x),ρ=J(πθ) and Qπ(x,a) is defined as:

Qπ(x,a)=r(x,a)J(πθ)+xP(x|x,a)Vπθ(x)dx

Recall that,

Vπθ(x)=aπθ(a|x)Qπθ(x,a)daxX=aπθ(a|x)Qπθ(x,a)da+aπθ(a|x)Qπθ(x,a)da=aπθ(a|x)Qπθ(x,a)da+aπθ(a|x)(r(x,a)J(πθ)+xP(x|x,a)Vπθ(x)dx)da=aπθ(a|x)Qπθ(x,a)da+aπθ(a|x)(J(πθ)+xP(x|x,a)Vπθ(x)dx)da


That is:

J(πθ)aπθ(a|x)da1=aπθ(a|x)Qπθ(x,a)da+aπθ(a|x)xP(x|x,a)Vπθ(x)dxdaVπθ(x)

Weighting both sides by (taking expectation) ρπθ(x) and remember the property of stationary distribution xρπθ(x)aπθ(a|x)xP(x|x,a)dxdadx=ρπθ(x):

xρπθ(x)dx1J(πθ)=xρπθ(x)aπθ(a|x)Qπθ(x,a)dadx+xρπθ(x)aπθ(a|x)xP(x|x,a)Vπθ(x)dxdadxxρπθ(x)Vπθ(x)dxJ(πθ)=xρπθ(x)aπθ(a|x)Qπθ(x,a)dadx+xρπθ(x)aπθ(a|x)xP(x|x,a)Vπθ(x)dxdadxxρπθ(x)Vπθ(x)dx=xρπθ(x)aπθ(a|x)Qπθ(x,a)dadx+xρπθ(x)Vπθ(x)dxxρπθ(x)Vπθ(x)dx=xρπθ(x)aπθ(a|x)Qπθ(x,a)dadx


Proof: Discounted Return Setting

In discounted return setting, dπ(x)=ρπθ(x),ρ=Jρ(πθ).

Recall that,

Vπθ(x)=aπθ(a|x)Qπθ(x,a)daxX=aπθ(a|x)Qπθ(x,a)da+aπθ(a|x)Qπθ(x,a)da=aπθ(a|x)Qπθ(x,a)da+aπθ(a|x)(r(x,a)+xP(x|x,a)γVπθ(x)dx)da=aπθ(a|x)Qπθ(x,a)da+aπθ(a|x)xP(x|x,a)γVπθ(x)dxda=aπθ(a|x)Qπθ(x,a)da+aπθ(a|x)xP(x|x,a)γVπθ(x)dxda=aπθ(a|x)Qπθ(x,a)da+xPπθ(x|x)γVπθ(x)dx


Now we expand Vπθ(x) likewise, we will end up with the recursive relation:

Vπθ(x)=xk=0γkPπθ(x|x;k)aπθ(a|x)Qπθ(x,a)dadx

Then, we can make this an expectation by adding normalization term:

Vπθ(x)=11γx(1γ)k=0γkPπθ(x|x;k)aπθ(a|x)Qπθ(x,a)dadx=11γxρπθ(x|x)aπθ(a|x)Qπθ(x,a)dadx


Then Jρ(πθ)=EXρ[Vπθ(X)] can be expressed as:

Jρ(πθ)=11γx1xρ(x1)ρπθ(x|x1)aπθ(a|x)Qπθ(x,a)dadxdx1=11γxρπθ(x)aπθ(a|x)Qπθ(x,a)dadx


Policy Gradients as Expectations

Recall that, the score function estimator :

θEXp(;θ)[f(X)]=xθp(x;θ)p(x;θ)p(x;θ)f(x)dx=EXp(;θ)[f(X)θlogp(X;θ)]

Thus, the results above can be written as:

  • Average Reward Setting:

    J(πθ)=xρπθ(x)aπθ(a|x)Qπθ(x,a)dadx=EXρπθ()[aπθ(a|X)πθ(a|X)πθ(a|X)Qπθ(X,a)da]=EXρπθ()[aπθ(a|X)πθ(a|X)πθ(a|X)Qπθ(X,a)da]=EXρπθ()[aπθ(a|X)logπθ(a|X)Qπθ(X,a)da]=EXρπθ(),Aπθ(|X)[logπθ(A|X)Qπθ(X,A)]


  • Discounted Return Setting:

    Jρ(πθ)EXρπθ(),Aπθ(|X)[logπθ(A|X)Qπθ(X,A)]

Now we have nice expectation over state distribution or discounted future state distribution, we can use samples to estimate the gradient.

Sampling from Expectations

Average Reward Setting

In average reward setting or in episodic tasks when we have γ=1, sampling from the expectation is straight forward. We can directly sample trajectories following πθ, then the states would be samples from the stationary state distribution and actions would be samples from the policy.

Discounted Return Setting

In discounted return setting, however, sampling is a little tricky.

  1. The agent starts an episode from X1ρ and follows πθ
  2. We get a sequence of states X1,...

However, these samples are from the stationary state distribution but not from the discounted future state distribution. In order to sample from the correct distribution, we either:

  1. Interpret γ as termination, terminate at each time step with probability 1γ.
  2. Add a weight γk which depends on the current time step to account for the discounting.

If we have sampled from the correct distribution and have an estimate for Qπθ(X,A), we have an unbiased sample of the policy gradient:

γklogπθ(A|X)Qπθ(X,A)

We can now use these samples in an SG scheme to improve the policy.

Next: REINFORECE and A2C

Reference

Amir-massoud Farahmand. Introduction to Reinforcement Learning (Spring 2021) Lecture 6

Sham M. Kakade. On the sample complexity of reinforcement learning. 2003. https://homes.cs.washington.edu/~sham/papers/thesis/sham_thesis.pdf

Thomas, P.. (2014). Bias in Natural Actor-Critic Algorithms. Proceedings of the 31st International Conference on Machine Learning, in PMLR 32(1):441-448

Sutton, R. S., McAllester, D. A., Singh, S. P., and Man- sour, Y. (1999). Policy gradient methods for reinforce- ment learning with function approximation. In Neural Information Processing Systems 12, pages 1057–1063.

Sutton & Barto's "Reinforcement Learning: An Introduction", 2nd edition

https://stats.stackexchange.com/questions/400087/how-deriving-the-formula-for-the-on-policy-distribution-in-episodic-tasks

MC:

http://wwwf.imperial.ac.uk/~ejm/M3S4/NOTES3.pdf

A First Look at Stochastic Processes https://www.worldscientific.com/worldscibooks/10.1142/11488