Value Function Approximation (2)
Value Function Approximation (Algorithms)
Approximate Value Iteration
Population Version
Recall that the procedure for VI is:
One way to develop its approximation version is to perform each step only approximately (i.e find
Where
We start from a
At each iteration,
Batch Version
The objective of AVI cannot be computed because:
is unknown- Environment
is often not available, thus cannot be computed.
If we have samples in the form of
Where
The question is: can we replace
Since
Since the expectation of variance term does not depend on
Similar to ERM, we do not know the environment dynamics
This is the basis of DQN.
Bellman Residual Minimization
Population Version
Recall that:
Under FA, we may not achieve this exact equality, instead:
Thus, we can formulate our objective:
By minimizing this objective (Bellman Residual Minimization
.
This procedure is different from AVI in that we do not mimic the iterative process of VI (which is convergent in the exact case without any FA), but instead directly go for the solution of fixed-point equation (
If there exists a
Batch Version
Similar to AVI, we may want to replace
Using
We can see that
To see this, for any
Then:
Since
The second term is:
We can see that the variance term
Projected Bellman Error
From BRM, we know that even though
We want to find a
Where
TRhe projectio operator
It has some properties:
- The projection belongs to function space
: - If
, the projection is itself: - The projection operator onto a subspace is a non-expansion:
Population Version
We can define a loss function based on
This is called Projected Bellman Error
or Mean Square Projected Bellman Error
.
We find the value function by solving the following optimization problem:
Since
So the objective can be rewritten as:
Which is the norm of the projection of the Bellman Residual onto
We can think of solving the projected bellman error objective as solving the two coupled optimization problems:
- Find the projection point given value function
: - Find the value function
on that is closest to the projection point :
Least Square Temporal Difference Learning (Population Version)
If
We can find a direct solution to the PBE objective, that is we want to find
We assume that:
is finite and , , each is a -dimentional vector.
Then, we define
The value function
Our goal is to find a weight vector
Let
Then, the projection operator onto the linear
By taking the derivative and setting it to zero (assuming that
Then the projection is:
Since
Substitute the equation of
Multiply both sides by
The solution Least Square Temporal Difference Method (LSTD)
.
Notice that the term
Thus, LSTD finds a
Batch Version
We have showed that the solution to
Where:
Expand
Thus, in terms of current state
Given data set
We can use LSTD to define an approximate policy iteration procedure to obtain a close to optimal policy (LSPI):
Semi-Gradient TD
Suppose that we know the true value function
Using samples
If we select proper step size
When we do not know
The substitution of