Monte Carlo Estimation
Learning From Stream of Data (Monte Carlo Methods)
MC Policy Evaluation
The Goal is to learn or estimate the value function
Recall that:
This is the expected return starting from state
If we repeat this process from the same state, we get another draw of r.v
This follows by the weak law of large number
, we can also estimate this expectation using Stochastic Approximation (SA) procedure to obtain an online fashion update:
First Visit and Every Visit MC
Above two algorithms are examples of first visit MC method. Each occurrence of state first-visit MC method
estimates every-visit MC method
averages the returns following all visits to
MC Control
Exploring Start
The general idea of MC control is to use some version of Policy Iteration. If we run many rollouts from each state-action pair
We can then choose:
Then we repeat this process until convergence.
However, we do not need to have a very accurate estimation of
or online version using exploring start:
We can see that all returns are cumulated and averaged irrespective which policy they follow, however, the algorithm can still converge to
Without Exploring Start
Even though we are guaranteed to converge optimal action value function using MC with exploring start, however, this assumption of exploring start is unlike especially when state space and action space are large. To avoid this unlikely assumption, we can replace exploring start with
Further
In this post, we have some basic understanding of generic MC methods and basic framework of MC Policy evaluation and control. Next, we will explore some popular MC methods for solving control problems. We will see that MC methods can be generalized to both on-policy sampling scenario and off-policy sampling scenario.