Intrinsic Motivation
Intrinsically Motivated Reinforcement Learning
Definitions
Extrinsic Motivation
Being moved to do something because of some specific rewarding outcome.
Intrinsic Motivation
Being moved to do something because it is inherently enjoyable. Intrinsic motivation leads organisms to engage in exploration, play and other behavior driven by curiosity in the absence of explicit reward.
Behavior Intrinsically Motivated
Psychologists call behavior intrinsically motivated when it is engaged in for its own sake rather than as a step toward solving a specific problem of clear practical value.
Reinforcement Learning of Skills
Options (Skills)
Starting from a finite MDP, which we call the core MDP, the simplest kind of option
\(o\) consists of:
- A policy \(\pi^o: S \times \cup_{s \in S} A_s \rightarrow [0, 1]\)
- A termination condition \(\beta^o: S \rightarrow [0, 1]\)
- An input set \(I^o \subset S\)
The option
is defined by a triplet \(o = <I^o, \pi^o, \beta^o>\). This option is available in state \(s\) if and only if \(s \in I^o\).
If the option is executed, then actions are selected according to \(\pi^o\) until the option terminates stochastically according to \(\beta^o\).
Example:
If the current state is \(s\), the next action is \(a\) with probability \(\pi^o (s, a)\), the environment makes a transition to state \(s^{\prime}\), where the option either terminates with probability \(\beta^o (s^{\prime})\) or else continues, determining the next action \(a^{\prime}\) with probability \(\pi^o (s^{\prime}, a^{\prime})\) and so on.
When the option terminates, the agent can select another option from the set of those available at the termination state. Note that any action of the core MDP, a primitive action \(a \in \cup_{s \in S} A_s\) is also an option, called an one-step option
, with \(I = \{s: a \in A_s\}\) and \(\beta (s) = 1\) for all \(s \in S\). This one-step option can be regard as taking one step forward for all actions available at the state \(s\).
A policy \(\mu\) over options selects option \(o\) in state \(s\) with probability \(\mu(s, o)\). \(o\)'s policy in turn selects other options until \(o\) terminates. The policy of each of these selected options selects other options and so on, until one-step options are selected that correspond to actions of the core MDP.
Example:
We start at \(s_0\), select option \(o\) according to \(\mu\), then we select \(o_1, ..., o_T\) with probability \(\pi^{o} (s_0, o_1), ..., \pi^{o} (s_0, o_T)\), until option \(o_T\) which is a one-step option with \(\pi^{o_T} (s_0, a) = 1\) for some primitive action \(a\), then we reach state \(s^{\prime}\) with probability \(P(s^{\prime} | s_0, a)\), the process for option \(o\) terminates. Repeat for \(o_1, ...., o_{T-1}\) and all the child process induced until termination.
Adding any set of options to a core finite MDP yields a well-defined discrete-time semi-Markov decision process whose actions are the options and whose rewards are the return delivered over the course of an option's execution.
One can define value functions corresponding to options in a manner analogous to how they are defined for simple MDPs. For example, the option-value function corresponding to \(\mu\) is defined as:
\[Q^{\mu} (s, o) = E[r_{t+1} + \gamma r_{t+2} + .... + \gamma^{\tau - 1} r_{t+\tau} + ... | \xi (o\mu, s, t)]\]
Where \(\xi (o\mu, s, t)\) is the event of \(o\) being initiated at time \(t\) in \(s\) and being followed until it terminates after \(\tau\) time steps, at which point control continues according to \(\mu\).
A multi-time model of an option, which we can call an option model
, generalizes the one-step model of a primitive action. For any option \(o\), let \(\xi (o, s, t)\) denote the event of \(o\) being initialized in state \(s\) at time \(t\). Then the reward part of the option model of \(o\) for any \(s \in S\) is:
\[R(s, o) = E[r_{t+1} + \gamma r_{t+2} + .... + \gamma^{\tau - 1} r_{t+\tau} | \xi (o, s, t)]\]
Where \(t + \tau\) is the random time at which \(o\) terminates. The state prediction part of the model of \(o\) for \(s\) is:
\[P(s^{\prime} | s, o) = \sum^{\infty}_{\tau = 1} p(s^{\prime}, \tau) \gamma^{\tau}\]
For all \(s^{\prime} \in S\), where \(p(s^{\prime}, \tau)\) is the probability that \(o\) terminates in \(s^{\prime}\) after \(\tau\) step when initiated in \(s\). Though not itself a probability, \(P(s^{\prime} | s, o)\) is a combination of the probability that \(s^{\prime}\) is the state in which \(o\) terminates together with a measure of how delayed that outcome is in terms of \(\gamma\).
The quantities \(R(s, o), P(s^{\prime} | s, o)\) generalize the reward and transition probabilities in core MDP respectively in such a way that it is possible to write a generalized form of the bellman optimality equation and extend RL methods to options.
Intrinsic Rewards and Options
Identify states that may usefully serve as 'subgoals' for a given task. An option is created whose policy, when it is fully learned, will control the environment to a subgoal state in an efficient manner, usually in minimum time, from any state in the option's input set, which may itself be learned.
- The option's termination condition is set to be the achievement of a subgoal state
- The option's policy is learned via a "pseudo reward function" (different from the reward function of the overall goal, it does not influence the behavior of the agent. It is used only to support the learning of the option's policy) which rewards the achievement of the subgoal and provides a small penalty to all other transitions.
The connection between intrinsic motivation and options, is the idea of creating an option upon the occurrence of an intrinsically-rewarding event, where what constitutes an intrinsically-rewarding event can be defined:
- Intrinsic rewards influences agent behavior. The agent should change its behavior in such as way that it focuses exploration in order to quickly refine its skill in bringing about the intrinsically-rewarding event. A corollary to this is that intrinsic reward should diminish with continued repetition of the activity that generates it. (i.e the agent should eventually get bored and move on to create and learn another option)
Conclusion
- construction of temporally-extended skills (option) can confer clear advantages over learning solely with primitive actions.
- Defining an effective form of intrinsic reward is not as straightforward.
- Intrinsic reward can reduce the speed of learning by making the agent persist in behavior directed toward a salient event long after that behavior has been well learned.
- This kind of 'obsessive-compulsive' behavior hinders the attainment of extrinsic goals
- Intrinsic reward does not propagate well, tending to remain restricted to the immediate vicinty of the salient event that gave rise to it.