Notes on Reinforcement Learning

Note

I don’t know who this article is for, excepting me, but whosoever comes across this, hope you get something out of it.

Reinforcement Learning is made out of two parts, the Agent and the Environment, interacting with each other with each though states, actions and rewards. As an agent learns to navigate the states in an Environment in order to accomplish a particular goal, it receives in return a reward which is usually emitted by the Environment. These rewards help in guiding the agents towards the predetermined goal. The aim of the agent is to maximize the quantity of rewards it can gather from the environment and that is possible by accomplishing the assigned goal in the most optimal way. There are scenarios where this may not happen due to a badly designed environment or making the agent optimize for the wrong things. Generally, a good way to think about this setup is to think about the agent and the environment as playing a two-person turn taking game. The state always takes the first turn and provides a range of options for the kinds of actions that the agent can take. Based on this range of actions available to the agent, the agent tries to take the action that not only maximizes its immediate reward it can get from the environment but also somehow maximize the overall reward it can gain from playing the game itself. In this game setup, the environment generally does not have agency, in such a way that the agent can have. If the environment can evolve over time in order to discourage the agent, this turn taking setup simply reduces to a variation of one of von Neumann’s minimax games. But for reinforcement learning, the environment can have probabilistic outputs but does not evolve over time in response to the agent. One of the simplest way to represent such a game is as a Markov Decision Processes.

flowchart TD
A(Agent)
E(Environment)

A --action--> E 
E --reward--> A
E --state--> A

Markov Decision Processes (MDPs)

MDPs are mathematical models and can be represented by

(S, A, P, R, γ, ρ)

Each element of the tuple has been defined below:

States $s \in S$ are the available positions in the MDP, there is usually a set of actions $s \mapsto A (s)$ associated with any state. Any state can therefore be thought of a function $S : s \to A$ , such that it takes the “ID” of a state $s$ and accordingly returns a set of actions $A \subseteq A$ out of which the agent chooses one based on its internal probabilistic logic. A state is almost always a part of the environment, the agent navigates through the states of an environment collecting rewards until either it runs of out turns or the goal is achieved. In order to follow through with the turn-taking metaphor, the state at turn $t$ can be represented as $s_{t}$ .

Actions $a \in A$ are the possible decisions that an agent can take from a particular state $s$ . Actions depends on the state that the agent is currently in.

Transition Dynamics $P (s^{'} ∣ a, s)$ is the probability of going from state $s$ to $s^{'}$ through action $a$ . The environment induces randomness through the the transition dynamics ingrained in it. Consider the action of jumping ( $a$ ) from the roof of one building (state $s$ ) to the roof of another building (state $s^{'}$ ). It is incredibly unlikely that you can land on the other roof if the two buildings are situated far enough, therefore the resulting probability of reaching the other state in this case is fairly low. Similarly, you can easily reach the next state if the roof are very close to each other. The underlying probability of transition $s \to s^{'}$ is a function of the distance between the two roofs. This underlying function is abstracted away, and you just see a singular probability value that communicates how likely/unlikely the transition is.

Reward Functions are one of the most important parts of the entire setup. It is essentially the guiding hand that determines how the agent learns to adapt to and excel in the environment it has been put in. A reward function determines how many points are awarded to the agent based on which action it takes from a specific state. For a state $s_{t}$ and action $a_{t}$ at turn $t$ , the reward is $r_{t} = R (s_{t}, a_{t})$ .

Discount Factor $γ \in (0, 1)$ is the scalar that determines how much importance the agent affords to any reward as the number of turns increase. This gives the agent either a farsighted approach (~0.999) causing the agent to place importance on rewards further down the line or to have a more shortsighted approach (~0.95). For a discount factor of 1, the agent places equal importance on each and every reward achieved from the environment, and can be thought of having perfect 20/20 vision.

Initial Probability $ρ (s)$ is the probability of starting in some initial state. This is the first turn that the environment takes and randomly places the agent in one of the states with a probability of $ρ (s)$ . This is a part of the problem definition and is a probability distribution spread over all possible initial states. Since it does not depend on parameters of the agent’s policy $θ$ , is cannot be modelled or optimized for via the policy gradient as $\nabla_{θ} ρ (s_{0}) = 0$ . More about the policy $π_{θ}$ and its parameters below.

The above factors are all usually part of the environment. As described in the above scenario, the environment mostly dictates the dynamics of this turn-taking game, the agent just has to do its best to thrive in it. From the perspective of the agent we similarly define multiple elements that will help it adapt to the environment with each turn. The first and most important element is the internal probabilistic logic of the agent. This is represented by the policy $π_{θ}$ which maps states to a probabilistic distribution of actions that can be taken from that particular state. There are two types of policies; the deterministic policy where a definitive action is return $a = π_{θ} (s)$ , then there is also the stochastic policy that returns a distribution of probabilities associated with each action and from which an action is sampled at turn $t$ : $a_{t} \sim π_{θ} (\cdot ∣ s_{t})$

The notation $a_{t} \sim π_{θ} (\cdot ∣ s_{t})$ randomly sampled from the probability distribution over actions defined by the policy $π_{θ}$ conditioned on the current state $s_{t}$ . The $\cdot$ is merely placeholder meaning over all possible actions. This requires that the current state is already known and the set of possible actions is provided. For a specific action we have $a \mapsto π_{θ} (a ∣ s_{t})$ which returns the probability of the action $a$ .

The agent thus traces a trajectory through the environment based on where it first starts, which action it takes next and what reward it collects per iteration.

τ = (s_{0}, a_{0}, r_{0}, s_{1}, a_{1}, r_{1}, \dots)

Each trajectory $τ$ is a unique path traced through the environment resulting from the stochastic processes induced by both the agent’s policy and the environment’s probabilistic transition dynamics. The probability of a trajectory is as follows

p_{θ} (τ) = ρ (s_{0}) t = 0 \prod \infty π_{θ} (a_{t} ∣ s_{t}) P (s_{t + 1} ∣ s_{t}, a_{t})

Here, the policy $π_{θ}$ has internal parameters $θ$ that are learned in order to find the optimal trajectory that can maximize the reward collected from the environment. The initial probability of starting in some state $s_{0}$ represented by $ρ (s_{0})$ , and the transition dynamics of the environment $P (s_{t + 1} ∣ s_{t}, a_{t})$ are also accounted for. In the resulting expression corresponding to a trajectory $p_{θ} (τ)$ , only the policy $π_{θ}$ depends on $θ$ . For a trajectory, the total amount of rewards that can be collected over a horizon of $T$ can be calculated as

R (τ) = R (s_{0}, a_{0}) + γ R (s_{1}, a_{1}) + γ^{2} R (s_{2}, a_{2}) + \dots = r_{0} + γ r_{1} + γ^{2} r_{2} + \dots = t = 0 \sum T γ^{t} r_{t}

The horizon of turns over which the task has to be completed can either be finite or infinite. The total reward accumulated depends, as mentioned, on the discount factor $γ$ and the rewards that the agent can extract out of the environment based on its decisions (actions) when interacting with the environment. For a set of trial runs of the agent, we can calculate the expectation by taking a weighted sum owf how likely each run is $p (x)$ and what is the total accumulated reward each run returns via the scoring function $R (x)$ , resulting in the expression

E [R (x)] = x \sum p (x) R (x)

The aim of any RL technique is to maximize the expectation of this discounted total reward while navigating the environment. The cost function associated with such a scenario is

J (θ) = E_{τ \sim p_{θ}} [R (τ)]

This computes the expectation of rewards for a trajectory that has been sampled via the policy associated with $p_{θ}$ , which is $π_{θ}$ . The calculation of the expected return over all such possible trajectories can be calculated in either the discrete or continuous manner. For the simple discrete and continuous case we will therefore have

discrete J (θ) = τ \sum p_{θ} (τ) R (τ) continuous J (θ) = \int p_{θ} (τ) R (τ) d τ

Note

The idea is to find the expectation over all the possible trajectories possible for an agent in the environment while accounting for the probabilistic effects of both the agent and the environment. For a setup with discrete actions and states this can be explicitly written out as
$s_{0} \sum a_{0} \sum s_{1} \sum a_{1} \sum \dots p_{θ} (τ) R (τ)$
When everything is continuous, the same expectation over trajectories can be written as:
$\int ρ (s_{0}) d s_{o} \int π (a_{0} ∣ s_{0}) d a_{0} \int P (s_{1} ∣ s_{0}, a_{0}) d s_{1} \dots R (τ)$
All of this is condensed into one compact notation represented in the above given continuous case integral
$\int p_{θ} (τ) R (τ) d τ$
For any turn $t$ , consider $N$ total samples $a_{t}^{(1)}, t^{(2)}, \dots, a_{t}^{(N)}$ all sampled from a particular policy $π_{θ}$ , such that $a_{t}^{(i)} \sim π_{θ} (\cdot ∣ s_{t})$ , then the expected reward, advantage, loss, etc. for that particular turn is dependent on the corresponding function $f (a^{(i)})$ and can be computed as
$E_{a_{t} \sim π_{θ}} [f (a_{t})] \approx \frac{1}{N} i = 1 \sum N f (a^{(i)})$
This simulates the effect of drawing an action from the policy over multiple iterations. Similarly, for any $N$ complete trajectories sampled from the policy, the expectation of the policy over mentioned trajectories can be computed as
$E_{τ \sim π_{θ}} [f (τ)] \approx \frac{1}{N} i = 1 \sum N R (τ^{(i)})$
where $R (τ)$ is the total rewards accumulated over the trajectory $τ$ . Gradients are taken inside the expectation, trajectories are sampled and $R (τ)$ is evaluated in order to determine the best policy.
$J (θ) = E_{s_{0} \sim ρ} [E_{a_{t} \sim π_{θ} (\cdot ∣ s_{t}), s_{t + 1} \sim P (\cdot ∣ s_{t}, a_{t})} [t = 0 \sum T γ^{t} r_{t}]], T \in [0, \infty)$

Taking the gradient of the continuous case expectation function

\nabla_{θ} J (θ) = \nabla_{θ} \int p_{θ} (τ) R (τ) d τ

Since $p_{θ} (τ)$ is differentiable by $θ$ , and the gradient function $\nabla_{θ} p_{θ} (τ)$ can be dominated by some other differentiable function $g$ such that $\int g d μ < \infty$ , the gradient can be swapped with the integral. This arises from a combination of Leibniz’s Theorem for differentiation under integrals and the dominated convergence theorem. The latter is ensured by the fact that any trajectory will be computed over a finite horizon $T$ and/or the rewards are discounted by a factor $γ < 1$ thereby diminishing the return and eventually making the series convergent and allowing the integrals and limits to commute.

Info

Leibniz Integral Rule:

For the below integral
$\int_{a (x)}^{b (x)} f (x, t) d t - \infty < a (x), b (x) < \infty$
if we seek to differentiate it w.r.t. $x$ , the resulting expression will be of the form
$\frac{d}{d x} (\int_{a (x)}^{b (x)} f (x, t) d t) = f (x, b (x)) \frac{d b ( x )}{d x} - f (x, a (x)) \frac{d a ( x )}{d x} + \int_{a (x)}^{b (x)} \frac{\partial}{\partial x} f (x, t) d t$
when $a (x) = a$ and $b (x) = b$ , we will have a simpler resulting expression
$\int_{a}^{b} \frac{\partial}{\partial x} f (x, t) d t$
This leads to the “trick” of putting the gradient, which is a derivative, under the integral sign. The additional conditions required are that $f (x, t)$ is continuous in $x$ and $t$ on the region ${(x, t) : a (x) \leq t \leq b (x)}$ with an existing partial derivative w.r.t. $x$ ; $a (x)$ and $b (x)$ should also be finite, continuous, and differentiable functions.

\int \nabla_{θ} p_{θ} (τ) R (τ) d τ

Now we concern ourselves with the green portion of the equation, which was received by pushing the gradient computed on the parameters into the integral. As can be seen above there a lot of extraneous factors in the probability which should ideally be removed in order to more simply compute the gradient of the policy $π_{θ} (a_{t} ∣ s_{t})$ which happens to be our ultimate goal. To aid in this matter we employ a simple trick which is an consequence of the chain rule, known as the log-derivative trick. It is not so much a trick as just elementary rearrangement the terms of log differentiation.

\nabla_{θ} lo g p_{θ} (τ) = \frac{\nabla _{θ} p _{θ} ( τ )}{p _{θ} ( τ )} ⟹ \nabla_{θ} p_{θ} (τ) = p (τ) \nabla_{θ} l o g p_{θ} (τ)

This thus leads to decomposition of the products in the trajectory probability $p_{θ} (τ)$ into a summation of probabilities and help us selectively removing all terms that are not dependent on the parameters $θ$ .

\int p_{θ} (τ) \nabla_{θ} l o g p_{θ} (τ) R (τ) d τ

Thus we breakdown the trajectory probability $p_{θ} (τ)$ as:

l o g p_{θ} (τ) = lo g ρ (s_{0}) + t = 0 \sum N l o g π_{θ} (a_{t} ∣ s_{t}) + t = 0 \sum N lo g P (s_{t + 1} ∣ s_{t}, a_{t})

Only the portion in lime green is dependent on the $θ$ -parameters and therefore the rest will be reduced to 0 when the gradient is applied w.r.t $θ$ .

\nabla_{θ} l o g p_{θ} (τ) = t = 0 \sum N \nabla_{θ} l o g π_{θ} (a_{t} ∣ s_{t})

The resulting Expectation Equation will have both the gradient of the log and the cumulative reward summation

\nabla_{θ} J (θ) = E_{τ} [t = 0 \sum N \nabla_{θ} l o g π_{θ} (a_{t} ∣ s_{t}) R (τ)]

Since a trajectory is one possible rollout of interaction between the environment, the policy and randomness which can come from three sources. These sources originate from what the initial state is ( $s_{0} \sim ρ (s_{0})$ ), what the policy is, and what the environmental dynamics are. Given $s_{0}$ , the policy produces a distribution $π_{θ} (\cdot ∣ s_{0})$ , then an action is sampled from the policy $a_{0} \sim π_{θ} (a ∣ s_{0})$ . The environment now takes $(a_{0}, s_{0})$ and samples the next state $s_{1} \sim P (\cdot ∣ s_{0}, a_{0})$ and emits a reward $r_{0} = R (s_{0}, a_{0}, s_{1})$ . Now, we have the partial trajectory $(s_{0}, a_{0}, r_{0}, s_{1})$ . This process continues until termination or when the horizon $T$ has been exhausted.

Policy: a_{t} \sim π_{θ} (\cdot ∣ s_{t}) Environment: s_{t + 1} \sim P (\cdot ∣ s_{t}, a_{t}), r_{t} = R (s_{0}, a_{0}, s_{1})

sykchw

Explorer

Notes on Reinforcement Learning

Markov Decision Processes (MDPs)

Leibniz Integral Rule:

Graph View