In this chapter we consider methods that instead learn a parameterized policy that can select actions without consulting a value function. A value function may still be used to learn the policy parameter, but is not required for action selection.

$\pi(a|s,\theta) = Pr\{ A_t = a | S_t = s, \theta_t = \theta \}$

In this chapter we consider methods for learning the policy parameter based on the gradient of some performance measure $J(\theta)$ with respect to the policy parameter. These methods seek to maximize performance, so their updates approximate gradient ascent in $J$:

$\theta_{t+1} = \theta_t + \alpha \nabla J(\theta_t)$

All methods that follow this general schema we call policy gradient methods, whether or not they also learn an approximate value function. Methods that learn approximations to both policy and value functions are often called actor-critic methods, where ‘actor’ is a reference to the learned policy, and ‘critic’ refers to the learned value function, usually a state-value function.

## 13.1 Policy Approximation and its Advantages

The most common parameterization: $\pi(a | s, \theta) = \frac{\exp(h(s,a,\theta))}{\sum_b \exp(h(s,b,\theta))}$

linear approximation: $h(s,a,\theta) = \theta^T x(s,a)$

1. An immediate advantage of selecting actions according to the softmax in action preferences is that the approximate policy can approach a deterministic policy, whereas with $\epsilon$-greedy action selection over action values there is always an $\epsilon$ probability of selecting a random action.
2. Action-value methods have no natural way of finding stochastic optimal policies, whereas policy approximating methods can, as shown in Example 13.1.

Example 13.1 Short corridor with switched actions

## 13.2 The Policy Gradient Theorem

In addition to the practical advantages of policy parameterization over $\epsilon$-greedy action selection, there is also an important theoretical advantage. With continuous policy parameterization the action probabilities may change dramatically for an arbitrarily small change in the estimated action values, if that change results in a different action having the maximal value.

In the episodic case we define performance as

$J(\theta) = v_{\pi_\theta}(s_0)$

The performance depends on:

• action selections
• distribution of states <- hard to know how parameter affects.

Both of them are affected by the policy parameter.

An excellent theoretical answer to this challenge in the form of the policy gradient theorem:

$\nabla J(\theta) \propto \sum_s \mu(s) \sum_a q_\pi(s,a) \nabla_\theta \pi(a | s, \theta)$

Proof of the Policy Gradient Theorem

## 13.3 REINFORCE: Monte Carlo Policy Gradient

$\begin{eqnarray*} \nabla J(\theta) &\propto& \sum_s \mu(s) \sum_a q_\pi(s,a) \nabla_\theta \pi(a | s, \theta) \\ \nabla J(\theta) &=& E_\pi [ \sum_a q_\pi(S_t, a) \nabla_\theta \pi(a | S_t, \theta) ] \\ &=& E_\pi [ \sum_a \pi(a|S_t, \theta) q_\pi(S_t, a) \frac{\nabla_\theta \pi(A_t | S_t, \theta)}{\pi(a | S_t, \theta)} ] \\ &=& E_\pi [q_\pi(S_t, A_t) \frac{\nabla_\theta \pi(A_t | S_t, \theta)}{\pi(A_t | S_t, \theta)}] \quad \text{sample A_t \sim \pi}\\ &=& E_\pi [ G_t \frac{\nabla_\theta \pi(A_t | S_t, \theta)}{\pi(A_t | S_t, \theta)} ], \quad \text{because E_\pi[G_t | S_t, A_t] = q_\pi(S_t, A_t)}\\ \end{eqnarray*}$

the update rule:

$\theta_{t+1} = \theta_t + \alpha G_t \frac{\nabla_\theta \pi(A_t | S_t, \theta_t)}{\pi(A_t | S_t, \theta_t)}$

We call this algorithm REINFORCE.

The update increases the parameter vector in this direction proportional to the return, and inversely proportional to the action probablility. The former makes sense because it causes the parameter to move most in the directions that favor actions that yield thehighest return. The latter makes sense because otherwise actions that are selected frequently are at an advantage and might win out even if they do not yield the hightest return.

$\begin{eqnarray*} \nabla_\theta \ln \pi(a|s,\theta) &=& \frac{\nabla_\theta \pi(A_t | S_t, \theta_t)}{\pi(A_t | S_t, \theta_t)} \\ &=& x(s,a) - \sum_b \pi(b|s,\theta) x(s,b) \end{eqnarray*}$

## 13.4 REINFORCE with Baseline

The policy gradient theorem can be generalized to include a comparison of the action value to an arbitrary baseline $b(s)$:

$\nabla J(\theta) \propto \sum_s \mu(s) \sum_a (q_\pi(s,a) - b(s)) \nabla_\theta \pi(a | s, \theta)$

The baselin can be any function, even a random variable, as long as it does not vary with $a$; the equation remains valid because the subtracted quantity is zero:

$\sum_a b(s) \nabla_\theta \pi(a|s,\theta) = b(s) \nabla_\theta \sum_a \pi(a|s,\theta) = b(s)\nabla_\theta 1 = 0.$

The update rule:

$\theta_{t+1} = \theta_t + \alpha (G_t - b(S_t)) \frac{\nabla_\theta \pi(A_t | S_t, \theta_t)}{\pi(A_t | S_t, \theta_t)}$

In general, the baseline leaves the expected value of the update unchanged, but it can have a large effect on its variance.

• reduce the variance and speed the learning.

One natural choice for the baseline is an estimate of the state value, $v(S_t, w)$.

REINFORCE with Baseline

Input: a diiferentiable policy parameterization $\pi(a|s,\theta)$

Input: a diiferentiable state-value parameterization $v(s,w)$

Parameters: step size $\alpha^\theta > 0$, $\alpha^w > 0$

Initialize policy parameter $\theta \in R^d$ and state-value weights $w \in R^{d’}$

Repeat forever:

• Generate an episode $S_0, A_0, R_1, dots, S_{T-1}, A_{T-1}, R_T$, following $\pi$.
• For each step of episode $t = 0, dots, T - 1$:
• $G_t \leftarrow \text{return from step$t$}$
• $\delta \leftarrow G_t - v(S_t, w)$
• $w \leftarrow w + \alpha^w \gamma^t \delta \nabla_w v(S_t, w)$
• $\theta \leftarrow \theta + \alpha^\theta \gamma^t \delta \nabla_\theta \ln \pi(A_t | S_t, \theta)$

## 13.5 Actor-Critic Methods

Although the REINFORCE-with-baseline method learns both a policy and a state-value function, we do not consider it to be an actor-critic method because its state-value function is used only as a baseline, not as a critic.

• only for baseline, not bootstrapping.

Add one-step bootstrapping to REINFORCE-with-baseline update rule:

$\begin{eqnarray*} \theta_{t+1} &=& \theta_t + \alpha (G_{t:t+1} - v(S_t,w)) \frac{\nabla_\theta \pi(A_t | S_t, \theta_t)}{\pi(A_t | S_t, \theta_t)} \\ &=& \theta_t + \alpha (R_{t+1} + \gamma v(S_{t+1}, w) - v(S_t,w)) \frac{\nabla_\theta \pi(A_t | S_t, \theta_t)}{\pi(A_t | S_t, \theta_t)} \\ &=& \theta_t + \alpha \theta_t \frac{\nabla_\theta \pi(A_t | S_t, \theta_t)}{\pi(A_t | S_t, \theta_t)} \end{eqnarray*}$

## 13.6 Policy Gradient for Continuing Problems

$J(\theta) = \lim_{h \rightarrow \infty} \frac{1}{h} \sum_{t=1}{h} E[R_t | A_{0:t-1} \sim \pi]$

## 13.7 Policy Parameterization for Continuous Actions

Using the normal distribution to model actions:

$p(x) = \frac{1}{\sigma \sqrt{2\pi}} \exp(-\frac{(x-\mu)^2}{2 \sigma^2})$

Policy function:

$\pi(a|s,\theta) = \frac{1}{\sigma(s,\theta) \sqrt{2\pi}} \exp(-\frac{(a-\mu(s,\theta))^2}{2 \sigma(s,\theta)^2})$

where $\mu(s,\theta) = \theta_\mu^T x(s)$ and $\sigma(s,\theta) = \exp(\theta_\sigma^T x(s))$.