# 6 Temporal-Difference Learning

## 6.1 TD Prediction

Tabular TD(0) for estimating $v_\pi$

Example 6.1: Driving Home

## 6.2 Advantages of TD Prediction Methods

Example 6.2 Random Walk

## 6.3 Optimality of TD(0)

Example 6.3: Random walk under batch updating

Example 6.4: You are the Predictor

## 6.4 Sarsa: On-policy TD Control

$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)]$

Sarsa (on-policy TD control) for estimating $Q \approx q*$

Example 6.5: Windy Gridworld

## 6.5 Q-learning: Off-policy TD Control

Q-learning (off-policy TD control) for estimating $\pi \approx \pi*$

$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma \max_a Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)]$

Example 6.6: Cliff Walking

## 6.6 Expected Sarsa

$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma \mathbb{E}[Q(S_{t+1}, A_{t+1}) | S_{t+1}] - Q(S_t, A_t)]$

## 6.7 Maximization Bias and Double Learning

Example 6.7: Maximization Bias Example

Double Q-learning