Reinforcement Learning 6: Temporal-Difference Learning
6 Temporal-Difference Learning
6.1 TD Prediction
Tabular TD(0) for estimating $v_\pi$
Example 6.1: Driving Home
6.2 Advantages of TD Prediction Methods
Example 6.2 Random Walk
6.3 Optimality of TD(0)
Example 6.3: Random walk under batch updating
Example 6.4: You are the Predictor
6.4 Sarsa: On-policy TD Control
\[Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)]\]Sarsa (on-policy TD control) for estimating $Q \approx q*$
Example 6.5: Windy Gridworld
6.5 Q-learning: Off-policy TD Control
Q-learning (off-policy TD control) for estimating $\pi \approx \pi*$
\[Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma \max_a Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)]\]Example 6.6: Cliff Walking
6.6 Expected Sarsa
\[Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma \mathbb{E}[Q(S_{t+1}, A_{t+1}) | S_{t+1}] - Q(S_t, A_t)]\]6.7 Maximization Bias and Double Learning
Example 6.7: Maximization Bias Example
Double Q-learning