16 Applications and Case Studies

16.1 TD-Gammon


  • TD($\lambda$) + NN(Neural network)
  • Backprop TD error

198 input units -> hidden units(40-80) -> predicted probability of winning -> TD error

16.2 Samuel’s Checkers Player

linear function approximation + TD + minimax search + alpha-beta cutoff

16.3 Watson’s Daily-Double Wagering

16.4 Optimizaing Memory Control

A reinforcement learning memory controller

MDP: precharge, activate, read, write, noop.

Using Sarsa to learn an action-value function.

States were represented by six integer-valued features.

The linear function approximation was implemented by tile coding with hashing.

16.5 Human-level Video Game Play

DQN, modified in three ways:

  • experience replay
  • fix the network in the next C updates as the Q-learning target.
  • clip the error in [-1, 1]

16.6 Mastering the Game of Go

MCTS + ResNet

16.6.1 AlphaGo

16.6.2 AlphaGo Zero

16.7 Personalized Web Services

The objective is to maximize the click-through rate. Contextual bandit problem.

greedy optimization, maximizing only the probability of immeditate clicks.

life-time value optimization, improving the number of clicks users made over multiple visits to a website.

16.8 Thermal Soaring