16 Applications and Case Studies

16.1 TD-Gammon

Backgammon

198 input units -> hidden units(40-80) -> predicted probability of winning -> TD error

linear function approximation + TD + minimax search + alpha-beta cutoff

A reinforcement learning memory controller

MDP: precharge, activate, read, write, noop.

Using Sarsa to learn an action-value function.

States were represented by six integer-valued features.

The linear function approximation was implemented by tile coding with hashing.

DQN, modified in three ways:

MCTS + ResNet

The objective is to maximize the click-through rate. Contextual bandit problem.

greedy optimization, maximizing only the probability of immeditate clicks.

life-time value optimization, improving the number of clicks users made over multiple visits to a website.