10 Sequence Modeling: Recurrent and Recursive Nets

10.1 Unfolding Computational Graphs

\[h^{(t)} = f(h^{t-1},x^{t};\theta)\]

10.2 Recurrent Neural Networks

back-propagation through time(BPTT)

10.2.1 Teacher Forcing and Networks with Output Recurrence

10.2.2 Computing the Gradient in Recurrent Neural Network

10.5 Deep Recurrent Network

Three blocks:

from the input to the hidden state,
from the previous hidden state to the next hidden state, and
from the hidden state to the output

10.6 Recursive Neural Networks

10.7 The Challenge of Long-Term Dependencies

vanish or explode

10.8 Echo State Networks

10.9 Leaky Units and Other Strategies for Multiple Time Scales

10.9.1 Adding Skip Connections through Time

10.9.2 Leaky Units and a Spectrum of Different Time Scales

10.9.3 Removing Connections

10.10 The Long Short-Term Memory and Other Gated RNNs

10.10.1 LSTM

10.10.2 Other Gated RNNs

10.11 Optimization for Long-Term Dependencies

10.11.1 Clipping Gradients

10.11.2 Regularizing to Encourage Information Flow

10.12 Explicit Memory

working memory

memory network

neural Turing machine https://arxiv.org/abs/1410.5401

It is difficult to optimize functions that produce exact, integer addresses. To alleviate this problem, NTMs actually read to or write from many memory cells simultaneously. To read, they take a weighted of many cells. To write, they modify multiple cells by different amounts. The coefficients for these operations are chosen to be focused on a small number of cells.

These memory cells are typically augmented to contain a vector. There are two reasons,

increasing cost of accessing a memory cell.
allow for content-based addressing.