8 Optimization for Training Deep Models

This chapter focuses on one particular case of optimization: finding the parameters $\theta$ of a neural network that significantly reduce a cost function $J(\theta)$, which typically includes a performance measure evaluated on the entire training set as well as additional regularization terms.

8.1 How Learning Differs from Pure Optimization

8.1.1 Empirical Risk Minimization

8.1.2 Surrogate Loss Functions and Early Stopping

8.1.3 Batch and Minibatch Algorithms

standard error of the mean: $\sigma / \sqrt{n}$

large numbers of examples that all make very similar contributions to the gradient

Minibatch sizes are generally driven by the following factors:

Larger bathces provide a more accurate estimate of the gradient, but with less than linear return.
Multicore architectures are usually underutilized by extremely small batches.
Parallel processing
Specific sizes of arrays
Small batches can offer a regularizing effect.

Some algorithms are more sensitive to sampling error than others:

either because they use information that is difficult to estimate accurately with few sampels.
or because they use information in ways that amplify sampling errors more.

It is also crucial that the minibatches be selected randomly.

shuffle the order of the dataset once and then store itin shuffled fashion.

8.2 Challenges in Neural Network Optimization

8.2.1 Ill-Conditioning

8.2.2 Local Minima

model identifiability

weight space symmetry

8.2.3 Plateaus, Saddle Points and Other Flat Regions

8.2.4 Cliffs and Exploding Gradients

8.2.5 Long-Term Dependencies

8.2.6 Inexact Gradients

8.2.7 Poor Correspondence between Local and Global Structure

8.2.8 Theoretical Limits of Optimization

8.3 Basic Algorithms

8.3.1 Stochastic Gradient Descent

8.3.2 Momentum

8.3.3 Nesterov Momentum

8.4 Parameter Initialization Strategies

Unfortunately, these optimal criteria for initial weights often do not lead to optimal performance. Three different reasons:

wrong criteria
the properties may not persist after learning
the criteria might succeed at improving the speed of optimization but inadvertently increase generalization error.

There are a few situations where we may set some biases to nonzero values:

if a bias is for an output unit.
choose the bias to avoid causing too much saturation at initialization.
a control unit.

8.5 Algorithms with Adaptive Learning Rates

delta-bar-delta

8.5.1 AdaGrad

Algorithm 8.4 The AdaGrad algorithm

8.5.2 RMSProp

Algorithm 8.5 The RMSProp algorithm

Algorithm 8.6 RMSProp algorithm with Nesterov momentum

8.5.3 Adam ???

8.5.4 Choosing the Right Optimization Algorithm

hahaha: The choice of which algorithm to use, at this point, seems to depend largely on the user’s familiarity with the algorithm(for ease of hyperparameter tuning).

8.6 Approximate Second-Order Methods

8.6.1 Newton’s Method

8.6.2 Conjugate Gradients

Conjugate gradients is a method to efficiently avoid the calculation of the inverse Hessian by iteratively descending conjugate directions.

8.6.3 BFGS ???

8.7 Optimization Strategies and Meta-Algorithms

8.7.1 Batch Normalization

8.7.2 Coordinate Descent

8.7.3 Polyak Averaging

When applying Polyak averaging to non-convex problems, it is typical to use an exponentially decaying running average:

\[\hat{\theta}^{(t)} = \alpha \hat{\theta}^{(t-1)} + (1 - \alpha) \theta^{(t)}.\]

8.7.4 Supervised Pretraining

greedy supervised pretraining

transfer learning

teacher student network

8.7.5 Designing Models to Aid Optimization

In practice, it is more important to choose a model family that is easy to optimize than to use a powerful optimization algorithm.

ReLU

linear paths or skip connections

8.7.6 Continuation Methods and Curriculum Learning

Continuation methods

Traditional continuation methods are usually based on smoothing the objective function.

Curriculum learning