Deep Learning: Chapter 5
5 Machine Learning Basics
5.1 Learning Algorithms
5.1.1 The Task, $T$
- Classification
- Classification with missing inputs
- Regression
- Transcription
- Machine translation
- Structured output
- Anomaly detection
- Sythesis and sampling
- Imputation of missing values
- Denoising
- Density estimation or probability mass function estimation
5.1.2 The Performance Measure, $P$
5.1.3 The Experience, $E$
5.1.4 Example: Linear Regression
5.2 Capacity, Overfitting and Underfitting
The error incurred by an oracle making predictions from the true distribution p(x, y) is called the Bayes error.
5.2.1 The No Free Lunch Theorem
5.2.2 Regularization
Regularization is any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error.
5.3 Hyperparameters and Validation Sets
5.3.1 Cross-Validation
5.4 Estimators, Bias and Variance
5.4.1 Point Estimation
5.4.2 Bias
5.4.3 Variance and Standard Error
5.4.4 Trading Off Bias and Variance to Minimize Mean Squared Error
5.4.5 Consistency
5.5 Maximum Likelihood Estimation
5.5.1 Conditional Log-Likelihood and Mean Squared Error
Example: Linear Regression as Maximum Likelihood
We now revisit linear regression from the point of view of maximum likelihood estimation.
5.5.2 Properties of Maximum Likelihood
Statistic efficiency
5.6 Bayesian Statistics
Example: Bayesian Linear Regression
5.6.1 Maximum A Posteriori (MAP) Estimation
5.7 Supervised Learning Algorithms
5.7.1 Probabilistic Supervised Learning
logistic regression
5.7.2 Support Vector Machines
kernel trick
radial basis function
5.7.3 Other Simple Supervised Learning Algorithms
k-nearest neighbors
decision tree
5.8 Unsupervised Learning Algorithms
There are multiple ways of defining a simpler representation. Three of the most common include lower-dimensional representations, sparse representations, and independent representations.
5.8.1 Principal Components Analysis
5.8.2 k-means Clustering
5.9 Stochastic Gradient Descent
The insignht of SGD is tha the gradient is an expectation. The expectation may be approximately estimated using a small set of samples. Specifically, on each step of the algorithm, we can sample a minibatch of examples $B = \{ x^1, \dots, x^{m’} \}$ drawn uniformly from the training set. The minibatch size $m’$ is typeically chosen to be a relatively small number of examples, rangning from one to a few hundred.
5.10 Building a Machine Learning Algorithm
A fairly simple recipe: combine a specification of a dataset, a cost function, an optimization procedure and a model.
5.11 Challenges Motivating Deep Learning
5.11.1 The Curse of Dimensionality
5.11.2 Local Constancy and Smoothness Regularization
5.11.3 Manifold Learning
The first observation in favor of the manifold hypothesis is that the probability distribution over images, text strings and sounds that occur in real life is highly concentrated.
The second argument in favor of the manifold hypothesis is that we can also imagine such neighborhoods and transformations, at least informally.