5 Machine Learning Basics

5.1 Learning Algorithms

5.1.1 The Task, $T$

Classification
Classification with missing inputs
Regression
Transcription
Machine translation
Structured output
Anomaly detection
Sythesis and sampling
Imputation of missing values
Denoising
Density estimation or probability mass function estimation

5.1.2 The Performance Measure, $P$

5.1.3 The Experience, $E$

5.1.4 Example: Linear Regression

5.2 Capacity, Overfitting and Underfitting

The error incurred by an oracle making predictions from the true distribution p(x, y) is called the Bayes error.

5.2.1 The No Free Lunch Theorem

5.2.2 Regularization

Regularization is any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error.

5.3 Hyperparameters and Validation Sets

5.3.1 Cross-Validation

5.4 Estimators, Bias and Variance

5.4.1 Point Estimation

5.4.2 Bias

5.4.3 Variance and Standard Error

5.4.4 Trading Off Bias and Variance to Minimize Mean Squared Error

5.4.5 Consistency

5.5 Maximum Likelihood Estimation

5.5.1 Conditional Log-Likelihood and Mean Squared Error

Example: Linear Regression as Maximum Likelihood

We now revisit linear regression from the point of view of maximum likelihood estimation.

5.5.2 Properties of Maximum Likelihood

Statistic efficiency

5.6 Bayesian Statistics

Example: Bayesian Linear Regression

5.6.1 Maximum A Posteriori (MAP) Estimation

5.7 Supervised Learning Algorithms

5.7.1 Probabilistic Supervised Learning

logistic regression

5.7.2 Support Vector Machines

kernel trick

radial basis function

5.7.3 Other Simple Supervised Learning Algorithms

k-nearest neighbors

decision tree

5.8 Unsupervised Learning Algorithms

There are multiple ways of defining a simpler representation. Three of the most common include lower-dimensional representations, sparse representations, and independent representations.

5.8.1 Principal Components Analysis

5.8.2 k-means Clustering

5.9 Stochastic Gradient Descent

The insignht of SGD is tha the gradient is an expectation. The expectation may be approximately estimated using a small set of samples. Specifically, on each step of the algorithm, we can sample a minibatch of examples $B = \{ x^1, \dots, x^{m’} \}$ drawn uniformly from the training set. The minibatch size $m’$ is typeically chosen to be a relatively small number of examples, rangning from one to a few hundred.

5.10 Building a Machine Learning Algorithm

A fairly simple recipe: combine a specification of a dataset, a cost function, an optimization procedure and a model.

5.11 Challenges Motivating Deep Learning

5.11.1 The Curse of Dimensionality

5.11.2 Local Constancy and Smoothness Regularization

5.11.3 Manifold Learning

The first observation in favor of the manifold hypothesis is that the probability distribution over images, text strings and sounds that occur in real life is highly concentrated.

The second argument in favor of the manifold hypothesis is that we can also imagine such neighborhoods and transformations, at least informally.