This post contains the notes taken from reading of the following paper:

This paper does not introduce any novelties in the field of Machine Learning, nor
some kinds of benchmarks, but rather offers a overview of the *black art* of
Machine Learning. Domingos covers a wide area of Machine Learning, but each
parts are not explored in depth.

# The Right Algorithm

Domingos splits the problem of choosing the right algorithm in three sub-problems:

- Finding the good
**representation**(*hyperplanes, rules, decision trees, etc.*) - The
**objective function**to optimize (*accuracy, likelihood, cross-entropy, etc.*) - The
**optimization method**(*quadratric, beam search, gradient descent, etc.*)

The optimal combinations should be taken according to hyperparameters: The accuracy, the training time, the problem type, etc.

# Evaluating The Algorithm

Domingos notes that while a high accuracy may seem *good*, it is not a sufficient
indicator. A high score of accuracy on the *train data* may simply mean that the
algorithm has an *overfit* problem, and thus generalize badly on new unseen data.

A common pitfall would be to train the algorithm on the train data and tweak
the hyperparameters in order to maximize our score on the *test data*. This may
lead to an overfit on also the test data!

The generalization problems (*how can I estimate my generalization?* and *how
can I improve my generalization*) are detailed in the further sections.

# The Bias-Variance Trade-off

When building a model it is interesting to decompose the generalization error
into two components: the *bias* and the *variance*.

Biasis a learner’s tendency to consistently learn the same wrong thing.

Varianceis the tendency to learn random things irrespective of the real signal.

This trade-off explains why a powerful learner may not be better than a weak learner. If my powerful learner has a low bias, he is performing well on the train data. However if my powerful learner has also a high variance, it may have learned noise from the train data that would be irrelevant for the test data and behave randomly.

# Reducing The Variance

There are several ways to reduce the variance:

### Train, Validation, and Test

Before training your model, the data should be split in three parts:

**Train**: On which the model will learn.**Validation**: On which we will optimize model’s performance by tweaking the parameters.**Test**: To test the model, only at the end.

In a certain way, we are overfitting on *validation* by tweaking the parameters
according to the *validation*’s performance. In order to mitigate this we can
use the cross-validation:

### Cross-Validation

We are still training the model on *train*, and tweaking the parameters in order
to optimize *validation*.

However instead of evaluating a fixed validation set, we are evaluating the average performance of the different folds:

Note that if there is too many parameters choices, the cross-validation may not be able to avoid overfitting.

### Regularization

Another way to way to avoid overfitting is to add *regularization*. It will force
the model to be simpler.

Let’s say the model has a set of weights $W$, an evaluation function $f(X)$ (that depends of the weights), and a loss function $L(X, Y)$.

Without regularization the model will try to optimize:

$$L(X, f(X))$$

With a regularization $R(W)$:

$$L(X, f(X)) + \lambda R(W)$$

The regularization is multiplied by a factor $\lambda$ that is determined empirically, with cross-validation for example.

There are several regularizations possible. The two most common are **L1**
(also known as *LASSO*), and **L2** (also known as *Ridge*):

L1 is the absolute norm:

$$\Vert W \Vert_1 = \Sigma_{i=1}^n |w_i|$$

While L2 is:

$$\Vert W \Vert_2 = \Sigma_{i=1}^{n} w_i^2$$

# The Curse Of Dimensionality

In addition of overfitting, a model can also fail to learn high-dimensional data.

For example, let’s imagine that we want to use a decision tree to learn data which features are binary discrete values. If there are 10 features, it would mean that there is a thousand possible samples. If there are 100 features (which is common), there are a thousand billion of billion of billion possible samples. It is unlearnable, either because the model will never generalize correctly, or the model will take a non-practical amount of time to learn.

Thankfully, the data’s features are often not completely independent and many
features are just noise. The *blessing of non-uniformity* as Domingos calls,
implies the samples are often spread on a lower-dimensional manifold.

To reduce the dimension, i.e. choosing the right features, many algorithms exist: PCA, NMF, LDA, etc.

The reduction of dimensionality is an often necessary step before feeding the model with the data.

# Feature Engineering Is The Key

Feature Engineering is the action of transforming raw data into something that
is more learnable by the model. It is dependant on the data’s type, and here
lies most of the *black art* of Machine Learning.

Two examples:

For text data, several processing are necessary:
- **tokenization** to split the words of the sentence.
- **lemmatization** to get the lemma (*loved, loving, lover -> love*)
- **POS-Tagging** to get the grammar label of a token (*be -> verb, car -> noun*)

For image data, in the case of object detection we can extract interesting features with the HOG algorithm and feed these features to a SVM to improve significantly the performances.

While feature engineering is major part of Machine Learning, it is less important in Deep Learning: with Convolutional Neural Network (CNN) the model is learning by itself the convolution kernels extracting the interesting features.

# Model Ensembles

In order to achieve the best performance we want to decrease both bias and variance.
It is often complicated to optimize this trade-off. A great way to achieve this
is to combine different models, kind of like a *wisdom of the crowd*.

There are three main categories of ensembles:

### Bagging

Used in the **Random Forest**, bagging generates plenty of model. Each has a low
bias but a high variance. A voting system is set up between them to choose the
output, thus lowering the individual variances.

### Boosting

Used in **Adaboost** or in **Gradient Boosting**, boosting generates at first
a simple weak learner: It should just be a bit better than a random guess. At each
iteration of the training, a new weak learner is added to the global learner. The
new weak learner focuses on the previously poorly predicted data.

At each iteration the bias is reduced as the overall model improves. There is a
diminished risk of overfitting with boosting: Because each iteration’s learner
focuses on poorly predicted data, the risk of *over-learning* data is small.

### Stacking

The stacking ensemble is the easiest to understand: Each model is connected to another: The output of one is the input of another.

# Data, Data, And Data

While Domingos offers us great insights into Machine Learning, and various methods to improve our models, he notes one constant:

More data beats a cleverer algorithm

It is often more advisable to focus the efforts on getting as much data as possible, and begin with a simple model, than to expect a complex model to generalize from few data.

### Available Data

There are plenty of resources available: