What is early stopping in machine learning?admin
Suppose you ask a data scientist or machine learning (ML) practitioner the biggest obstacle(s) in predictive modeling. In that case, they will most likely mention model overfitting or underfitting and how to apply appropriate regularization techniques to overcome them.
Overfitting and underfitting is a central problem in supervised learning, and this article is dedicated to addressing this issue. Specifically, we will discuss,
- The bias-variance tradeoff in ML
- ML model overfitting and underfitting
- Regularization techniques to combat overfitting
- Early Stopping in neural networks
- Implementing Early Stopping in Keras with Layer and exploring its effect as an overfitting regularizer
- Tips for Early Stopping and final remarks
Before getting deep into the weeds of overfitting or underfitting, let’s recap the prediction errors of machine learning algorithms and the bias-variance tradeoff. There are various algorithms for data scientists to choose from for supervised predictive models, such as regressions, support vector machines (SVM), decision trees, and neural networks.
While distinct in their underlying mechanisms, these algorithms inherently share a contradiction — the amount of bias and variance errors in their predictions, a.k.a., Bias-Variance tradeoff.
There are three components of ML prediction errors:
- Bias: represents the difference between model predicted values and the true observed values. This is an error from oversimplified assumptions in the learning algorithm. As such, models with high bias are not able to capture the relations between input data and the outcome variable;
- Variance: represents the variability in model predictions. This is an error from overly complicated assumptions. Models with high variance are sensitive to small fluctuations in a specific training set, and thus, cannot be generalized to other unseen datasets with equal performance;
- Irreducible (error): refers to errors that can’t be reduced no matter which training algorithms we use. Causes of this type of error could be inaccurate definitions of the business problems or unknown confounding variables getting neglected from the model.
Below is the decomposition of ML prediction errors
, where the bias term usually is expressed as a square, such as MSE
It should be noted that the irreducible error measures the amount of noise in the data, and it cannot be eliminated because there is no “perfect” model in the real world. Therefore, bias and variance are the errors that ML practitioners thrive to minimize.
This brings us to the No Free Lunch Theorem, which states that nothing can be gained without cost; that is, bias and variance move in the opposite direction of one spectrum, where reducing one will inevitably lead to increasing another.
Since it’s impossible to decrease both bias and variance simultaneously, we need to strike a balance between them -the Bias-Variance tradeoff, marked by the vertical dotted line in the chart above. Thus, rather than choosing bias over variance or vice versa, why don’t we find a middle region in the model complexity that minimizes the total error. This middle point ensures that our model is relatively well-performed (i.e., low bias) and flexible enough to be applied to new or out-of-sample datasets (i.e., low variance).
Now with an understanding of the bias-variance tradeoff in ML, let’s discuss the overfitting or underfitting issues aforementioned.
- Underfitting: when the model bias is high yet variance is low, which essentially indicates that our model is “universally inaccurate” because it’s unable to capture the underlying structure of the data. Model performance on both training and testing sets is low. Simple algorithms, e.g, linear regression, logistic regression, and naive Bayesian, tend to underfit.
- Overfitting: Overfitting happens when the model bias is low, whereas variance is high. What it implies is that our model is “only accurate” with respect to the current training set because it memorizes the noise but not the signal of the data. In other words, our model appears to perform well on the training set, yet badly on the testing set. Examples of algorithms that tend to overfit are (generalized) linear regression with higher-order interactions, SVM, tree models, and nearest neighbors; oftentimes non-linear or non-parametric algorithms.
If you are a visual person, this chart depicts underfitting vs. overfitting,
As stated before, in data science/ML practice, one wants to build a model that has an optimal balance without underfitting nor overfitting. Now, you must be wondering: if we were to compare a model that overfits and another that underfits, which one is more detrimental? Well, the short answer is it depends on your specific application. That said, nonetheless, overfitting is arguably worse than underfitting from a practical point of view:
- a model that overfits is considered “wrong” because it is memorizing noise in the training set. Hence, it will reproduce the noise when applied to the out-of-sample data, causing decreased model performance. What’s even worse is that this degradation of prediction performance has no upper limit. Hence, under most circumstances, a model that overfits is useless, even though there are scenarios where overfitting is acceptable, and we will discuss more in the next section.
- a model that underfits, on the other hand, is “not good enough” because it is fitting to the signal that actually exists in the data, yet the fitted model could just be an approximation without high accuracy. As such, there is a limit for performance degradation, such as the mean value in regression. Underfitting can be useful because it gives your models credibility, not to mention that these models are more interpretable in general.
Therefore, in this article, we will be focusing on techniques to prevent our models from overfitting.
Before we dive into the specific techniques, let’s address the argument I alluded to, i.e., not all model overfitting should be rejected; sometimes, we want to overfit our models! One scenario is the ensemble model, where it’s desirable to slightly overfit the base learners to generate strong meta-features to be included in the final meta-learner.
Another scenario pertains to embeddings to reduce the high cardinality of categorical variables. High-dimensional categorical data, e.g., zip code, cannot be encoded using one-hot or label encoding techniques. Instead, an effective method is entity embedding, where real-valued numeric vectors represent each category. These vectors come from the weights of a neural network model predicting the outcome variable with the categorical variables. Overfitting this type of neural net is considered acceptable.
Since this article is about overfitting regularization rather than “overfitting appreciation”, these examples are just some food for thought.
Now, you may think, to combat overfitting (or underfitting), all one needs to do is to find that “optimal” point on the model complexity scale. True, but it’s easier said than done. In the next part of this section, we will look at several common methods for model regularization. It should be noted that this, by no means, is an exhaustive list, but hopefully, it can set the stage for our use case implementation later.
We can modify the loss function to prevent model overfitting. The core idea is to add to the loss function an extra penalty term that penalizes overly complicated models, i.e., when the magnitude of model coefficients gets too large. L1 and L2 regularization fall into this category. A regression model with L1 and L2 regularization is called Lasso, and Ridge regression, respectively.
To apply L1 regularization, we add a term of the absolute value of the magnitude of coefficients to the loss function, whereas, for L2 regularization, we add the squared value of the magnitude of coefficients. L1 regularization shrinks the variable coefficients to zero; hence, it can be used for feature selection. More detail about the underlying math is linked here if you are interested.
When it comes to the tree-based models, such as random forest or gradient boosting, the commonly used technique to overcome overfitting is through hyper-parameter tuning. These hyper-parameters include (but not limited to) maximum tree depth (max_depth), minimum samples to split a node (min_samples_split), and maximum number of features for splitting (max_features).
Take the max_depth as an example, it controls how deep our tree is going to grow. The larger the value, the deeper the tree; this implies more splits to capture more information in the data. When this value is specified to be too large, our tree will fit the training set perfectly and may fail to generalize to the unseen data, i.e., overfitting. For an exhaustive list of such hyper-parameters and their effects, check out the articles here and here.
A neural network model usually consists of multiple layers, such as the input layer, hidden layer, convolutional layer, etc., and within each layer, there are many neurons. The weights associated with these neurons can get large as the training goes on, leading to an overfitted model.
Batch normalization refers to a layer that brings all the input features to the same scale. Why is this useful? Well, because our features are measured in significantly different ranges, e.g., 0.1-0.2 vs. 100-200, the weights associated with these features also vary drastically. This uneven weights distribution is troublesome because it will cause our learning algorithm to spend a long time oscillating around the plateau before reaching for the global minima.
All the input/activation values are brought to the same scale with batch normalization, improving the training speed. This efficiency, in turn, could prevent the model from overfitting.
Dropout is another technique to prevent overfitting by reducing the complexity of neural networks. In a nutshell, dropout happens when we randomly switch off (hidden or visible) neurons during the training process, resulting in a ‘thinned’ network.
As depicted above, the left shows a standard multi-layer neural network, whereas on the right, it is the same network but with several neurons dropped. The dropout rate can be treated as a hyper-parameter to tune; generally speaking, dropout rate = 0.5 yields maximum regularization, and thus, is preferred for large networks.
When in doubt, get more samples! Big sample sizes (from the same population) are generally considered to be better in statistical inference as well as in ML. Augmenting a training set with more (good quality) data points could introduce small noise for the algorithm to learn from. Thus, it makes the model robust and improves generalization.
Another way of manipulating input data is (K-fold) Cross-validation, a technique to create a potentially unbiased representation of the true distribution by splitting the initial training set into multiple mini sets.
Now, we know that the root cause of overfitting lies in over-complicated models. Hence, intuitively, another approach to solve this challenge is to stop the iterative training process, like gradient descent, before the algorithm starts to over-fit! This is exactly what Early Stopping does. In the following sections, we will discuss this strategy and then implement a neural net with and without early stopping to show how the model performance would change.
As the name suggests, early Stopping is a technique to halt training to prevent the model from learning noise in the data. Compared to other hyper-parameters in neural nets, such as the number of neurons, learning rate, dropout rate, and so on, early stopping strategy can be viewed as (partially) parameter-free. Depending on the criterion we choose, the number of epochs may not need to be decided beforehand, but rather it will be dynamically determined during the training; we will get into this later.
Then how does early stopping combat overfitting? Let’s illustrate with the plot below.
When training a neural network model, we specify the number of iterations (i.e., epochs) for the algorithm to run. In the beginning, error/loss on both training and validation sets decreases; however, after a certain iteration, the network appears to get better and better, as shown in the decreasing error trend on the TRAINING data. But the error on the VALIDATION set starts to increase, i.e., a sign of overfitting! Now, if we discontinue our training after this point, we can achieve model regularization. Simple and easy!
For early stopping to work, we need to define a criterion when stopping is triggered. There are a number of plausible criteria, among which the most commonly used are the following; we stop when,
- the pre-specified number of epochs is reached;
- the validation error/loss exceeds a certain threshold;
- the (absolute) change in validation error/loss is below a threshold over a certain number of epochs.
Upon reaching any of these criteria, does it mean that the model training must be halted immediately? Due to the stochastic nature of the deep learning/ML algorithms, a best practice is to wait around for a few more epochs in case our optimizer gets stuck in the local minima.
For our demo in this article, we will be using the 3rd criteria, and later you will see how this “wait-around” idea is implemented via an argument `patience` in Keras.
Admittedly, choosing the appropriate stopping criterion is important for early stopping. What’s equally important is to save the optimal model. What does it mean? Let’s say we stop the training when performance decreases after a few epochs, then weights from the previous epoch are the desired weights. But how can we go back to this prior model?
The answer is simple. Instead of deciding on which models to save, we can maintain a “best model” object, such that whenever one iteration outperforms its prior iteration in validation results, we store a copy of the current weights to the “best model” object; this is the concept of checkpointing.
In practice, early stopping and checkpoint go hand in hand in the way that checkpoint monitors and saves the optimal model as the training goes, and early stopping halts the training when validation performance decreases.
Now, let’s move on to the fun experimentation part. For this exercise, we will be using Layer, a Declarative MLOps (DM) platform, to track and compare our model performance.
Layer setup: Layer lets users keep track of all their model details, such as what works, what fails, what parameters are used, and what model performance looks like, etc.; basically everything you need for model versioning. More importantly, Layer does the versioning automatically after the initial setup, meaning that there are no additional configuration steps from your side.
For Layer to work with Python, let’s first install the Layer SDK using,
! pip install layer-sdk
It’s worth noting that the Layer SDK is compatible with Python versions 3.8+. Then, after running the following command,
we will be prompted to log into Layer https://beta.layer.co.
To incorporate the latest updates and new features from Layer, we can upgrade the package with this command,
! pip install --upgrade layer-sdk
Layer projects structure: there are two key components in each Layer project: the data folder and the models’ folder, which contains the data/feature sets and training models, respectively. Aside from these two folders, we will also see a README file as a general description of the Layer project. For example, here is the file structure in the README file of our current `early-stop` project,
As we can see, the dataset we will be working with is the Titanic Survival data that comes in Layer’s titanic survival model example; more details can be found here. In a nutshell, our goal is to build a neural network model using the features listed above to predict whether a given passenger survived or not. We will explore how Early Stopping affects the performance of this model.
First, let’s import all the required packages,
import seaborn as sns import matplotlib.pyplot as plt from sklearn.linear_model import LogisticRegression from sklearn.neighbors import NearestNeighbors from sklearn.model_selection import train_test_split, GridSearchCV, KFold, StratifiedKFold from sklearn.svm import SVC from sklearn.preprocessing import StandardScaler from tensorflow.keras import backend as K from tensorflow.keras.models import Sequential, Model from tensorflow.keras.layers import Input, Dense, Embedding, Concatenate, Flatten, BatchNormalization, Dropout, Reshape, Activation from tensorflow.keras.callbacks import Callback, EarlyStopping, ModelCheckpoint
Next, we need to create the training and testing datasets,
# create the training and label data train_df = tf.to_pandas() X = train_df.drop(["PassengerId", "Survived"], axis=1) Y = train_df["Survived"] # specify seed and testing size random_state = 9125 test_size = 0.1 train.log_parameter("test_size", test_size) # check and log to Layer the class distribution class_cnt = Counter(train_df["Survived"]) class_ratio = class_cnt[[k for k in class_cnt.keys()]]/class_cnt[[k for k in class_cnt.keys()]] train.log_parameter("class_ratio", class_ratio)
- The `train` object represents our current train of the model, which will be passed to Layer when training starts;
- The `train.log_parameter` method is used to log the relevant data information (i.e., testing data size and training data class distribution) in Layer for tracking purposes.
Then, split the titanic data into training and testing, and standardize the input features for modeling,
### Preprocess the training and testing data def Pre_proc( X, Y, current_test_size, current_seed=random_state, ): (x_train, x_test, y_train, y_test) = train_test_split(X, Y, test_size=current_test_size, random_state=current_seed) sc = StandardScaler() x_train = sc.fit_transform(x_train) x_test = sc.transform(x_test) (y_train, y_test) = (np.array(y_train), np.array(y_test)) return (x_train, x_test, y_train, y_test) (x_train, x_test, y_train, y_test) = Pre_proc(X, Y, current_test_size=test_size) train.log_parameter('train_columns', x_train.shape)
A tip here, we logged `the number of columns` in Layer; this information will come in handy when we specify the input shape of our neural nets.
Moving on to the model, let’s define a neural net with three hidden layers and a sigmoid activation function on the output layer for binary classification,
### Building a neural nets def runModel(cols): inp = Input(shape = (cols,)) x = Dense(128, activation='relu')(inp) x = BatchNormalization()(x) x = Dense(128, activation='relu')(x) x = Dense(64, activation='relu')(x) x = BatchNormalization()(x) out = Dense(1, activation='sigmoid')(x) model = Model(inp, out) return model
With everything ready to go, we can now specify the model parameters for training, as shown below:
## 6 features in the dataset model = runModel(cols=6) model.compile(loss='binary_crossentropy', optimizer="adam", metrics=['accuracy']) train.log_parameter("early_stopping_indicator", 0) train.log_parameter("total_epochs", current_epochs) train.log_parameter("batch_size", current_batch_size) train.log_parameter("validation_ratio", current_validation_ratio)
A couple of notes,
- We added an `early_stopping_indicator` parameter; value = 0 for model WITHOUT Early Stopping, whereas value = 1 for model WITH Early Stopping (which will be defined in the next section);
- We specified a large number of epochs to overfit the model.
Then fitting the model,
#### fit the current model WITHOUT Early Stopping history = model.fit(x_train, y_train, validation_split=current_validation_ratio, epochs=current_epochs, batch_size=current_batch_size, verbose=1) #### send the model loss to Layer for logging train.log_metric('Avg. Training Loss', np.round(np.mean(history.history['loss']), 3)) train.log_metric('Avg. Validation Loss', np.round(np.mean(history.history['val_loss']), 3))
Immediately after we kick off the training with:
### CD to the local early-stop folder Layer start
We will see training progress in the command window:
Upon completion of the training, let’s check the model performance in our Layer project:
As we can see, the mean value of training loss = 0.361, whereas the mean value of testing loss = 0.686, which is almost twice the training. This, as we discussed in the previous sections, is a sign of model overfitting!
Next, let’s introduce early stopping to our model,
CURRENT_PATIENCE = 3 es = EarlyStopping(monitor='val_loss', mode='min', restore_best_weights=True, verbose=1, patience=CURRENT_PATIENCE) #### checkpointing to save the optimal model as training goes mc = ModelCheckpoint('best_model.h5', monitor='val_loss', mode='min', save_best_only=True, verbose=1, save_weights_only=True) history = model.fit(x_train, y_train, callbacks=[es, mc], ## callbacks for es & mc validation_split=current_validation_ratio, epochs=current_epochs, batch_size=current_batch_size, verbose=1) #### send the model loss to Layer for logging train.log_metric('Avg. Training Loss', np.round(np.mean(history.history['loss']), 3)) train.log_metric('Avg. Validation Loss', np.round(np.mean(history.history['val_loss']), 3))
A few points to make sure a correct implementation of Early Stopping in Keras,
- the `EarlyStopping` method is a convenient API function that takes in multiple args. In our code, we selected the validation loss as the metric to monitor. Additionally, setting the `patience` argument to be 3 tells the training process to stop if no further improvement after 3 epochs from the best model so far;
- the `ModelCheckpoint` method is used to take a “snapshot” and save the best model;
- to ensure that our training process factors in both EarlyStopping and ModelCheckpoint, we specify the `callback` argument in `model.fit` to be a list [es, mc].
Let’s run this model, and Layer will automatically version the runs for us. Below shows this model performance,
Training loss has a mean value of 0.490, whereas testing loss has a mean value of 0.598; these two values are reasonably close as the training loss is expected to be slightly smaller than testing loss.
Comparing this version of the model with Early Stopping (purple v26.1 on the right side) vs. the previous version without Early Stopping (green v27.1 on the left side),
we observed that,
- even though v27.1 without Early Stopping reached a lower training loss 0.361 on average, as compared to v26.1 with Early Stopping (i.e., 0.49), v26.1 outperforms v27.1 against the testing dataset; Clearly, v26.1 overfits the model, leading to its tremendously decreased performance on testing. On the contrary, in the presence of Early Stopping, v27.1 was regularized to (1) perform consistently on training and testing, (2) beat its unregularized competitor with better performance on the unseen testing data;
- as to the total run time that automatically tracked in Layer, v26.1 completed within 14 seconds or so, but it took v27.1 a whopping 4 minutes to run! Imagine how much efficiency we could gain with Early Stopping when we scale up our model!
When it comes to implementing the Early Stopping technique, several tips are worth sharing:
- choose the appropriate metrics to monitor. In our exercise, we monitored the validation loss values as a criterion (for stopping). Another commonly used metric is accuracy. Depending on the specific questions, the metric to monitor can be a business KPI of interest.
- we set `patience` to be 3 for demonstration purposes; in a real-world application, this number usually is treated as a hyper-parameter to tune through cross-validation.
- along the same line of hyper-parameter tuning, the number of stopping epochs/patience often comes last, with all other hyper-parameters being selected already. We can also plot out the model training history as guidance.
In this article, we went through the fundamental question of ML — bias-variance tradeoff and basic principles of the regularizer Early Stopping. On top of that, we implemented and compared the effect of Early Stopping on neural networks with Keras.
I hope this article provides some guidance on how Early Stopping functions, and you are encouraged to set up and explore with your projects in Layer. The full script can be found in my Github repo here.