Underfitting and Overfitting in Machine Learning
30 December, 2022
2
2
0
Contributors
Introduction
1.
Signal - In machine learning, the term "signal" refers to the useful information or patterns in the data that the model is trying to learn.
2.
Noise - The term "noise" refers to the irrelevant or random variations in the data that can interfere with the model's ability to learn the signal.
3.
Bias - The term "bias" refers to the systematic error or deviation of the model's predictions from the true values. A model with high bias is said to be underfitting, meaning it is too simple and unable to capture the complexity and patterns in the data.
4.
Variance - The term "variance" refers to the degree to which the model's predictions vary or fluctuate for different training data sets. A model with high variance is said to be overfitting, meaning it is too complex and is able to fit the training data too well, but performs poorly on new, unseen data.
What is Overfitting?
β’
Too many features: If the model has too many input variables (features), it may be able to fit the training data too well, but not be able to generalize to new data. This is because the model has learned the noise in the training data, rather than the underlying relationships and patterns.
β’
Lack of regularization: Some machine learning algorithms, such as neural networks and decision trees, have the ability to "memorize" the training data if they are not regularized. Regularization is a technique that adds a penalty to the model to prevent it from fitting the training data too well.
β’
Insufficient training data: If the training dataset is small, the model may be able to fit the training data too well, but not be able to generalize to new data. This is because the model has not seen enough examples to learn the underlying relationships and patterns.
Overfitting in Machine Learning
What is Underfitting?
β’
Too few features: If the model has too few input variables (features), it may not be able to capture the complexity and patterns in the training data.
β’
Insufficient model complexity: Some machine learning algorithms, such as linear regression and logistic regression, have a limited capacity to capture complex patterns in data. If the training data is too complex, these algorithms may be unable to fit it accurately.
β’
Insufficient training data: If the training dataset is small, the model may not have seen enough examples to learn the underlying relationships and patterns in the data.
Underfitting in Machine Learning
How to prevent and mitigate overfitting and underfitting
β’
Splitting the data into training and validation sets: One way to avoid overfitting is to split the data into a training set and a validation set. The model is trained on the training set, and its performance is evaluated on the validation set. This allows us to determine whether the model is overfitting or underfitting the data.
β’
Using cross-validation: Another way to prevent overfitting is to use cross-validation, which involves training the model on different subsets of the data and evaluating its performance on the remaining data. This helps to ensure that the model is not overly dependent on any particular subset of the data.
β’
Regularization: As mentioned earlier, regularization is a technique that adds a penalty to the model to prevent it from fitting the training data too well. This can help to prevent overfitting and improve the generalization of the model to new, unseen data. There are several types of regularization techniques, including L1 regularization, L2 regularization, and elastic net regularization. These techniques add a penalty term to the model's objective function, which encourages the model to simplify and reduce the complexity of the learned relationships. Regularization can be adjusted using a hyperparameter, which controls the strength of the penalty. By tuning the hyperparameter, it is possible to find the optimal balance between bias and variance and to prevent overfitting while still capturing the useful signal in the data.
β’
Ensemble techniques: Ensemble techniques are machine learning methods that combine the predictions of multiple models to produce a more accurate and robust prediction. These techniques can be used to prevent overfitting and underfitting, as they can help to reduce the variance and bias of the model. There are several types of ensemble techniques, including:
β’
Bagging: Bagging (short for bootstrapped aggregating) involves training multiple models on different subsets of the data, and then averaging or voting on their predictions. This can help to reduce the variance of the model, as the individual models are less likely to overfit the data.
β’
Boosting: Boosting involves training multiple models sequentially, where each model is trained to correct the mistakes of the previous model. This can help to reduce the bias of the model, as the individual models are able to learn from the errors of the previous models.
β’
Stacking: Stacking involves training multiple models, and then using a "meta-model" to combine their predictions. This can help to reduce both the bias and variance of the model, as the individual models and the meta-model are able to learn from the errors of the other models.
Inference
data
analysis
learning
machine
deep
datascience