Regularization is a key concept in machine learning that helps prevent overfitting, improve model generalization, and make models more robust to new data. It adds a penalty to the loss function to discourage the model from fitting the noise in the training data, which leads to overfitting.
Overfitting occurs when a model performs well on the training data but fails to generalize to new, unseen data. This happens when the model is too complex and captures both the signal and the noise in the data.
Underfitting, on the other hand, happens when a model is too simple to capture the underlying patterns in the data, resulting in poor performance even on the training set.
Regularization helps strike a balance between overfitting and underfitting by controlling model complexity and encouraging simpler models that generalize better.
Types of Regularization
There are several types of regularization techniques used in machine learning, with the most common being:
\(L_2\) Regularization (Ridge Regression)
\(L_1\) Regularization (Lasso Regression)
Elastic Net Regularization
Dropout (for neural networks)
Here we will discus the first two kind only.
\(L_2\) Regularization (Ridge Regression)
\(L_2\) regularization (also known as Ridge regression in linear models) adds a penalty term to the loss function proportional to the sum of the squared coefficients (weights) of the model. The goal is to minimize both the original loss function and the magnitude of the coefficients.
For a linear regression model, the objective is to minimize the following regularized loss function:
\(\hat{y_i}\) is the model’s predicted output for input \(x_i\).
\(y_i\) is the true target value.
\(\theta_j\) are the model parameters (coefficients).
\(\lambda\) is the regularization strength, controlling the magnitude of the penalty (higher \(\lambda\) increases regularization).
More about \(\lambda\)
\(\lambda\) is a continuous non-negative scaler value, typically a floating-point number.
Minimum \(\lambda=0\), model becomes the standard linear regression model. For smaller \(\lambda\) the regularization effect is minimal, allowing the model to fit the training data more closely.
In theory, there is no upper bound for \(\lambda\). However, as \(\lambda\) increases, the model becomes more regularized, and the coefficients tend to shrink toward zero.
Selecting the optimal value of \(\lambda\) is crucial. Typically, it’s done via cross-validation, where different values of \(\lambda\) are tried, and the model is evaluated based on its performance on the validation set. The value that results in the best generalization is selected.
\(L_2\) regularization shrinks the coefficients towards zero but doesn’t force them to be exactly zero, thus retaining all features in the model.
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.linear_model import Ridge,LinearRegressionfrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import StandardScalerfrom sklearn.metrics import mean_squared_error# Generate synthetic datanp.random.seed(0)X =2* np.random.rand(100, 1)y =4+3* X + np.random.randn(100, 1)# Split the data into training and test setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# LinearRegression model linear_model = LinearRegression()linear_model.fit(X_train,y_train)y_pred_linear = linear_model.predict(X_test)mse_linear = mean_squared_error(y_test, y_pred_linear)print(f"Mean Squared Error (Linear Regression): {mse_linear:.2f}")# Train Ridge regression model (L2 Regularization)sc = StandardScaler()X_train_sc = sc.fit_transform(X_train)X_test_sc = sc.transform(X_test)ridge_model = Ridge(alpha=10) # alpha is the regularization strength (lambda)ridge_model.fit(X_train_sc, y_train)# Predictions and evaluationy_pred_ridge = ridge_model.predict(X_test_sc)mse_ridge = mean_squared_error(y_test, y_pred_ridge)print(f"Mean Squared Error (Ridge Regression): {mse_ridge:.2f}")# Plot the resultsplt.scatter(X_test, y_test, color='blue', label='True Data')plt.plot(X_test, y_pred_linear, color='green', label='Linear Prediction')plt.plot(X_test, y_pred_ridge, color='red', label='Ridge Prediction')plt.xlabel('X')plt.ylabel('y')plt.title('Ridge Regularization')plt.legend()plt.gca().set_facecolor('#f4f4f4') plt.gcf().patch.set_facecolor('#f4f4f4')plt.savefig('rg.png')plt.show()
Mean Squared Error (Linear Regression): 0.92
Mean Squared Error (Ridge Regression): 0.92
In this example, alpha corresponds to \(\lambda\), the regularization strength. A higher value of alpha will result in stronger regularization, shrinking the model coefficients more.
\(L_1\) Regularization (Lasso Regression)
\(L_1\) regularization (also known as Lasso regression) adds a penalty term proportional to the sum of the absolute values of the coefficients. This type of regularization can force some coefficients to be exactly zero, effectively performing feature selection.
The terms are the same as those for \(L_2\) regularization.
The penalty is the absolute value of the coefficients instead of the squared value.
\(L_1\) regularization has the effect of making some coefficients exactly zero, which means it can be used to reduce the number of features in the model.
from sklearn.linear_model import Lassoprint(f"Mean Squared Error (Linear Regression): {mse_linear:.2f}")# Train Lasso regression model (L1 Regularization)lasso_model = Lasso(alpha=.5) # alpha is the regularization strength (lambda)lasso_model.fit(X_train_sc, y_train)# Predictions and evaluationy_pred_lasso = lasso_model.predict(X_test_sc)mse_lasso = mean_squared_error(y_test, y_pred_lasso)print(f"Mean Squared Error (Lasso Regression): {mse_lasso:.2f}")# Plot the resultsplt.scatter(X_test, y_test, color='blue', label='Data')plt.plot(X_test, y_pred_linear, color='red', label='Linear Prediction')plt.plot(X_test, y_pred_lasso, color='green', label='Lasso Prediction')plt.xlabel('X')plt.ylabel('y')plt.title('Lasso Regularization')plt.legend()plt.gca().set_facecolor('#f4f4f4') plt.gcf().patch.set_facecolor('#f4f4f4')plt.show()
Mean Squared Error (Linear Regression): 0.92
Mean Squared Error (Lasso Regression): 1.02
Discussion
Choosing the Right \(\lambda\)
Selecting the optimal value of \(\lambda\) is crucial. Typically, it’s done via cross-validation, where different values of \(\lambda\) are tried, and the model is evaluated based on its performance on the validation set. The value that results in the best generalization is selected.
Impact of \(\lambda\) on Bias-Variance Trade-off
Low \(\lambda\): Leads to a low bias and high variance model because the model closely fits the training data.
High \(\lambda\): Leads to a high bias and low variance model, as the regularization prevents the model from fitting the training data too closely, reducing the variance but increasing the bias.
Facts
Scaling is required for both Ridge and Lasso regression as they are not scale invariant due to the different norms in the definition.
Criteria
L1 Regularization (Lasso)
L2 Regularization (Ridge)
Feature Selection
Can set some coefficients exactly to zero, effectively performing feature selection.
Does not set coefficients to zero; shrinks them but retains all features.
Handling Multicollinearity
Not ideal for handling highly correlated features, as it may arbitrarily select one feature and discard the others.
Works better in the presence of multicollinearity, as it tends to spread the penalty across correlated features.
Effect on Coefficients
Sparse solutions; coefficients are either zero or relatively large, favoring simpler models with fewer features.
Coefficients are small and distributed more evenly across all features, leading to less sparse solutions.
Interpretability
Easier to interpret, as some features are removed, simplifying the model.
All features remain in the model, making it harder to interpret when there are many features.
Computational Complexity
Can be computationally intensive with a large number of features due to the non-smooth nature of the L1 penalty.
Less computationally expensive due to its smooth penalty term (squared coefficients).
Best Suited For
When you want a sparse model with feature selection, and when the number of irrelevant features is large.
When you want to retain all features, especially in cases of multicollinearity, and avoid overfitting by shrinking coefficients.
When to Use
When you expect only a few features to be important.
When you want automatic feature selection.
When you need a simple, interpretable model.
When you believe all features contribute to the target.
When dealing with multicollinear data.
When you want to prevent overfitting but don’t want feature elimination.
References
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267-288.
Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), 55-67.
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301-320.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929-1958.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press.
---title: "Model Fine Tuning: Regularization"date: "2024-09-24"author: Rafiq Islamcategories: [Data Science, Machine Learning, Artificial Intelligence, Data Engineering]citation: truesearch: truelightbox: trueimage: rg.pnglisting: contents: "/../../dsandml" max-items: 3 type: grid categories: false date-format: full fields: [image, date, title, author, reading-time] format: html: toc: true---## Introduction<p style="text-align:justify">Regularization is a key concept in machine learning that helps prevent overfitting, improve model generalization, and make models more robust to new data. It adds a penalty to the loss function to discourage the model from fitting the noise in the training data, which leads to **overfitting**.</p>- **Overfitting** occurs when a model performs well on the training data but fails to generalize to new, unseen data. This happens when the model is too complex and captures both the signal and the noise in the data.- **Underfitting**, on the other hand, happens when a model is too simple to capture the underlying patterns in the data, resulting in poor performance even on the training set.<p style="text-align:justify">Regularization helps strike a balance between overfitting and underfitting by controlling model complexity and encouraging simpler models that generalize better.</p>## Types of RegularizationThere are several types of regularization techniques used in machine learning, with the most common being:- **$L_2$ Regularization (Ridge Regression)**- **$L_1$ Regularization (Lasso Regression)**- **Elastic Net Regularization**- **Dropout (for neural networks)**Here we will discus the first two kind only.---## $L_2$ Regularization (Ridge Regression)<p style="text-align: justify">**$L_2$ regularization** (also known as **Ridge regression** in linear models) adds a penalty term to the loss function proportional to the sum of the squared coefficients (weights) of the model. The goal is to minimize both the original loss function and the magnitude of the coefficients.</p>For a linear regression model, the objective is to minimize the following regularized loss function:$$J(\theta) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y_i})^2 + \lambda \sum_{j=1}^{p} \theta_j^2$$Where: - $\hat{y_i}$ is the model's predicted output for input $x_i$.- $y_i$ is the true target value.- $\theta_j$ are the model parameters (coefficients).- $\lambda$ is the regularization strength, controlling the magnitude of the penalty (higher $\lambda$ increases regularization). **More about $\lambda$** - $\lambda$ is a continuous non-negative scaler value, typically a floating-point number. - Minimum $\lambda=0$, model becomes the standard linear regression model. For smaller $\lambda$ the regularization effect is minimal, allowing the model to fit the training data more closely. - In theory, there is no upper bound for $\lambda$. However, as $\lambda$ increases, the model becomes more regularized, and the coefficients tend to shrink toward zero. <p style="text-align:justify">Selecting the optimal value of $\lambda$ is crucial. Typically, it's done via cross-validation, where different values of $\lambda$ are tried, and the model is evaluated based on its performance on the validation set. The value that results in the best generalization is selected.</p>$L_2$ regularization shrinks the coefficients towards zero but doesn't force them to be exactly zero, thus retaining all features in the model.```{python}import numpy as npimport matplotlib.pyplot as pltfrom sklearn.linear_model import Ridge,LinearRegressionfrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import StandardScalerfrom sklearn.metrics import mean_squared_error# Generate synthetic datanp.random.seed(0)X =2* np.random.rand(100, 1)y =4+3* X + np.random.randn(100, 1)# Split the data into training and test setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# LinearRegression model linear_model = LinearRegression()linear_model.fit(X_train,y_train)y_pred_linear = linear_model.predict(X_test)mse_linear = mean_squared_error(y_test, y_pred_linear)print(f"Mean Squared Error (Linear Regression): {mse_linear:.2f}")# Train Ridge regression model (L2 Regularization)sc = StandardScaler()X_train_sc = sc.fit_transform(X_train)X_test_sc = sc.transform(X_test)ridge_model = Ridge(alpha=10) # alpha is the regularization strength (lambda)ridge_model.fit(X_train_sc, y_train)# Predictions and evaluationy_pred_ridge = ridge_model.predict(X_test_sc)mse_ridge = mean_squared_error(y_test, y_pred_ridge)print(f"Mean Squared Error (Ridge Regression): {mse_ridge:.2f}")# Plot the resultsplt.scatter(X_test, y_test, color='blue', label='True Data')plt.plot(X_test, y_pred_linear, color='green', label='Linear Prediction')plt.plot(X_test, y_pred_ridge, color='red', label='Ridge Prediction')plt.xlabel('X')plt.ylabel('y')plt.title('Ridge Regularization')plt.legend()plt.gca().set_facecolor('#f4f4f4') plt.gcf().patch.set_facecolor('#f4f4f4')plt.savefig('rg.png')plt.show()```In this example, `alpha` corresponds to $\lambda$, the regularization strength. A higher value of `alpha` will result in stronger regularization, shrinking the model coefficients more.---## $L_1$ Regularization (Lasso Regression)**$L_1$ regularization** (also known as **Lasso regression**) adds a penalty term proportional to the sum of the absolute values of the coefficients. This type of regularization can force some coefficients to be exactly zero, effectively performing feature selection.The objective function for L1 regularization is:$$J(\theta) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y_i})^2 + \lambda \sum_{j=1}^{p} |\theta_j|$$Where: - The terms are the same as those for $L_2$ regularization.- The penalty is the absolute value of the coefficients instead of the squared value.$L_1$ regularization has the effect of making some coefficients exactly zero, which means it can be used to reduce the number of features in the model.```{python}from sklearn.linear_model import Lassoprint(f"Mean Squared Error (Linear Regression): {mse_linear:.2f}")# Train Lasso regression model (L1 Regularization)lasso_model = Lasso(alpha=.5) # alpha is the regularization strength (lambda)lasso_model.fit(X_train_sc, y_train)# Predictions and evaluationy_pred_lasso = lasso_model.predict(X_test_sc)mse_lasso = mean_squared_error(y_test, y_pred_lasso)print(f"Mean Squared Error (Lasso Regression): {mse_lasso:.2f}")# Plot the resultsplt.scatter(X_test, y_test, color='blue', label='Data')plt.plot(X_test, y_pred_linear, color='red', label='Linear Prediction')plt.plot(X_test, y_pred_lasso, color='green', label='Lasso Prediction')plt.xlabel('X')plt.ylabel('y')plt.title('Lasso Regularization')plt.legend()plt.gca().set_facecolor('#f4f4f4') plt.gcf().patch.set_facecolor('#f4f4f4')plt.show()```## Discussion ### Choosing the Right $\lambda$Selecting the optimal value of $\lambda$ is crucial. Typically, it's done via cross-validation, where different values of $\lambda$ are tried, and the model is evaluated based on its performance on the validation set. The value that results in the best generalization is selected.### Impact of $\lambda$ on Bias-Variance Trade-off- Low $\lambda$: Leads to a low bias and high variance model because the model closely fits the training data.- High $\lambda$: Leads to a high bias and low variance model, as the regularization prevents the model from fitting the training data too closely, reducing the variance but increasing the bias. ### Facts Scaling is required for both Ridge and Lasso regression as they are not scale invariant due to the different norms in the definition. ```{=html}<table border="1" cellpadding="10" cellspacing="0"> <thead> <tr> <th>Criteria</th> <th>L1 Regularization (Lasso)</th> <th>L2 Regularization (Ridge)</th> </tr> </thead> <tbody> <tr> <td><strong>Feature Selection</strong></td> <td>Can set some coefficients exactly to zero, effectively performing feature selection.</td> <td>Does not set coefficients to zero; shrinks them but retains all features.</td> </tr> <tr> <td><strong>Handling Multicollinearity</strong></td> <td>Not ideal for handling highly correlated features, as it may arbitrarily select one feature and discard the others.</td> <td>Works better in the presence of multicollinearity, as it tends to spread the penalty across correlated features.</td> </tr> <tr> <td><strong>Effect on Coefficients</strong></td> <td>Sparse solutions; coefficients are either zero or relatively large, favoring simpler models with fewer features.</td> <td>Coefficients are small and distributed more evenly across all features, leading to less sparse solutions.</td> </tr> <tr> <td><strong>Interpretability</strong></td> <td>Easier to interpret, as some features are removed, simplifying the model.</td> <td>All features remain in the model, making it harder to interpret when there are many features.</td> </tr> <tr> <td><strong>Computational Complexity</strong></td> <td>Can be computationally intensive with a large number of features due to the non-smooth nature of the L1 penalty.</td> <td>Less computationally expensive due to its smooth penalty term (squared coefficients).</td> </tr> <tr> <td><strong>Best Suited For</strong></td> <td>When you want a sparse model with feature selection, and when the number of irrelevant features is large.</td> <td>When you want to retain all features, especially in cases of multicollinearity, and avoid overfitting by shrinking coefficients.</td> </tr> <tr> <td><strong>When to Use</strong></td> <td> <ul> <li>When you expect only a few features to be important.</li> <li>When you want automatic feature selection.</li> <li>When you need a simple, interpretable model.</li> </ul> </td> <td> <li>When you believe all features contribute to the target.</li> <li>When dealing with multicollinear data.</li> <li>When you want to prevent overfitting but don’t want feature elimination.</li> </td> </tr> </tbody></table>```---## References1. **Tibshirani, R.** (1996). Regression shrinkage and selection via the lasso. *Journal of the Royal Statistical Society: Series B (Methodological)*, 58(1), 267-288.2. **Hoerl, A. E., & Kennard, R. W.** (1970). Ridge regression: Biased estimation for nonorthogonal problems. *Technometrics*, 12(1), 55-67.3. **Zou, H., & Hastie, T.** (2005). Regularization and variable selection via the elastic net. *Journal of the Royal Statistical Society: Series B (Statistical Methodology)*, 67(2), 301-320.4. **Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R.** (2014). Dropout: A simple way to prevent neural networks from overfitting. *Journal of Machine Learning Research*, 15(1), 1929-1958.5. **Goodfellow, I., Bengio, Y., & Courville, A.** (2016). *Deep Learning*. MIT Press.6. **Murphy, K. P.** (2012). *Machine Learning: A Probabilistic Perspective*. MIT Press.---**Share on** <div id="fb-root"></div><script async defer crossorigin="anonymous" src="https://connect.facebook.net/en_US/sdk.js#xfbml=1&version=v20.0"></script><div class="share-buttons"><div class="fb-share-button" data-href="https://mrislambd.github.io/dsandml/regularization/"data-layout="button_count" data-size="small"><a target="_blank" href="https://www.facebook.com/sharer/sharer.php?u=https%3A%2F%2Fmrislambd.github.io%2Fdsandml%2Fregularization%2F&src=sdkpreparse" class="fb-xfbml-parse-ignore">Share</a></div><script src="https://platform.linkedin.com/in.js" type="text/javascript">lang: en_US</script><script type="IN/Share" data-url="https://mrislambd.github.io/dsandml/regularization/"></script> <a href="https://twitter.com/share?ref_src=twsrc%5Etfw" class="twitter-share-button" data-url="https://mrislambd.github.io/dsandml/regularization/" data-show-count="true">Tweet</a><script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script></div><div class="fb-comments" data-href="https://mrislambd.github.io/dsandml/regularization/" data-width="" data-numposts="5"></div> **You may also like**