L1 VS L2 Regularization: How it works, difference, use cases| Machine Learning | Deep Learning

Sourin
3 min readApr 4, 2024

Regularization, technique that is used to prevent overfitting in machine learning and deep learning models. It adds a penalty term to the loss function of the model. This penalty term prevents or discourages the model from learning overly complex patterns that may fit the training data too closely. By doing these Regularization helps to find the sweet spot between underfit and overfit condition (Bias and variance).

The most commonly used regularization techniques are: L1 regularization (Lasso), L2 regularization (Ridge), and Elastic Net regularization.

L2 regularization/ Ridge regularization:

L2 regularization adds the square of the weights to the loss function, penalizing large weights more strongly than L1 regularization. It helps to smooth out the model weights and prevent extreme values.

Defined as: L2 regularization = Loss function + λ * Σ||w_i²||

L1 regularization/ Lasso regularization:

L1 regularization adds the absolute values of the weights to the loss function, penalizing large weights. It encourages sparsity in the model, meaning that it tends to produce sparse coefficients, effectively selecting a subset of features while setting others to zero.

Defined as: L1 regularization = Loss function + λ * Σ||w_i||

Difference between L1 and L2 regularization:

L2 Regularization (Ridge): L2 regularization adds the sum of the squared values of the weights of the loss function used in your model.

L1 Regularization (Lasso): L1 regularization adds the sum of the absolute values of the weights to the loss function.

L2 Regularization: L2 regularization leads to smaller but non-zero coefficients for all the features in your dataset. It reduces the impact of individual features but does not let the weights to become zero.

L1 Regularization: L1 regularization tends to produce sparse models by leading some coefficients to zero. It performs automatic feature selection. The features with zero weights are effectively ignored by the model.

L2 Regularization: The L2 regularization term works smoothly while using standard gradient descent algorithms as it is always differentiable.

L1 Regularization: The L1 regularization term is not differentiable at zero, which makes the optimization challenging. Subgradient methods are used to optimize L1 regularization.

When should you use L2 and when L1?

# Use L2 over L1 when:

  • Your dataset contains features that are co-related. L2 regularization does not let any one feature dominates over others, it gives equal preference to all the features of your dataset.
  • Your model is comparatively not very complex and you just want to control the overall complexity of the your model. L2 is generally a better choice when there is no specific need for feature selection.

# Use L1 over L2 when:

  • Your dataset contains so many features and you have to choose correct features out of those (Perform feature selection). L1 promotes sparsity, it selects the most relevant features from the dataset and ignore the irrelevant or less important features.
  • L1 can also help when interpretability in important as it produce sparse data.

--

--