Why Use L2 Norm Instead of L1 Norm in Loss Functions?

Category: []

Excerpt:

Have you guys noticed that, in machine learning, MSE is often the preferred choice…

Thumbnail:



Have you noticed that, in many applications, MSE (Mean Squared Error), RMSE (Root Mean Squared Error) and SSE (Sum of Squared Error) are often the preferred choice for the loss function. But why is this the case? Why do we favor the L2 norm over the L1 norm, such as Mean Absolute Error (MAE)?

For a linear regression model, the answer is obvious — Gauss-Markov Theorem directly implies that L2 norm error is inside the best linear unbiased estimator. But in practice, not all models we work with are linear regression models…

Consider the loss function in some machine learning models (typically non-linear), which is often defined as

$$ \text{MSE} = \dfrac{1}{N}\sum_{i=1}^{N} \left( \left( y -\hat{y} \right)^2 \right )$$

One might argue that L2 norm error emphasizes larger errors by squaring the residuals, effectively “zooming in” on significant deviations. But if that’s the case, why not use even higher powers which can penalize large errors more heavily, such as $ \dfrac{1}{N}\displaystyle\sum_{i=1}^{N} \left( \left( y -\hat{y} \right)^4 \right ) $

Indeed, higher powers would penalize large errors even more. However, the preference for the L2 norm isn’t just about magnifying errors, now let’s delve into it!

Usually, the goal in many statistical models is to find the function $f(\mathbf{x})$ that best describes the input $\mathbf{x}$ and the observed data, enabling accurate predictions and generalization to new data.

To achieve this, we typically use Maximum Likelihood Estimation (MLE), which allows us to estimate the model parameters that make the observed data most probable. Specifically, when we maximize the likelihood function of the errors $\mathbf{\epsilon}$ — the differences between the model’s predictions and the observed data — we are finding the parameters that make these errors most likely under our model.

Why? Because by maximizing the likelihood of these errors, we then can identify the parameters that most likely generated the observed errors. This approach is rooted in empirical evidence: It makes sense to believe the cases (to choose the parameters) that make the observed errors the most probable (that maximizes the likelihood function of the observed errors), as we have no reason to prefer less likely errors.

For example, imagine your parents walk into your room five times, and each time they catch you playing computer games instead of doing homework 😂. They might conclude that you’ve been playing computer games for all day, even though you actually spent hours doing homework and just happened to take a break at the wrong moments (what a bad excuse btw😂)… Here, they’re maximizing the likelihood of the “variable” — their assumption that you’re always gaming — because those were the moments they observed, and they don’t think that is a rare coincidence. In reality, you were just unlucky, but based on the evidence they have, their conclusion is the most probable one by applying MLE.

So, typically, the statistical model’s goal is to find $\hat{y}$ such that:

$$
\hat{y} = \arg \left ( \max_{y} \Big ( L(\epsilon) \Big ) \right )
$$

where

  • $\hat{y}$ is the set of expect model’s best outputs which $\hat{y} = \{\hat{y}_1,\hat{y}_2,\cdots, \hat{y}_N\}$
  • $y$ is the set of observed data which $y = \{y_1, y_2, \cdots, y_N\}$
  • $L(\epsilon)$ is the joint likelihood of every individual error which $L(\epsilon) = L\left (\displaystyle \bigcap^{N}_{i=1} \epsilon_i \right )$
  • $\epsilon_i$ is individual error which $\epsilon_i=\hat{y}_i-y_i$

For simplicity and to make the model computationally feasible, we assume every individual error in $\epsilon$ to be statistically independent, which $L(\epsilon) = \displaystyle \prod^{M}_{i=1} L(\epsilon_i) $, resultantly:

$$
\hat{y} = \arg \bigg ( \max_{y} \Big ( \displaystyle \prod^{N}_{i=1} L(\epsilon_i) \Big ) \bigg )
$$

Taking the logarithm to simplify the product into a sum (and because the logarithm is a strictly increasing monotonic function, which $\forall x_1, x_2 \in \mathbb{R}^{+}, \, x_1 < x_2 \implies \log(x_1) < \log(x_2)$, so the maximization is preserved):

\begin{align*}
\hat{y} &= \arg \bigg ( \max_{y} \bigg ( \log \Big ( \displaystyle \prod^{N}_{i=1} L(\epsilon_i) \Big ) \bigg ) \bigg ) \\
\hat{y} &= \arg \bigg ( \max_{y} \bigg ( \sum^{N}_{i=1} \Big ( \log \big ( L(\epsilon_i) \big ) \Big ) \bigg ) \bigg ) \\
\end{align*}

Here, we assume that every individual error follows a normal distribution, which $L(\epsilon_i) = \dfrac{1}{\sqrt{2\pi \displaystyle\sigma_i^2}} \ \exp\left(-\dfrac{\epsilon_i^2}{2\sigma_i^2}\right)$, with the mean of every individual error is $0$ and homoscedasticity which $\sigma_1=\sigma_2=\cdots=\sigma_n=\sigma$. Because each error can be seen as the sum of smaller i.i.d. (identically distributed and independent) variables, then applying the Central Limit Theorem implies every individual error tends to follow a normal distribution when taking the limit as the total number of variables approaches infinity.

\begin{align*}
\hat{y} &= \arg \bigg ( \max_{y} \bigg ( \sum^{N}_{i=1} \Big ( \log \big ( L(\epsilon_i) \big ) \Big ) \bigg ) \bigg ) \\
&= \arg \left( \max_{y} \left( \sum_{i=1}^{N} \left( -\frac{1}{2} \log(2\pi\sigma_i^2) -\frac{\epsilon_i^2}{2\sigma_i^2} \right) \right) \right) \\
&= \arg \left( \min_{y} \left( \sum_{i=1}^{N} \left( \frac{1}{2} \log(2\pi\sigma_i^2) + \frac{\epsilon_i^2}{2\sigma_i^2} \right) \right) \right) \\
&= \arg \left( \min_{y} \left( \frac{1}{2} \sum_{i=1}^{N} \log(2\pi\sigma_i^2) + \frac{1}{2} \sum_{i=1}^{N} \frac{\epsilon_i^2}{\sigma_i^2} \right) \right)\\
&= \arg \left( \min_{y} \left( \frac{1}{2} \sum_{i=1}^{N} \log(2\pi) + \frac{1}{2} \sum_{i=1}^{N} \log(\sigma_i^2) + \frac{1}{2} \sum_{i=1}^{N} \frac{\epsilon_i^2}{\sigma_i^2} \right) \right) \\
&= \arg \left( \min_{y} \left( \frac{N}{2} \log(2\pi) + \frac{N}{2} \log(\sigma^2) + \frac{1}{2\sigma^2} \sum_{i=1}^{N} \epsilon_i^2 \right) \right) \\
&= \arg \left( \min_{y} \left( \frac{1}{2\sigma^2} \sum_{i=1}^{N} \epsilon_i^2 \right) \right) \\
&= \arg \left( \min_{y} \left( \sum_{i=1}^{N} \epsilon_i^2 \right) \right) \\
&= \arg \left( \min_{y} \left( \sum_{i=1}^{N} \left (\hat{y}_i-y_i \right )^2 \right) \right) \\
\end{align*}

Given the above derivation, we see that minimizing the sum of squared errors is equivalent to maximizing the likelihood of the errors under the assumption that they follow a normal distribution with mean zero and constant variance. This directly leads to the use of the L2 norm (squared errors) in loss functions such as Mean Squared Error (MSE).

However, it’s important to note that the L2 norm error may not be the best choice in all cases. Specifically, when the error distribution deviates from normality, the L2 norm’s assumptions break down.

For example, in classification tasks where errors are often not normally distributed, so L2 norm error might lead to suboptimal results. In classification tasks, the errors are related to the incorrect classification of categories rather than continuous deviations, therefore the data typically follow Categorical Distribution. Here, the loss function can be more appropriately modeled by Cross-Entropy. Or if the data follow the Laplace Distribution, then picking L1 norm error in the loss function would be a better option. (Note: You can derive those results by applying the similar math strategies.)



Leave a Reply

Your email address will not be published. Required fields are marked *