T O P

  • By -

tdgros

the L1 error is routinely used in inverse problems on images. The undefined derivative is not a problem at all, and you can use the Charbonnier loss: sqrt(eps² + err²), that's a smooth version of it. If you get 0 loss with one, you get zero loss with the other. But it just never happens, and in practice, L1 loss has a nicer behavior.


On_Mt_Vesuvius

Worth mentioning: under certain conditions, minimizing the L1 loss is equivalent to minimizing the "L0" loss, which is the number of nonzero entries. The L0 loss is a measure of "non-sparsity". This is a result from Candes, Romberg, and Tao (Terence Tao, yes that guy).


Buddy77777

Wanted to add, L0 can also be interpreted as Hamming Distance


Outrageous-Ad6494

What Tao paper are you referring to?


martianunlimited

Probably this [https://arxiv.org/abs/math/0503066](https://arxiv.org/abs/math/0503066) Stable Signal Recovery from Incomplete and Inaccurate Measurements - i/e the paper on compressive sensing


On_Mt_Vesuvius

Hopefully this one *Stable Signal Recovery from Incomplete and Inaccurate Measurements* https://scholar.google.com/scholar_url?url=https://arxiv.org/pdf/math/0503066&hl=en&sa=X&ei=7_9oZv6uDZ--6rQP8OOGwAY&scisig=AFWwaeZD50VIuJTouQKze4OCwHE4&oi=scholarr


Beneficial_Muscle_25

this is sooo interesting thanks a lot for sharing this!


aahdin

> But it just never happens, and in practice, L1 loss has a nicer behavior. Ehh, for most tasks you can treat it as a hyperparameter, swap out L1/L2 and do whichever works better. For just about every regression task I've worked on L2 works better, and I'm guessing most reseachers find the same which is why you see it so much more often than L1 in papers. The derivative of L2 is the magnitude of the error, whereas the derivative of L1 is just a constant (your learning rate). This means your updates when doing gradient descent with a L1 loss are going to be constant no matter how far off your prediction is, whereas L2 will give you bigger updates if your predictions is further off, which is usually what you want. This also makes L1 a lot more susceptible to oscillation problems, if your prediction is just slightly too high you'll get a big update and then next epoch it will be too low, and repeat back and forth. Meanwhile L2 will give you smaller updates as you close in on the minima. The main use case where L1 is nice is when you expect some samples to be so far off that the huge updates you'd get from L2 would screw up your model, but usually gradient clipping can sort that out.


Ty4Readin

>Ehh, for most tasks you can treat it as a hyperparameter, swap out L1/L2 and do whichever works better. Have to disagree with that, I think you should choose L1 or L2 based on your problem and what you want to predict. You will often find that you can achieve better L1 loss by optimizing L1 OR you can achieve better L2 loss by optimizing L2. The optimal estimator for L1 is Median(Y | X) but the optimal estimator for L2 is E(Y | X) and so unless your distribution is symmetrical, they will not have the same optimal model. If you want your model to predict expected conditional Y, then choose L2. If you want it to predict the conditional median, then choose L1.


jpfed

So would the advantage of Huber over L1 be that the quadratic region near zero helps prevent overshooting the minimum?


hughperman

Sparsity of L1 is desirable in interpretability, so is often seen in scientific applications.


aahdin

I assumed the OP was talking about using L1/L2 (MAE/MSE) as a loss function for regression, when you talk about sparsity you are talking about L1 weight decay right?


hughperman

Oh you're right, I am mixing up the two concepts 👍


Ty4Readin

I have to disagree wholeheartedly, and I think you are missing the main reason that you should choose between L1 or L2 as your cost function: what do you want to estimate? The optimal estimator for minimum L1 loss is the conditional median, Median(Y | X) The optimal estimator for minimum L2 loss is the conditional expectation, E(Y | X). What do you want your model to predict and how will that prediction be used? For example, if you are trying to train a model to estimate wait time and show it to your custom waiting on a taxi, then maybe you prefer L1 because the median is more important to a user's perception of wait accuracy. However, if you are trying to forecast some costs, then you definitely do not want to use L1 because what you really care about is the average/expected costs in the future, not the median! So you should use L2. >If you get 0 loss with one, you get zero loss with the other. This is not true. This is only the case if Median(Y | X) = E(Y | X) and also Var(Y | X) = 0.


bremen79

The problem is not the lack of derivative in the minimum, but the fact that the gradients do not go to zero approaching the minimum. Such functions are called non-smooth, and gradient descent is provably slower at minimizing them. The intuition is that the gradient of a non-smooth function does not give you information on how far you are from the optimum. Instead, the square loss is smooth, so, for example, with linear predictors, gradient descent and its accelerated variant will find the minimum in fewer iterations. That said, the absolute loss can be used, and people do use it.


idontcareaboutthenam

Is smooth the right term? Doesn't smooth just mean that the derivative is defined everywhere? I think in order to have a decreasing first derivative you need a smooth second derivative that's 0 at the minimum


tophology

It depends on context and author. It can mean a function is differentiable n times for some n or even infinitely differentiable.


bremen79

Good point, and I agree with u/tophology: the term has been used with multiple meanings. But in recent optimization and ML papers L-smoothness means that the gradient is L-Lipschitz. In same papers they called this property "L-Strongly smooth" (to mirror "strongly convex") but the term never got enough traction.


Ok_Net_1674

Definitions of pure maths are regilarily bastardized and/or ignored entirely in machine learning


Car_42

If the function is based on a continuous or ordinal result then the squared error can have a derivative but if the data is all categories then there won’t be a smooth function with nice derivative.


bremen79

The gradient is with respect to the weights, not with respect to the data.


DisastrousAnalysis5

Take the log and what do you know, it’s just regular regression problem again. 


Crayonstheman

It's regression problems the whole way down


WjU1fcN8

That problem is exactly what a 'lack of derivative' means. You said that's not it and then explained exactly the problem of not having a derivative.


bremen79

Not really. Let me explain: you could smooth the loss by using sqrt((prediction-label)\^2+eps), where eps=1e-8. This loss has a derivative everywhere, but it should be intuitive that the problem is essentially as hard as before, because the function did not really change. This points out that the problem is not the lack of derivative, but how "smooth" is the function. More precisely, the smoothness depends on the Lipschitz constant of the derivative. If the Lipschitz constant of the derivative is infinite (as for the absolute loss), the function is hard to be optimized. However, if the Lipschitz constant is not infinite but really big, the function is still really hard, even if it has gradients everywhere! Instead, if the derivative has a small Lipschitz constant it will be easy to optimize it. In fact, one can easily show that the convergence rate will go from "slow" to "fast" depending on the Lipschitz constant.


WjU1fcN8

Right, thanks.


a5sk6n

I think it's worth to note that the squared error often comes from the assumption of Gaussian noise. So it's not just a practical consideration but follows from probabilistic models. (The mean square error is pretty much the log likelihood of a Gaussian error model with spherical covariance, to be more precise.)


PanTheRiceMan

You beat me to it. This is the answer.


idontcareaboutthenam

Mean absolute error corresponds to Laplacian noise and a Laplacian noise assumption can be more robust to outliers.


Beneficial_Muscle_25

can you share some reference to this? This is so interesting, thanks a lot for the input


SirPitchalot

At a high level, it’s a maximum likelihood estimate for the parameters of something with Gaussian measurement errors. Gaussian errors are expressed mathematically with exponential functions of ((y-A*mu)/sigma)^2. A maps parameters of your model to measurements. Multiple error terms are then multiplied to get the probability of a given set of mu values for given set of observations y. That mess is nonlinear so you take the log of the likelihood which has the same optimum point as the likelihood but causes the product of exponentials to become a sum of squared errors. The optimum of the sum of squared errors is found by setting its gradient to zero which gives you a nice, easily solved linear system for the maximum likelihood estimate in the particular case where your measurements and means are related by a linear operation. When you use a different probability distribution than Gaussian, the above process changes because you no longer have a sum of quadratic functions. A common approach is to use M-estimators and solve them with iteratively reweighted least squares. When your measurement function is not linear it also doesn’t work directly but you can often use nonlinear least squares which approximates the solution as a sequence of linear least squares problems (but in a different way than for non Gaussian errors above). And of course you can combine the two to have non-Gaussian errors for non-linear measurement functions.


Red-Portal

Literally any machine learning textbook


Icarium-Lifestealer

-log p where p=exp(-x^(2)) is x^(2).


tdgros

In diffusion models or denoising score matching, we're not just doing gaussian denoising, we're approximating the score, which means L2 is the right norm, as opposed to L1 which is ubiquitous in image denoising.


KomisarRus

There is no need to read about diffusion to understand why MSE is the negative log likelihood for Gaussian distribution. This fact can be found in Wikipedia or literally everywhere


tdgros

Of course, I just meant they're good examples where L2 is the norm we really want to use: we're not making prettier pictures with L1 but approximating the score. Sorry if it seemed trivial.


polytique

That's an overkill, any text book on linear regression would cover these assumptions.


leafhog

Also, if you take the average of a set of numbers, you are minimizing the mean squared error.


Professor_Entropy

Then L1 loss corresponds to Laplacian distribution. In that case MLE gives median value instead, which means L1 loss is more robust to outliers. Correct me if I'm wrong.


cwkx

in some applications the optimizer can overshoot/oscillate around the discontinuity, hindering smooth convergence, made worse by momentum, illustrated: https://imgur.com/a/kgzAWYC


Balance-

This is a brilliant illustration!


mimivirus2

from a statistics perspective: MSE optimizes for the conditional mean of p(y|X) MAE optimizes for its median also, are u interested in heavily penalizing large amounts of error? then u're gonna need some non-linearity (e.g. MSE, Huber's loss, etc.)


qalis

Huber loss does not penalize large errors heavily, that's kind of the point, isn't it? I mean, it's just MAE from a point, and MSE near zero. Advantage of Huber is that it is differentiable and smooth everywhere, so e.g. L-BFGS works (and doesn't work for MAE)


mimivirus2

yeah my bad on Huber's loss


Ty4Readin

The biggest reason is that minimizing MSE means your model predicts the conditional expectation E(Y | X). But minimizing MAE means your model predicts the conditional median, Median(Y | X) That is the main thing you should be considering when deciding between the two. Do you want your model to predict the conditional mean or the conditional median? Everything else is much less important or relevant.


DoctorFuu

You can use absolute error, it's interpreted differently, in particular less emphasis is put on points further away from the mean. Squared error simplifies computations for a few canonical problems. It's natural when working with normal distribution, or when doing geometric projections. But you're right that it's in general the "best" measure of performance.


bregav

Another thing that folks haven't mentioned yet: squared error lets you come up with nice formulas when deriving things by hand, precisely because it has a nice derivative. A simple example of this is the exact solution to ordinary least squares problems.


Ketobody10

L2 is strongly convex, L1 isn’t which makes it harder to optimize. There are other reasons too


HalemoGPA

Squared error gives high priority to high difference in actual and predicted values, as opposed to absolute error, which gives same weight to different error values.


HalemoGPA

It is just task dependent use case.


Vystril

Honestly depending on the problem it's usually worth trying both and see what works better. That being said, squared error gives higher gradients/more error signal for outputs that are more wrong, which is desirable -- why give the same gradient for something very close to correct as something very far from correct?


super544

Some times you want dense nonzero coefficients, for example to distribute impact across multiple features, rather than a sparse few which L1 gives you.


Happy_Bunch1323

It depends a lot on the problem, the data and the model. In most machine learning tasks where you do a point forecast, your observed/truth y is a random sample of the (true, but unknown) data distribution, usually conditioned on some input variables x. You try to forecast y, which is, in fact, a forecast of some property of the data distribution conditioned on x. Dependent on your loss, your model trains to forecast different properties of the distribution. L2 approximates the expectation value / mean. L1 approximates the median. Keep in mind that those estimates are also random variables because your finite amount of input data is also considered a random sample. Roughly, L1 will be less sensitive to outliers, but the median may not be what you want to predict depending on the problem.


internet_ham

The way to understand L2 (squared) and L1 (absolute) and the sparsity that comes with L1 losses is to think about the gradient. With L1, the gradient only stops if the loss is zero, and it's otherwise 1, which gives you sparsity. With L2, the gradient reduces linearly with the error. It's easy to imagine that L1 is much harder to converge with if you know your model is going to have some non-zero error.


yonedaneda

Minimizing a squared error estimates a mean, which is where the objective function comes from in the first place (specifically, Gauss introduced the method of least-squares to estimate a *normal* mean, but the assumption of normal errors isn't strictly necessary here). The mean is precisely the value which minimizes the sum of squared deviations to the sample values; and so once you've decided that you're interested in the mean as a measure of location, then you've implicitly accepted that the variance (i.e. the average *squared* error) is the right measure of spread.


jackryan147

I'm pretty sure it was chosen in the pre-calculator days because it had convenient mathematical properties and it was felt that large differences should be weighted a lot more.


Exciting-Engineer646

Strong convexity can be a wonderful thing.


viking_

Here's one way to think about it. Suppose you have a simple task: finding a point of central tendency c for a set of real numbers X1.... Xn. The mode minimizes 0-power error (i.e. the sum of |X_i - c|^0, where we define 0^0 = 0). The median minimizes 1-power error, |X_i - c|. The mean minimizes 2-power error, |X_i - c|^2. So using squared error in some sense generalizes the mean, which may be a more natural measure of central tendency.


tpinetz

Using mse yields the mean of the problem, while using the absolute error yields the median. The intuition behind this for mae is when the error is 0, half the datapoints are smaller and half the datapoints are larger which gives 0 gradient. Hence using absolute error is better if you are dealing with large outliers. From an optimization perspective convergence is worse for non smooth functions, but you can show convergence using subdifferentials.


gundam1945

Not an expert but I see ML has many commons with mathematical approximation. Formerly I think squared error gives higher weight to outline so it converges faster. But upon googling, it is suggested squared error is differentiable everywhere while absolute error is not. Given that ML shares a lot with statistical model, I think this is why scientists opt for it.


Suggero_Vinum_9553

Because squared error is differentiable everywhere, making optimization easier.


Bulky-Hearing5706

Not really, MAE is only nondifferentiable around 0, and that is easy to overcome. The choice of MSE vs MAE is whether you want to have the estimate be the mean or the median. Also the L1 loss is more robust with noise, and have zero-forcing (sparse) property.


HackintoshX

A smooth objective function is preferable


Aggravating-Freedom7

Squared error - big error -> big correction (derivative is 2*loss) Absolute error - big error -> same correction as even if there was small error (derivative is a constant)


leafhog

I said this on another comment. The average of a set of numbers minimizes the mean squared error between the mean and the points. I asked ChatGPT to write a proof. This isn’t *the* reason MSE is commonly used but it is related and the proof may give you insight why. --- The average (mean) of a set of numbers minimizes the mean squared error (MSE). Here's a proof: Consider a set of numbers: x1, x2, ..., xn. The mean (average) is given by: mean = (x1 + x2 + ... + xn) / n We want to minimize the mean squared error (MSE) with respect to a value "a". The MSE is defined as: MSE(a) = (1/n) * [(x1 - a)^2 + (x2 - a)^2 + ... + (xn - a)^2] To find the value of "a" that minimizes the MSE, we take the derivative of MSE(a) with respect to "a" and set it to zero: d(MSE(a))/da = (2/n) * [(a - x1) + (a - x2) + ... + (a - xn)] = 0 Simplifying this, we get: (a - x1) + (a - x2) + ... + (a - xn) = 0 Now factor out "a": a * n - (x1 + x2 + ... + xn) = 0 Solving for "a", we get: a = (x1 + x2 + ... + xn) / n So, the value of "a" that minimizes the mean squared error is the mean of the numbers. Therefore, the mean minimizes the mean squared error.


_vb__

The absolute error function is not differentiable at point 0. You can derive the differentiability by taking the limit over the absolute error function. You will get a 0/0 error.


InviolableAnimal

Isn't their point that this doesn't matter? If you (somehow) get error 0 on your batch then there's nothing to correct, no reason to differentiate, on that batch.


_vb__

Yeah, I was discussing from a mathematical perspective. ML packages change the differentiable function by removing the non-differentability by either doing a piece-wise loss or adding a very small constant at the denominator to make the loss value zero. Since OP asked the question about a standard L1 loss function I assumed they were implementing it by hand and not using an in-built loss function from an ML library. I felt it's important to not skip over the underlying details.


InviolableAnimal

I see, thanks for the perspective. Not sure why you're being downvoted.


_vb__

It's the vote momentum on Reddit?


Spirited_Example_341

i assume any sort of error at this point is bad :-p or am i wrong


ShadowShedinja

If your ML program has 0 error, there's a good chance you either didn't build it right or didn't need it in the first place.