Where Those Loss Constants Come From

9 minute read

tl;dr

A common form of the objective function for LASSO a.k.a. L1 regularized least-squares regression for batch (size $n$) of training data (total size $N$) is

\[\begin{equation} L(\theta) = \lambda \vert \theta \vert + \frac{1}{n} \sum_i^n \frac{(y_i - f(\mathbf{x_i};\theta))^2}{2}. \end{equation}\]

The total loss $L$ consists of a regularization term and the mean of a loss function $\ell$. Why does that $1/n$ show up in the loss term? Where does that $2$ come from in the denominator? The objective function would be prettier without them!

$1/n$ shows up to support batch gradient descent optimization. The mean loss over a batch of size $n<N$ is an estimate of the mean loss over the entire data set. $n$ can be as small as $1$ and this is still true.

As for that $2$… that is introduced in the loss purely so that the gradient has no dangling constants. That’s it. Not super compelling, but also not completely devoid of purpose.

Along the way, we’ll get a bonus of seeing what other constants are implicitly packed into $\lambda$, the regularization strength.

I’ll show my work as best I can.

Posterior Probability

In typical machine learning problems the data is divided into the target or dependent variable $y$ (a scalar) and the features or independent variables $\mathbf{x}$ (a vector). The target and the features are paired into tuples $y_i\vert \mathbf{x}_i$ read “y given x” for all data points indexed by i from 1 to $N$, the size of the data set. The set of all data is \(\{ y_i \vert \mathbf{x}_i \}\).

The posterior probability \(p(\theta \vert \{ y_i \vert \mathbf{x}_i \})\) is the probability of observing the data given the model parameters. You can use Bayes’ rule to show that it is proportional to the product of the prior on parameters $p(\theta)$ and the likelihood function $\mathcal{L}$:

\[\begin{equation} p(\theta \vert \{ y_i \vert \mathbf{x}_i \}) \propto p(\theta) \mathcal{L}(\theta \vert \{ y_i \vert \mathbf{x}_i \}). \end{equation}\]

The likelihood is interpreted as the probability that the observed data was generated by the model. The prior is interpreted as the range of likely values of the parameters of the model, as expressed using probabilities.

The likelihood function will lead to the loss function, which is the square-error in this case, and the prior will lead to the regularization term.

The best model is the one that maximizes the posterior probability. This is called the maximum a posteriori or MAP model. If the prior is flat ($p(\theta)=1$), the maximum a posteriori model is also the maximum likelihood estimate, abbreviated MLE.

I am ignoring an additional term called the model evidence or marginal likelihood, which is a probability that the prior and the likelihood are divided by. This term is immediately discarded because it does not depend on model parameters $\theta$, and maximizing the full posterior probability is equivalent to maximizing just the prior times the likelihood. I bring this up because discarding the marginal likelihood does not affect the constants that enter into the optimization problem.

Optimization is more conveniently performed in log space. Maximization is turned into minimization by multiplying by $-1$; this is entirely an historical convention.

\[\begin{equation} -\log{p(\theta \vert \{ y_i \vert \mathbf{x}_i \})} = - \log{p(\theta)} - \log{\mathcal{L}(\theta \vert \{ y_i \vert \mathbf{x}_i \})}. \end{equation}\]

Gaussian Likelihood

Now to make things more concrete. Let’s say the dependent variable $y_i$ is continuous and unbounded from \(-\infty\) to \(\infty\), and errors between $y_i$ and predictions of \(y_i\) from the model $f(\mathbf{x}_i;\theta)$ are normally distributed. Then the likelihood function is

\[\begin{equation} \mathcal{L}(\theta \vert \{ y_i \vert \mathbf{x}_i \}) = \prod_{i=1}^N\frac{1}{\sigma\sqrt{2\pi}}\exp{\left(-\frac{(y_i - f(\mathbf{x}_i;\theta))^2}{2\sigma^2}\right)} \end{equation}\]

The negative log-likelihood is

\[\begin{eqnarray} - \log{\mathcal{L}(\theta \vert \{ y_i \vert \mathbf{x}_i \})} & = & - \log{\left( \prod_{i=1}^N\frac{1}{\sigma\sqrt{2\pi}}\exp{\left(-\frac{(y_i - f(\mathbf{x}_i;\theta))^2}{2\sigma^2}\right)} \right)} \\ & = & - \sum_i^N \log{\left( \frac{1}{\sigma\sqrt{2\pi}} \exp{\left(-\frac{(y_i - f(\mathbf{x}_i;\theta))^2}{2\sigma^2}\right)} \right)} \\ & = & - \sum_i^N \log{\left( \frac{1}{\sigma\sqrt{2\pi}} \right)} - \sum_i^N \log{\exp{\left(-\frac{(y_i - f(\mathbf{x}_i;\theta))^2}{2\sigma^2}\right)}} \\ & = & - \sum_i^N \log{\left( \frac{1}{\sigma\sqrt{2\pi}} \right)} + \sum_i^N \frac{(y_i - f(\mathbf{x}_i;\theta))^2}{2\sigma^2} \end{eqnarray}\]

Minimization of the negative log-likelihood over $\theta$ can further simplify the task:

\[\begin{eqnarray} \mathrm{argmin}_{\theta} \left( - \log{\mathcal{L}(\theta \vert \{ y_i \vert \mathbf{x}_i \})} \right) & = & \mathrm{argmin}_{\theta} \left( - \sum_i^N \log{\left( \frac{1}{\sigma\sqrt{2\pi}} \right)} + \sum_i^N \frac{(y_i - f(\mathbf{x}_i;\theta))^2}{2\sigma^2} \right) \\ & = & \mathrm{argmin}_{\theta} \left( \sum_i^N \frac{(y_i - f(\mathbf{x}_i;\theta))^2}{2\sigma^2} \right). \end{eqnarray}\]

That is to say, you can get rid of the constant term because it does not depend on $\theta$.

Laplace Prior

Let’s say that the parameters of the model should be Laplace distributed about $0$, like this:

\[\begin{eqnarray} p(\theta) = \frac{1}{2b} \exp{\left( - \frac{\vert \theta \vert}{b} \right)}. \end{eqnarray}\]

This prior asserts that I believe values of $\theta$ should usually be about $b$ or less away from $0$, and I most expect $\theta=0$ “with prejudice”. As far as priors on model parameters goes, this prior amounts to a really silly assertion, because if I thought the model parameters should most likely be $0$ then it’s not clear why I’m building a model at all. More precisely, in real situations I always expect at least some model parameters to be different from zero, so the Laplace prior results in a biased MAP model.

That said, I’m going to proceed with this prior anyway because it ends up being enormously useful.

The negative log-prior is

\[\begin{eqnarray} - \log{p(\theta)} & = & - \log \left( \frac{1}{2b} \exp{\left( - \frac{\vert \theta \vert}{b} \right)} \right) \\ & = & - \log \left( \frac{1}{2b} \right) - \log \exp{\left( - \frac{\vert \theta \vert}{b} \right)} \\ & = & - \log \left( \frac{1}{2b} \right) + \frac{\vert \theta \vert}{b} \end{eqnarray}\]

Minimization of the negative log-prior over $\theta$ can further simplify the task:

\[\begin{eqnarray} \mathrm{argmin}_{\theta} \left( - \log{p(\theta)} \right) & = & \mathrm{argmin}_{\theta} \left( - \log \left( \frac{1}{2b} \right) + \frac{\vert \theta \vert}{b} \right) \\ & = & \mathrm{argmin}_{\theta} \left( \frac{\vert \theta \vert}{b} \right). \end{eqnarray}\]

That is to say, you can get rid of the constant term because it does not depend on $\theta$.

Total Loss

Math is beautiful to look at, so let’s write out the full posterior probability model I’ve developed so far simply to behold it:

\[\begin{eqnarray} p(\theta \vert \{ y_i \vert \mathbf{x}_i \}) \propto \frac{1}{2b} \exp{\left( - \frac{\vert \theta \vert}{b} \right)} \prod_{i=1}^N\frac{1}{\sigma\sqrt{2\pi}}\exp{\left(-\frac{(y_i - f(\mathbf{x}_i;\theta))^2}{2\sigma^2}\right)} \end{eqnarray}\]

The objective function $Q(\theta)$ is defined as the negative log-posterior, and because I intend to minimize the object with respect to $\theta$, I discard the constant terms immediately for simplicity:

\[\begin{eqnarray} Q(\theta) & = & \frac{\vert \theta \vert}{b} + \sum_i^N \frac{(y_i - f(\mathbf{x}_i;\theta))^2}{2\sigma^2}. \end{eqnarray}\]

To be super clear, I’ve simple taken the negative log-prior term that depends on $\theta$ and the negative log-likelihood term that depends on $\theta$ and labeled it $Q(\theta)$. This is not quite yet the total loss I talked about at the beginning. We’re getting there.

Machine learning reduces to learning the best $\theta$, which is the one that minimizes $Q(\theta)$. How is that done?

Stochastic Gradient Descent

Gradient descent (GD) is class of optimization techniques where you search for a function’s minimum by taking steps in the direction of the function’s steepest descent. The direction of a function’s steepest descent is its gradient.

To do GD you do two things.

  1. Take a measurement. Measure the objective function value at the current step $j$, $Q(\theta_j)$.
  2. Take a step in the right direction. Descend the objective function in the direction of the gradient with a step size equal to the steepness times some tunable number $\eta_j$ that modulates the steepness. This puts us in the next location $\theta_{j+1} = \theta_j - \eta_j \nabla Q(\theta_j)$.

If you do these two steps repeatedly and you are smart to make $\eta_j$ smaller with each step, you will eventually reach a (local) minimum of the function $Q$.

That’s great if $N$ is not absolutely enormous. If it is, computing each step can take forever and standard GD takes too long to be practically useful. So I have to use more tricks.

Batching

A solution at huge $N$ is to break up the training data \(\{y_i \vert \mathbf{x}_i\}\) into batches that are small enough to support quick computation of the objective and its gradient for steps 1 and 2, respectively.

If the batches are $n<N$ randomly sampled data points, how do I fairly estimate the total objective function in step 1 and the gradient in step 2? If I naively compute the objective over the subset of data points I get

\[\begin{eqnarray} q(\theta) & = & \frac{\vert \theta \vert}{b} + \sum_i^n \frac{(y_i - f(\mathbf{x}_i;\theta))^2}{2\sigma^2}. \end{eqnarray}\]

The problem is that this is not a good estimate of $Q$ from the full data set. The reason is, $Q$ accumulates square-residuals (times constants) over every data point up to $N$. If square residuals are on average, say, $0.1$ for illustration’s sake, then

\[\begin{eqnarray} Q = \frac{\vert \theta \vert}{b} + \frac{0.1N}{2\sigma^2}. \end{eqnarray}\]

But on a batch of data $q$ will be about $\vert \theta \vert/b + 0.1n/2\sigma^2 < \vert \theta \vert/b + 0.1N/2\sigma^2$, so $q<Q$ in expectation. A solution to this is to multiply the likelihood term by $N/n$ to scale it back up. Then I get, in my cartoon, a new batch objective of

\[\begin{eqnarray} q_{\mathrm{batch}} = \frac{\vert \theta \vert}{b} + \frac{N}{n} \frac{0.1n}{2\sigma^2} = \frac{\vert \theta \vert}{b} + \frac{0.1N}{2\sigma^2} = Q, \end{eqnarray}\]

on average. And if this is true for my randomly chosen average square residual of $0.1$ it must be true for every average square residual.

So now the task is to minimize

\[\begin{eqnarray} q_{\mathrm{batch}}(\theta) & = & \frac{\vert \theta \vert}{b} + \frac{N}{n} \sum_i^n \frac{(y_i - f(\mathbf{x}_i;\theta))^2}{2\sigma^2}. \end{eqnarray}\]

It is also true that, for the purposes of minimization,

\[\begin{eqnarray} \mathrm{argmin}_{\theta} q_{\mathrm{batch}}(\theta) & = & \mathrm{argmin}_{\theta}\left( \frac{\vert \theta \vert}{b} + \frac{N}{n} \sum_i^n \frac{(y_i - f(\mathbf{x}_i;\theta))^2}{2\sigma^2}\right) \\ & = & \mathrm{argmin}_{\theta}\left( \frac{\sigma^2}{N} \frac{\vert \theta \vert}{b} + \frac{1}{n} \sum_i^n \frac{(y_i - f(\mathbf{x}_i;\theta))^2}{2}\right) \\ & = & \mathrm{argmin}_{\theta}\left( \lambda \vert \theta \vert + \frac{1}{n} \sum_i^n \ell_i(y_i\vert \mathbf{x}_i;\theta) \right) \end{eqnarray}\]

if $\lambda = n\sigma^2/Nb$ and \(\ell_i(y_i\vert \mathbf{x}_i;\theta) = (y_i - f(\mathbf{x}_i;\theta))^2/2\). In the machine learning community, $\lambda$ is treated as a tunable hyperparameter and $\ell_i(y_i\vert\mathbf{x}_i;\theta)$ is a useful abstraction called the loss. The objective function to be minimized is called the total loss, as estimated from a batch of size $n$,

\[\begin{eqnarray} L(\theta) = \lambda \vert \theta \vert + \frac{1}{n} \sum_i^n \ell_i(y_i\vert\mathbf{x}_i;\theta). \end{eqnarray}\]

The gradient in this case, to be used in gradient descent, is

\[\begin{eqnarray} \nabla L(\theta) & = & \nabla \lambda \vert \theta \vert + \nabla \frac{1}{n} \sum_i^n \ell_i(y_i\vert\mathbf{x}_i;\theta) \\ & = & \lambda (2\times \mathbb{1}_{\theta > 0} - 1) + \frac{1}{n} \sum_i^n \nabla \ell_i(y_i\vert\mathbf{x}_i;\theta;) \\ & = & \lambda (2\times \mathbb{1}_{\theta > 0} - 1) + \frac{1}{n} \sum_i^n 2 \mathbf{x}_i \frac{(f(\mathbf{x}_i;\theta) - y_i)}{2} \\ & = & \lambda (2\times \mathbb{1}_{\theta > 0} - 1) + \frac{1}{n} \sum_i^n \mathbf{x}_i (f(\mathbf{x}_i;\theta) - y_i) \end{eqnarray}\]

where $\mathbb{1}_{\theta > 0}$ is the indicator function equal to 1 when the argument $\theta > 0$ is true and 0 otherwise. Note that the 2 in the denominator of this loss function canceled with the 2 that fell out of the gradient operation. This cancelation is the only reason to keep the 2 around in the denominator (and it was not a very good reason).

Conclusion

That’s it! I’ve shown in gory detail where the $1/n$ and the $2$ came from. As a bonus I showed that $\lambda$ can be expressed entirely in terms of other fundamental constants of the problem.

Here are some variants of the this procedure to try so you can stay sharp and impress people at parties.

  • Use a zero-centered Gaussian prior to reproduce Ridge regression.
  • Use a Bernoulli likelihood function to reproduce (regularized) logistic regression.
  • Think about L0 regularization in terms of its implied prior. What makes L0 regression intractable?

Comments