Machine Learning Likelihood, Loss, Gradient, and Hessian Cheat Sheet

6 minute read

Cheat sheet for likelihoods, loss functions, gradients, and Hessians. This is a living document that I’ll update over time.

Motivating theory

Bayes theorem

Bayes’ theorem tells us that the posterior probability of a hypothesis $H$ given data $D$ is

\begin{equation} P(H|D) = \frac{P(H) P(D|H)}{P(D)}, \end{equation}

where

$P(H \vert D)$ is the posterior probability of the (variable) hypothesis given the (fixed) observed data
$P(H)$ is the prior probability of the hypothesis
$P(D \vert H)$ is the likelihood $\mathcal{L}$, the probability that the observed data was generated by $H$
$P(D)$ is the marginal likelihood, usually discarded because it’s not a function of $H$.

In supervised machine learning, models are hypotheses and data are $y_i | \mathbf{x}_i$ label-feature vector tuples.

We’re looking for the best model, which maximizes the posterior probability. If the prior is flat ($P(H) = 1$) this reduces to likelihood maximization.

If the prior on model parameters is normal you get Ridge regression. If the prior on model parameters is Laplace distributed you get LASSO.

Gradient descent

Objectives are derived as the negative of the log-likelihood function. Objects with regularization can be thought of as the negative of the log-posterior probability function, but I’ll be ignoring regularizing priors here.

Objective function is derived as the negative of the log-likelihood function, and can also be expressed as the mean of a loss function $\ell$ over data points.

\[L = -\log{\mathcal{L}} = \frac{1}{N}\sum_i^{N} \ell_i.\]

In linear regression, gradient descent happens in parameter space

For linear models like least-squares and logistic regression,

\[\ell_i = \ell(f(\beta; \mathbf{x}_i))\]

where

\[f(\beta; \mathbf{x}_i) = \mathbf{x}_i^T \mathbf{\beta},\]

$\beta$ are the coefficients and $\mathbf{x}_i = 1$ is the $i$-th feature vector. This formulation supports a y-intercept or offset term by defining $x_{i,0} = 1$. The rest of the entries $x_{i,j}: j>0$ are the model features.

Gradient descent minimazation methods make use of the first partial derivative.

\[\begin{equation} \ell^{\prime} = \frac{\partial \ell}{\partial \mathbf{\beta}} = \mathbf{x}_i \frac{\partial \ell}{\partial f} \end{equation}\]

Some gradient descent variants, like Newton-Raphson, use the second partial derivative or Hessian.

\[\begin{equation} \ell^{\prime\prime} = \frac{\partial^2 \ell}{\partial \mathbf{\beta}^2} = \mathbf{x}_i^2 \frac{\partial^2 \ell}{\partial f^2} \end{equation}\]

In gradient boosting, gradient descent happens in function space

In gradient boosting,

\[\begin{equation} \ell_i = \ell(f(\mathbf{x}_i)) \end{equation}\]

where optimization is done over the set of different functions $\{f\}$ in functional space rather than over parameters of a single linear function. In this case the gradient is taken w.r.t. the function $f$.

\[\begin{equation} \ell^{\prime} = \frac{\partial \ell}{\partial f} \end{equation}\]

and the Hessian is

\[\begin{equation} \ell^{\prime\prime} = \frac{\partial^2 \ell}{\partial f^2}. \end{equation}\]

All derivatives below will be computed with respect to $f$. If you are using them in a gradient boosting context, this is all you need. If you are using them in a linear model context, you need to multiply the gradient and Hessian by $\mathbf{x}_i$ and $\mathbf{x}_i^2$, respectively.

Likelihood, loss, gradient, Hessian

The loss is the negative log-likelihood for a single data point.

Square loss

Used in continous variable regression problems.

Likelihood

Start by asserting normally distributed errors.

\[\begin{equation} \prod_{i=1}^N\frac{1}{\sigma\sqrt{2\pi}}\exp{-\frac{(y_i - f(\mathbf{x}_i))^2}{2\sigma^2}} \end{equation}\]

Loss

\[\begin{equation} \ell = (y_i - f(\mathbf{x}_i))^2 \end{equation}\]

Gradient

\[\begin{equation} \frac{\partial \ell}{\partial f} = 2(f(\mathbf{x}_i) - y_i) \end{equation}\]

Hessian

\[\begin{equation} \frac{\partial^2 \ell}{\partial f^2} = 2 \end{equation}\]

Log loss

Used in binary classifiction problems.

Likelihood

Start by asserting binary outcomes are Bernoulli distributed.

\begin{equation} \prod_{i=1}^N p(\mathbf{x}_i)^{y_i} (1 - p(\mathbf{x}_i))^{1 - {y_i}} \end{equation}

The model in this case is a function with support $h \in \{-\infty, \infty\}$ that maps to the Bernoulli probability parameter $p$ via the log-odds or “logit” link function.

\begin{equation} f(\mathbf{x}_i) = \log{\frac{p(\mathbf{x}_i)}{1 - p(\mathbf{x}_i)}} \end{equation}

This formulation maps the boundless hypotheses onto probabilities $p \in \{0, 1\}$ by just solving for $p$:

\begin{equation} p(\mathbf{x}_i) = \frac{1}{1 + \exp{(-f(\mathbf{x}_i))}} \end{equation}

Loss

For labels following the binary indicator convention $y \in \{0, 1\}$, all of the following are equivalent. The easiest way to prove they are equivalent is to plug in $y = 0$ and $y = 1$ and rearrange.

\[\begin{equation} \begin{split} \ell & = -y_i\log{p(\mathbf{x}_i)} - (1 - y_i)\log{(1 - p(\mathbf{x}_i))} \\ & = y_i \log{(1 + \exp{(-f(\mathbf{x}_i))})} \\ & \qquad + (1 - y_i) \log{(1 + \exp{(f(\mathbf{x}_i))})} \\ & = - y_i f(\mathbf{x}_i) + \log{(1 + \exp{(f(\mathbf{x}_i))})} \end{split} \end{equation}\]

The first form is useful if you want to use different link functions.

For labels following the transformed convention $z = 2y-1 \in \{-1, 1\}$:

\[\begin{equation} \ell = \log{(1 + exp{(-z f(\mathbf{x}_i))})} \end{equation}\]

Gradient

\[\begin{equation} \frac{\partial \ell}{\partial f} = p(\mathbf{x}_i) - y_i \end{equation}\]

Hessian

\[\begin{equation} \frac{\partial^2 \ell}{\partial f^2} = p(\mathbf{x}_i)(1 - p(\mathbf{x}_i)) \end{equation}\]

Quantile regression

Likelihood

I have not yet seen somebody write down a motivating likelihood function for quantile regression loss.

Loss

Sometimes called the pinball loss.

\[\begin{equation} \begin{split} \ell & = (y_i - f(\mathbf{x}_i)) ( \tau - \mathbb{1}_{y_i < f(\mathbf{x}_i)} ) \\ & = \sum_{y_i \geq f(\mathbf{x}_i)}\tau (y_i - f(\mathbf{x}_i)) \\ & \qquad - \sum_{y_i < f(\mathbf{x}_i)}(1 - \tau) (y_i - f(\mathbf{x}_i)) \end{split} \end{equation}\]

Gradient

\[\begin{equation} \begin{split} \frac{\partial \ell}{\partial f} & = - ( \tau - \mathbb{1}_{y_i < f(\mathbf{x}_i)} ) \\ & = - \sum_{y_i \geq f(\mathbf{x}_i)}\tau + \sum_{y_i < f(\mathbf{x}_i)}(1 - \tau) \end{split} \end{equation}\]

Hessian

\[\begin{equation} \frac{\partial^2 \ell}{\partial f^2} = 0 \end{equation}\]

Mean absolute deviation

Mean absolute deviation is quantile regression at $\tau=0.5$.

Cox proportional hazards

Likelihood

Start from the Cox proportional hazards partial likelihood function. The partial likelihood is, as you might guess, just part of a larger likelihood, but it is sufficient for maximum likelihood estimation and therefore regression.

\[\begin{equation} L(f) = \prod_{i:C_i = 1} \frac{\exp{f_i}}{\sum_{j:t_j \geq t_i} \exp{f_j}} \end{equation}\]

Using the analogy of subscribers to a business who may or may not renew from period to period, following is the unique terminology of survival analysis.

$i$ and $j$ index users.
$C_i = 1$ is a cancelation or churn event for user $i$ at time $t_i$
$C_i = 0$ is a renewal or survival event for user $i$ at time $t_i$
Subscribers $i:C_i = 1$ are users who canceled at time $t_i$.
$j:t_j \geq t_i$ are users who have survived up to and including time $t_i$, which is the instant before subscriber $i$ canceled their subscription and churned out of the business. This is called the risk set, because they are the users at risk of canceling at the time user $i$ canceled. The risk set includes user $i$.

In clinical studies, users are subjects and churn is non-survival, i.e. death.

Loss

\[\begin{equation} \ell_i = \delta_i \left[ - f_i + \log{\sum_{j:t_j \geq t_i} \exp{f_j}} \right] \end{equation}\]

where $\delta_i$ is the churn/death indicator.

Gradient

The efficient algorithm to compute the gradient and hessian involves ordering the $n$ survival data points, which are index by $i$, by time $t_i$. This turns $n^2$ time complexity into $n\log{n}$ for the sort followed by $n$ for the progressive total-loss compute (ref).

For linear regression, the gradient for instance $i$ is

\[\begin{equation} \frac{\partial \ell_i}{\partial \beta} = \delta_i \left[ - \mathbf{x}_i + \frac{\sum_{j:t_j \geq t_i} \mathbf{x}_j \exp{f(\beta; \mathbf{x}_j)}}{\sum_{j:t_j \geq t_i} \exp{f(\beta; \mathbf{x}_j)}} \right] \end{equation}\]

For gradient boosting, the gradient for instance $i$ is

\[\begin{equation} \frac{\partial \ell_i}{\partial f} = \delta_i \left[ - 1 + \sum_{j=1}^n \delta_j \mathbb{1}_{t_i \geq t_j} \frac{\exp{(f(\mathbf{x}_i))}}{\sum_{r=1}^n \mathbb{1}_{t_r \geq t_j} \exp{(f(\mathbf{x}_r))}} \right] \end{equation}\]

Gemini deriving the gradient:

The log-partial likelihood is:

\[l(\beta) = \sum_{i \in \D} C_i \left[ \beta'x_i - \log\left(\sum_{j \in \R(t_i)} \exp(\beta'x_j)\right) \right]\]

The gradient is:

\[\nabla l(\beta) = \left[ \frac{\partial l(\beta)}{\partial \beta_1}, \frac{\partial l(\beta)}{\partial \beta_2}, \dots, \frac{\partial l(\beta)}{\partial \beta_p} \right]\]

Let’s derive the partial derivative with respect to $\beta_k$:

\[\frac{\partial l(\beta)}{\partial \beta_k} = \sum_{i \in \D} C_i \left[ x_{ik} - \frac{\sum_{j \in \R(t_i)} x_{jk} \exp(\beta'x_j)}{\sum_{j \in \R(t_i)} \exp(\beta'x_j)} \right]\]

where $x_{ik}$ is the $k$-th component of $x_i$.

In vector notation, the gradient is:

\[\nabla l(\beta) = \sum_{i \in \D} C_i \left[ x_i - \frac{\sum_{j \in \R(t_i)} x_j \exp(\beta'x_j)}{\sum_{j \in \R(t_i)} \exp(\beta'x_j)} \right]\]

Hessian

To be written.

Backlog

Cross entropy for multiclass problems
Accelerated failure time
- https://arxiv.org/abs/2006.04920
- sksurv
Hinge
Huber
Poisson
Kullback-Leibler

Will High

Machine Learning Likelihood, Loss, Gradient, and Hessian Cheat Sheet

Motivating theory

Bayes theorem

Gradient descent

In linear regression, gradient descent happens in parameter space

In gradient boosting, gradient descent happens in function space

Likelihood, loss, gradient, Hessian

Square loss

Log loss

Quantile regression

Mean absolute deviation

Cox proportional hazards

Backlog

Further reading

Comments

You May Also Enjoy

The Mightyohm Geiger Counter on Raspberry Pi 4

Deploy Custom Shiny Apps to AWS Elastic Beanstalk

Debugging Metaflow Jobs

Metaflow Best Practices for Machine Learning