Negative Log Likelihood 对数似然损失函数

摘要

这篇文章的目标是使用矩阵形式表示二分类问题的对数似然损失函数和损失函数对参数的梯度。

0,1编码

假设我们使用\(0, 1\)分别对负类,正类进行编码,Negative Log Likelihood如下:

\[ \begin{align*} \text{NLL} &= - \Big(\sum_{i=1}^m y^{(i)}\log(\frac{1}{1+\exp(-{\bf w}^T{\bf x^{(i)}})})+(1-y^{(i)})\log(1-\frac{1}{1+\exp(-{\bf w}^T{\bf x^{(i)}})})\Big)\\ &= - \Big(\sum_{i=1}^m -y^{(i)}\log\big(1+\exp(-{\bf w}^T{\bf x^{(i)}})\big)+(1-y^{(i)})\log(\frac{\exp(-{\bf w}^T{\bf x^{(i)}})}{1+\exp(-{\bf w}^T{\bf x^{(i)}})})\Big)\\ &= - \Big(\sum_{i=1}^m -y^{(i)}\log\big(1+\exp(-{\bf w}^T{\bf x^{(i)}})\big)-y^{(i)}\log(\frac{\exp(-{\bf w}^T{\bf x^{(i)}})}{1+\exp(-{\bf w}^T{\bf x^{(i)}})})+\log(\frac{\exp(-{\bf w}^T{\bf x^{(i)}})}{1+\exp(-{\bf w}^T{\bf x^{(i)}})})\Big)\\ &= - \bigg(\sum_{i=1}^m -y^{(i)}\log\Big(\big(1+\exp(-{\bf w}^T{\bf x^{(i)}})\big)\times\frac{\exp(-{\bf w}^T{\bf x^{(i)}})}{1+\exp(-{\bf w}^T{\bf x^{(i)}})}\Big)+\log(\frac{1}{1+\exp({\bf w}^T{\bf x^{(i)}})})\bigg)\\ &=- \Big(\sum_{i=1}^m -y^{(i)}\log\big(\exp(-{\bf w}^T{\bf x^{(i)}})\big) + \log(\frac{1}{1+\exp({\bf w}^T{\bf x^{(i)}})})\Big)\\ &=- \big(\sum_{i=1}^m y^{(i)}{\bf w}^T{\bf x^{(i)}} + \log(\frac{1}{1+\exp({\bf w}^T{\bf x^{(i)}})})\big)\\ &= \sum_{i=1}^m \Big(-y^{(i)}{\bf w}^T{\bf x^{(i)}} + \log\big(1+\exp({\bf w}^T{\bf x^{(i)}})\big)\Big)\\ &=-{\bf y}^TX{\bf w} + {\bf 1}^T\log\big(1+\exp(X{\bf w})\big) \end{align*} \]

微分如下: \[ \begin{align*} \text{d}l &= \text d\Big(-{\bf y}^TX{\bf w} + {\bf 1}^T\log\big(1+\exp(X{\bf w})\big)\Big)\\ &=\text d(-{\bf y}^TX{\bf w}) + \text d\Big({\bf 1}^T\log\big(1+\exp(X{\bf w})\big)\Big)\\ &=-{\bf y}^TX\text d{\bf w} + {\bf 1}^T\Big(\frac{1}{1 + \exp(X{\bf w})} \odot \big(\exp(X{\bf w})\odot (X\text d{\bf w})\big)\Big)\\ &=-{\bf y}^TX\text d{\bf w} + {\bf 1}^T \big(\frac{\exp(X{\bf w})}{1 + \exp(X{\bf w})}\odot (X\text d{\bf w})\big)\\ &= -{\bf y}^TX\text d{\bf w} + {\bf 1}^T \big(\frac{1}{1 + \exp(-X{\bf w})}\odot (X\text d{\bf w})\big)\\ &= \text tr(-{\bf y}^TX\text d{\bf w}) + \text tr\big({\bf 1}^T \big(\frac{1}{1 + \exp(-X{\bf w})}\odot (X\text d{\bf w})\big)\\ &= \text tr(-{\bf y}^TX\text d{\bf w}) + \text tr\Big({\bf 1} \odot \frac{1}{1 + \exp(-X{\bf w})})^T (X\text d{\bf w})\big)\Big)\\ &= \text tr(-{\bf y}^TX\text d{\bf w}) + \text tr\big((\frac{1}{1 + \exp(-X{\bf w})})^T X\text d{\bf w}\big)\\ &= \text tr\big((\frac{1}{1 + \exp(-X{\bf w})}- {\bf y})^TX\text d{\bf w}\big)\\ &= \text tr\Big(\big(X^T(\frac{1}{1 + \exp(-X{\bf w})}- {\bf y})\big)^T\text d{\bf w}\Big) \end{align*} \] 注意此处我们认为\(\bf w\)\(n \times 1\)的矩阵,其对应梯度也为\(n\times 1\)的矩阵,即使用Denominator Layout表示。

所以,

梯度如下: \[ \begin{align*} \nabla_{\bf w} \text{NLL} = X^T(\frac{1}{1 + \exp(-X{\bf w})}- {\bf y}) \end{align*} \]

-1,1编码

假设我们使用\(-1, 1\)分别对负类,正类进行编码,loss function 如下: \[ \begin{align*} \text{NLL} &= - \sum_{i=1}^m \log(\frac{1}{1+\exp(-y^{(i)}{\bf w}^T{\bf x^{(i)}})})\\ &= \sum_{i=1}^m \log\big(1+\exp(-y^{(i)}{\bf w}^T{\bf x^{(i)}})\big)\\ &= {\bf 1}^T\log\Big(1+ \exp\big(-{\bf y} \odot (X{\bf w})\big)\Big) \end{align*} \]

微分如下: \[ \begin{align*} {\text d}(NLL) &= d\bigg({\bf 1}^T\log\Big(1+ \exp\big(-{\bf y} \odot (X{\bf w})\big)\Big)\bigg)\\ &= {\bf 1}^T\bigg(\frac{1}{1+ \exp\big(-{\bf y} \odot (X{\bf w})\big)} \odot\Big(\exp\big(-{\bf y} \odot (X{\bf w})\big) \odot \big(-{\bf y} \odot (X\text d{\bf w})\big)\Big)\bigg)\\ &= {\bf 1}^T(-{\bf y} \odot\frac{1}{1+\exp\big({\bf y} \odot (X\text d{\bf w})\big)}\odot X\text d{\bf w})\\ &= \text tr\big({\bf 1}^T(-{\bf y} \odot\frac{1}{1+\exp\big({\bf y} \odot (X\text d{\bf w})\big)}\odot X\text d{\bf w})\big)\\ &= \text tr \big((-{\bf 1} \odot {\bf y} \odot \frac{1}{1+\exp\big({\bf y} \odot (X\text d{\bf w})\big)})^TX\text d{\bf w}\big)\\ &= \text tr \Big(\big(X^T(-{\bf y} \odot \frac{1}{1+\exp\big({\bf y} \odot (X\text d{\bf w})\big)})\big)^T\text d{\bf w}\Big) \end{align*} \]

所以,

梯度如下: \[ \begin{align*} \nabla_{\bf w} \text{NLL} = X^T(-{\bf y} \odot \frac{1}{1+\exp\big({\bf y} \odot (X\text d{\bf w})\big)}) \end{align*} \]