Heishiro Kanagawa
by Heishiro Kanagawa

Categories

  • density estimation

Noise contrastive estimation is a powerful estimation method proposed by Gutmann and Hyvärinen, 2007 (see also, Gutmann and Hyvärinen, 2012). The method is known for enabling training of densities whose normalising constants may be intractable. In the following, I will describe what it does. Let us assume that we have an i.i.d. sample \(\{x_1,\dots, x_n\}\subset \mathbb{R}^N\), \(N\geq1\) from a distribution defined by a unknown density function \(p_d\), which we model with another density \(p_m\). Intuitively, we want \(p_m\) to be close to \(p_d\) in some sense. In NCE, we introduce another dataset \(\{y_1,\dots, y_k\}\) by sampling from a known distribution \(p_n\). This distribution is called a (contrastive) noise distribution. In short, the crux of NCE is to solve a binary classification problem in which we distinguish between the two datasets. More specifically, we train a classifier whose logit is given by \(p_m/p_n\); as the optimal classifier should depend on the ratio \(p_d/p_n\), training \(p_m\) this way should get \(p_m\) close to \(p_d\). As we are training a density model, how can we understand this procedure in terms of distributional divergences?

Let us consider the following hypothetical generative process:

  1. Generate a binary value \(C\) with probability \(q\).
  2. Generate \(x\) with \(\tilde{p}(x\rvert C)\), where \(\tilde{p}(x\rvert0)=p_m\) and \(\tilde{p}(x\rvert1)=p_n\)

Similarly, we consider the following process:

  1. Generate a binary value \(C\) with probability \(q\).
  2. Generate \(x\) with \(p(x\rvert C)\), where \(p(x\rvert 0)=p_d\) and \(p(x\rvert 1)=p_n\).

These processes can be considered as two Bayesian models with different likelihood functions, defining two joint distributions \(p(x,c)\) and \(\tilde{p}(x,c)\). Unless our model \(p_m\) equals the data density \(p_d\), the likelihood functions above should be distinct, and so should be the posteriors.

It might look somewhat arbitrary, but let us characterise the difference between posteriors, which is easy as the posteriors are discrete distributions. We use a KL-divergence for this:

\[\begin{align} \mathrm{KL} [p(\cdot|x)||\tilde{p}(\cdot|x)] &= \sum_{c\in\{0,1\}} p(c|x)\left[\log p(c|x) - \log \tilde{p}(c|x)\right ]\\ &={p_d(x)\over p_d(x) + \nu p_n(x)}\log {p_d(x) \over p_m(x)} + \log {\nu p_n(x) \over p_d(x) + \nu p_n(x)} - \log {\nu p_n(x) \over p_m(x) + \nu p_n (x)},\\ \end{align}\]

where \(\nu=q/(1-q)\). Taking the average w.r.t. \(p_d\) gives us

\[\begin{align} \mathbb{E}_{x\sim p_d}\bigl[\mathrm{KL} [p(\cdot|x)||\tilde{p}(\cdot|x)]\bigr] &= \int {p_d(x)\over p_d(x) + \nu p_n(x)}\left(\log {p_d(x) \over p_m(x) } \right)p_d(x)\mathrm{d}x\\ & \quad +\int\left(\log {\nu p_n(x) \over p_d(x) + \nu p_n(x)} - \log {\nu p_n(x) \over p_m(x) + \nu p_n (x)}\right)p_d(x)\mathrm{d}x.\\ \end{align}\]

What are these? The first term can be thought of as a KL divergence between \(p_d\) and \(p_m\), weighted by \(p_d/(p_d+\nu p_n)\). The second term is the log-difference between \(p(C=1|x)\) and \(\tilde{p}(C=1|x)\) averaged over \(x\). When \(p_n=0\) whenever \(p_d>0\) (i.e., \(p_n\) is singular to \(p_d\)), the posterior KL divergence is zero. This means that when the classification task is really easy (the distributions are completely separated), the KL does not provide a useful signal. On the other hand, in the case \(p_n=p_d\), we have

\[\begin{align} \mathbb{E}_{x\sim p_d}\bigl[\mathrm{KL} [p(\cdot|x)||\tilde{p}(\cdot|x)]\bigr] &= \frac{1}{1+\nu}\mathrm{KL}[p_d || p_m] \\ & \quad +\int \left\{\log\left(1+\nu \frac{p_d(x)}{p_m(x)}\right) - \log \left(\frac{p_d(x)}{p_m(x)}\right) - \log(1+\nu)\right\}p_d(x)\mathrm{d}x \end{align}\]

Note that the second term can be negative when \(p_d(x)/p_m(x)>1\). As we aimed in the construction of the posteriors, the KL divergence gives us a measure of discrepancy between \(p_m\) and \(p_d\).

Why did we bring up this? This is because roughly speaking, we can treat the KL divergence as the objective used in NCE1. Informally, when the sample sizes \(k,n\) are infinity, and \(\nu = \lim_{k,n\to \infty} k/n\). Then the objective in NCE is the average cross entropy (or the negative log-likelihood)

\[-\mathbb{E}_{x} \left[\sum_c p(c|x) \log \tilde{p}(c|x)\right] = \mathbb{E}_x\mathrm{KL}[p(\cdot|x), \tilde{p}(\cdot|x)] + \mathrm{const},\]

which is equal to the KL divergence (up to an additive constant w.r.t. \(p_d\)). What we saw above is that when \(p_n\) is close to \(p_d\), the objective is somewhat similar to the KL divergence between \(p_d\) and \(p_m\).

Connection to GANs

We have seen an interpretation of NCE in terms of the training of a density model \(p_m\) (or a classifier). The performance of NCE depends on the choice of \(p_n\), and as we saw, the noise \(p_n\) should be chosen so that the classification task is difficult – \(p_n\) should be similar to \(p_d\). We might then want to train \(p_n\) to meet this requirement; we train \(p_n\) so that the classification becomes difficult and then train \(p_m\) to correctly discriminate between the real and noise data.

You might have noticed that this has a flavour similar to Generative Adversarial Networks (GANs). We can think of the noise distribution \(p_n\) as a generator and the classifier (or \(p_m\)) as a discriminator. One difference between GAN and NCE training is that NCE framework requires the noise to have a evaluable density function, whereas GANs can have implicit generators. GAN’s connection to NCE is mentioned the original paper by Goodfellow et al., 2014. I was not aware of this point prior to writing this post (thanks to Michael Arbel for his feedback!).

  1. Technically, the empirical version of the KL divergence above differs from the objective (see Equation (8) in (Gutmann and Hyvärinen, 2012)).