Sampling from a multivariate discrete distribution
Generating random samples is a key step in Monte Carlo simulations. Examples are computing the expectation or the variance of a random variable, estimating quantiles (and synthesising fancy, realistic images as pursuit by many people these days). In this post we consider sampling from discrete distributions, namely those defined on sets of finite elements. In a nutshell, this post is about approximating a (unnormalised) distribution with a tractable one, a common setting in Bayesian inference.
A Describtion of Noise Contrastive Estimation
Noise contrastive estimation is a powerful estimation method proposed by Gutmann and Hyvärinen, 2007 (see also, Gutmann and Hyvärinen, 2012). The method is known for enabling training of densities whose normalising constants may be intractable. In the following, I will describe what it does. Let us assume that we have an i.i.d. sample \(\{x_1,\dots, x_n\}\subset \mathbb{R}^N\), \(N\geq1\) from a distribution defined by a unknown density function \(p_d\), which we model with another density \(p_m\). Intuitively, we want \(p_m\) to be close to \(p_d\) in some sense. In NCE, we introduce another dataset \(\{y_1,\dots, y_k\}\) by sampling from a known distribution \(p_n\). This distribution is called a (contrastive) noise distribution. In short, the crux of NCE is to solve a binary classification problem in which we distinguish between the two datasets. More specifically, we train a classifier whose logit is given by \(p_m/p_n\); as the optimal classifier should depend on the ratio \(p_d/p_n\), training \(p_m\) this way should get \(p_m\) close to \(p_d\). As we are training a density model, how can we understand this procedure in terms of distributional divergences?