For quite sometime now I’ve been working on neural inference methods that have become very popular recently. There is an abundance of resources on these methods and that is precisely why I decided to write a considerably terse post about the most prominent of such techniques. Since this is only an attempt to summarize the vast body of work being done in this field, I will try to provide links to more detailed (read: much better than mine) posts and paper for each of the methods. As mentioned earlier, this is only an attempt to summarize this fairly sophisticated field, please correct me via the comments section at the end of this page if you find any mistakes in my description, of which I can promise there will be plenty to fix :) Lastly before starting, this is more of a *dynamic* post, meaning that I will keep updating the entries and adding details as I get time.

With that lets delve straight into our main topic, **Approximate Neural Inference** or as we will refer to it throughout this text, **NI**. Simply put, variational inference (**VI**) is a deterministic method of carrying out approximate inference in probabilistic models when the posterior distribution over the variables of interest (or the latent state) is intractable. In order to do so, VI uses a tractable family of distributions to approximate the intractable posterior in an optimization procedure that usually minimizes the negative log-likelihood of the data that the model is trying to fit. Consider the following example,

Let be the likelihood of our data under our model parameterized by where (continuous or discrete in which case can be replaced by ) is a latent variable. Assuming, the posterior distribution is intractable, mean-field VI (which is a type of VI) approximates this posterior by first introducing variational parameters for each and then approximate the true posterior by the variational posterior . In practice, instead of directly optimizing the negative log-likelihood to fit the posterior, VI methods use a simpler to optimize lowerbound on it (popularly referred to as the **ELBO** or the evidence lowerbound). Traditional VI methods make use of *conjugate* priors over the latent variables to derive closed form `EM-like`

updates for optimization, which accounts for their fast speed of inference. But this is also the reason for their limited applicability and/or accuracy. Not all models can leverage the convenience of conjugate priors. Another drawback is the need of repeated derivations for even minor changes in initial assumptions. **NI** can be thought of as an alternative form or approximate varitational inference that has a certain *black-box* characteristic to it and therefore allows for carrying out approximate inference without the need to derive update equations even in the models that do not have conjugate priors.

I think that is all that we need to know from traditional VI theory to proceed further. The only other thing that needs to be spelled out is the actual **ELBO** that is optimized in the old methods can be written down in two equivalent ways as follows,

.

`Kingma and Welling, 2013`

VAE uses the following ELBO, . Where within the KL term instead of using the posterior it regularizes the variational posterior by the prior over .The benefit of this change is that unlike the (intractable) posteriors, priors are always available in Bayesian methods. The problematic term is the expectation over the log conditional of the data . While this expectation cannot always be computed, it can always be approximated using *Monte-Carlo* (**MC**) given we can sample from the variational posterior. Considering that is true for at least some exponential family distributions we can therefore estimate the EBLO.

VAEs use a feed-forward neural network, called the *inference network* for generating the parameters of the variational posterior and a *recognition network* for decoding the data. Using an inference network for generating variational distribution imposes a certain smoothness assumption over the data-points and hence instead of having a mean-filed type local parameters, the network allows sharing of global variational parameter. Training this network is problematic because by using MC for approximating the expectation in the above ELBO, discontinuity is introduced in the graph and hence back propagation is not possible. Back propagating through random sampler is a well studied problem and the way VAE solve it is by using the **reparametrization trick**/**Mat-trick** which is applicable to any continuous distribution that has a **non-centered parametric form**. I do not want to get in detail of these tricks here as that would require a lot more space than I have, so I will finish this description of VAEs by saying that VAEs can be used as a **neural inference** method for a large class of continuous latent state model where the prior over the latent variable has a **non-centered parametric form**. Recently, we also showed that VAEs can be used for **Dirichlet** priors like in a topic model using **Laplace Approximation**. Details can be found here. In future I will update this section for the more recent stuff such as,

So what if your posterior is not a unimodal Gaussian distribution? This is the type of issue that Normalizing Flows and related methods are trying to resolve. The basic idea is fairly simple, lets assume that your complicated posterior is some complicated transformation of unimodel Gaussian. So Normalizing flows essentially uses special invertible functions (**such that the determinant of the Jacobian of their inverse is known**) to transform the unimodal Gaussian to approximate your complicated posterior well. More on this later, but for now note that there are several ways to go about using the same idea, including but not limited to **volume preserving transformations**, **invertible neural networks**, etc.

We are releasing an extended version of our ICLR paper to explain in point in further mathematical detail.

.

While there are plenty of VAE implentations all over the web in almost every Deep Learning package I find this `Tensorflow`

implementation by Jan Hendrik Metzen quite intuitive and easy to follow.

`Mnih and Gregor, 2014`

NVIL is a more general method of inference that is equally applicable to both continuous and discrete latent state probabilistic models. Lets start by considering how the ELBO is transcribed as an expectation in NVIL by absorbing the prior over into the joint of the model,

Like VAEs, in NVIL the variational posterior is constructed using an inference network as well. The gradients of this ELBO *w.r.t* to and are given as follows,

and

.

There is no problem with approximating with MC but the high variance in the MC approximation of requires additional methods to train successfully. To this end, the authors proposed two black-box methods of variance reduction in the gradient estimates. is called the **learning signal**.

The reason that the gradient with respect to is written as an expectation (by clever use of log-derivative trick and moving differentiation operator in and out of summation) is to allow substracting from the learning signal, any quantity that does not depend on . Since expectation in the learning signal remains invariant if a quantity is subtracted from it, therefore, a systematically calculated function of the data, when subtracted from the learning signal can help bring the noise down in the gradients.

While centering the learning signal reduces the variance in the gradients, it does not suffice just by itself as the gradients tend to have shift drastically at times. Therefore, the variance of the learning signal is used to further normalize the gradients.

NVIL also allows benefiting from the local structure of the model.

There are a few example implementation of this paper in theano and Matlab.

`Rajesh Ranganath, Sean Gerrish and David M. Blei, 2014`

If I remember well, this is precursor to the NVIL above. Essentially NVIL generalizes this method with an encoder to get variational parameters without needing to resort to the mean-field assumption and therefore uses different variance reduction techniques. The lowerbound for BVI is given by:

Where represents the free parameters. Then the gradients are given as:

and its monte carlo approximation becomes:

where

This ELBO can be optimized using a gradient based method. But the variance of these gradients estimates is generally to high to work and therefore this method relies on the following variance reduction techniques to reduce this variance:

Suppose we are interested in calculating the expectation of a function, for example the gradients with respect to as above. The idea behind Rao-Blackwellization is to replace this function with another function such that the expectation doesn’t change but the variance reduces. It does so by using the conditional expectation of the function.

This is very similar to the centering and the variance normalization as in the above case of NVIL, so we do not discuss it here.

I initially wanted to provide a very short description of GANs here although they are generally not explained under the *variational inference*-type framework. But given their popularity I think it is only fair to write a separate post on GANs. If you are interested in the divergence minimization formulation of GANs have a look at the -GAN paper

Automatic Differentiation Variational Inference generalizes the VAE framework by providing a library of invertible mappings from the distribution (of-interest) domain to ; which can be used to transform the prior density to build a Gaussian approximation to the posterior. This allows for using the re-parameterization trick for inference using VAE. Since this mappings are invertible the Gaussian approximations can be transformed back to the required domain.

There is a very nice implementation of ADVI in Stan . A reasonably good implementation is available in PyMC3.

Okay I should say that I am quite a fan of this and other related papers on this topic. So I plan on writing atleast a paragraph about each one of them. But that is for later so for now I only have a one-liner: remember how we generally resort to the familliar exponential family in most of the variational inference methods? How our variational posteriors are mostly uni-modal (expect in the case of mixture distributions, but that is a differnet story for another time)? Well, here is a solution to all of that and it’s called Normalizing Flows.