Week 10: Stochastic Variational Inference / Automatic Differentiation Variation Inference (SAD VI)¶
Assigned Reading¶
- Murphy: Chapter 18
Overview¶
- Review Variational Inference
- Derive the variational objective
- ELBO intuition
- Stochastic optimization
Posterior Inference for Latent Variable Models¶
Imagine we had the following latent variable model

which represents the probabilistic model p(x, z ; \theta) where
- x_{1:N} are the observations
- z_{1:N} are the unobserved local latent variables
- \theta are the global latent variables (i.e. the parameters)
The conditional distribution of the unobserved variables given the observed variables (the posterior inference) is
which we will denote as p_{\theta}(z | x).
Because the computation \int\int p(x, z, \theta)d_zd_{\theta} is intractable, making the computation of the conditional distribution itself intractable, we must turn to variational methods.
Approximating the Posterior Inference with Variational Methods¶
Approximation of the posterior inference with variational methods works as follows:
- Introduce a variational family, q_\phi(z | x) with parameters \phi.
- Encode some notion of "distance" between p_\theta and q_\phi.
- Minimize this distance.
This process effectively turns Bayesian Inference into an optimization problem (and we love optimization problems in machine learning).
It is important to note that whatever function we choose for q_\phi, it is unlikely that our variational family will have the true distribution p_\theta in it.

Kullback-Leibler Divergence¶
We will measure the distance between q_\phi and p_\theta using the Kullback-Leibler divergence.
Note
Kullback–Leibler divergence has lots of names, we will stick to "KL divergence".
We compute D_{KL} as follows:
Properties of the KL Divergence¶
- D_{KL}(q_\phi || p_\theta) \ge 0
- D_{KL}(q_\phi || p_\theta) = 0 \Leftrightarrow q_\phi = p_\theta
- D_{KL}(q_\phi || p_\theta) \not = D_{KL}(p_\theta || q_\phi)
The significance of the last property is that D_{KL} is not a true distance measure.
Variational Objective¶
We want to approximate p_\theta by finding a q_\phi such that
but the computation of D_{KL}(q_\phi || p_\theta) is intractable (as discussed above).
Note
D_{KL}(q_\phi || p_\theta) is intractable because it contains the term p_\theta(z | x), which we have already established, is intractable.
To circumvent this issue of intractability, we will derive the evidence lower bound (ELBO), and show that maximizing the ELBO \Rightarrow minimizing D_{KL}(q_\phi || p_\theta).
Where \mathcal L(\theta, \phi ; x) is the ELBO.
Note
Notice that \log p_\theta(x) is not dependent on z.
Rearranging, we get
Because D_{KL}(q_\phi (z | x) || p_\theta (z | x)) \ge 0
\therefore maximizing the ELBO \Rightarrow minimizing D_{KL}(q_\phi (z | x) || p_\theta (z | x)).
Alternative Derivation¶
Starting with Jenson's inequality,
if X is a random variable and f is a convex function.
Given that \log is a concave function, we have
Alternative Forms of ELBO and Intuitions¶
We have that
1) The most general interpretation of the ELBO is given by
2) We can also re-write 1) using entropy
3) Another re-write and we arrive at
Tip
The instructor suggest that this would be useful for assignment 3.
This frames the ELBO as a tradeoff. The first term can be thought of as a "reconstruction likelihood", i.e. how probable is x given z, which encourages the model to choose the distribution which best reconstructs the data. The second term acts as regularization, by enforcing the idea that our parameterization shouldn't move us too far from the true distribution.
Note
The instructor recommends we read "sticking the landing".
Mean Field Variational Inference¶
In mean field variational inference, we restrict ourselves to variational families, q, that we can compute the gradient of, and assume the approximate distribution q fully factorizes to q_\phi(z) (no x!). I.e., we approximate p_\theta(z|x) with q_\phi(z)
where \phi = (\phi_\theta, \phi_{1:N}).

If qs are in the same family as ps, we can optimize via coordinate ascent.
Traditional Variational Inference (ASIDE)¶
- Fix all other variables --> optimize local
- Aggregate local --> optimize global
- Repeat until KL divergence
Warning
I think this was meant to be an aside.
Optimizing ELBO¶
We have that
If we want to optimize this with gradient methods, we will need to compute \nabla_\phi \mathcal L(\phi ; x). Nowadays, we have automatic differentiation (AD). We can optimize with gradient methods if:
- z is continuous
- dependence on \phi is exposed to AD
If these are both true, then
but, this is difficult because we are taking the gradient of an expectation and we are trying to compute this gradient from samples. This brings us to our big idea: instead of taking the gradient of an expectation, we compute the gradient as an expectation.
Score Gradient¶
Also called the likelihood ratio, or REINFORCE, was independently developed in 1990, 1992, 2013, and 2014 (twice). It is given by
if we assume that q_\phi(z) is a continous function of \phi, then
using the log-derivative trick \big ( \nabla_\phi \log q_\phi = \frac{\nabla_\phi q_\phi}{q_\phi} \big ):
where q_\phi(z | x ) is the score function. Finally, we have
which is unbiased, but high variance.
Pathwise Gradient¶
Appendix¶
Useful Resources¶
- High level overview on variational inference.