Black Box Variational Inference (BBVI)

Theory

In variational inference, we approximate the posterior $p\left(\theta\mid{y}\right)$ with a variational distribution $q\left(\theta\mid{\phi}\right)$, with the goal being to minimize the Kullback-Leibler divergence:

$$ KL\left[q\left(z\right)\mid\mid{p}\left(z\right)\right] = -E_{q}\left(\log\left(\frac{p\left(\theta\mid{y}\right)}{q\left(\theta\mid{\phi}\right)}\right)\right)$$$$ KL\left[q\left(z\right)\mid\mid{p}\left(z\right)\right] = -E_{q}\left[\log\left({p\left(\theta\mid{y}\right)}\right)\right] + E_{q}\left[\log\left({q\left(\theta\mid{\phi}\right)}\right)\right] $$

We can't directly minimize this term because we don't know the log evidence which is encompassed in the first term, but we can indirectly minimize the KL by maximizing the evidence lower bound (ELBO):

$$ ELBO = E_{q}\left[\log\left({p\left(\theta,{y}\right)}\right)\right] - E_{q}\left[\log\left({q\left(\theta\mid{\phi}\right)}\right)\right] $$

There are many choices of approximating distribution. The simplest is a mean-field approximation for $M$ parameters:

$$ q\left(\theta\mid{\phi}\right) = \prod^{M}_{m=1}q\left(\theta_{i}\mid{\phi_{i}}\right) $$

PyFlux currently implements this approximation with Gaussian distributions. Future versions will be more flexible in the choices of distribution and dependency structure.

BBVI is a general-purpose methodology for conducting variational inference. Traditionally VI required quite specific requirements (such as exponential conjugacy), but BBVI by Ranganath (2013) allows for non-conjugate models. The basic idea is to approximate the expectations with Monte Carlo approximation. We simulate from the variational approximation $q\left(\theta\mid{\phi}\right)$ and compute the gradient, then use a stochastic optimization technique to follow the gradient to maximize the ELBO. The gradient is:

$$ \nabla_{\phi}{ELBO} = E_{q}\left[\nabla_{\phi}q\left(\theta\mid{\phi}\right)( \log\left({p\left(\theta,{y}\right)}\right) - \log\left({q\left(\theta\mid{\phi}\right)}\right)\right)] $$

Simulating from $q\left(\theta\mid{\phi}\right)$, the naive Monte Carlo estimator is:

$$ \nabla_{\phi}{ELBO} = \frac{1}{S}\sum^{S}_{s=1} \left[\nabla_{\phi}q\left(\theta_{s}\mid{\phi}\right)( \log\left({p\left(\theta_{s},{y}\right)}\right) - \log\left({q\left(\theta_{s}\mid{\phi}\right)}\right)\right)] $$

But this naive estimator has a very high variance so variance reduction techniques are needed in practice. Ranganath (2013) implements Rao-Blackwellisation and control variates to reduce variance and uses AdaGrad for stochastic optimization to follow the gradient.

In the PyFlux implementation for time series models we use control variates, following Ranganath, as well as Rao-Blackwellisation for some model types. The user has a choice of optimizer : RMSProp or ADAM. The final parameter estimates are taken as an average over the final ten percent of iterations.

PyFlux

We will illustrate using BBVI to estimate a GASNormal model. We use the following dataset.

In [1]:
import numpy as np
import pyflux as pf
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline 

data = pd.read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/MASS/drivers.csv")
accidents = data['drivers'].values
plt.figure(figsize=(15,5))
plt.plot(data['time'].values,accidents)
plt.ylabel('Deaths')
plt.xlabel('Year')
plt.title("Deaths of Car Drivers in Great Britain 1969-84");

Since we are dealing with large counts, a Normal model should work fine.

In [2]:
model = pf.GASNormal(data=accidents,ar=3,sc=3)
model.parameters.adjust_prior(0,prior=pf.Uniform())
model.parameters.adjust_prior([1,4],prior=pf.Normal(0,3))
model.parameters.adjust_prior([2,5],prior=pf.Normal(0,2.70))
model.parameters.adjust_prior([3,6],prior=pf.Normal(0,2.40))
print(model.parameters)
Index  Parameter                 Prior      Hyperparameters           V.I. Dist  Transform 
====== ========================= ========== ========================= ========== ==========
0      Constant                  Uniform    n/a (non-informative)     Normal     None      
1      AR(1)                     Normal     mu0: 0, sigma0: 3         Normal     None      
2      AR(2)                     Normal     mu0: 0, sigma0: 2.7       Normal     None      
3      AR(3)                     Normal     mu0: 0, sigma0: 2.4       Normal     None      
4      SC(1)                     Normal     mu0: 0, sigma0: 3         Normal     None      
5      SC(2)                     Normal     mu0: 0, sigma0: 2.7       Normal     None      
6      SC(3)                     Normal     mu0: 0, sigma0: 2.4       Normal     None      
7      Scale                     Uniform    n/a (non-informative)     Normal     exp       

We next run BBVI on this model. We will use the RMSProp optimizer.

In [3]:
x = model.fit('BBVI',iterations=1000,optimizer='RMSProp')
x.summary()
10% done : ELBO is -1820.99156193
20% done : ELBO is -1820.99156192
30% done : ELBO is -1821.48278908
40% done : ELBO is -1821.48278907
50% done : ELBO is -1821.48278903
60% done : ELBO is -1821.48278898
70% done : ELBO is -1821.48278894
80% done : ELBO is -1821.48278882
90% done : ELBO is -1821.48278882
100% done : ELBO is -1821.48278877

Final model ELBO is -1821.4827887
NORMAL GAS(3,0,3) REGRESSION                                                                              
======================================================= ==================================================
Dependent Variable: Series                              Method: BBVI                                      
Start Date: 3                                           Unnormalized Log Posterior: -1804.7849            
End Date: 191                                           AIC: 3625.56981946                                
Number of observations: 189                             BIC: 3651.50379558                                
==========================================================================================================
Parameter                                Median             Mean               95% Credibility Interval 
======================================== ================== ================== =========================
Constant                                 1670.3205          1670.3203          (1670.237 | 1670.4034)   
AR(1)                                    -0.306             -0.306             (-0.3883 | -0.2237)      
AR(2)                                    0.2039             0.2039             (0.1247 | 0.284)         
AR(3)                                    0.0863             0.086              (0.0043 | 0.1666)        
SC(1)                                    0.646              0.646              (0.566 | 0.7266)         
SC(2)                                    0.7063             0.7062             (0.6259 | 0.7856)        
SC(3)                                    0.2576             0.2576             (0.1764 | 0.3391)        
Scale                                    69.377             69.3642            (63.9675 | 75.2274)      
==========================================================================================================

We can plot the parameters below:

In [4]:
model.plot_parameters(indices=range(1,6))

We can check the in-sample fit of the model using plot_fit:

In [5]:
model.plot_fit(figsize=(15,5))

We can predict forward with the model and plot the results using plot_predict:

In [6]:
model.plot_predict(h=3,past_values=16,figsize=(15,5))