## Theory

ARIMA models are the classic Box-Jenkins tool for time-series forecasting. They take the form $ARIMA(p,d,q)$ where $p$ is the number of autoregressive lags, $d$ is the degree of differencing and $q$ is the number of moving average lags:

$\Delta^{D}y_{t} = \sum^{p}_{i=1}\phi_{i}\Delta^{D}y_{t-i} + \sum^{q}_{j=1}\theta_{j}\epsilon_{t-j} + \epsilon_{t}$

$\epsilon_{t} \sim N\left(0,\sigma^{2}\right)$

The Box-Jenkins philosophy for time series is that you difference your series until it is stationary, and then use information criteria to choose the appropriate lag order for the $ARIMA$ process. The PyFlux library is not in the Box-Jenkins tradition, but we still provide support for this model type due to its popularity.

## PyFlux

We’ll run an ARIMA Model for yearly sunspot data:

import numpy as np
import pandas as pd
import pyflux as pf
from datetime import datetime
import matplotlib.pyplot as plt
%matplotlib inline

data.index = data['time'].values

plt.figure(figsize=(15,5))
plt.plot(data.index,data['sunspot.year'])
plt.ylabel('Sunspots')
plt.title('Yearly Sunspot Data');


We can build an ARIMA model as follows, specifying the order of model we want, as well as a pandas DataFrame or numpy array carrying the data. Here we specify an arbitrary ARIMA(4,0,4) model:

model = pf.ARIMA(data=data,ar=4,ma=4,integ=0,target='sunspot.year')


Next we estimate the latent variables. For this example we will use a maximum likelihood point mass estimate $z^{MLE}$:

x = model.fit("MLE")
x.summary()

ARIMA(4,0,4)
======================================================= =================================================
Dependent Variable: sunspot.year                        Method: MLE
Start Date: 1704                                        Log Likelihood: -1189.488
End Date: 1988                                          AIC: 2398.9759
Number of observations: 285                             BIC: 2435.5008
=========================================================================================================
Latent Variable                          Estimate   Std Error  z        P>|z|    95% C.I.
======================================== ========== ========== ======== ======== ========================
Constant                                 8.0092     3.2275     2.4816   0.0131   (1.6834 | 14.3351)
AR(1)                                    1.6255     0.0367     44.2529  0.0      (1.5535 | 1.6975)
AR(2)                                    -0.4345    0.2455     -1.7701  0.0767   (-0.9157 | 0.0466)
AR(3)                                    -0.8819    0.2295     -3.8432  0.0001   (-1.3317 | -0.4322)
AR(4)                                    0.5261     0.0429     12.2515  0.0      (0.4419 | 0.6103)
MA(1)                                    -0.5061    0.0383     -13.2153 0.0      (-0.5812 | -0.4311)
MA(2)                                    -0.481     0.1361     -3.533   0.0004   (-0.7478 | -0.2142)
MA(3)                                    0.2511     0.1093     2.2979   0.0216   (0.0369 | 0.4653)
MA(4)                                    0.2846     0.0602     4.7242   0.0      (0.1665 | 0.4027)
Sigma                                    15.7944
=========================================================================================================


We can plot the latent variables $z$ using plot_z on the model object:

model.plot_z(indices=range(1,9))


We can plot the in-sample fit using plot_fit:

model.plot_fit(figsize=(15,5))


We can get an idea of the performance of our model by using rolling in-sample prediction through the plot_predict_is function:

model.plot_predict_is(50,figsize=(15,5))


We can plot h-step ahead predictions using the plot_predict function:

model.plot_predict(h=20,past_values=20,figsize=(15,5))


If we just want the predictions themselves, use the predict method:

model.predict(h=20)

sunspot.year
1989 134.372090
1990 147.839299
1991 127.473194
1992 95.732619
1993 48.541385
1994 10.669902
1995 -13.106388
1996 -10.378003
1997 12.961587
1998 50.760091
1999 87.145753
2000 110.718263
2001 112.168327
2002 92.078392
2003 57.144659
2004 20.211365
2005 -6.163868
2006 -12.748751
2007 2.202417
2008 33.197924

Note that we should compare different types of ARIMA model using information criteria before settling on which model type to use. You could also use cross-validation and compare the out-of-sample performance of different specifications.