## Theory

ARIMA models are the classic Box-Jenkins tool for time-series forecasting. They take the form where is the number of autoregressive lags, is the degree of differencing and is the number of moving average lags:

The Box-Jenkins philosophy for time series is that you difference your series until it is stationary, and then use information criteria to choose the appropriate lag order for the process. The PyFlux library is not in the Box-Jenkins tradition, but we still provide support for this model type due to its popularity.

## PyFlux

We’ll run an ARIMA Model for yearly sunspot data:

```
import numpy as np
import pandas as pd
import pyflux as pf
from datetime import datetime
import matplotlib.pyplot as plt
%matplotlib inline
data = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/datasets/sunspot.year.csv')
data.index = data['time'].values
plt.figure(figsize=(15,5))
plt.plot(data.index,data['sunspot.year'])
plt.ylabel('Sunspots')
plt.title('Yearly Sunspot Data');
```

We can build an ARIMA model as follows, specifying the order of model we want, as well as a pandas DataFrame or numpy array carrying the data. Here we specify an arbitrary ARIMA(4,0,4) model:

```
model = pf.ARIMA(data=data,ar=4,ma=4,integ=0,target='sunspot.year')
```

Next we estimate the latent variables. For this example we will use a maximum likelihood point mass estimate :

```
x = model.fit("MLE")
x.summary()
```

```
ARIMA(4,0,4)
======================================================= =================================================
Dependent Variable: sunspot.year Method: MLE
Start Date: 1704 Log Likelihood: -1189.488
End Date: 1988 AIC: 2398.9759
Number of observations: 285 BIC: 2435.5008
=========================================================================================================
Latent Variable Estimate Std Error z P>|z| 95% C.I.
======================================== ========== ========== ======== ======== ========================
Constant 8.0092 3.2275 2.4816 0.0131 (1.6834 | 14.3351)
AR(1) 1.6255 0.0367 44.2529 0.0 (1.5535 | 1.6975)
AR(2) -0.4345 0.2455 -1.7701 0.0767 (-0.9157 | 0.0466)
AR(3) -0.8819 0.2295 -3.8432 0.0001 (-1.3317 | -0.4322)
AR(4) 0.5261 0.0429 12.2515 0.0 (0.4419 | 0.6103)
MA(1) -0.5061 0.0383 -13.2153 0.0 (-0.5812 | -0.4311)
MA(2) -0.481 0.1361 -3.533 0.0004 (-0.7478 | -0.2142)
MA(3) 0.2511 0.1093 2.2979 0.0216 (0.0369 | 0.4653)
MA(4) 0.2846 0.0602 4.7242 0.0 (0.1665 | 0.4027)
Sigma 15.7944
=========================================================================================================
```

We can plot the latent variables using

**plot_z**on the model object:

```
model.plot_z(indices=range(1,9))
```

We can plot the in-sample fit using

**plot_fit**:

```
model.plot_fit(figsize=(15,5))
```

We can get an idea of the performance of our model by using rolling in-sample prediction through the

**plot_predict_is**function:

```
model.plot_predict_is(50,figsize=(15,5))
```

We can plot h-step ahead predictions using the

**plot_predict**function:

```
model.plot_predict(h=20,past_values=20,figsize=(15,5))
```

If we just want the predictions themselves, use the

**predict**method:

```
model.predict(h=20)
```

sunspot.year | |
---|---|

1989 | 134.372090 |

1990 | 147.839299 |

1991 | 127.473194 |

1992 | 95.732619 |

1993 | 48.541385 |

1994 | 10.669902 |

1995 | -13.106388 |

1996 | -10.378003 |

1997 | 12.961587 |

1998 | 50.760091 |

1999 | 87.145753 |

2000 | 110.718263 |

2001 | 112.168327 |

2002 | 92.078392 |

2003 | 57.144659 |

2004 | 20.211365 |

2005 | -6.163868 |

2006 | -12.748751 |

2007 | 2.202417 |

2008 | 33.197924 |

Note that we should compare different types of ARIMA model using information criteria before settling on which model type to use. You could also use cross-validation and compare the out-of-sample performance of different specifications.