Arima Model in Python

The ARIMA model is one of the most popular statistical methods for time series analysis and forecasting. It is a combination of three components: AutoRegression (AR), Integration (I), and Moving Average (MA). Let’s break it down step-by-step, covering both the theory and Python implementation.

1. Introduction to ARIMA

The ARIMA model is a technique for forecasting time series that is an acronym for AutoRegressive Integrated Moving Average.

It is the amalgamation of three components:

AR (AutoRegression): This refers to relating an observation to past observations (lags).
I (Integration): Differencing the data to make it stationary.
MA (Moving Average): Relates an observation to past error terms.

The parameters of ARIMA are:

p: Number of lag observations (AR order).
d: Number of times the data is differenced to achieve stationarity.
q: The number of lagged forecast errors (MA order).

2. Steps to Build and Implement an ARIMA Model

We’ll use Python to implement an ARIMA model step-by-step.

Step 1: Import Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

Step 2: Load Data

For this example, let’s use the classic AirPassengers dataset. It contains monthly airline passenger numbers from 1949 to 1960.

# Load dataset
data = pd.read_csv('https://raw.githubusercontent.com/jbrownlee/Datasets/master/airline-passengers.csv', 
                   index_col='Month', parse_dates=True)
data.index.freq = 'MS'  # Monthly Start frequency
data.columns = ['Passengers']

# Plot the data
data.plot(figsize=(10, 5), title='Monthly Airline Passengers', color='blue')
plt.show()

Output: A line plot showing the number of passengers over time, clearly indicating an upward trend and seasonality.

Step 3: Check for Stationarity

To use ARIMA, the data must be stationary (constant mean and variance). Perform the Augmented Dickey-Fuller (ADF) test.

# Perform ADF Test
result = adfuller(data['Passengers'])

print(f"ADF Statistic: {result[0]}")
print(f"p-value: {result[1]}")

Mock Output:

ADF Statistic: 0.815368
p-value: 0.991880

Interpretation: The p-value > 0.05, so the data is non-stationary. We’ll need to difference it.

Step 4: Differencing to Achieve Stationarity

Apply differencing to remove trends.

# First-order differencing
data_diff = data['Passengers'].diff().dropna()

# Plot the differenced data
data_diff.plot(figsize=(10, 5), title='Differenced Series', color='purple')
plt.show()

# Recheck stationarity
result_diff = adfuller(data_diff)
print(f"ADF Statistic: {result_diff[0]}")
print(f"p-value: {result_diff[1]}")

Mock Output:

Differenced series shows no obvious trend.
Stationarity test result:

ADF Statistic: -2.830
p-value: 0.002

Interpretation: p-value < 0.05 indicates the series is now stationary.

Step 5: Plot ACF and PACF

Use ACF (Autocorrelation Function) and PACF (Partial Autocorrelation Function) plots to identify p and q.

# ACF and PACF
plot_acf(data_diff, lags=20, title='ACF Plot')
plot_pacf(data_diff, lags=20, title='PACF Plot')
plt.show()

Output:

ACF Plot: Significant spikes indicate potential q (MA order).
PACF Plot: Significant spikes indicate potential p (AR order).

Example Inference:

ACF suggests q=2.
PACF suggests p=2.

Step 6: Fit the ARIMA Model

Now, fit the ARIMA model using the identified (p, d, q) values.

# Fit ARIMA model
model = ARIMA(data['Passengers'], order=(2, 1, 2))  # ARIMA(p=2, d=1, q=2)
model_fit = model.fit()

# Summary of the model
print(model_fit.summary())

Mock Output:

                               SARIMAX Results                                
==============================================================================
Dep. Variable:             Passengers   No. Observations:                  143
Model:                 ARIMA(2, 1, 2)   Log Likelihood                -506.453
Date:                Thu, 23 Jan 2025   AIC                           1022.905
Time:                        12:34:12   BIC                           1036.679
Sample:                    01-01-1949   HQIC                          1028.377
                         - 12-01-1960                                         
==============================================================================

Step 7: Check Residuals

Residuals should have no pattern if the model is good.

# Plot residuals
residuals = model_fit.resid
residuals.plot(title='Residuals')
plt.show()

# Plot residual density
residuals.plot(kind='kde', title='Residual Density')
plt.show()

Output:

Residuals appear randomly distributed around zero.
The density plot shows normality in residuals.

Step 8: Forecast Future Values

Forecast the next 12 months of passenger data.

# Forecast future values
forecast = model_fit.forecast(steps=12)
print("Forecasted Values:")
print(forecast)

# Plot forecast against actual data
plt.figure(figsize=(10, 5))
plt.plot(data['Passengers'], label='Actual Data')
plt.plot(forecast, label='Forecast', color='red')
plt.title('Actual vs Forecasted Data')
plt.legend()
plt.show()

Mock Output:

1. Forecasted values:

1961-01-01    461.29
1961-02-01    469.42
1961-03-01    475.38
...

2. The plot shows the forecasted red line extending beyond the historical data.

3. Examples of ARIMA Parameter Tuning

You may need to try different (p, d, q) values using trial and error or automated tuning with libraries like pmdarima.

4. Best Practices

1. Use auto_arima for automatic parameter selection:

from pmdarima import auto_arima
auto_model = auto_arima(data['Passengers'], seasonal=False, trace=True)
print(auto_model.summary())

2. Validate the model using train-test split to ensure generalizability.

For Employers

For Employees

For Employers

For Employees