Arima Model in Python
The ARIMA model is one of the most popular statistical methods for time series analysis and forecasting. It is a combination of three components: AutoRegression (AR), Integration (I), and Moving Average (MA). Let’s break it down step-by-step, covering both the theory and Python implementation.
1. Introduction to ARIMA
The ARIMA model is a technique for forecasting time series that is an acronym for AutoRegressive Integrated Moving Average.
It is the amalgamation of three components:
- AR (AutoRegression): This refers to relating an observation to past observations (lags).
- I (Integration): Differencing the data to make it stationary.
- MA (Moving Average): Relates an observation to past error terms.
The parameters of ARIMA are:
- p: Number of lag observations (AR order).
- d: Number of times the data is differenced to achieve stationarity.
- q: The number of lagged forecast errors (MA order).
2. Steps to Build and Implement an ARIMA Model
We’ll use Python to implement an ARIMA model step-by-step.
Step 1: Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
Step 2: Load Data
For this example, let’s use the classic AirPassengers dataset. It contains monthly airline passenger numbers from 1949 to 1960.
# Load dataset
data = pd.read_csv('https://raw.githubusercontent.com/jbrownlee/Datasets/master/airline-passengers.csv',
index_col='Month', parse_dates=True)
data.index.freq = 'MS' # Monthly Start frequency
data.columns = ['Passengers']
# Plot the data
data.plot(figsize=(10, 5), title='Monthly Airline Passengers', color='blue')
plt.show()
Output: A line plot showing the number of passengers over time, clearly indicating an upward trend and seasonality.
Step 3: Check for Stationarity
To use ARIMA, the data must be stationary (constant mean and variance). Perform the Augmented Dickey-Fuller (ADF) test.
# Perform ADF Test
result = adfuller(data['Passengers'])
print(f"ADF Statistic: {result[0]}")
print(f"p-value: {result[1]}")
Mock Output:
ADF Statistic: 0.815368
p-value: 0.991880
- Interpretation: The p-value > 0.05, so the data is non-stationary. We’ll need to difference it.
Step 4: Differencing to Achieve Stationarity
Apply differencing to remove trends.
# First-order differencing
data_diff = data['Passengers'].diff().dropna()
# Plot the differenced data
data_diff.plot(figsize=(10, 5), title='Differenced Series', color='purple')
plt.show()
# Recheck stationarity
result_diff = adfuller(data_diff)
print(f"ADF Statistic: {result_diff[0]}")
print(f"p-value: {result_diff[1]}")
Mock Output:
- Differenced series shows no obvious trend.
- Stationarity test result:
ADF Statistic: -2.830
p-value: 0.002
- Interpretation: p-value < 0.05 indicates the series is now stationary.
Step 5: Plot ACF and PACF
Use ACF (Autocorrelation Function) and PACF (Partial Autocorrelation Function) plots to identify p and q.
# ACF and PACF
plot_acf(data_diff, lags=20, title='ACF Plot')
plot_pacf(data_diff, lags=20, title='PACF Plot')
plt.show()
Output:
- ACF Plot: Significant spikes indicate potential
q(MA order). - PACF Plot: Significant spikes indicate potential
p(AR order).
Example Inference:
- ACF suggests
q=2. - PACF suggests
p=2.
Step 6: Fit the ARIMA Model
Now, fit the ARIMA model using the identified (p, d, q) values.
# Fit ARIMA model
model = ARIMA(data['Passengers'], order=(2, 1, 2)) # ARIMA(p=2, d=1, q=2)
model_fit = model.fit()
# Summary of the model
print(model_fit.summary())
Mock Output:
SARIMAX Results
==============================================================================
Dep. Variable: Passengers No. Observations: 143
Model: ARIMA(2, 1, 2) Log Likelihood -506.453
Date: Thu, 23 Jan 2025 AIC 1022.905
Time: 12:34:12 BIC 1036.679
Sample: 01-01-1949 HQIC 1028.377
- 12-01-1960
==============================================================================
Step 7: Check Residuals
Residuals should have no pattern if the model is good.
# Plot residuals
residuals = model_fit.resid
residuals.plot(title='Residuals')
plt.show()
# Plot residual density
residuals.plot(kind='kde', title='Residual Density')
plt.show()
Output:
- Residuals appear randomly distributed around zero.
- The density plot shows normality in residuals.
Step 8: Forecast Future Values
Forecast the next 12 months of passenger data.
# Forecast future values
forecast = model_fit.forecast(steps=12)
print("Forecasted Values:")
print(forecast)
# Plot forecast against actual data
plt.figure(figsize=(10, 5))
plt.plot(data['Passengers'], label='Actual Data')
plt.plot(forecast, label='Forecast', color='red')
plt.title('Actual vs Forecasted Data')
plt.legend()
plt.show()
Mock Output:
1. Forecasted values:
1961-01-01 461.29
1961-02-01 469.42
1961-03-01 475.38
...
2. The plot shows the forecasted red line extending beyond the historical data.
3. Examples of ARIMA Parameter Tuning
You may need to try different (p, d, q) values using trial and error or automated tuning with libraries like pmdarima.
4. Best Practices
1. Use auto_arima for automatic parameter selection:
from pmdarima import auto_arima
auto_model = auto_arima(data['Passengers'], seasonal=False, trace=True)
print(auto_model.summary())
2. Validate the model using train-test split to ensure generalizability.