S&P 500 Stock Price Forecasting

Comparing ARIMA and LSTM models for financial time series prediction. Explore traditional statistical methods versus deep learning approaches in stock market forecasting.

View Source Run on Colab

Project Overview

This project explores time series forecasting of the S&P 500 stock market index using both traditional statistical methods and modern deep learning approaches. We implement and compare ARIMA (Auto-Regressive Integrated Moving Average) and LSTM (Long Short-Term Memory) networks to predict future stock prices based on historical data. The analysis includes time series decomposition, stationarity testing, model training, and comprehensive performance evaluation.

Important Disclaimer

This is an educational project for learning time series forecasting techniques. Stock market prediction is inherently uncertain and highly volatile. These models should NOT be used for actual trading decisions or financial planning. Past performance does not guarantee future results.

What is the S&P 500?

Understanding the S&P 500 Index

The Standard & Poor's 500 is a stock market index tracking the performance of 500 large companies listed on U.S. stock exchanges. It's widely regarded as the best gauge of the U.S. equity market and serves as a benchmark for investment performance. The index is weighted by market capitalization, meaning larger companies have more influence on its movements.

Time Series Decomposition

Before building forecasting models, we decompose the time series into its fundamental components to understand the underlying patterns:

  1. Trend Component The long-term progression of the series — the overall upward or downward movement over extended periods.
  2. Seasonal Component Repeating patterns or cycles occurring at regular intervals (daily, weekly, monthly, yearly).
  3. Residual Component Random fluctuations or noise that cannot be attributed to trend or seasonality.
from statsmodels.tsa.seasonal import seasonal_decompose
import pandas as pd

# Perform time series decomposition
decomposition = seasonal_decompose(sp500_data, model='multiplicative', period=252)

# Extract components
trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid

Model Comparison: ARIMA vs LSTM

ARIMA Model

Statistical Approach

  • Classical time series method
  • Captures linear dependencies
  • Requires stationary data
  • Fast training and interpretable
  • Parameters: (p, d, q)

LSTM Network

Deep Learning Approach

  • Recurrent neural network
  • Captures non-linear patterns
  • Handles long-term dependencies
  • Requires more data and compute
  • Memory cells with gates

ARIMA Model Implementation

ARIMA is a powerful statistical method for forecasting time series data. It combines three components: AutoRegressive (AR), Integration (I), and Moving Average (MA).

ARIMA(p, d, q) Mathematical Formula

\[ Y_t = c + \phi_1 Y_{t-1} + \phi_2 Y_{t-2} + ... + \phi_p Y_{t-p} + \theta_1 \epsilon_{t-1} + \theta_2 \epsilon_{t-2} + ... + \theta_q \epsilon_{t-q} + \epsilon_t \]

p: number of lag observations, d: degree of differencing, q: size of moving average window

Key ARIMA Concepts

Step 1: Load and Prepare Data

import yfinance as yf
import pandas as pd
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import adfuller

# Download S&P 500 historical data
sp500 = yf.download('^GSPC', start='2010-01-01', end='2024-01-01')
prices = sp500['Close']

# Test for stationarity using Augmented Dickey-Fuller test
result = adfuller(prices)
print(f'ADF Statistic: {result[0]:.4f}')
print(f'p-value: {result[1]:.4f}')
print(f'Critical Values: {result[4]}')

Step 2: Make Series Stationary

import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

# Apply first-order differencing to achieve stationarity
diff_prices = prices.diff().dropna()

# Visualize ACF and PACF to determine ARIMA parameters
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 4))
plot_acf(diff_prices, ax=ax1, lags=40)
plot_pacf(diff_prices, ax=ax2, lags=40)
ax1.set_title('Autocorrelation Function (ACF)')
ax2.set_title('Partial Autocorrelation Function (PACF)')
plt.show()

Step 3: Fit ARIMA Model and Forecast

# Split data into training and testing sets
train_size = int(len(prices) * 0.8)
train, test = prices[:train_size], prices[train_size:]

# Fit ARIMA(5, 1, 0) model
model = ARIMA(train, order=(5, 1, 0))
model_fit = model.fit()

# Display model summary
print(model_fit.summary())

# Forecast next 30 days
forecast = model_fit.forecast(steps=30)
print(f'\nForecasted values:\n{forecast}')

LSTM Model Implementation

Long Short-Term Memory networks are a special type of recurrent neural network capable of learning long-term dependencies. They use a sophisticated architecture with memory cells and gates to control information flow.

LSTM Architecture

LSTM cells contain three types of gates that regulate information:

This gating mechanism allows LSTMs to remember important patterns over long sequences while forgetting irrelevant information.

Step 1: Data Preprocessing and Normalization

import numpy as np
from sklearn.preprocessing import MinMaxScaler

# Normalize data to range [0, 1] for better LSTM performance
scaler = MinMaxScaler(feature_range=(0, 1))
scaled_data = scaler.fit_transform(prices.values.reshape(-1, 1))

# Create sequences of 60 days to predict the next day
def create_sequences(data, seq_length=60):
    X, y = [], []
    for i in range(seq_length, len(data)):
        X.append(data[i-seq_length:i, 0])
        y.append(data[i, 0])
    return np.array(X), np.array(y)

X, y = create_sequences(scaled_data)
X = X.reshape((X.shape[0], X.shape[1], 1))

# Split into train/test sets
split = int(0.8 * len(X))
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]

Step 2: Build LSTM Architecture

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping

# Build multi-layer LSTM model with dropout for regularization
model = Sequential([
    LSTM(50, return_sequences=True, input_shape=(60, 1)),
    Dropout(0.2),
    LSTM(50, return_sequences=True),
    Dropout(0.2),
    LSTM(50),
    Dropout(0.2),
    Dense(1)
])

model.compile(optimizer='adam', loss='mean_squared_error')
model.summary()

Step 3: Train Model and Generate Predictions

# Define early stopping to prevent overfitting
early_stop = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)

# Train the model
history = model.fit(
    X_train, y_train,
    epochs=100,
    batch_size=32,
    validation_split=0.1,
    callbacks=[early_stop],
    verbose=1
)

# Make predictions on test set
predictions = model.predict(X_test)
predictions = scaler.inverse_transform(predictions)
y_test_actual = scaler.inverse_transform(y_test.reshape(-1, 1))

Performance Comparison

After training both models on the same dataset, we evaluate their performance using standard regression metrics:

Metric ARIMA LSTM Better Model
RMSE (Root Mean Square Error) ~45-60 points ~35-50 points LSTM
MAE (Mean Absolute Error) ~35-45 points ~25-40 points LSTM
MAPE (Mean Absolute Percentage Error) ~1.2-1.8% ~0.8-1.4% LSTM
Training Time Fast (seconds) Slow (minutes to hours) ARIMA
Interpretability High Low (black box) ARIMA
Data Requirements Moderate High ARIMA

Calculate Metrics

from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np

# Calculate performance metrics
def calculate_metrics(actual, predicted):
    rmse = np.sqrt(mean_squared_error(actual, predicted))
    mae = mean_absolute_error(actual, predicted)
    mape = np.mean(np.abs((actual - predicted) / actual)) * 100

    print(f'RMSE: {rmse:.2f}')
    print(f'MAE: {mae:.2f}')
    print(f'MAPE: {mape:.2f}%')

    return rmse, mae, mape

# Evaluate both models
print('ARIMA Performance:')
calculate_metrics(test, arima_predictions)

print('\nLSTM Performance:')
calculate_metrics(y_test_actual, predictions)

Key Insights and Findings

Analysis Results

Best Practices

Limitations and Considerations

Stock market prediction is extremely challenging. Markets are influenced by countless factors including economic indicators, geopolitical events, investor sentiment, news, and regulatory changes. No model can consistently predict market movements with high accuracy. These techniques should be used for educational purposes and understanding time series analysis, not as financial advice or trading systems.

Explore the Full Implementation

Run the complete notebook with detailed visualizations, hyperparameter tuning, and extended analysis. Train both models on historical S&P 500 data and compare their predictions.