US GDP Forecasting

US GDP Forecasting#

Author: Dron Mongia

Course Project, UC Irvine, Math 10, S24

I would like to post my notebook on the course’s website: Yes

Motivation and Procedure#

GDP is a very strong metric of a country’s economic wellbeing and therefore, forecasts of GDP are highly sought after. Policy makers and legislators, for example, may want to have a rough forecast of the trends regarding the country’s GDP prior to passing some new bill or law. My procedure for this project will be to first compile time series data from the FRED api consisting of some economic metrics closely related to GDP (GDP = Consumption + Investment + Govt. Spending + Net Export). We then will conduct a series of exploratory tests to better understand our data, and finally we will use a vairety of models (including machine learning methods) to see which one provides the most accurate forecast.

import fredapi as fd
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px

Feature Creation#

fred = fd.Fred(api_key = 'efdb939ff3f6b3129c564ef0aa34a91e') # generally this should be hidden, however for the purposes of this project I will keep it visible

def gen_df(category, series):
    gen_ser = fred.get_series(series, frequency='q')
    return pd.DataFrame({'Date': gen_ser.index, category + ' : Billions of dollars': gen_ser.values})

def merge_dataframes(dataframes, on_column):
    merged_df = dataframes[0]

    for df in dataframes[1:]:
        merged_df = pd.merge(merged_df, df, on=on_column)

    return merged_df

dataframes_list = [
    gen_df('GDP', 'GDP'),
    gen_df('PCE', 'PCE'),
    gen_df('GPDI', 'GPDI'),
    gen_df('NETEXP', 'NETEXP'),
    gen_df('GovTotExp', 'W068RCQ027SBEA')
]

data = merge_dataframes(dataframes_list,'Date')
data

	Date	GDP : Billions of dollars	PCE : Billions of dollars	GPDI : Billions of dollars	NETEXP : Billions of dollars	GovTotExp : Billions of dollars
0	1960-01-01	542.648	326.4	96.476	2.858	144.233
1	1960-04-01	541.080	332.2	87.096	3.395	147.417
2	1960-07-01	545.604	332.1	86.377	4.682	150.459
3	1960-10-01	540.197	334.0	75.963	5.880	153.780
4	1961-01-01	545.018	334.5	78.378	5.902	157.254
...	...	...	...	...	...	...
252	2023-01-01	26813.601	18269.6	4725.828	-825.687	9326.383
253	2023-04-01	27063.012	18419.0	4780.290	-806.093	9422.404
254	2023-07-01	27610.128	18679.5	4915.033	-779.231	10007.677
255	2023-10-01	27956.998	18914.5	4954.426	-783.734	9700.808
256	2024-01-01	28255.928	19164.2	5004.419	-850.094	9924.229

257 rows × 6 columns

Date_col = data['Date'].tail(26) # to be used later on
print(Date_col) 

 2017-10-01
 2018-01-01
 2018-04-01
 2018-07-01
 2018-10-01
 2019-01-01
 2019-04-01
 2019-07-01
 2019-10-01
 2020-01-01
 2020-04-01
 2020-07-01
 2020-10-01
 2021-01-01
 2021-04-01
 2021-07-01
 2021-10-01
 2022-01-01
 2022-04-01
 2022-07-01
 2022-10-01
 2023-01-01
 2023-04-01
 2023-07-01
 2023-10-01
 2024-01-01
Name: Date, dtype: datetime64[ns]

EDA#

Below is a simple line plot comparing each of the time series. We notice that some of the features resemble GDP far more than others, however, we can not yet begin selecting and removing certain features prior do conducting more exploratory tests.

date_column = 'Date'

feature_columns = data.columns.difference([date_column])

fig, ax = plt.subplots(figsize=(10, 6))
fig.suptitle('Features vs Time', y=1.02)

for i, feature in enumerate(feature_columns):
    ax.plot(data[date_column], data[feature], label=feature, color=plt.cm.viridis(i / len(feature_columns)))

ax.set_xlabel('Date')
ax.set_ylabel('Billions of Dollars')
ax.legend(loc='upper left', bbox_to_anchor=(1, 1))

# Show the plot
plt.show()

../_images/d21e850d971d58fdb2da348a14875416e1b8e0f6cfe0c6bb7c9d1ce8c1bdc0ad.png

Here we run a test to evaluate the stationarity of out time series. Stationarity essentially means that a time series has a constant mean and varience. This is important to test for because many popular forecasting methods require stationarity to function properly. Although we can determine the stationarity for most of these time series just by looking at the graph, doing the testing is still beneficial because we will likely reuse it in later parts of the forecast.

from statsmodels.tsa.stattools import adfuller

for column in data.columns:
    if column != 'Date':
        result = adfuller(data[column])
        print(f"ADF Statistic for {column}: {result[0]}")
        print(f"P-value for {column}: {result[1]}")
        print("Critical Values:")
        for key, value in result[4].items():
            print(f"  {key}: {value}")

        print("\n" + "=" * 40 + "\n")

ADF Statistic for GDP : Billions of dollars: 7.301049607594838
P-value for GDP : Billions of dollars: 1.0
Critical Values:
  1%: -3.4562572510874396
  5%: -2.8729420379793598
  10%: -2.5728461399461744

========================================

ADF Statistic for PCE : Billions of dollars: 6.798336970342141
P-value for PCE : Billions of dollars: 1.0
Critical Values:
  1%: -3.4562572510874396
  5%: -2.8729420379793598
  10%: -2.5728461399461744

========================================

ADF Statistic for GPDI : Billions of dollars: 2.765116699661126
P-value for GPDI : Billions of dollars: 1.0
Critical Values:
  1%: -3.4561550092339512
  5%: -2.8728972266578676
  10%: -2.5728222369384763

========================================

ADF Statistic for NETEXP : Billions of dollars: -0.009808856906310032
P-value for NETEXP : Billions of dollars: 0.9577133440470131
Critical Values:
  1%: -3.4565688966099373
  5%: -2.8730786194395455
  10%: -2.5729189953388762

========================================

ADF Statistic for GovTotExp : Billions of dollars: 2.7291557866089486
P-value for GovTotExp : Billions of dollars: 0.9990880001645039
Critical Values:
  1%: -3.4577787098622674
  5%: -2.873608704758507
  10%: -2.573201765981991

========================================

The numbers we are interested from this test are the P-values. A P-value close to zero (equal to or less than 0.05) implies stationarity while a value closer to 1 implies nonstationarity. We can see that all of our time series are highly unstationary due to their statistically insignificant p-values, in other words we are unable to reject the null hypothesis for stationarity. Below is a simple visual representation of the test. The red dotted line represents the P-value where we would be able to determine stationarity for the time series, the blue box represents the P-value where we are at currently.

adf_results = []

for column in data.columns:
    if column != 'Date':  
        result = adfuller(data[column])
        adf_results.append({
            'Feature': column,
            'ADF Statistic': result[0],
            'P-value': result[1],
            'Critical Values': result[4]
        })

adf_df = pd.DataFrame(adf_results)

fig, axes = plt.subplots(nrows=len(adf_df), figsize=(12, 6 * len(adf_df)))

for idx, row in adf_df.iterrows():
    ax = axes[idx] if len(adf_df) > 1 else axes  

    ax.bar(row['Feature'] + '_P-value', row['P-value'], color='skyblue')
    ax.text(row['Feature'] + '_P-value', row['P-value'] + 0.005, f"{row['P-value']:.4f}", ha='center')
    
    ax.set_title(f'ADF Test Results for {row["Feature"]}')
    ax.set_ylabel('Value')
    ax.axhline(y=0.05, color='red', linestyle='--', label='Significance Level (0.05)')
    ax.legend()

plt.tight_layout()
plt.show()

../_images/44e75d8067f56a6d1e8ff70190e05ed35407fa46c5bc3c5569543d601e74a46f.png

Below is a VIF (variance inflation factor) test. The purpose is to check for multicollinearity, or the degree of correlation the predictors share with one another. High multicollineaity is not necessairly detrimental for our forecast, however, it can make it much harder for us to determine the individual effect of each feature time series for the prediction, thus hurting the interpretability of the model. Usually a score less than 5 implies no multicollinearity, implying that our features are extremely higly correlated with one another. This will play into our model selection for later.

from statsmodels.stats.outliers_influence import variance_inflation_factor


X = data.iloc[:, [2, 3, 4, 5]]
X['intercept'] = 1
X = X[['intercept'] + [col for col in X if col != 'intercept']]

vif_data = pd.DataFrame()
vif_data["Feature"] = X.columns[1:] 
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(1, X.shape[1])]

print(vif_data)
plt.figure(figsize=(10, 6))
plt.bar(vif_data["Feature"], vif_data["VIF"], color='skyblue')
plt.title('VIF Scores for Features')
plt.xlabel('Features')
plt.ylabel('VIF Score')
plt.show()

C:\Users\Dronm\AppData\Local\Temp\ipykernel_49020\2334127177.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['intercept'] = 1

                           Feature         VIF
      PCE : Billions of dollars  170.028884
     GPDI : Billions of dollars   77.277626
   NETEXP : Billions of dollars    7.335556
GovTotExp : Billions of dollars   52.196119

../_images/87bc8d141f5284e4ad6e573e733b4335f192ed48cf4e111fc9f154020e3552a2.png

Modeling#

VAR forecasting#

To use the Vector Auto Regression (VAR) model we need to first ensure our data is stationary. we can achieve this using a technique called differencing which essentially is a tranformation of the raw data. It is defined mathematically as y’(t) = y(t)-y(t-1). We are taking values from our time series and calculating the change which occured following the previous point.

data_diff = data.drop('Date', axis=1).diff().dropna()
for column in data_diff.columns:
    result = adfuller(data_diff[column])
    print(f"ADF Statistic for {column}: {result[0]}")
    print(f"P-value for {column}: {result[1]}")
    print("Critical Values:")
    for key, value in result[4].items():
        print(f"  {key}: {value}")

    print("\n" + "=" * 40 + "\n") 

ADF Statistic for GDP : Billions of dollars: -3.906538294382316
P-value for GDP : Billions of dollars: 0.0019872523174590214
Critical Values:
  1%: -3.4566744514553016
  5%: -2.8731248767783426
  10%: -2.5729436702592023

========================================

ADF Statistic for PCE : Billions of dollars: -6.0438368049790725
P-value for PCE : Billions of dollars: 1.3263283721705999e-07
Critical Values:
  1%: -3.4564641849494113
  5%: -2.873032730098417
  10%: -2.572894516864816

========================================

ADF Statistic for GPDI : Billions of dollars: -14.67874111897741
P-value for GPDI : Billions of dollars: 3.180913951450319e-27
Critical Values:
  1%: -3.4562572510874396
  5%: -2.8729420379793598
  10%: -2.5728461399461744

========================================

ADF Statistic for NETEXP : Billions of dollars: -8.837398948334888
P-value for NETEXP : Billions of dollars: 1.7044265762844e-14
Critical Values:
  1%: -3.4565688966099373
  5%: -2.8730786194395455
  10%: -2.5729189953388762

========================================

ADF Statistic for GovTotExp : Billions of dollars: -3.110199822877988
P-value for GovTotExp : Billions of dollars: 0.0258075066297484
Critical Values:
  1%: -3.4577787098622674
  5%: -2.873608704758507
  10%: -2.573201765981991

========================================

fig, ax = plt.subplots(figsize=(10, 6))
fig.suptitle('Features vs Time', y=1.02)

for i, feature in enumerate(data_diff.columns):
    ax.plot(data['Date'][1:], data_diff[feature], label=feature, color=plt.cm.viridis(i / len(data_diff.columns)))

ax.set_xlabel('Date')
ax.set_ylabel('Difference')
ax.legend(loc='upper left', bbox_to_anchor=(1, 1))

# Show the plot
plt.show()

../_images/5e4a4958a9167337742bcf09375e1e56bb9887b3de8183ecb2f5e6e76a12e202.png

Below is an Autocorrelation Function (ACF) plot which tells us the significance that previous lags have to the current lag. This is usually helpful in evaluating factors like seasonality which are important for certain models. We can see that for the most part, previous lags have reasonably small scores in relation to our time series.

from statsmodels.graphics.tsaplots import plot_acf

for column in data_diff.columns:
    if column != 'Date':  
        plot_acf(data_diff[column], lags=12)
        plt.title(f'Autocorrelation Function (ACF) - {column}')
        plt.xlabel('Lag')
        plt.ylabel('Autocorrelation')
        plt.show()

../_images/4602fdf7479d52d8e02f9943871559f010e758a21c71a5505c2d9e5ac7d79eab.png

../_images/469f846f7e88ea1fd7c607c5ca7cc9b680b9bc2f38d32e65361abb83a84d4cea.png

../_images/3af28c73b3f4eef02be16c06ff694513d94631aeda5710154001e65ba6dbb324.png

../_images/7152618fca96a417242aed3266a5126cc8a9e530c0ac835df7e46e1e463571ee.png

../_images/89a04552491d9d1e84c7687ec55dc8899964c18ce3b44b0beef0dd62509174bb.png

From the ADF test along with the plot we can determine that our data is now centered around the mean. Our next step is to perform a granger causality test which indicates to us if a lagged verison of a specific time series can help us predict out target time series, however not necessairly that one time series causes the other.

from statsmodels.tsa.stattools import grangercausalitytests
columns = ['PCE : Billions of dollars', 'GPDI : Billions of dollars', 'NETEXP : Billions of dollars', 'GovTotExp : Billions of dollars']
lags = [6, 9, 1, 1] #determined from individually testing each combination

for column, lag in zip(columns, lags):
    df_new = data_diff[['GDP : Billions of dollars', column]]
    print(f'For: {column}')
    gc_res = grangercausalitytests(df_new, lag)
    print("\n" + "=" * 40 + "\n")

For: PCE : Billions of dollars

Granger Causality
number of lags (no zero) 1
ssr based F test:         F=0.8821  , p=0.3485  , df_denom=252, df_num=1
ssr based chi2 test:   chi2=0.8926  , p=0.3448  , df=1
likelihood ratio test: chi2=0.8910  , p=0.3452  , df=1
parameter F test:         F=0.8821  , p=0.3485  , df_denom=252, df_num=1

Granger Causality
number of lags (no zero) 2
ssr based F test:         F=0.7486  , p=0.4741  , df_denom=249, df_num=2
ssr based chi2 test:   chi2=1.5273  , p=0.4660  , df=2
likelihood ratio test: chi2=1.5228  , p=0.4670  , df=2
parameter F test:         F=0.7486  , p=0.4741  , df_denom=249, df_num=2

Granger Causality
number of lags (no zero) 3
ssr based F test:         F=0.9103  , p=0.4366  , df_denom=246, df_num=3
ssr based chi2 test:   chi2=2.8087  , p=0.4221  , df=3
likelihood ratio test: chi2=2.7933  , p=0.4246  , df=3
parameter F test:         F=0.9103  , p=0.4366  , df_denom=246, df_num=3

Granger Causality
number of lags (no zero) 4
ssr based F test:         F=0.8554  , p=0.4914  , df_denom=243, df_num=4
ssr based chi2 test:   chi2=3.5483  , p=0.4706  , df=4
likelihood ratio test: chi2=3.5235  , p=0.4743  , df=4
parameter F test:         F=0.8554  , p=0.4914  , df_denom=243, df_num=4

Granger Causality
number of lags (no zero) 5
ssr based F test:         F=1.8087  , p=0.1118  , df_denom=240, df_num=5
ssr based chi2 test:   chi2=9.4578  , p=0.0921  , df=5
likelihood ratio test: chi2=9.2840  , p=0.0983  , df=5
parameter F test:         F=1.8087  , p=0.1118  , df_denom=240, df_num=5

Granger Causality
number of lags (no zero) 6
ssr based F test:         F=2.0385  , p=0.0614  , df_denom=237, df_num=6
ssr based chi2 test:   chi2=12.9021 , p=0.0446  , df=6
likelihood ratio test: chi2=12.5802 , p=0.0502  , df=6
parameter F test:         F=2.0385  , p=0.0614  , df_denom=237, df_num=6

========================================

For: GPDI : Billions of dollars

Granger Causality
number of lags (no zero) 1
ssr based F test:         F=0.3612  , p=0.5484  , df_denom=252, df_num=1
ssr based chi2 test:   chi2=0.3655  , p=0.5454  , df=1
likelihood ratio test: chi2=0.3653  , p=0.5456  , df=1
parameter F test:         F=0.3612  , p=0.5484  , df_denom=252, df_num=1

Granger Causality
number of lags (no zero) 2
ssr based F test:         F=0.0208  , p=0.9794  , df_denom=249, df_num=2
ssr based chi2 test:   chi2=0.0425  , p=0.9790  , df=2
likelihood ratio test: chi2=0.0425  , p=0.9790  , df=2
parameter F test:         F=0.0208  , p=0.9794  , df_denom=249, df_num=2

Granger Causality
number of lags (no zero) 3
ssr based F test:         F=1.3069  , p=0.2727  , df_denom=246, df_num=3
ssr based chi2 test:   chi2=4.0323  , p=0.2580  , df=3
likelihood ratio test: chi2=4.0005  , p=0.2614  , df=3
parameter F test:         F=1.3069  , p=0.2727  , df_denom=246, df_num=3

Granger Causality
number of lags (no zero) 4
ssr based F test:         F=1.0861  , p=0.3640  , df_denom=243, df_num=4
ssr based chi2 test:   chi2=4.5051  , p=0.3419  , df=4
likelihood ratio test: chi2=4.4653  , p=0.3467  , df=4
parameter F test:         F=1.0861  , p=0.3640  , df_denom=243, df_num=4

Granger Causality
number of lags (no zero) 5
ssr based F test:         F=1.0364  , p=0.3968  , df_denom=240, df_num=5
ssr based chi2 test:   chi2=5.4197  , p=0.3668  , df=5
likelihood ratio test: chi2=5.3620  , p=0.3733  , df=5
parameter F test:         F=1.0364  , p=0.3968  , df_denom=240, df_num=5

Granger Causality
number of lags (no zero) 6
ssr based F test:         F=1.0198  , p=0.4130  , df_denom=237, df_num=6
ssr based chi2 test:   chi2=6.4546  , p=0.3742  , df=6
likelihood ratio test: chi2=6.3727  , p=0.3828  , df=6
parameter F test:         F=1.0198  , p=0.4130  , df_denom=237, df_num=6

Granger Causality
number of lags (no zero) 7
ssr based F test:         F=1.6969  , p=0.1105  , df_denom=234, df_num=7
ssr based chi2 test:   chi2=12.6395 , p=0.0814  , df=7
likelihood ratio test: chi2=12.3291 , p=0.0902  , df=7
parameter F test:         F=1.6969  , p=0.1105  , df_denom=234, df_num=7

Granger Causality
number of lags (no zero) 8
ssr based F test:         F=1.8874  , p=0.0628  , df_denom=231, df_num=8
ssr based chi2 test:   chi2=16.2107 , p=0.0395  , df=8
likelihood ratio test: chi2=15.7029 , p=0.0468  , df=8
parameter F test:         F=1.8874  , p=0.0628  , df_denom=231, df_num=8

Granger Causality
number of lags (no zero) 9
ssr based F test:         F=2.3202  , p=0.0162  , df_denom=228, df_num=9
ssr based chi2 test:   chi2=22.6221 , p=0.0071  , df=9
likelihood ratio test: chi2=21.6454 , p=0.0101  , df=9
parameter F test:         F=2.3202  , p=0.0162  , df_denom=228, df_num=9

========================================

For: NETEXP : Billions of dollars

Granger Causality
number of lags (no zero) 1
ssr based F test:         F=9.1682  , p=0.0027  , df_denom=252, df_num=1
ssr based chi2 test:   chi2=9.2773  , p=0.0023  , df=1
likelihood ratio test: chi2=9.1126  , p=0.0025  , df=1
parameter F test:         F=9.1682  , p=0.0027  , df_denom=252, df_num=1

========================================

For: GovTotExp : Billions of dollars

Granger Causality
number of lags (no zero) 1
ssr based F test:         F=57.2092 , p=0.0000  , df_denom=252, df_num=1
ssr based chi2 test:   chi2=57.8903 , p=0.0000  , df=1
likelihood ratio test: chi2=52.1702 , p=0.0000  , df=1
parameter F test:         F=57.2092 , p=0.0000  , df_denom=252, df_num=1

========================================

here we are just looking for a single lag for each feature which has statiscally significant p-values. Personal consumption Expenditures arguably did not make this cut off, however the sixth lag is so close that I decided to keep it in. Our next step is to create our VAR model now that we have decided that all of our features are significant.

from statsmodels.tsa.api import VAR
from sklearn.metrics import mean_absolute_error
columns = ['GDP : Billions of dollars', 'PCE : Billions of dollars', 'GPDI : Billions of dollars', 'NETEXP : Billions of dollars', 'GovTotExp : Billions of dollars']
data_diff = data_diff[columns]

split_index = int(len(data_diff) * 0.90)
train_data = data_diff.iloc[:split_index]
test_data = data_diff.iloc[split_index:]

model = VAR(train_data)
results = model.fit(maxlags=14)

forecast_steps = len(test_data)
forecast = results.forecast(train_data.values[-results.k_ar:], steps=forecast_steps)


forecast_df = pd.DataFrame(forecast, index=test_data.index, columns=columns)


gdp_forecast = forecast_df['GDP : Billions of dollars']
gdp_actual = test_data['GDP : Billions of dollars']


dates = data_diff.index[split_index:]


comparison_df = pd.DataFrame({'Date': Date_col, 'Actual GDP': gdp_actual.values, 'Forecasted GDP': gdp_forecast.values})
comparison_df.dropna(inplace=True)


mae = mean_absolute_error(comparison_df['Actual GDP'], comparison_df['Forecasted GDP'])
print(f"MAE: {mae:.2f}")

def mean_absolute_percentage_error(y_true, y_pred):
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

mape = mean_absolute_percentage_error(comparison_df['Actual GDP'], comparison_df['Forecasted GDP'])
print(f"MAPE: {mape:.2f}%")


print("GDP Actual vs Forecast:")
print(comparison_df)

plt.figure(figsize=(12, 6))
plt.plot(comparison_df['Date'], comparison_df['Actual GDP'], label='Actual GDP', color='blue')
plt.plot(comparison_df['Date'], comparison_df['Forecasted GDP'], label='Forecasted GDP', color='red', linestyle='--')
plt.xlabel('Date')
plt.ylabel('GDP : Billions of dollars')
plt.title('Actual vs Forecasted GDP using VAR')
plt.legend()
plt.show()

C:\Users\Dronm\anaconda3\Lib\site-packages\statsmodels\tsa\base\tsa_model.py:473: ValueWarning: An unsupported index was provided and will be ignored when e.g. forecasting.
  self._init_dates(dates, freq)

MAE: 328.39
MAPE: 57.72%
GDP Actual vs Forecast:
          Date  Actual GDP  Forecasted GDP
2017-10-01     344.493       84.479946
2018-01-01     291.465      179.477634
2018-04-01     252.359      155.624087
2018-07-01     217.818      105.220413
2018-10-01     119.137      153.221441
2019-01-01     186.266      240.780124
2019-04-01     280.642      216.362075
2019-07-01     309.507      151.012279
2019-10-01     208.108      286.572411
2020-01-01    -195.877      232.309027
2020-04-01   -1793.370      226.008422
2020-07-01    1734.497      242.669525
2020-10-01     376.862      255.241138
2021-01-01     575.683      203.352525
2021-04-01     692.177      247.383534
2021-07-01     536.611      211.806723
2021-10-01     825.630      188.329785
2022-01-01     374.513      193.401142
2022-04-01     515.157      220.097723
2022-07-01     450.366      193.285245
2022-10-01     413.766      200.739952
2023-01-01     405.196      220.162682
2023-04-01     249.411      174.751380
2023-07-01     547.116      229.388535
2023-10-01     346.870      241.182963
2024-01-01     298.930      201.618804

../_images/80516886a639249bed548ee2524a287a74f5397d5506dc1f4630115e687542f5.png

Looking this forecast, we can clealy see that despite missing the mark quite heavily on both evaluation metrics used (MAE and MAPE), our model visually was not too innaccurate barring the outliers caused by the pandemic. We managed to stay on the testing line for the most part from 2018-2019 and from 2022-2024, however the global events following obviously threw in some unpredictability which affected the model’s ability to precisely judge the trends.

VECM forecasting#

VECM (Vector Error Correction Model) is similar to, albeit a few key differences. Unlike VAR, VECM does not rely on stationarity so differencing and normalizing the time series will not be necessary. Below is a Johansen cointegration test which conpares test values to critical values in order to determine the number of relevant cointegration relationships (variables which share a long term equilibrium relationship).

from statsmodels.tsa.vector_ar.vecm import VECM, select_order, select_coint_rank
columns = ['GDP : Billions of dollars', 'PCE : Billions of dollars', 'GPDI : Billions of dollars', 'NETEXP : Billions of dollars', 'GovTotExp : Billions of dollars']
data = data[columns]
rank_test = select_coint_rank(data, det_order=0, k_ar_diff=1)
print(rank_test.summary())

Johansen cointegration test using trace test statistic with 5% significance level
=====================================
r_0 r_1 test statistic critical value
-------------------------------------
  0   5          243.6          69.82
  1   5          94.27          47.85
  2   5          47.34          29.80
  3   5          15.06          15.49
-------------------------------------

columns = ['GDP : Billions of dollars', 'PCE : Billions of dollars', 'GPDI : Billions of dollars', 'NETEXP : Billions of dollars', 'GovTotExp : Billions of dollars']
data = data[columns]
split_index1 = int(len(data) * 0.90)
train_data1 = data.iloc[:split_index1]
test_data1 = data.iloc[split_index1:]
rank_test = select_coint_rank(data, det_order=0, k_ar_diff=1)
lag_order = select_order(data, maxlags=15, deterministic="ci")

model = VECM(train_data1, k_ar_diff=lag_order.aic, coint_rank=rank_test.rank, deterministic='ci')
results = model.fit()

forecast_steps = len(test_data1)
forecast = results.predict(steps=forecast_steps)

forecast_df = pd.DataFrame(forecast, index=test_data1.index, columns=columns)

gdp_forecast = forecast_df['GDP : Billions of dollars']
gdp_actual = test_data1['GDP : Billions of dollars']

dates = dates[:len(gdp_actual)]

comparison_df = pd.DataFrame({'Date': Date_col, 'Actual GDP': gdp_actual.values, 'Forecasted GDP': gdp_forecast.values})
comparison_df.dropna(inplace=True)

mae = mean_absolute_error(comparison_df['Actual GDP'], comparison_df['Forecasted GDP'])
print(f"MAE: {mae:.2f}")

print("GDP Actual vs Forecast:")
print(comparison_df)

plt.figure(figsize=(12, 6))
plt.plot(comparison_df['Date'], comparison_df['Actual GDP'], label='Actual GDP', color='blue')
plt.plot(comparison_df['Date'], comparison_df['Forecasted GDP'], label='Forecasted GDP', color='red', linestyle='--')
plt.xlabel('Date')
plt.ylabel('GDP : Billions of dollars')
plt.title('Actual vs Forecasted GDP using VECM')
plt.legend()
plt.show()

MAE: 1482.53
GDP Actual vs Forecast:
          Date  Actual GDP  Forecasted GDP
2017-10-01   20037.088    19753.822655
2018-01-01   20328.553    19886.303817
2018-04-01   20580.912    20019.396912
2018-07-01   20798.730    20091.444369
2018-10-01   20917.867    20201.747583
2019-01-01   21104.133    20397.007183
2019-04-01   21384.775    20590.328595
2019-07-01   21694.282    20715.361432
2019-10-01   21902.390    20933.376095
2020-01-01   21706.513    21102.670722
2020-04-01   19913.143    21333.003367
2020-07-01   21647.640    21572.994781
2020-10-01   22024.502    21812.328621
2021-01-01   22600.185    22030.977990
2021-04-01   23292.362    22299.984264
2021-07-01   23828.973    22534.530745
2021-10-01   24654.603    22750.677701
2022-01-01   25029.116    22985.403756
2022-04-01   25544.273    23271.176604
2022-07-01   25994.639    23483.148023
2022-10-01   26408.405    23719.922245
2023-01-01   26813.601    23977.040794
2023-04-01   27063.012    24160.506343
2023-07-01   27610.128    24361.691614
2023-10-01   27956.998    24615.083068
2024-01-01   28255.928    24786.868614

../_images/783e20d430ecb6dfd85b84c778d3b0fbb6fd4555a015c47562a2ced9cd7d2b96.png

Though it is hard to compare to our VAR model to this one given that we are now using the unstationary data, we can still deduce both by the error metric and the visualization that this model was not able to accurately capture the trends in this forecast. With this, it is fair to say that we can rule out traditional statistical methods for approaching this problem.

Machine Learning forecasting#

When deciding a machine learning approach to model this problem, I wanted to keep in mind the amount of data that I am working with. Prior to creating lagged columns, out dataset has a total of 1275 observations across all time series. This means that using more complex approaches, such as LSTMs or gradient boosting, are perhaps unnecessary as we can use more simple model to recieve the same amount of accuracy and far more interpretability.

def printevals(y_test,y_pred):
    mape = mean_absolute_percentage_error(y_test, y_pred)
    print(f"MAPE: {mape:.2f}%")
    mae = mean_absolute_error(y_test, y_pred)
    print(f"MAE: {mae:.2f}")
def plotresults(title):
    
    comparison_df = pd.DataFrame({'Date': Date_col, 'Actual GDP': y_test.values, 'Forecasted GDP': y_pred})
    comparison_df.dropna(inplace=True)

    print("GDP Actual vs Forecast:")
    print(comparison_df)
    
    plt.figure(figsize=(12, 6))
    plt.plot(comparison_df['Date'], comparison_df['Actual GDP'], label='Actual GDP', color='blue')
    plt.plot(comparison_df['Date'], comparison_df['Forecasted GDP'], label='Forecasted GDP', color='red', linestyle='--')
    plt.xlabel('Date')
    plt.ylabel('GDP : Billions of dollars')
    plt.title(title)
    plt.legend()
    plt.show()

Train-Test Split#

Train-test splits for time series problems differ slightly from splits in traditional regression or classification tasks. Here it is imparative that we do not shuffle around our data, since that would mean we are training our model on data from the future which in turn will cause data leakages. Also notice that we are training over a very large portion (90%) of the data. This is because we are not actually concerned with forecasting over a large time frame. Realistically even forecasting over several years is not probable for this task. (Note we also used the train test split in our VAR and VECM models, it just felt more appropriate to talk about in the Machine Learning section)

split_index = int(len(data_diff) * 0.90)

train_data = data_diff.iloc[:split_index]
test_data = data_diff.iloc[split_index:]

X_train = train_data.drop('GDP : Billions of dollars', axis=1)
y_train = train_data['GDP : Billions of dollars']
X_test = test_data.drop('GDP : Billions of dollars', axis=1)
y_test = test_data['GDP : Billions of dollars']

Random Forests#

Remembering our VIF test from earlier, we know our features are highly multicolinear. This partially plays into the decision of choosing random forests as our model. decision trees make binary choices between features, meaning that theoretically our features being highly correlated should not be detrimental to our model. In addition to this, random forests is generally a very strong model being robust to overfitting from the schocastic nature of how the trees are computed and generally being very simple to implement.

from sklearn.ensemble import RandomForestRegressor

rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

y_pred = rf_model.predict(X_test)

printevals(y_test,y_pred)
plotresults('Actual vs Forecasted GDP using Random Forest')

MAPE: 44.31%
MAE: 287.97
GDP Actual vs Forecast:
          Date  Actual GDP  Forecasted GDP
2017-10-01     344.493       222.66964
2018-01-01     291.465       232.99618
2018-04-01     252.359       192.05078
2018-07-01     217.818       220.23716
2018-10-01     119.137       118.40956
2019-01-01     186.266       115.82722
2019-04-01     280.642       198.88359
2019-07-01     309.507       228.12818
2019-10-01     208.108       156.30033
2020-01-01    -195.877       -78.69897
2020-04-01   -1793.370      -161.98572
2020-07-01    1734.497       233.81564
2020-10-01     376.862       233.81564
2021-01-01     575.683       151.72623
2021-04-01     692.177       129.98664
2021-07-01     536.611       233.97176
2021-10-01     825.630       235.97884
2022-01-01     374.513       233.81564
2022-04-01     515.157       163.32067
2022-07-01     450.366       167.84568
2022-10-01     413.766       275.92276
2023-01-01     405.196       157.41253
2023-04-01     249.411       240.37866
2023-07-01     547.116       296.78346
2023-10-01     346.870       198.98305
2024-01-01     298.930       179.61721

../_images/9ae56db5376b0e28be6aa9860df8e3bb52ace279c15095164fd84549f91c6c2b.png

We can see that Random Forests was able to produce our best forecast yet, attaining better error metrics than our attepts at VAR and VECM. Perhaps most impressively, visually we can see that our model was almost perfectly encaptualiting the data from 2017-2019, just prior to encountering the outliers.

KNN#

KNN (K-Nearest-Neighbors) was one final apporach I wanted to attempt. Given that our data is fairly low dimensional (5 features), with over 1000 observations, we have a ratio which should allow this algorithm to perform very well. Notice below we test out various values of our hyperparamter k, with k=2 seemingly being the best one.

from sklearn.neighbors import KNeighborsRegressor
for i in range (1,10):
    knn_model = KNeighborsRegressor(n_neighbors=i)
    knn_model.fit(X_train, y_train)

    y_pred = knn_model.predict(X_test)

    print(f'for k = {i} ')
    printevals(y_test,y_pred)
    print("\n" + "=" * 40 + "\n")

for k = 1 
MAPE: 48.93%
MAE: 302.74

========================================

for k = 2 
MAPE: 47.72%
MAE: 303.45

========================================

for k = 3 
MAPE: 48.31%
MAE: 306.37

========================================

for k = 4 
MAPE: 48.41%
MAE: 306.15

========================================

for k = 5 
MAPE: 49.35%
MAE: 310.94

========================================

for k = 6 
MAPE: 50.54%
MAE: 315.49

========================================

for k = 7 
MAPE: 50.36%
MAE: 315.71

========================================

for k = 8 
MAPE: 50.62%
MAE: 317.13

========================================

for k = 9 
MAPE: 50.04%
MAE: 315.95

========================================

knn_model = KNeighborsRegressor(n_neighbors=2)
knn_model.fit(X_train, y_train)

y_pred = knn_model.predict(X_test)

printevals(y_test,y_pred)

plotresults('Actual vs Forecasted GDP using KNN')

MAPE: 47.72%
MAE: 303.45
GDP Actual vs Forecast:
          Date  Actual GDP  Forecasted GDP
2017-10-01     344.493        237.9050
2018-01-01     291.465        252.8530
2018-04-01     252.359        217.4445
2018-07-01     217.818        206.1615
2018-10-01     119.137        115.1375
2019-01-01     186.266        110.2265
2019-04-01     280.642        203.0175
2019-07-01     309.507        264.4540
2019-10-01     208.108        156.4880
2020-01-01    -195.877          1.0515
2020-04-01   -1793.370        -65.8135
2020-07-01    1734.497        227.1580
2020-10-01     376.862        129.9220
2021-01-01     575.683        167.6270
2021-04-01     692.177        124.7765
2021-07-01     536.611        227.1580
2021-10-01     825.630        227.1580
2022-01-01     374.513        193.9800
2022-04-01     515.157        167.6270
2022-07-01     450.366        185.8210
2022-10-01     413.766        167.6270
2023-01-01     405.196        183.3980
2023-04-01     249.411        264.4540
2023-07-01     547.116        167.6270
2023-10-01     346.870        221.4005
2024-01-01     298.930        198.0385

../_images/f7c382591e365ff998ed8ae0af0599e36e946b271b6810f6c75eacea484ea150.png

We can see KNN in its own right performed very well. Despite being out performed slightly in terms of error metrics compared to Random Forests, visually the model performed about the same and arguably captured the period before the pandemic from 2018-2019 even better than Random Forests.

Conclusions#

Looking at all of our models, we can see the one which performed the best was Random Forests. This is most likely due to Random Forests generally being a very strong predictive model which can be fit to a vairety of datasets. In general the machine learning algorithms far outperformed the traditional statistical methods. Perhaps this can be explained by the fact that VAR and VECM both require a great amount of historical background data to work optimally, something which we did not have much of given that our data came out in quarterly intervals. There also may be something to be said about how both the machine learning models used were nonparametric. These models often are governed by less assumptions than their counterparts and therefore may be more flexible to unique problem sets like the one here. Below is our final best prediction, removing the differencing transformation we previously used to fit the models.

rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

y_pred = rf_model.predict(X_test)

comparison_df = pd.DataFrame({'Date': Date_col, 'Actual GDP': y_test.values, 'Forecasted GDP': y_pred})
comparison_df.dropna(inplace=True)

last_actual_value = data.iloc[split_index - 1]['GDP : Billions of dollars']

comparison_df['Actual GDP'] = comparison_df['Actual GDP'].cumsum() + last_actual_value
comparison_df['Forecasted GDP'] = comparison_df['Forecasted GDP'].cumsum() + last_actual_value
plt.figure(figsize=(12, 6))
plt.plot(comparison_df['Date'], comparison_df['Actual GDP'], label='Actual GDP', color='blue')
plt.plot(comparison_df['Date'], comparison_df['Forecasted GDP'], label='Forecasted GDP', color='red', linestyle='--')
plt.xlabel('Date')
plt.ylabel('GDP : Billions of dollars')
plt.title('Actual vs Forecasted GDP (Differencing Removed) using Random Forest')
plt.legend()
plt.show()

../_images/626696a6c0284111d5c3c3a3104047b1cc81b1f8f3387a6a6614619bbfd0b2c8.png

Challenges and Areas of Improvement#

By far the greatest challenge regarding this forecasting problem was handeling the massive outlier caused by the pandemic along with the following instability caused by it. Our methods for forecasting obviously can not predict that this would occur, ultimately decreasing our accuracy for each approach. Had our goal been to forecast the previous decade, our models would most likely have a much easier time finding and predicting trends. In terms of improvement and further research, I think a possible solution would be to perform some sort of normalization and outlier smoothing technique on the time interval from 2020-2024, and then evaluating our fully trained model on new quarterly data which comes in. In addition, it may be beneficial to incorporate new features which have a heavy influence on GDP such as quarterly inflation and personal assest evaluations.

References#

For traditional statistical methods - https://link.springer.com/book/10.1007/978-1-4842-7150-6 For machine learning methods - https://www.statlearning.com/