Predicting various tickers from yfinance#

Author: Scott Sibley

Course Project, UC Irvine, Math 10, Fall 24

I would like to post my notebook on the course’s website. Yes

Introduction#

The purpose of this project is to analyze the behavior of some stocks and commodities, and try to find different prediction models to optimise portfolios of multiple different investments.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import yfinance as yf
import seaborn as sns

Importing Data#

I will use the yfinance library to pull data directly, as it gives a convinient way to find large quantities of different data.

tickers=['AAPL','SPY','GC=F','SI=F']
n_tickers=len(tickers)
Close=pd.DataFrame()
for ticker in tickers:
    data = yf.Ticker(ticker).history(start='2001-01-01', end='2025-06-01')
    Close[ticker] = data['Close']
FullData=[]

Close.head()
for ticker in tickers:
    FullData.append(yf.Ticker(ticker).history(start='2001-01-01', end='2025-06-01'))

Clean and Understand our Data#

Check our dataset for missing or NaN values#

print(Close.isna().sum())
for i in range(n_tickers):
    print(FullData[i].isna().sum())
Close.dropna(inplace=True)
AAPL     0
SPY      0
GC=F    18
SI=F    17
dtype: int64
Open            0
High            0
Low             0
Close           0
Volume          0
Dividends       0
Stock Splits    0
dtype: int64
Open             0
High             0
Low              0
Close            0
Volume           0
Dividends        0
Stock Splits     0
Capital Gains    0
dtype: int64
Open            0
High            0
Low             0
Close           0
Volume          0
Dividends       0
Stock Splits    0
dtype: int64
Open            0
High            0
Low             0
Close           0
Volume          0
Dividends       0
Stock Splits    0
dtype: int64

As we can see, there very few NaN values, meaning we are safe to proceed by just dropping them.

Find log returns#

Using the logarithm of the returns lets us add returns, and might be useful later

log_returns = np.log(Close/Close.shift(1))
log_returns = log_returns.dropna()
sns.histplot(log_returns,bins=100,element='step',stat='percent',fill=False)
plt.title('Distributions of Log Returns for selected tickers')
plt.xlabel('Log Returns')
plt.ylabel('Relative Frequency')
plt.show()
../_images/8a776f6be503b6c8460e2b44b839fecc63257e8e068f1ee746b6ab56d465c1de.png

Finding Covarience and Correlations#

We will use the covarience and correlation of different stocks to hedge our risk

Corr=log_returns.corr()
Cov=log_returns.cov()
sns.heatmap(Cov,annot=True)
plt.title('Covariance Heatmap of selected tickers')
plt.figure()
sns.heatmap(Corr,annot=True)
plt.title('Correlation Heatmap of selected tickers')
plt.show()
../_images/f0b96687da8b5b27cf22d6a97f0656efe2ff32699b4a173a616729daac159648.png ../_images/025bc771661c1e10945ad8ce0110269bff8c829b50bf0e89428a6ddecc98b6a8.png

Visualizations#

We now want to visualize our data, and see moving averages I will use mplfinance, a library built on matplotlib, to show these plots.

import mplfinance as fplt
print("Stock or commodity prices with 7,30 and 365 day moving averages \n")
for i in range(n_tickers):
    fplt.plot(FullData[i],volume=True,type='line',mav=(7,30,365),title=f'{tickers[i]} Price',xlabel='Date',ylabel='Adjusted Price',ylabel_lower='Volume')
Stock or commodity prices with 7,30 and 365 day moving averages 
../_images/61b9f33907a72922d8429f7bc53be7bede534d0e97f99e0385aae030cabdd604.png ../_images/9d543b546aae2664489b5efcc6d888a10484e86327596df5a4a695d23834c89e.png ../_images/9fa1287769ae61ac275f0dda35161db91975cefbeefa3afec52001c926568927.png ../_images/63ded2e43c4dfb5c42a8db0e9b8eb4595f2b539655574c4ae4d7e31574005634.png

Linear Regression of a single ticker#

For these, change the ticker list to only 1 or 2 tickers, else many plots and regressions will be run

#Workaround to prevent too many plots and regressions from happening
#Forces us only to take the first ticker no matter the actual number of tickers selected earlier
n_tickers=2

Imports#

from sklearn.model_selection import train_test_split, KFold,cross_val_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler

Get Features/ Feature Engineering#

We can use moving averages and past data to predict the forwards data

Regression_features=[]
for i in range(len(tickers)):
    Lag1=Close[tickers[i]].shift(1).rename('Lag1')
    Lag5=Close[tickers[i]].shift(5).rename('Lag5')
    Lag30=Close[tickers[i]].shift(30).rename('Lag30')
    MA5=Close[tickers[i]].rolling(window=5).mean().rename('MA5')
    MA30=Close[tickers[i]].rolling(window=30).mean().rename('MA30')
    MA90=Close[tickers[i]].rolling(window=90).mean().rename('MA90')
    Target=Close[tickers[i]].rename('Close')
    Vol=FullData[i]['Volume'].rename('Vol')
    Regression_features.append(pd.concat([Lag1,Lag5,Lag30,MA5,MA30,MA90,Vol,Target],axis=1).dropna())

Train-test split and k-fold cross validation on Linear Model#

We want to use k-fold splits to test many different possible linear models

features=['Lag1','Lag5','Lag30','MA5','MA30','MA90','Vol']
goal='Close'
for i in range(n_tickers):
    X=Regression_features[i][features]
    y=Regression_features[i][goal]
    X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)
    kf = KFold(n_splits=5, shuffle=True, random_state=42)
    model=LinearRegression()
    linear=LinearRegression()
    linear.fit(X_train,y_train)
    scores=cross_val_score(model,X_train,y_train,cv=kf,scoring='r2')
    mse=mean_squared_error(y_test,linear.predict(X_test))
    r2test=r2_score(y_test,linear.predict(X_test))
    print(f"For Linear Regression {tickers[i]} R^2 scores for each split: {scores} \n mean squared error: {mse} \n r2 score: {r2test}")
For Linear Regression AAPL R^2 scores for each split: [0.99955373 0.99945564 0.99950088 0.99961959 0.99947949] 
 mean squared error: 1.235744198839389 
 r2 score: 0.9997012962273906
For Linear Regression SPY R^2 scores for each split: [0.99971156 0.99967349 0.99968147 0.99973096 0.99961794] 
 mean squared error: 4.735684177047314 
 r2 score: 0.9997645278074918

Results#

  • These are very good results, with very high \(R^2\) values and low MSE

Ridge Regression and Lasso Regression#

  • We must scale the features to ensure that the magnitude of the features do not affect the model

for i in range(n_tickers):
    X=Regression_features[i][features]
    y=Regression_features[i][goal]
    X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)
    scaler=StandardScaler()
    X_train=scaler.fit_transform(X_train)
    X_test=scaler.transform(X_test)
    model=Ridge()
    model.fit(X_train,y_train)
    mse=mean_squared_error(y_test,model.predict(X_test))
    r2test=r2_score(y_test,model.predict(X_test))
    print(f"For Ridge Regression {tickers[i]} mean squared error: {mse} \n r2 score: {r2test}")
for i in range(n_tickers):
    X=Regression_features[i][features]
    y=Regression_features[i][goal]
    X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)
    scaler=StandardScaler()
    X_train=scaler.fit_transform(X_train)
    X_test=scaler.transform(X_test)
    model=Lasso()
    model.fit(X_train,y_train)
    mse=mean_squared_error(y_test,model.predict(X_test))
    r2test=r2_score(y_test,model.predict(X_test))
    print(f"For Ridge Regression {tickers[i]} mean squared error: {mse} \n r2 score: {r2test}")
For Ridge Regression AAPL mean squared error: 1.3631638844383196 
 r2 score: 0.9996704963734816
For Ridge Regression SPY mean squared error: 5.276298278192233 
 r2 score: 0.9997376468790055
For Ridge Regression AAPL mean squared error: 2.1932754848568483 
 r2 score: 0.9994698420091198
For Ridge Regression SPY mean squared error: 6.296458811704498 
 r2 score: 0.9996869214867378

Results#

  • These are slightly worse than the linear model, likely because of the low dimentionality of our data

Plotting the Linear Model#

for i in range(n_tickers):
    for modeln in [LinearRegression, Ridge, Lasso]:
        model=modeln()
        model.fit(X_train,y_train)
        plt.scatter(y_test, model.predict(X_test), alpha=0.6, color='blue')
        plt.plot([y_test.min(), model.predict(X_test).max()], [y_test.min(), model.predict(X_test).max()], 'r--', linewidth=2)
        plt.title(f'{tickers[i]} \n {type(modeln()).__name__} Regression Results')
        plt.xlabel('Actual Price')
        plt.ylabel('Predicted Price')
        plt.grid(True)
        plt.show()
../_images/073947461f9008f9693df13e635648d87b36437a0e66b399f9ad2b35f30ef814.png ../_images/b6fbf1aac14d8d5eba8edf8c8c4c13d83b992b066ec5cbc281bd352fd522082e.png ../_images/603e099f71ada7b5a08774d397921c0409c52fa645483a8a304f9b617d765d11.png ../_images/2641e8997db554793b31b127830ac5f1e980977bc831a5f5a0abbb8fc4c331fd.png ../_images/1cd3b1540a861e55057748a7ccd4b42c9aaf5e3493cd92cb7eb38984a80e019c.png ../_images/67457ad35b2fc2063f0b0d91ab2cd4817129e836856cd61294c88c2ff9aed439.png

These plots simply show the predictions compared to the ground truth. The ground truth is the diagonal line \(y=x\), and we can see that most predictions are very close to this line

Summary:#

  • Linear regression with or without a penalty term perform well in predicting future stock prices, but are limited to making only one days worth of predictions using this method.

  • The bias of this model over a longer span is high, because in reality, market data is not driven linearly. The varience is not as high, as we have fit a large amount of data to our models

k-NN prediction models#

Imports#

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score

Return to using all tickers

n_tickers=len(tickers) #If desired
for i in range(n_tickers):
    X=Regression_features[i][features]
    y=Regression_features[i][goal]
    X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)
    k=7
    knn = KNeighborsClassifier(n_neighbors=k)
    ytr_discrete= y_train > y_train.shift(-1)
    yts_discrete= y_test > y_test.shift(1)
    #We want to see if we go up or down in the next day
    knn.fit(X_train, ytr_discrete)
    # Predict on the test set
    y_pred_knn = knn.predict(X_test)
    conf_matrix = confusion_matrix(yts_discrete, y_pred_knn)
    sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='coolwarm', xticklabels=['Decrease', 'Increase'], yticklabels=['Decrease', 'Increase'])
    plt.title(f'Confusion Matrix for K-NN Classifier for {tickers[i]}')
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.show()
    print(f'Accuracy score for {tickers[i]} : {accuracy_score(yts_discrete, y_pred_knn)} \n F1 score for {tickers[i]} : {f1_score(yts_discrete, y_pred_knn)}')
../_images/133afcadc90742402bdc80c3fe8d4fe011c3fb7e7ea0617d20897eeaad851d63.png
Accuracy score for AAPL : 0.6619718309859155 
 F1 score for AAPL : 0.6408450704225352
../_images/394ae051bfa2e44b92d6bc4782f73182e8bf8325865ff19ebf0aea1877507ebf.png
Accuracy score for SPY : 0.6014913007456504 
 F1 score for SPY : 0.6041152263374485
../_images/22ce17b3c684c22173acc2be3ec15d48026f85c3bc9fb4f4d404be783a745d4c.png
Accuracy score for GC=F : 0.7125103562551781 
 F1 score for GC=F : 0.7120331950207469
../_images/55788079a9fccb421e56c07b2ba60d31793e758bebfa0e71be22879465bcc998.png
Accuracy score for SI=F : 0.704225352112676 
 F1 score for SI=F : 0.7007544006705784

Results#

  • The k-NN method works decently well to predict whether we expect the stock to go up or down

Random Forest based model#

  • These models work by generating many decision trees, and taking the average answer to determine its output

Imports

from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

Fitting the basic Random Forest

  • Generates trees randomly from subsets of the training data

RF_models=[]
for i in range(n_tickers):
    X=Regression_features[i][features]
    y=Regression_features[i][goal]
    X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)
    rf = RandomForestRegressor(n_estimators=100, random_state=42)
    rf.fit(X_train, y_train)
    y_pred_rf = rf.predict(X_test)
    RF_models.append(rf)
    print(f"Random Forest mse for {tickers[i]}:{mean_squared_error(y_test, y_pred_rf)}")
    print(f"Random Forest R2 for {tickers[i]}: {r2_score(y_test, y_pred_rf)}")

Fitting the Gradient Boosting model

  • Generates successive trees to try to improve on past trees

GB_models=[]
for i in range(n_tickers):
    X=Regression_features[i][features]
    y=Regression_features[i][goal]
    X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)
    gb = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
    gb.fit(X_train, y_train)
    y_pred_gb = gb.predict(X_test)
    GB_models.append(gb)
    print(f"Gradient boosting mse for {tickers[i]}:{mean_squared_error(y_test, y_pred_gb)}")
    print(f"Gradient boosting R2 for {tickers[i]}: {r2_score(y_test, y_pred_gb)}")

Read into the models to see which features are used most#

  • Find the importances of the features in each model

for i in range(n_tickers):
    for models in [RF_models, GB_models]:
        importances = models[i].feature_importances_
        importances.sort()
        importances=importances[::-1]
        plt.bar(features, importances)
        plt.title(f'Feature Importances for {tickers[i]} using {type(models[i]).__name__}')
        plt.ylabel('Importance')
        plt.show()
for i in range(n_tickers):
    for models in [RF_models, GB_models]:
        y_pred=models[i].predict(X_test)
        plt.scatter(y_test, y_pred, alpha=0.6, color='green', label='Predicted vs Actual')
        plt.plot([y_pred.min(), y_pred.max()], [y_pred.min(), y_pred.max()], 'r--', linewidth=2)
        plt.title(f'{tickers[i]} Predictions vs. Actual using {type(models[i]).__name__}')
        plt.xlabel('Actual Prices')
        plt.ylabel('Predicted Prices')
        plt.legend()
        plt.grid(True)
        plt.show()

Results using Random Forest and Gradient Boosting#

  • These have a similar mse and r2 as the linear models, but may be better because they allow for use of nonlinear features

  • We can also see that the results from 1 day ago are the best predictor for the results for today, which makes sense

  • However, for the AAPL, there is misfitting for higher values, which is a drawback of tree-based designs

Summary and Conclusion#

  • Using Linear, k-NN and Random Forest/Gradient Boosting algorithms, we were able to get very accurate predictions for stock prices one day in the future from lagging and moving average data.

  • The linear models, though fit well, high bias as stocks cannot be modeled linearly

  • The k-NN models are decently accurate at predicting discrete data if the stock will go up or down.

  • The tree based models have lower bias, but high variance from possibilities of overfitting or underfitting as they are sensitive to hyperparameters.

References#

  • I took inspiration from previous math 10 projects done Hailili Subinuer and James Cho

  • I used the documentation for yfinance, matplotlib, mplfinance, seaborn, and sklearn to aid in plotting and doing regression