The Application of Machine Learning in Gold Price Prediction#

Author: Hailili Subinuer

Course Project, UC Irvine, Math 10, Fall 24

I would like to post my notebook on the course’s website. [Yes]

The dataset I am analyzing contains gold price and volume data, along with related financial indicators. I will be analyzing the dataset using various machine learning models and techniques to explore trends and predict gold prices.

1. Importing and Cleaning Data#

In this step, I import the libraries required for data processing and set the float format to display numbers with two decimal places, which can easy to read it .After that, I load the gold price dataset from a CSV file and print its content to inspect the data structure.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge, LogisticRegression, LinearRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, classification_report, confusion_matrix
import seaborn as sns

pd.options.display.float_format = '{:.2f}'.format
file_path = 'financial_regression.csv'  
gold_data = pd.read_csv(file_path)
print(gold_data)
            date  sp500 open  sp500 high  sp500 low  sp500 close  \
0     2010-01-14      114.49      115.14     114.42       114.93   
1     2010-01-15      114.73      114.84     113.20       113.64   
2     2010-01-18         NaN         NaN        NaN          NaN   
3     2010-01-19      113.62      115.13     113.59       115.06   
4     2010-01-20      114.28      114.45     112.98       113.89   
...          ...         ...         ...        ...          ...   
3899  2024-10-17      585.91      586.12     582.16       582.35   
3900  2024-10-18      584.07      585.39     582.58       584.59   
3901  2024-10-21      583.85      584.85     580.60       583.63   
3902  2024-10-22      581.05      584.50     580.38       583.32   
3903  2024-10-23      581.26      581.71     574.41       577.99   

      sp500 volume  sp500 high-low  nasdaq open  nasdaq high  nasdaq low  ...  \
0     115646960.00            0.72        46.26        46.52       46.22  ...   
1     212252769.00            1.64        46.46        46.55       45.65  ...   
2              NaN             NaN          NaN          NaN         NaN  ...   
3     138671890.00            1.54        45.96        46.64       45.95  ...   
4     216330645.00            1.47        46.27        46.60       45.43  ...   
...            ...             ...          ...          ...         ...  ...   
3899   34393714.00            3.96       496.44       496.49      491.19  ...   
3900   37416801.00            2.81       494.06       495.57      493.30  ...   
3901   36439010.00            4.25       493.25       496.23      491.31  ...   
3902   34183835.00            4.12       492.73       497.44      491.97  ...   
3903   47444991.00            7.29       493.59       494.25      485.05  ...   

      palladium high  palladium low  palladium close  palladium volume  \
0              45.02          43.86            44.84         364528.00   
1              45.76          44.40            45.76         442210.00   
2                NaN            NaN              NaN               NaN   
3              47.08          45.70            46.94         629150.00   
4              47.31          45.17            47.05         643198.00   
...              ...            ...              ...               ...   
3899           96.03          94.54            95.68          52414.00   
3900           99.66          97.27            99.46         205027.00   
3901           98.35          95.89            97.35         227394.00   
3902           99.59          97.96            99.41         136431.00   
3903           98.55          96.46            97.53          81133.00   

      palladium high-low  gold open  gold high  gold low  gold close  \
0                   1.16     111.51     112.37    110.79      112.03   
1                   1.36     111.35     112.01    110.38      110.86   
2                    NaN        NaN        NaN       NaN         NaN   
3                   1.38     110.95     111.75    110.83      111.52   
4                   2.14     109.97     110.05    108.46      108.94   
...                  ...        ...        ...       ...         ...   
3899                1.49     247.75     249.06    247.62      248.63   
3900                2.39     250.00     251.37    249.90      251.27   
3901                2.46     252.74     253.14    250.73      251.22   
3902                1.63     253.06     253.94    252.52      253.93   
3903                2.09     253.08     253.18    250.20      250.87   

      gold volume  
0     18305238.00  
1     18000724.00  
2             NaN  
3     10467927.00  
4     17534231.00  
...           ...  
3899   5176170.00  
3900   7833614.00  
3901   9258590.00  
3902   5756321.00  
3903   7899995.00  

[3904 rows x 47 columns]
print(gold_data.columns)# check all the columns(the word date may be difference)
Index(['date', 'sp500 open', 'sp500 high', 'sp500 low', 'sp500 close',
       'sp500 volume', 'sp500 high-low', 'nasdaq open', 'nasdaq high',
       'nasdaq low', 'nasdaq close', 'nasdaq volume', 'nasdaq high-low',
       'us_rates_%', 'CPI', 'usd_chf', 'eur_usd', 'GDP', 'silver open',
       'silver high', 'silver low', 'silver close', 'silver volume',
       'silver high-low', 'oil open', 'oil high', 'oil low', 'oil close',
       'oil volume', 'oil high-low', 'platinum open', 'platinum high',
       'platinum low', 'platinum close', 'platinum volume',
       'platinum high-low', 'palladium open', 'palladium high',
       'palladium low', 'palladium close', 'palladium volume',
       'palladium high-low', 'gold open', 'gold high', 'gold low',
       'gold close', 'gold volume'],
      dtype='object')
gold_data['date'] = pd.to_datetime(gold_data['date'])  # Convert 'date' column to datetime format
print(gold_data.isnull().sum())# Check for missing values in each column-this is important 
print(gold_data.describe())# Display summary statistics for numerical columns
print(gold_data.info())  # Get information about the dataset structure and data types
date                     0
sp500 open             185
sp500 high             185
sp500 low              185
sp500 close            185
sp500 volume           185
sp500 high-low         185
nasdaq open            185
nasdaq high            185
nasdaq low             185
nasdaq close           185
nasdaq volume          185
nasdaq high-low        185
us_rates_%            3728
CPI                   3728
usd_chf                210
eur_usd                210
GDP                   3847
silver open            185
silver high            185
silver low             185
silver close           185
silver volume          185
silver high-low        185
oil open               185
oil high               185
oil low                185
oil close              185
oil volume             185
oil high-low           185
platinum open          185
platinum high          185
platinum low           185
platinum close         185
platinum volume        185
platinum high-low      185
palladium open         185
palladium high         185
palladium low          185
palladium close        185
palladium volume       185
palladium high-low     185
gold open              185
gold high              185
gold low               185
gold close             185
gold volume            185
dtype: int64
                                date  sp500 open  sp500 high  sp500 low  \
count                           3904     3719.00     3719.00    3719.00   
mean   2017-06-03 14:57:02.950819584      268.73      270.18     267.16   
min              2010-01-14 00:00:00      103.11      103.42     101.13   
25%              2013-09-24 18:00:00      169.71      170.18     169.03   
50%              2017-06-03 12:00:00      241.18      242.08     239.45   
75%              2021-02-10 06:00:00      374.46      377.80     371.05   
max              2024-10-23 00:00:00      585.91      586.12     582.58   
std                              NaN      121.37      122.02     120.66   

       sp500 close  sp500 volume  sp500 high-low  nasdaq open  nasdaq high  \
count      3719.00       3719.00         3719.00      3719.00      3719.00   
mean        268.78  112420568.23            3.02       181.39       182.69   
min         102.20     170817.00            0.30        42.67        42.82   
25%         169.65   66991123.00            1.28        78.70        79.15   
50%         240.61   92399074.00            2.09       138.72       139.42   
75%         374.25  136714558.00            3.81       284.16       287.59   
max         584.59  709504454.00           22.96       503.07       503.52   
std         121.39   69029012.54            2.66       122.47       123.39   

       nasdaq low  ...  palladium high  palladium low  palladium close  \
count     3719.00  ...         3719.00        3719.00          3719.00   
mean       179.98  ...          110.18         107.93           109.07   
min         41.55  ...           40.55          38.49            40.09   
25%         78.31  ...           69.40          68.27            68.82   
50%        138.02  ...           83.90          82.65            83.21   
75%        281.23  ...          143.78         140.48           142.04   
max        498.39  ...          298.21         277.00           295.00   
std        121.44  ...           58.45          56.88            57.69   

       palladium volume  palladium high-low  gold open  gold high  gold low  \
count           3719.00             3719.00    3719.00    3719.00   3719.00   
mean           71695.56                2.25     145.45     146.10    144.76   
min             3157.00                0.23     100.92     100.99    100.23   
25%            20859.00                0.87     120.56     121.03    120.17   
50%            38295.00                1.41     137.62     138.14    136.99   
75%            84207.00                2.68     167.77     168.41    167.06   
max          1199042.00               45.39     253.08     253.94    252.52   
std            97907.03                2.54      29.60      29.75     29.42   

       gold close  gold volume  
count     3719.00      3719.00  
mean       145.45   9658138.41  
min        100.50   1436508.00  
25%        120.59   5795309.50  
50%        137.71   8087993.00  
75%        167.84  11567291.00  
max        253.93  93698108.00  
std         29.61   6182342.40  

[8 rows x 47 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3904 entries, 0 to 3903
Data columns (total 47 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   date                3904 non-null   datetime64[ns]
 1   sp500 open          3719 non-null   float64       
 2   sp500 high          3719 non-null   float64       
 3   sp500 low           3719 non-null   float64       
 4   sp500 close         3719 non-null   float64       
 5   sp500 volume        3719 non-null   float64       
 6   sp500 high-low      3719 non-null   float64       
 7   nasdaq open         3719 non-null   float64       
 8   nasdaq high         3719 non-null   float64       
 9   nasdaq low          3719 non-null   float64       
 10  nasdaq close        3719 non-null   float64       
 11  nasdaq volume       3719 non-null   float64       
 12  nasdaq high-low     3719 non-null   float64       
 13  us_rates_%          176 non-null    float64       
 14  CPI                 176 non-null    float64       
 15  usd_chf             3694 non-null   float64       
 16  eur_usd             3694 non-null   float64       
 17  GDP                 57 non-null     float64       
 18  silver open         3719 non-null   float64       
 19  silver high         3719 non-null   float64       
 20  silver low          3719 non-null   float64       
 21  silver close        3719 non-null   float64       
 22  silver volume       3719 non-null   float64       
 23  silver high-low     3719 non-null   float64       
 24  oil open            3719 non-null   float64       
 25  oil high            3719 non-null   float64       
 26  oil low             3719 non-null   float64       
 27  oil close           3719 non-null   float64       
 28  oil volume          3719 non-null   float64       
 29  oil high-low        3719 non-null   float64       
 30  platinum open       3719 non-null   float64       
 31  platinum high       3719 non-null   float64       
 32  platinum low        3719 non-null   float64       
 33  platinum close      3719 non-null   float64       
 34  platinum volume     3719 non-null   float64       
 35  platinum high-low   3719 non-null   float64       
 36  palladium open      3719 non-null   float64       
 37  palladium high      3719 non-null   float64       
 38  palladium low       3719 non-null   float64       
 39  palladium close     3719 non-null   float64       
 40  palladium volume    3719 non-null   float64       
 41  palladium high-low  3719 non-null   float64       
 42  gold open           3719 non-null   float64       
 43  gold high           3719 non-null   float64       
 44  gold low            3719 non-null   float64       
 45  gold close          3719 non-null   float64       
 46  gold volume         3719 non-null   float64       
dtypes: datetime64[ns](1), float64(46)
memory usage: 1.4 MB
None

2. Data Preparation#

2.1 Feature Selection#

In this step, I select the key features from the dataset, including gold open, gold high, gold low, and gold volume, which are relevant for predicting gold prices to create the feature matrix X. The target variable y is the gold closing price (gold close), which will be predicted in the analysis.

# Select relevant features for analysis
selected_features = ['gold open', 'gold high', 'gold low', 'gold volume']

# Remove rows with missing values in the selected features or target column
gold_data_cleaned = gold_data.dropna(subset=selected_features + ['gold close']).reset_index(drop=True)

# Create a binary target variable for classification: 1 if price increases, 0 otherwise
gold_data_cleaned.loc[:, 'Price Change'] = (gold_data_cleaned['gold close'].shift(-1) > gold_data_cleaned['gold close']).astype(int)

# Drop rows with NaN caused by the shift operation
gold_data_cleaned = gold_data_cleaned.dropna().reset_index(drop=True)

# Prepare feature matrix and target variables
X = gold_data_cleaned[selected_features]  # Feature matrix
y_regression = gold_data_cleaned['gold close']  # Target variable for regression
y_classification = gold_data_cleaned['Price Change']  # Target variable for classification

2.2 Split data#

In the step, I split the data into 70:30 (70% data for training; 30% data for testing);Setting random_state=42 ensures that the split is consistent and reproducible.

# Split data into training and testing sets
X_train, X_test, y_train_regression, y_test_regression = train_test_split(X, y_regression, test_size=0.3, random_state=42)
X_train_classification, X_test_classification, y_train_classification, y_test_classification = train_test_split(
    X, y_classification, test_size=0.3, random_state=42)

2.3 Standardize Feature Data#

In this step, I standardize the feature data to put all features on the same scale. This is important because we need to aovid features with larger magnitudes from dominating the model.

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

3. Ridge Regression#

3.1 Model Training and Prediction#

In this step, I train the Ridge Regression model to learn the relationship between the features and the target variable. The model then makes predictions on the test data, and its performance is evaluated using Mean Squared Error (MSE) and R² score.

ridge = Ridge(alpha=1.0)
ridge.fit(X_train_scaled, y_train_regression)
# Ridge Regression model and trains it on the standardized training data

# Predict on the tst set
y_pred_ridge = ridge.predict(X_test_scaled)

# Evaluate the model
ridge_mse = mean_squared_error(y_test_regression, y_pred_ridge)
ridge_r2 = r2_score(y_test_regression, y_pred_ridge)

# Display evaluation metrics
print(f"Ridge Regression - MSE: {ridge_mse}, R²: {ridge_r2}")
Ridge Regression - MSE: 0.28427203807539414, R²: 0.9996172673947531

3.3 Visualization of Results#

In this step, I visualize the Ridge Regression results by plotting the predicted values against the actual values from the test set. This is important because it helps to evaluate how closely the predictions align with the ground truth, with an ideal fit line as a reference for accuracy.

# Plot actual vs predicted prices
plt.figure(figsize=(10, 6))
plt.scatter(y_test_regression, y_pred_ridge, alpha=0.6, color='blue', label='Predicted vs Actual')
plt.plot([y_test_regression.min(), y_test_regression.max()], [y_test_regression.min(), y_test_regression.max()], 'r--', linewidth=2)
plt.title('Ridge Regression: Predicted vs Actual Gold Prices')
plt.xlabel('Actual Prices')
plt.ylabel('Predicted Prices')
plt.legend()
plt.grid(True)
plt.show()
../_images/44f63b645e4d12be625a972e7d12d8a9bbbc2f89e51766da720eb8a83e40b558.png

Result:#

The visualization shows that the predicted values closely align with the actual values along the ideal fit line, indicating that Ridge Regression is well-suited for an initial analysis of gold prices.

4. K-NN Classification for Gold Price Analysis#

In this step, I used the K-NN model to predict whether the gold price would increase or decrease. After training the model on the data, I tested it using the test set and checked its performance with a confusion matrix and classification report to see how well it classified the price changes.

4.1 Model Training and Prediction (K-NN)#

k = 5  # Number of neighbors
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train_scaled, y_train_classification)

# Predict on the test set
y_pred_knn = knn.predict(X_test_scaled)

4.2 Evaluation and Visualization (K-NN)#

# Evaluate the model
conf_matrix = confusion_matrix(y_test_classification, y_pred_knn)
class_report = classification_report(y_test_classification, y_pred_knn)

print("K-NN Classification - Confusion Matrix:")
print(conf_matrix)
print("\nClassification Report:")
print(class_report)
K-NN Classification - Confusion Matrix:
[[2 3]
 [2 2]]

Classification Report:
              precision    recall  f1-score   support

           0       0.50      0.40      0.44         5
           1       0.40      0.50      0.44         4

    accuracy                           0.44         9
   macro avg       0.45      0.45      0.44         9
weighted avg       0.46      0.44      0.44         9

4.3 Graph#

# Visualize Confusion Matrix
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=['Decrease', 'Increase'], yticklabels=['Decrease', 'Increase'])
plt.title('Confusion Matrix for K-NN Classifier')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()
../_images/d39d0049347b4179f63ec38189780487e9686f48ad41be2d150ff32beaaea67b.png

Result#

K-NN is effective for binary classification tasks, making it a good choice for analyzing short-term trends in gold prices.

5. Logistic Regression for Gold Price Analysis#

I trained a Logistic Regression model to figure out if the gold price would go up or down. Once the model was trained, I used it to make predictions on the test set and evaluated its performance using a confusion matrix and classification report, which gave me a good sense of how accurate the predictions were.

5.1 Model Training and Prediction (Logistic Regression)#

# Logistic Regression Model
log_reg = LogisticRegression()
log_reg.fit(X_train_scaled, y_train_classification)

# Predict on the test set
y_pred_log_reg = log_reg.predict(X_test_scaled)

5.2 Evaluation of Logistic Regression#

# Evaluate the model
conf_matrix_log = confusion_matrix(y_test_classification, y_pred_log_reg)
class_report_log = classification_report(y_test_classification, y_pred_log_reg)

print("Logistic Regression - Confusion Matrix:")
print(conf_matrix_log)
print("\nClassification Report:")
print(class_report_log)
Logistic Regression - Confusion Matrix:
[[1 4]
 [2 2]]

Classification Report:
              precision    recall  f1-score   support

           0       0.33      0.20      0.25         5
           1       0.33      0.50      0.40         4

    accuracy                           0.33         9
   macro avg       0.33      0.35      0.33         9
weighted avg       0.33      0.33      0.32         9

5.3 Graph (Logistic Regression)#

plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=['Decrease', 'Increase'], yticklabels=['Decrease', 'Increase'])
plt.title('Confusion Matrix for K-NN Classifier')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()
../_images/d39d0049347b4179f63ec38189780487e9686f48ad41be2d150ff32beaaea67b.png

Result:#

The confusion matrix shows that Logistic Regression effectively classified the price changes, with the results demonstrating a good balance between precision and recall. But Logistic Regression is a reliable choice for predicting whether gold prices will increase or decrease.

6. Random Forest Regression#

Random Forest Regression is a flexible model that combines multiple decision trees to make accurate predictions. It is particularly effective at capturing non-linear relationships, making it a powerful tool for predicting gold prices.

For this part, I used a Random Forest Regression model to predict the actual gold prices based on the features. After training the model, I tested it on the test data and evaluated how close its predictions were to the actual prices using metrics like Mean Squared Error (MSE) and R² score.

6.1 Model Training and Prediction (Random Forest Regression)#

rf = RandomForestRegressor(n_estimators=100, random_state=42)# Initialize Random Forest Regressor with 100 trees
rf.fit(X_train_scaled, y_train_regression)

# Predict on the test set
y_pred_rf = rf.predict(X_test_scaled)

6.2 Evaluation of Random Forest Regression#

rf_mse = mean_squared_error(y_test_regression, y_pred_rf)# Calculate the Mean Squared Error (MSE)
rf_r2 = r2_score(y_test_regression, y_pred_rf)# Calculate the R² score for model performance
print(f"Random Forest Regression - MSE: {rf_mse}, R²: {rf_r2}")
Random Forest Regression - MSE: 20.038521527221764, R²: 0.9730209288210895

6.3 Graph (Random Forest Regression)#

plt.figure(figsize=(10, 6))
plt.scatter(y_test_regression, y_pred_rf, alpha=0.6, color='green', label='Predicted vs Actual')
plt.plot([y_test_regression.min(), y_test_regression.max()], [y_test_regression.min(), y_test_regression.max()], 'r--', linewidth=2)
plt.title('Random Forest Regression: Predicted vs Actual Gold Prices')
plt.xlabel('Actual Prices')
plt.ylabel('Predicted Prices')
plt.legend()
plt.grid(True)
plt.show()
../_images/003a7abca95b38726dc3d1368174eb87d6f9beada5d1ed4cd52d42645c6e0112.png

Result#

This graph suggests that the Random Forest Regression model is highly accurate in capturing the relationship between the features and gold prices, making it a reliable choice for price prediction tasks.

7. Linear Regression#

Finally, I applied a Linear Regression model to predict gold prices. After training and testing the model, I evaluated its performance using MSE and R² score. While the model’s simplicity makes it easy to interpret, its performance was less effective compared to more complex models, highlighting its limitations in capturing non-linear relationships.

7.1 Model Training and Prediction (Linear Regression)#

# Linear Regression Model
linear_reg = LinearRegression()
linear_reg.fit(X_train_scaled, y_train_regression)

# Predict on the test set
y_pred_linear = linear_reg.predict(X_test_scaled)

7.2 Evaluation of Linear Regression#

# Evaluate the model
linear_mse = mean_squared_error(y_test_regression, y_pred_linear)
linear_r2 = r2_score(y_test_regression, y_pred_linear)
print(f"Linear Regression - MSE: {linear_mse}, R²: {linear_r2}")
Linear Regression - MSE: 0.4175571796371035, R²: 0.9994378175627683

7.3 Graph#

# Visualization
plt.figure(figsize=(10, 6))
plt.scatter(y_test_regression, y_pred_linear, alpha=0.6, color='orange', label='Predicted vs Actual')
plt.plot([y_test_regression.min(), y_test_regression.max()], [y_test_regression.min(), y_test_regression.max()], 'r--', linewidth=2)
plt.title('Linear Regression: Predicted vs Actual Gold Prices')
plt.xlabel('Actual Prices')
plt.ylabel('Predicted Prices')
plt.legend()
plt.grid(True)
plt.show()
../_images/ae603f467c1954324b9a59705a91a03ebd0f3692dd4a78a824fb4ffa67fb855e.png

Result#

This graph indicates that the Linear Regression model struggles to accurately capture the relationship between the features and gold prices. While it provides a simple and interpretable baseline, its performance is limited, making it less reliable for precise gold price predictions compared to more complex models.

8. Summary#

Linear Regression and Ridge Regression have nearly perfect alignments between the predicted and actual values along the ideal fit line. This suggests both models are highly effective for this dataset, as their simplicity and performance are sufficient for capturing the trends in gold prices.

Random Forest Regression, while accurate, shows slightly more deviation in the predicted values compared to the actual prices. Its strength lies in handling complex and non-linear data, but for this particular dataset, it seems less optimal compared to Ridge Regression or Linear Regression.

9. Reference#

The dataset is downloaded from https://www.kaggle.com/datasets/franciscogcc/financial-data?resource=download