Predicting NBA Game Outcomes Project#

Author: Si Bao#

Course Project, UC Irvine, Math 10, S24#

I would like to post my notebook on the course’s website. Yes#

Table of Contents#

  1. Introduction

  2. Data Preparation

  3. Helper Function

  4. Linear Models

    1. Linear Regression

    2. Ridge Regression

    3. Lasso Regression

  5. Logistic Regression

  6. K-Nearest Neighbor Regression

  7. Decision Trees

  8. Support Vector Machines

  9. Conclusion

Introduction#

The National Basketball Association (NBA) is a professional basketball league in North America composed of 30 teams (29 in the United States and 1 in Canada). It is one of the major professional sports leagues in the United States and Canada and is considered the premier professional basketball league in the world. In my project, I will utilize the game data from 2014 to 2021 to predict the points and winning rate of each team

Data Preparation#

The section imports all necessary tools, generate all the necessary data, and drop data with missing values

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression, Ridge, Lasso, LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.model_selection import KFold, train_test_split
from sklearn.metrics import r2_score, accuracy_score, roc_curve, auc
import matplotlib.pyplot as plt
from copy import deepcopy
games = pd.read_csv('games.csv')
teams = pd.read_csv('teams.csv')
Team = teams['NICKNAME'].tolist()
Team_abb = teams['ABBREVIATION'].tolist()
ID = teams['TEAM_ID'].tolist()

The following two cells describe the teams and the corresponding team ID

Team
['Hawks',
 'Celtics',
 'Pelicans',
 'Bulls',
 'Mavericks',
 'Nuggets',
 'Rockets',
 'Clippers',
 'Lakers',
 'Heat',
 'Bucks',
 'Timberwolves',
 'Nets',
 'Knicks',
 'Magic',
 'Pacers',
 '76ers',
 'Suns',
 'Trail Blazers',
 'Kings',
 'Spurs',
 'Thunder',
 'Raptors',
 'Jazz',
 'Grizzlies',
 'Wizards',
 'Pistons',
 'Hornets',
 'Cavaliers',
 'Warriors']
ID
[1610612737,
 1610612738,
 1610612740,
 1610612741,
 1610612742,
 1610612743,
 1610612745,
 1610612746,
 1610612747,
 1610612748,
 1610612749,
 1610612750,
 1610612751,
 1610612752,
 1610612753,
 1610612754,
 1610612755,
 1610612756,
 1610612757,
 1610612758,
 1610612759,
 1610612760,
 1610612761,
 1610612762,
 1610612763,
 1610612764,
 1610612765,
 1610612766,
 1610612739,
 1610612744]
# This section drops the rows with missing values
games_clean = games.dropna()
# This section generates the data of each team in each season

Team_data = [] # This list will store the performance of each team from 2014 to 2021
for i in range(0, len(Team)):
    temp_team = games_clean[(games_clean['TEAM_ID_home'] == ID[i]) | (games_clean['TEAM_ID_away'] == ID[i])]
    temp_team_away = temp_team.loc[temp_team.TEAM_ID_home == ID[i],['SEASON', 'PTS_home', 'FG_PCT_home', 'FT_PCT_home', 'FG3_PCT_home', 'AST_home', 'REB_home']]
    temp_team_away.rename(columns = {'PTS_home': 'PTS', 'FG_PCT_home': 'FG_PCT', 'FT_PCT_home': 'FT_PCT', 'FG3_PCT_home': 'FG3_PCT', 'AST_home': 'AST', 'REB_home': 'REB'}, inplace = True)
    temp_team_home = temp_team.loc[temp_team.TEAM_ID_away == ID[i], ['SEASON', 'PTS_away', 'FG_PCT_away', 'FT_PCT_away', 'FG3_PCT_away', 'AST_away', 'REB_away']]
    temp_team_home.rename(columns = {'PTS_away': 'PTS', 'FG_PCT_away': 'FG_PCT', 'FT_PCT_away': 'FT_PCT', 'FG3_PCT_away': 'FG3_PCT', 'AST_away': 'AST', 'REB_away': 'REB'}, inplace = True)
    com_team = pd.concat([temp_team_away, temp_team_home], ignore_index = True)
    Team_data.append(com_team)

Feature Explanation:

PTS_home: Points by home team

FG_PCT_home: ratio of field goals to field goals attempts by home team

FT_PCT_home: ratio of free throws to free throws attempts by home team

FG3_PCT_home: ratio of three-point field goals to three-point field goals attempts by home team

AST_home: number of assists made home team

REB_home: number of rebounds for home team

PTS_away: Points by away team

FG_PCT_away: ratio of field goals to field goals attempts by away team

FT_PCT_away: ratio of free throws to free throws attempts by away team

FG3_PCT_away: ratio of three-point field goals to three-point field goals attempts by away team

AST_away: number of assists made away team

REB_away: number of rebounds for away team

# This section creates a new column called "PTS_DIFF_home_away",
# which is determined by home team points minus away team points
new_games_clean = games_clean.copy()
new_games_clean['PTS_DIFF_home_away'] = new_games_clean.apply(lambda row: row['PTS_home'] - row['PTS_away'], axis = 1)
new_games_clean.head()
GAME_DATE_EST GAME_ID GAME_STATUS_TEXT HOME_TEAM_ID VISITOR_TEAM_ID SEASON TEAM_ID_home PTS_home FG_PCT_home FT_PCT_home ... REB_home TEAM_ID_away PTS_away FG_PCT_away FT_PCT_away FG3_PCT_away AST_away REB_away HOME_TEAM_WINS PTS_DIFF_home_away
0 2022-03-12 22101005 Final 1610612748 1610612750 2021 1610612748 104.0 0.398 0.760 ... 53.0 1610612750 113.0 0.422 0.875 0.357 21.0 46.0 0 -9.0
1 2022-03-12 22101006 Final 1610612741 1610612739 2021 1610612741 101.0 0.443 0.933 ... 46.0 1610612739 91.0 0.419 0.824 0.208 19.0 40.0 1 10.0
2 2022-03-12 22101007 Final 1610612759 1610612754 2021 1610612759 108.0 0.412 0.813 ... 52.0 1610612754 119.0 0.489 1.000 0.389 23.0 47.0 0 -11.0
3 2022-03-12 22101008 Final 1610612744 1610612749 2021 1610612744 122.0 0.484 0.933 ... 55.0 1610612749 109.0 0.413 0.696 0.386 27.0 39.0 1 13.0
4 2022-03-12 22101009 Final 1610612743 1610612761 2021 1610612743 115.0 0.551 0.750 ... 39.0 1610612761 127.0 0.471 0.760 0.387 28.0 50.0 0 -12.0

5 rows × 22 columns

Helper Function#

This section defines several functions that will later be used for approximation and plotting in order to avoid repeated and long contents

The ‘prediction_with_Linear_Models’ function is used for the ‘Linear Models’ section. It takes the input of home team name, away team name, and one of the three linear model names: Linear Regression, Ridge Regression, and Lasso Regression

def prediction_with_Linear_Models(home, away, model_name):
    home_index = Team.index(home)
    away_index = Team.index(away)
    home_id = ID[home_index]
    away_id = ID[away_index]
    
    # set the models for home and away team according to the model_name input
    if model_name == 'Linear Regression':
        model_home = LinearRegression()
        model_away = LinearRegression()
    elif model_name == 'Ridge Regression':
        model_home = Ridge()
        model_away = Ridge()
    elif model_name == 'Lasso Regression':
        model_home = Lasso()
        model_away = Lasso()
    else:
        return 'This model is not included'
    
    # set the features(X) and target(y) for approximation
    home_features = ['FG_PCT_home', 'FT_PCT_home', 'FG3_PCT_home', 'AST_home', 'REB_home']
    home_target = ['PTS_home']
    away_features = ['FG_PCT_away', 'FT_PCT_away', 'FG3_PCT_away', 'AST_away', 'REB_away']
    away_target = ['PTS_away']
    
    # find the correspoding dataset containing both home team and away team from the cleaned big dataset
    temp_team = games_clean[(games_clean['TEAM_ID_home'] == home_id) & (games_clean['TEAM_ID_away'] == away_id)]
    
    X_home = temp_team[home_features]
    y_home = temp_team[home_target]
    
    X_away = temp_team[away_features]
    y_away = temp_team[away_target]
    
    # fit the model
    model_home.fit(X_home, y_home)
    model_away.fit(X_away, y_away)
    
    # predict the points that home and away team will get using the fitted model
    pred_home = model_home.predict(X_home)
    pred_away = model_away.predict(X_away)
    
    # duplicate the dataset so that the original one will not be modified
    temp_team_copy = temp_team.copy()
    
    # create a new column of predicted points for both teams
    temp_team_copy['PTS_home_pred'] = pred_home
    temp_team_copy['PTS_home_away'] = pred_away
    
    # create a new column of predicted home win using the predicted points column above
    temp_team_copy['HOME_TEAM_WINS_pred'] = temp_team_copy.apply(lambda row: row['PTS_home_pred'] > row['PTS_home_away'], axis=1)
    temp_team_copy['HOME_TEAM_WINS_pred'] = temp_team_copy['HOME_TEAM_WINS_pred'].astype(int)
    
    # compare the actual and predicts home win and create a new column to record the True/False results
    temp_team_copy['CORRECT_pred'] = temp_team_copy.apply(lambda row: row['HOME_TEAM_WINS'] == row['HOME_TEAM_WINS_pred'], axis=1)
    temp_team_copy['CORRECT_pred'] = temp_team_copy['CORRECT_pred'].astype(int)
    
    # find the accuracy
    acc = accuracy_score(temp_team_copy['HOME_TEAM_WINS'], temp_team_copy['HOME_TEAM_WINS_pred'])
    
    # return a dataset for reporting and graphing
    return {
        'mean_HOME_TEAM_WINS_pred': np.mean(temp_team_copy['HOME_TEAM_WINS_pred']),
        'accuracy': acc,
        'actual_home_points': y_home,
        'predicted_home_points': pred_home,
        'actual_away_points': y_away,
        'predicted_away_points': pred_away,
        'home_coefficients': model_home.coef_,
        'home_intercept': model_away.intercept_,
        'away_coefficients': model_home.coef_,
        'away_intercept': model_away.intercept_
    }

The ‘prediction_with_Other_Models’ function is used for the ‘Logistic Regression’, ‘Decision Trees’, and ‘Support Vector Machines’ section. It takes the input of home team name, away team name, and one of the three model names: Logistic Regression, Decision Trees, and Support Vector Machines

def prediction_with_Other_Models(home, away, model_name):
    accuracy = []
    home_index = Team.index(home)
    away_index = Team.index(away)
    home_id = ID[home_index]
    away_id = ID[away_index]
    
    # find the correspoding dataset containing both home team and away team from the cleaned big dataset
    temp = games_clean[(games_clean['TEAM_ID_home'] == home_id) & (games_clean['TEAM_ID_away'] == away_id)]
    temp_dummy = pd.get_dummies(temp, columns = ['TEAM_ID_home', 'TEAM_ID_away'])
    
    features = ['FG_PCT_home', 'FT_PCT_home', 'FG3_PCT_home', 'AST_home', 'REB_home', 'FG_PCT_away', 'FT_PCT_away', 'FG3_PCT_away', 'AST_away', 'REB_away']
    target = ['HOME_TEAM_WINS']
    
    # set the models for home and away team according to the model_name input
    if model_name == 'Logistic Regression':
        model = LogisticRegression(max_iter = 20000, solver = 'lbfgs')
        X = temp[features].values
        y = temp[target].values.ravel()
    elif model_name == 'Decision Trees':
        model = DecisionTreeClassifier(random_state = 42)
        X = temp[features].values
        y = temp[target].values.ravel()
    elif model_name == 'Support Vector Machines':
        model = SVC(random_state = 42)
        X = temp_dummy.drop(columns = ['GAME_DATE_EST', 'GAME_ID', 'GAME_STATUS_TEXT', 'HOME_TEAM_ID', 'SEASON', 'PTS_home', 'VISITOR_TEAM_ID', 'PTS_away', 'HOME_TEAM_WINS'])
        y = temp_dummy['HOME_TEAM_WINS']
    else:
        return 'This model is not included'
    
    # split data for training and tesing
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)
    
    # fit the data
    model.fit(X_train, y_train)    
    y_pred_test = model.predict(X_test)
    y_pred = model.predict(X)
    
    accuracy = accuracy_score(y_test, y_pred_test)
    
    # return dataset for reporting and graphing
    return {
        'accuracy': accuracy,
        'mean_prediction': np.mean(y_pred),
        'actual_values': y,
        'predicted_values': y_pred,
    }

The ‘plot_score’ function is used for plotting the relationship betweenthe predicted and actual points for home and away team

def plot_score(home, away, model_name):
    # get the required data for plotting
    results = prediction_with_Linear_Models(home, away, model_name)
    
    # store the predicted and actual points for home and away team
    actual_home_points = results['actual_home_points'].to_numpy().flatten()
    predicted_home_points = results['predicted_home_points'].flatten()
    actual_away_points = results['actual_away_points'].to_numpy().flatten()
    predicted_away_points = results['predicted_away_points'].flatten()

    # plot the graph for home team
    plt.subplot(1, 2, 1)
    plt.scatter(actual_home_points, predicted_home_points, color = 'blue', label = 'Home Team')
    plt.plot([actual_home_points.min(), actual_home_points.max()],
             [actual_home_points.min(), actual_home_points.max()],
             color = 'black', linestyle = '--', label = 'Perfect Prediction')
    plt.xlabel('Actual Home Points')
    plt.ylabel('Predicted Home Points')
    plt.title('Home Team')
    plt.legend()
    
    # plot the graph for away team
    plt.subplot(1, 2, 2)
    plt.scatter(actual_away_points, predicted_away_points, color = 'red', label = 'Away Team')
    plt.plot([actual_away_points.min(), actual_away_points.max()],
             [actual_away_points.min(), actual_away_points.max()],
             color = 'black', linestyle = '--', label = 'Perfect Prediction')
    plt.xlabel('Actual Away Points')
    plt.ylabel('Predicted Away Points')
    plt.title('Away Team')
    plt.legend()
    
    plt.suptitle('Actual vs Predicted Points for Home and Away Teams')
    plt.tight_layout(rect = [0, 0.03, 1, 0.95])
    plt.show()

The ‘plot_win’ function is used for plotting the relationship between actual and predicted home wins

def plot_win(results):
    plt.figure(figsize = (5, 3))
    plt.hexbin(results['actual_values'], results['predicted_values'], gridsize = 30, cmap = 'Blues', mincnt = 1, edgecolors = 'grey')
    plt.colorbar(label = 'Count')
    plt.xlabel('Actual Home Win (0 or 1)')
    plt.ylabel('Predicted Home Win (0 or 1)')
    plt.title('Actual Home Win vs. Predicted Home Win')
    plt.grid(True)
    plt.show()

Linear Models#

Linear Regression#

# This section predicts the points each team will have depending on the following features:
# FG_PCT, PT_PCT, FG3_PCT, AST, REB
# Model used: Linear Regression

reg = LinearRegression()
kf = KFold(n_splits = 5, shuffle = True, random_state = 1)
scores = []

features = ['FG_PCT', 'FT_PCT', 'FG3_PCT', 'AST','REB']
target = ['PTS']

k = 1
a = 0
for i in range(0, len(Team_data)):
    for train_index, test_index in kf.split(Team_data[i]): # Test the prediction using 5-fold cross-validation
        temp = Team_data[i]
        X_train, X_test = temp[features].iloc[train_index], temp[features].iloc[test_index]
        y_train, y_test = temp[target].iloc[train_index], temp[target].iloc[test_index]
        
        reg.fit(X_train, y_train)
        y_pred = reg.predict(X_test)
        score = r2_score(y_test, y_pred)
        scores.append(score)
        
print(np.mean(score))
0.7116344958573725

This average number show that linear regression has an overall high accuracy in predicting total points

# This section gives an exmaple of predicting the winning possibility of
# the home team Dallas Mavericks against the away team Boston Celtics using Linear Regression

results = prediction_with_Linear_Models('Mavericks', 'Celtics', 'Linear Regression')
ave = results['mean_HOME_TEAM_WINS_pred']
acc = results['accuracy']

print(f'The predicted possibility that home team Mavericks wins the game against away team Celtics is {ave}')
print(f'The accuracy is {acc}')
The predicted possibility that home team Mavericks wins the game against away team Celtics is 0.7894736842105263
The accuracy is 0.8947368421052632

This example shows that linear regression is quite accurate in predicting the game result between Dallas Mavericks and Boston Celtics

# This section visualizes the approximation by Linear Regression

home = 'Mavericks'
away = 'Celtics'

plot_score(home, away, 'Linear Regression')
../_images/88f09627d616003ac10b3938cab3bfff602254304c81e412f7e54875d1f6b1df.png

The two graphs show that for both teams, the predicted and actual points are pretty close for over 70 percent of the samples

Ridge Regression#

# This section predicts the points each team will have depending on the following features:
# FG_PCT, PT_PCT, FG3_PCT, AST, REB
# Model used: Ridge Regression

ridge = Ridge(alpha = 0.1)
kf = KFold(n_splits = 5, shuffle = True, random_state = 1)
scores = []

features = ['FG_PCT', 'FT_PCT', 'FG3_PCT', 'AST','REB']
target = ['PTS']

k = 1
a = 0
for i in range(0, len(Team_data)):
    for train_index, test_index in kf.split(Team_data[i]): # Test the prediction using 5-fold cross-validation
        temp = Team_data[i]
        X_train, X_test = temp[features].iloc[train_index], temp[features].iloc[test_index]
        y_train, y_test = temp[target].iloc[train_index], temp[target].iloc[test_index]
        
        ridge.fit(X_train, y_train)
        y_pred = ridge.predict(X_test)
        score = r2_score(y_test, y_pred)
        scores.append(score)
        
print(np.mean(score))
0.7099268435047021

This average number show that ridge regression has an overall high accuracy in predicting total points

# This section gives an exmaple of predicting the winning possibility of
# the home team Dallas Mavericks against the away team Boston Celtics using Ridge Regression

results = prediction_with_Linear_Models('Mavericks', 'Celtics', 'Ridge Regression')
ave = results['mean_HOME_TEAM_WINS_pred']
acc = results['accuracy']

print(f'The predicted possibility that home team Mavericks wins the game against away team Celtics is {ave}')
print(f'The accuracy is {acc}')
The predicted possibility that home team Mavericks wins the game against away team Celtics is 0.631578947368421
The accuracy is 0.631578947368421

This example shows that ridge regression is relatively less accurate in predicting the game result between Dallas Mavericks and Boston Celtics

# This section visualizes the approximation by Ridge Regression

home = 'Mavericks'
away = 'Celtics'

plot_score(home, away, 'Ridge Regression')
../_images/10edd36ba58a7a62a6a61a055cfa48b89efee2c3a4c52bfb5127fb312e7c8670.png

The two graphs show that for both teams, the predicted and actual points by ridge regression are more scattered than the graph of linear regression

Lasso Regression#

# This section predicts the points each team will have depending on the following features:
# FG_PCT, PT_PCT, FG3_PCT, AST, REB
# Model used: Lasso Regression

lasso = Lasso(alpha = 0.1)
kf = KFold(n_splits = 5, shuffle = True, random_state = 1)
scores = []

features = ['FG_PCT', 'FT_PCT', 'FG3_PCT', 'AST','REB']
target = ['PTS']

k = 1
a = 0
for i in range(0, len(Team_data)):
    for train_index, test_index in kf.split(Team_data[i]): # Test the prediction using 5-fold cross-validation
        temp = Team_data[i]
        X_train, X_test = temp[features].iloc[train_index], temp[features].iloc[test_index]
        y_train, y_test = temp[target].iloc[train_index], temp[target].iloc[test_index]
        
        lasso.fit(X_train, y_train)
        y_pred = lasso.predict(X_test)
        score = r2_score(y_test, y_pred)
        scores.append(score)
        
print(np.mean(score))
0.6612318106475936

This number shows that lasso regression has an above-average accuracy

# This section gives an exmaple of predicting the winning possibility of
# the home team Dallas Mavericks against the away team Boston Celtics using Lasso Regression

results = prediction_with_Linear_Models('Mavericks', 'Celtics', 'Lasso Regression')
ave = results['mean_HOME_TEAM_WINS_pred']
acc = results['accuracy']

print(f'The predicted possibility that home team Mavericks wins the game against away team Celtics is {ave}')
print(f'The accuracy is {acc}')
The predicted possibility that home team Mavericks wins the game against away team Celtics is 0.631578947368421
The accuracy is 0.631578947368421

This example shows that the accuracy is basically the same as the overall accuracy

# This section visualizes the approximation by Lasso Regression

home = 'Mavericks'
away = 'Celtics'

plot_score(home, away, 'Lasso Regression')
../_images/5ce2592885c72b28ae3ff7858390c3d9bd824365b1cf3e7aef996c31d29cf255.png

These two graph are similar to those of ridge regression, with minor differences.

Summarizing the three regression tools, we can conclude that linear regression has the best approximation accuracy in predicting the game and points results

Logistic Regression#

# This section generates the overall accuracy by feeding the model with the data of all 30 teams

overall_accuracy = []
for i in range(0, len(Team)):
    temp_team = deepcopy(Team)
    temp_team.remove(Team[i])
    for j in range(0, len(temp_team)):
        results = prediction_with_Other_Models(Team[i], temp_team[j], 'Logistic Regression')
        overall_accuracy.append(results['accuracy'])
        
av_acc = np.mean(overall_accuracy)
print(f'The overall accuracy of predicting the winning rate of each team is {av_acc}')
The overall accuracy of predicting the winning rate of each team is 0.6978113802657413
# This section gives an example of predicting the result between two specific games.
# As an example, we want to predict the result between
# Boston celtics and Dallas Mavericks using Logistic Regression

results = prediction_with_Other_Models('Mavericks', 'Celtics', 'Logistic Regression')
possi = results['mean_prediction']
acc = results['accuracy']

print(f'The predicted winning rate of home team Celtics against away team Mavericks is {possi}')
print(f'The overall accuracy is {acc}')
The predicted winning rate of home team Celtics against away team Mavericks is 0.5789473684210527
The overall accuracy is 0.0

It turns out that logistic regression is not useful in this example, since the accuracu is 0

# This section visualizes the approximation by Logistic Regression

home = 'Mavericks'
away = 'Celtics'
results = prediction_with_Other_Models(home, away, 'Logistic Regression')

plot_win(results)
../_images/f7f69181444e7afadc88efc51a57ea607bcb1866d4856eec26bef100fbfdaf0f.png

K-Nearest Neighbor Regression#

# This function is individually set in this section because we need to find the best k first
# I have to change the input 'n' in a for loop,
# so I can't combine it with the 'prediction_with_Other_Models' function

def prediction_with_KNN_home(home, away, n):
    accuracy = []
    home_index = Team.index(home)
    away_index = Team.index(away)
    home_id = ID[home_index]
    away_id = ID[away_index]
    
    temp = games_clean[(games_clean['TEAM_ID_home'] == home_id) & (games_clean['TEAM_ID_away'] == away_id)]
    temp_dummy = pd.get_dummies(temp, columns = ['TEAM_ID_home', 'TEAM_ID_away'])
    
    X = temp_dummy.drop(columns = ['GAME_DATE_EST', 'GAME_ID', 'GAME_STATUS_TEXT', 'HOME_TEAM_ID', 'SEASON', 'PTS_home', 'VISITOR_TEAM_ID', 'PTS_away', 'HOME_TEAM_WINS'])
    y = temp_dummy['HOME_TEAM_WINS']
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    knn = KNeighborsClassifier(n_neighbors = n)
    knn.fit(X_train, y_train)
    
    y_pred_test = knn.predict(X_test)
    y_pred = knn.predict(X)
    acc = accuracy_score(y_test, y_pred_test)
    
    return acc, np.mean(y_pred)
# This section tests the accuracy of KNN model using k-value from 1 to 12 

overall_accuracy = []
ave_acc = []
for k in range(1, 13):
    for i in range(0, len(Team)):
        temp_team = deepcopy(Team)
        temp_team.remove(Team[i])
        
        for j in range(0, len(temp_team)):
            acc, ave_pred = prediction_with_KNN_home(Team[i], temp_team[j], k)
            overall_accuracy.append(acc)
            
    ave_acc.append(np.mean(overall_accuracy))
    
ave_acc
[0.6341139639415501,
 0.6213082591099833,
 0.6350885980771038,
 0.635928071768589,
 0.640380678708265,
 0.6408117159553941,
 0.6428693515023564,
 0.6433556469871125,
 0.6436807594278858,
 0.6434954470816538,
 0.6431105841215559,
 0.642111453893063]
# This part calculates the best value of k for approximation
max_acc = max(ave_acc)
best_k = ave_acc.index(max_acc) + 1
print(f'The best k is {best_k}')
print(f'The overall accuracy of predicting the winning rate of each team is {max_acc}')
The best k is 9
The overall accuracy of predicting the winning rate of each team is 0.6436807594278858

KNN model returns an average accuracy, and the best best k in this circumstance is 9

# This section gives an example of predicting the result between two specific games.
# As an example, we want to predict the result between
# Dallas Mavericks and Boston Celtics using KNN with the best k value we obtained from the cell above

acc, possi = prediction_with_KNN_home('Mavericks', 'Celtics', best_k)
print(f'The predicted winning rate of home team Dallas Mavericks against away team Boston Celtics is {possi}')
print(f'The overall accuracy is {acc}')
The predicted winning rate of home team Dallas Mavericks against away team Boston Celtics is 0.7894736842105263
The overall accuracy is 0.75

KNN has a pretty good accuracy in this exmaple

Decision Trees#

# This section generates the overall accuracy by feeding the model with the data of all 30 teams

overall_accuracy = []
for i in range(0, len(Team)):
    temp_team = deepcopy(Team)
    temp_team.remove(Team[i])
    for j in range(0, len(temp_team)):
        results = prediction_with_Other_Models(Team[i], temp_team[j], 'Decision Trees')
        overall_accuracy.append(results['accuracy'])
        
av_acc = np.mean(overall_accuracy)
print(f'The overall accuracy of predicting the winning rate of each team is {av_acc}')
The overall accuracy of predicting the winning rate of each team is 0.6718327363231015

The Decision Trees model returns an average accuracy

# This section gives an example of predicting the result between two specific games.
# As an example, we want to predict the result between
# Dallas Mavericks and Minnesota Timberwolves using Decision Trees

results = prediction_with_Other_Models('Mavericks', 'Celtics', 'Decision Trees')

possi = results['mean_prediction']
acc = results['accuracy']

print(f'The predicted winning rate of home team Dallas Maverick against away team Minnesota Timberwolves is {possi}')
print(f'The overall accuracy is {acc}')
The predicted winning rate of home team Dallas Maverick against away team Minnesota Timberwolves is 0.631578947368421
The overall accuracy is 0.5

The Decision Trees model does not give a vey good accuracy in this example

# This section visualizes the approximation of home wins using Decision Trees

home = 'Mavericks'
away = 'Celtics'
results = prediction_with_Other_Models(home, away, 'Decision Trees')

plot_win(results)
../_images/e953d9a52bce34fa42cbca5d787429f011b120d5195a4691c4627a710239b099.png

Support Vector Machines#

# This section generates the overall accuracy by feeding the model with the data of all 30 teams

overall_accuracy = []
for i in range(0, len(Team)):
    temp_team = deepcopy(Team)
    temp_team.remove(Team[i])
    for j in range(0, len(temp_team)):
        results = prediction_with_Other_Models(Team[i], temp_team[j], 'Support Vector Machines')
        overall_accuracy.append(results['accuracy'])
        
av_acc = np.mean(overall_accuracy)
print(f'The overall accuracy of predicting the winning rate of each team is {av_acc}')
The overall accuracy of predicting the winning rate of each team is 0.585768289257133

SVM gives a relatively lowest accuracy, which implies that SVM may not be suitable for prediction of NBA game results

# This section gives an example of predicting the result between two specific games.
# As an example, we want to predict the result between
# Dallas Mavericks and Minnesota Timberwolves using Support Vector Machine

results = prediction_with_Other_Models('Mavericks', 'Celtics', 'Support Vector Machines')

possi = results['mean_prediction']
acc = results['accuracy']

print(f'The predicted winning rate of home team Dallas Mavericks against away team Minnesota Timberwolves is {possi}')
print(f'The overall accuracy is {acc}')
The predicted winning rate of home team Dallas Mavericks against away team Minnesota Timberwolves is 1.0
The overall accuracy is 0.6666666666666666

SVM has an average performance in this example

# This section visualizes the approximation of home wins using Support Vector Machines

home = 'Mavericks'
away = 'Celtics'
results = prediction_with_Other_Models(home, away, 'Support Vector Machines')

plot_win(results)
../_images/d0e962bfdb04767ccfc097fcc6f6f521efc8fccacda9505abfb83164d570abd3.png

Conclusion#

In this project, I try to predict the NBA game outcomes using 7 Machine Learning tools. Three of the tools are used for predicting total points of home and away team, and four of them are used for predicting home team wins. Linear Regression gives the best accuracy score: 71.16%, while Support Vector Machines yields the worst: 58.58%. Nonetheless, the outcomes emphasize that Machine Learning is promising in the field of basketball game predictions.