Top Valorant Players#

Author: Ethan Tran (ethankt1@uci.edu)
Course Project
UC Irvine
Math 10 | Spring 2024
I would like to post my notebook on the course’s website. [Yes]

Introduction#

This is a dataset of the highest-ranking players from Riot Games’, Valorant. The game features a 5v5 objective-based matches where the mechanics include an economy system for purchasing various weapons (emphasizing precise gameplay) and strategic planning. The goal of this analysis is to show if there are any relationships in the data visually amongst the top players in the world.

Import libraries#

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression, LogisticRegression, ElasticNet
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor

Load Data#

df = pd.read_csv("val_stats.csv")
df
/var/folders/_h/2nwpq18d7c13g7t5z53xhw6r0000gn/T/ipykernel_77938/3110898118.py:1: DtypeWarning: Columns (0,10) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv("val_stats.csv")
region name tag rating damage_round headshots headshot_percent aces clutches flawless ... gun2_name gun2_head gun2_body gun2_legs gun2_kills gun3_name gun3_head gun3_body gun3_legs gun3_kills
0 NaN ShimmyXD #NA1 Radiant 135.8 992 24.9 0 140 80 ... Phantom 33 62 5 220 Classic 36 60 3 147
1 NaN XSET Cryo #cells Radiant 170.3 879 28.3 2 122 94 ... Operator 8 91 0 226 Phantom 32 63 5 137
2 NaN PuRelittleone #yoruW Radiant 147.5 720 24.0 3 117 59 ... Phantom 36 61 3 231 Operator 8 91 1 102
3 NaN Boba #0068 Radiant 178.2 856 37.3 3 83 49 ... Sheriff 48 51 1 48 Phantom 44 56 0 36
4 NaN i love mina #kelly Radiant 149.8 534 24.4 2 71 38 ... Spectre 21 71 8 65 Operator 8 92 0 64
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
85673 LAT Kazutora #img0d Radiant 138.2 342 21.4 0 58 54 ... Vandal 28 69 2 175 Classic 39 59 2 71
85674 LAT el lobo marino #uthur Radiant 182.9 650 30.1 4 77 42 ... Vandal 40 57 3 212 Spectre 33 63 5 139
85675 LAT p9pzet #666x Radiant 158.8 613 30.2 0 70 54 ... Phantom 40 56 4 159 Operator 10 89 1 87
85676 LAT EZ4TGD EnSBuwu #kmeve Immortal 3 155.9 132 22.2 0 23 18 ... Phantom 39 57 3 37 Spectre 27 69 4 17
85677 LAT Neon #SSJ Radiant 164.4 1,127 20.7 3 136 82 ... Operator 9 88 2 132 Spectre 27 70 2 108

85678 rows × 38 columns

df.columns
Index(['region', 'name', 'tag', 'rating', 'damage_round', 'headshots',
       'headshot_percent', 'aces', 'clutches', 'flawless', 'first_bloods',
       'kills', 'deaths', 'assists', 'kd_ratio', 'kills_round', 'most_kills',
       'score_round', 'wins', 'win_percent', 'agent_1', 'agent_2', 'agent_3',
       'gun1_name', 'gun1_head', 'gun1_body', 'gun1_legs', 'gun1_kills',
       'gun2_name', 'gun2_head', 'gun2_body', 'gun2_legs', 'gun2_kills',
       'gun3_name', 'gun3_head', 'gun3_body', 'gun3_legs', 'gun3_kills'],
      dtype='object')

Copy Data (Numeric-only)#

Using numerical data is often better for analysis because it is more computationally efficient, reduces noise, and leads to more accurate and reliable insights.

numeric_df = df.select_dtypes(include=['int64', 'float64'])
numeric_df.to_csv("val_stats_numeric.csv", index=False)

Visualize Data#

Correlation Heatmap#

fig, ax = plt.subplots(figsize=(13, 8))
sns.heatmap(numeric_df.corr())
plt.title('Correlation Heatmap of Top Valorant Players', fontsize=20)
plt.show()
../_images/f364521fdd3e66fafce49d801d9015e3526ddb5c3fab635da32b4e24f3d1b4aa.png

Context:#

In Valorant, a players’ weapon choice significantly impacts gameplay. Among the 17 different weapons, the top-three weapons, “Gun[1-3]”, stand out as the most frequently used by players. (Gun 1 being the most-favored weapon of choice, Gun 3 being the 3rd most-favored)

for i in range(1,4):
    colors = ["red","blue","green"]
    fig, ax = plt.subplots(figsize = (15,5))
    
    plt.hist(df["gun{}_name".format(i)], bins = 64, edgecolor="Black", facecolor = colors[i-1])
    plt.title("Gun {} and number of eliminations".format(i).title(),fontsize = 20)
    plt.xlabel("Gun name:",fontsize = 15)
    plt.ylabel("Number of eliminations",fontsize = 15)
    plt.show()
../_images/eb05f38e71e7b2fb82f04f90b562e53f49cac4d8d362849d6914ffc4d1a18c9c.png ../_images/2f3e496f67b4d81f66b4c99635131fe831d3cf5de0f3d6c044f55f14a07d9914.png ../_images/efa22b53ee4f37caefe72b1df7c7a113279d691258b39f5280b81c06c0bf7218.png

Shown by the histogram, the most preferred weapon is the “Vandal”. Second most preferred is the “Phantom”. Lastly, “Spectre” as the third.

Linear Regression#

X = df[['damage_round']]
y = df['kd_ratio']

model = LinearRegression()
model.fit(X, y)
lm = model.predict(X)

print(f"Coefficient: {model.coef_[0]}")
print(f"Intercept: {model.intercept_}")
print(f"R-squared: {model.score(X, y)}")

plt.figure(figsize=(10, 7))
sns.scatterplot(x='damage_round', y='kd_ratio', data=numeric_df, alpha=0.5, label='Actual data')
plt.plot(df['damage_round'], lm, color='red', label='Regression line')
plt.xlabel('Damage per Round')
plt.ylabel('K/D Ratio')
plt.title('Linear Regression: Damage per Round vs. K/D Ratio')
plt.legend()
plt.show()
Coefficient: 0.0077748624852823425
Intercept: -0.05643181253842333
R-squared: 0.6629070657865264
../_images/e7882641f3814b85729a332610813f6d662c42cee2a1b25690875d821cce6305.png

As shown through the linear regression model, these players’ ability to consistently deal significant damage while maintaining a high kill/death ratio is indicative of their overall impact and importance within competitive play. Since kills and damage are highly correlated, it was inevitable that this was going to be a positive trend; however, when taking into account the amount of deaths, it is more interesting to see if there were any outliers among the top players.

Logistic Regression#

We will use logistic regression to predict whether a player is a “duelist” based on the average number of kills per round. Duelists are agents known to engage in direct combat and secure kills. Stereotypically, these are supposed to be the ones with the highest amount of kills.

duelist_agents = ['Jett', 'Phoenix', 'Raze', 'Reyna', 'Yoru', 'Neon']
df['isDuelist'] = df['agent_1'].isin(duelist_agents).astype(int)

Train Test Split & Accuracy#

X = df[['kills_round']]
y = df['isDuelist']      

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=1)
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of logistic regression model:", accuracy)
Accuracy of logistic regression model: 0.6604729335418661

An accuracy score of 66% indicates that the model’s ability to correctly classify agents as duelists or non-duelists based on average kills per round is slightly above average.

Sigmoid Function#

By mapping the output to a probability value between 0 and 1, the sigmoid function makes it possible to interpret the results of logistic regression. This graph should make it easier to predict the probability of being a duelists based on KPR

X_range = np.linspace(X['kills_round'].min(), X['kills_round'].max(), 300)
z = model.coef_ * X_range.reshape(-1,1) + model.intercept_ 
y_prob = 1 / (1 + np.exp(-z))
plt.figure(figsize=(10, 6))
sns.scatterplot(x='kills_round', y='isDuelist', data=df, alpha=0.5, label='Actual data')
plt.plot(X_range, y_prob, color='red', label='Logistic regression curve (Sigmoid)')
plt.xlabel('Kills per Round (KPR)')
plt.ylabel('Probability of Being a Duelist')
plt.title('Logistic Regression: Probability of Being a Duelist vs. Kills per Round')
plt.legend()
plt.show()
../_images/8920dcbde44b754eccad6e6536e0e86109909d4543d6363612ef02a278b27927.png

This logistic regression shows that having a high kill per round (KPR) does not necessarily mean that you are a duelist. While duelists are often associated with high kill counts, other roles also contribute significantly to team success through non-kill-related actions. However, in higher-ranked matches, we can see that it may be impossible to predict whether or not someone is playing a “duelist” solely based off KPR. Unfortunately, KPR is calculated to the nearest tenth so we will not get accurate depictions of where exactly is the mean KPR. On the other hand, it does not matter in this case since predicting the role of the character does not cause higher/lower KPR.

Box Plot#

Since KPR and the probability of being a duelist is not interesting enough, how about seeing it through the lens of all agents?

plt.figure(figsize=(14, 8))
sns.boxplot(x='kills_round', y='agent_1', data=df)
plt.xlabel('Kills per Round')
plt.ylabel('Agent')
plt.title('Box Plot of Kills per Round by Agent')
plt.show()
../_images/a4bfaf1a345695ba987a241428ad91d814e30738e9c67df44555efb9b8c114b7.png

Here we can see that the median of every agent is around 0.75 KPR, but what is interesting is the amount of disparity/outliers coming from each agent. Based on the many box plots, we can clearly see that our hypothesis was correct about agents not having the biggest effect on kills.

ElasticNet#

ElasticNet is a linear model that combines the properties of both Lasso (L1) and Ridge (L2) regression.

X = df[['kills_round', 'damage_round', 'headshot_percent', 'score_round']]
y = df['win_percent']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=1)

Selecting these features might be irrelevant to the modeling process. However, since KPR and Score per Round (SPR) are highly correlated along with damage and headshot percentage, it might be useful since it can handle multicollinearity.

Feature Scaling#

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

With the combined property of both Lasso and Ridge regression, its important to address the regularization that comes with it. By scaling, it lets all features contribute equally so that one does not effect the other drastically.

elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic_net.fit(X_train_scaled, y_train)
y_pred = elastic_net.predict(X_test_scaled)
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
print("ElasticNet Coefficients:", elastic_net.coef_)
print("ElasticNet Intercept:", elastic_net.intercept_)
print("ElasticNet R-squared:", r2)
print("ElasticNet Mean Squared Error:", mse)
ElasticNet Coefficients: [ 2.18967798  0.33484552 -0.53189074  0.        ]
ElasticNet Intercept: 53.83205957188543
ElasticNet R-squared: 0.04880163146877592
ElasticNet Mean Squared Error: 125.29573107476482

With such a high MSE, it might be that the model has made an error in predicting the target variable. However, this would be the case since there are so many different factors to predicting win percentage. In a matchmaking system of 5v5 of the top players, even having one player not having the best day could throw this graph off and cause their own team to lose.

plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [intercept + coefficients * y_test.min(), intercept + coefficients * y_test.max()],
         color='red', label='Regression Line')
plt.xlabel('Actual Win Percentage')
plt.ylabel('Predicted Win Percentage')
plt.title('ElasticNet Regression: Actual vs Predicted Win Percentage')
plt.show()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[34], line 3
      1 plt.figure(figsize=(10, 6))
      2 plt.scatter(y_test, y_pred, alpha=0.5)
----> 3 plt.plot([y_test.min(), y_test.max()], [intercept + coefficients * y_test.min(), intercept + coefficients * y_test.max()],
      4          color='red', label='Regression Line')
      5 plt.xlabel('Actual Win Percentage')
      6 plt.ylabel('Predicted Win Percentage')

NameError: name 'intercept' is not defined
../_images/cf9d875f8b8610f6683a531d3254fe0686b51893bd18bea756db3e23985344fc.png

Random Forest Regressor#

Unlike ElasticNet producing coefficients, Random Forest fits a number of decision tree regressors on a number of samples that are averaged in order to improve the accuracy in predictions.

rf = RandomForestRegressor(n_estimators=10, random_state=1)
rf.fit(X_train_scaled, y_train)
y_pred_rf = rf.predict(X_test_scaled)
r2_rf = r2_score(y_test, y_pred_rf)
mse_rf = mean_squared_error(y_test, y_pred_rf)
print("Random Forest R-squared:", r2_rf)
print("Random Forest Mean Squared Error:", mse_rf)
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred_rf, color='green', alpha=0.5)
plt.xlabel('Actual Win Percentage')
plt.ylabel('Predicted Win Percentage')
plt.title('Random Forest Regression: Actual vs Predicted Win Percentage')
plt.show()

Overall, since there are not enough features, the regression techniques are not enough to accurately depict actual vs predicted win percentages. In this case, regularization should not be used since it discourages a more complex model in preventing overfitting.

Summary#

Overall, the analysis provides valuable insight into the Top Valorant Players dataset, helping to understand player behavior, performance factors, and the extremities of stereotypes. This analysis can inform strategies for improving player mindset while optimizing gameplay tactics in Valorant. What should be kept into mind is that win percentage in Valorant is shaped not only by individual gameplay statistics, but also by effective communication, teamwork tactics, etc.

References#