Classifying Songs by Genre using its Features#

Author: Adam Miller

Course Project, UC Irvine, Math 10, S24

I would like to post my notebook on the course’s website. Yes

Introduction:#

This data is from over 30,000 songs from the Spotify API. The songs come from six main genres (EDM, Latin, Pop, R&B, Rap, and Rock), but there are also some subgenres categorized by Spotify included in the data. Both the data and the data dictionary below were taken from Kaggle. As we can see from the data dictionary, the data includes many features of each song, including the popularity, genre, danceability, energy, speechiness, acousticness, tempo, and duration; I plan to explore any possible correlation between these features (e.g. are songs with higher tempo more danceable? Does track length affect popularity?). I also plan to use k-means clustering to see if we can classify songs by genre based on the other features, and also to see if we are able to find similar songs to an input song (as the other songs in the same cluster as the input should be similar).

variable

class

description

track_id

character

Song unique ID

track_name

character

Song Name

track_artist

character

Song Artist

track_popularity

double

Song Popularity (0-100) where higher is better

track_album_id

character

Album unique ID

track_album_name

character

Song album name

track_album_release_date

character

Date when album released

playlist_name

character

Name of playlist

playlist_id

character

Playlist ID

playlist_genre

character

Playlist genre

playlist_subgenre

character

Playlist subgenre

danceability

double

Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.

energy

double

Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.

key

double

The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.

loudness

double

The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.

mode

double

Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.

speechiness

double

Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.

acousticness

double

A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.

instrumentalness

double

Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.

liveness

double

Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.

valence

double

A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

tempo

double

The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.

duration_ms

double

Duration of song in milliseconds

import pandas as pd
data = pd.read_csv('./spotify_songs.csv')

Cleaning Up the Data:#

Since the way the data is organized is by Spotify’s playlisting, or the autogenerated playlists created by Spotify, it’s possible that some songs show up more than once in the dataframe, since one song could be placed in multiple playlists. The following code computes how many unique entries there are in the 'track_id' column.

data['track_id'].nunique()
28356

So, there are only 28,356 unique tracks in the data set. In order to get the best results when we perform regression and classification methods on the data set, we should update the dataframe so that each track appears once, and drop the duplicate values. Also, we’ll drop any rows with missing values:

data_unique = data.drop_duplicates(subset=['track_id'])
df = data_unique.dropna()
df.shape
(28352, 23)

Data Exploration:#

Now, we have a dataframe df which consists of 28,352 unique songs (and no missing values). For this section, we focus on visualizing the data in a sort of exploratory data analysis. We’ll begin by printing the first five rows of the dataframe:

df.head(5)
track_id track_name track_artist track_popularity track_album_id track_album_name track_album_release_date playlist_name playlist_id playlist_genre ... key loudness mode speechiness acousticness instrumentalness liveness valence tempo duration_ms
0 6f807x0ima9a1j3VPbc7VN I Don't Care (with Justin Bieber) - Loud Luxur... Ed Sheeran 66 2oCs0DGTsRO98Gh5ZSl2Cx I Don't Care (with Justin Bieber) [Loud Luxury... 2019-06-14 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop ... 6 -2.634 1 0.0583 0.1020 0.000000 0.0653 0.518 122.036 194754
1 0r7CVbZTWZgbTCYdfa2P31 Memories - Dillon Francis Remix Maroon 5 67 63rPSO264uRjW1X5E6cWv6 Memories (Dillon Francis Remix) 2019-12-13 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop ... 11 -4.969 1 0.0373 0.0724 0.004210 0.3570 0.693 99.972 162600
2 1z1Hg7Vb0AhHDiEmnDE79l All the Time - Don Diablo Remix Zara Larsson 70 1HoSmj2eLcsrR0vE9gThr4 All the Time (Don Diablo Remix) 2019-07-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop ... 1 -3.432 0 0.0742 0.0794 0.000023 0.1100 0.613 124.008 176616
3 75FpbthrwQmzHlBJLuGdC7 Call You Mine - Keanu Silva Remix The Chainsmokers 60 1nqYsOef1yKKuGOVchbsk6 Call You Mine - The Remixes 2019-07-19 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop ... 7 -3.778 1 0.1020 0.0287 0.000009 0.2040 0.277 121.956 169093
4 1e8PAfcKUYoKkxPhrHqw4x Someone You Loved - Future Humans Remix Lewis Capaldi 69 7m7vv9wlQ4i0LFuJiE2zsQ Someone You Loved (Future Humans Remix) 2019-03-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop ... 1 -4.672 1 0.0359 0.0803 0.000000 0.0833 0.725 123.976 189052

5 rows × 23 columns

Next, we compute the mean, standard deviation, min, max, and 25th, 50th, and 75th quartiles of each of the features that have numerical values.

df.select_dtypes(['float64', 'int64']).describe()
track_popularity danceability energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo duration_ms
count 28352.000000 28352.000000 28352.000000 28352.000000 28352.000000 28352.000000 28352.000000 28352.000000 28352.000000 28352.000000 28352.000000 28352.000000 28352.000000
mean 39.335320 0.653375 0.698373 5.367417 -6.817777 0.565533 0.107939 0.177192 0.091129 0.190955 0.510386 120.958219 226574.631102
std 23.699443 0.145791 0.183508 3.613743 3.036433 0.495696 0.102547 0.222814 0.232562 0.155888 0.234344 26.954502 61081.363704
min 0.000000 0.000000 0.000175 0.000000 -46.448000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 4000.000000
25% 21.000000 0.561000 0.579000 2.000000 -8.310250 0.000000 0.041000 0.014300 0.000000 0.092600 0.329000 99.972000 187741.250000
50% 42.000000 0.670000 0.722000 6.000000 -6.261000 1.000000 0.062600 0.079700 0.000021 0.127000 0.512000 121.993500 216933.000000
75% 58.000000 0.760000 0.843000 9.000000 -4.708750 1.000000 0.133000 0.260000 0.006573 0.249000 0.695000 133.999000 254975.250000
max 100.000000 0.983000 1.000000 11.000000 1.275000 1.000000 0.918000 0.994000 0.994000 0.996000 0.991000 239.440000 517810.000000

This is a lot of information to process just by looking at it, so we’ll make a few graphs to make some of the relationships between the features a bit more visually appealing. First, let’s plot the number of songs in each genre:

import matplotlib.pyplot as plt
import seaborn as sns

fig, ax = plt.subplots()
sns.countplot(data = df, x='playlist_genre', order = df['playlist_genre'].value_counts().index)
plt.title('Number of songs by genre')
plt.xlabel('Genre')
plt.ylabel('Number of Songs')
plt.show()
../_images/9108b42eed6cb221fdce1bacc4cab4a2ecda8a6c68e14a27a9985364c72273b7.png

We can see that there are over 4000 songs in each genre, so it’s a pretty uniform distribution, where there isn’t too much or too little of one genre. This makes the 'playlist_genre' feature of the data a good canditate to use for multiclass classfication, which we will focus on later.

Linear Regression:#

Before really analyzing the data, I have a hunch that loudness and energy will be correlated, in that louder songs will have more energy. To see this relationship, let’s plot energy as a function of loudness. Also, we’ll apply linear regression and plot the linear regression line on the same graph:

from sklearn.linear_model import LinearRegression
lm = LinearRegression()
X = df[['loudness']]
y = df['energy']
lm.fit(X, y)
fig, ax = plt.subplots()
ax.scatter(X, y)
ax.plot(X, X * lm.coef_[0] + lm.intercept_, color='red')
ax.set_ylim([-0.05, 1.05])
ax.set_xlabel("Loudness")
ax.set_ylabel("Energy")
ax.set_title(f"Loudness as a function of Energy, R^2 = {lm.score(X, y):.4f}");
../_images/476634dd07d4e938dbf3eb7575b8626e8b0570d27cd2c89b78c5e445a8890c09.png

So, we can see that the R^2 value is 0.4653.

The above graph shows only the relationship between 'loudness' and 'energy'. If we want to visualize the relationships between many more of the features, particularly the features which take on numerical values, we can use the Correlation Matrix, and plot it using a heatmap:

numerical_features = df.select_dtypes(['float64', 'int64'])
corr = numerical_features.corr()
plt.figure(figsize=(15, 8))
sns.heatmap(corr, annot=True)
<Axes: >
../_images/d9d46eb92b875637a080c52d3d9c482d41a760d282cb8b6701d444fbf8b82a05.png

From this, we can see that the vast majority of the features are not correlated at all. In fact, 'loudness' and 'energy', which we plotted above, happen to be the highest correlated. Now, this may mean that we may get some low accuracy scores when doing classification, but that doesn’t mean that these efforts are fruitless. However, this goes a bit against our intuition: one may guess that a song’s energy and danceability may be very correlated, or that a song’s energy and tempo may be correlated, etc. and this correlation matrix shows that this is actually not the case: There are a couple of explanations for this that I can come up with. One is that we are missing a few key features, including the types of instruments in the song or the lyrical content of the song. Another possibility is that multiple features together are better at predicting one feature than only considering one at a time. To test this, we can use multiple linear regression and compare the \(R^2\) scores for different combinations of features predicting one feature. We’ll test how combinations of the features 'danceability', 'loudness', 'tempo', 'valence', and 'liveness' predict 'energy'. We’ll only use combinations of these five features just for simplicity, so we don’t get too many combinations:

from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, test_size=0.5, random_state=0, stratify=df['playlist_genre'])

First we split the data into training and testing data, and the stratify=df['playlist_genre'] code ensures that each group has the same proportion of genres.

from itertools import combinations
features = ['danceability', 'loudness',	'tempo', 'valence', 'liveness']
from sklearn.metrics import r2_score
features_comb = list(combinations(features, 1)) + list(combinations(features, 2)) + list(combinations(features, 3)) + list(combinations(features, 4)) + list(combinations(features, 5))
R2_train = []
R2_test = []
for combination in features_comb:
    x_train = df_train[list(combination)]
    y_train = df_train['energy']
    x_test = df_test[list(combination)]
    y_test = df_test['energy']

    lm = LinearRegression()
    lm.fit(x_train, y_train)
    y_train_pred = lm.predict(x_train)
    y_test_pred = lm.predict(x_test)

    r2_train = r2_score(y_train, y_train_pred)
    r2_test = r2_score(y_test, y_test_pred)

    R2_train.append(r2_train)
    R2_test.append(r2_test)

results = pd.DataFrame({'Feature': [list(comb) for comb in features_comb],'R2_train': R2_train,'R2_test': R2_test})
results
Feature R2_train R2_test
0 [danceability] 0.006296 0.006805
1 [loudness] 0.463760 0.466767
2 [tempo] 0.027840 0.017513
3 [valence] 0.023161 0.021459
4 [liveness] 0.028556 0.024941
5 [danceability, loudness] 0.472100 0.475298
6 [danceability, tempo] 0.030229 0.020985
7 [danceability, valence] 0.042003 0.041449
8 [danceability, liveness] 0.031961 0.028985
9 [loudness, tempo] 0.472914 0.472255
10 [loudness, valence] 0.478530 0.478818
11 [loudness, liveness] 0.476912 0.477022
12 [tempo, valence] 0.052022 0.040490
13 [tempo, liveness] 0.055106 0.041784
14 [valence, liveness] 0.052503 0.047761
15 [danceability, loudness, tempo] 0.478456 0.478603
16 [danceability, loudness, valence] 0.497897 0.497790
17 [danceability, loudness, liveness] 0.482871 0.483401
18 [danceability, tempo, valence] 0.063390 0.054302
19 [danceability, tempo, liveness] 0.055917 0.043291
20 [danceability, valence, liveness] 0.065980 0.062597
21 [loudness, tempo, valence] 0.488272 0.484990
22 [loudness, tempo, liveness] 0.485740 0.482326
23 [loudness, valence, liveness] 0.492216 0.489798
24 [tempo, valence, liveness] 0.080049 0.065952
25 [danceability, loudness, tempo, valence] 0.503250 0.500700
26 [danceability, loudness, tempo, liveness] 0.489390 0.486824
27 [danceability, loudness, valence, liveness] 0.507846 0.505543
28 [danceability, tempo, valence, liveness] 0.087447 0.075575
29 [loudness, tempo, valence, liveness] 0.501625 0.495785
30 [danceability, loudness, tempo, valence, liven... 0.513358 0.508561

From this, we see that the combination of all five features has the best \(R^2\) score, and is therefore using the combination of danceability, loudness, tempo, valence, and liveness is the best at predicting the energy of the song, with a training \(R^2\) score of \(0.513358\) and a testing \(R^2\) score of \(0.508561\).

Classification of Songs by Genre:#

In this section, I want to see if we are able to accurately classify the songs by Genre using methods of multiclass classification. In order to do this, I will use k-nearest neighbor classification, k-means clustering, and another method of classification called Random Forest Classification, which we did not discuss in Math 10. From the scikit-learn website, the random forest classifier is a machine learning algorithm which fits decision tree classifiers on subsamples of the data set, and then uses averaging to imporve the predictive accuracy and control overfitting.

Before we start using these classification methods, we have to:

  1. Scale the features using a min-max scaler

  2. Use the Label Encoder to encode the genres pop, rap, edm, latin, r&b, and rock into the classes 1, 2, 3, 4, 5, 6

Scaling the Data:#

First, we need to choose which features we want to use to predict the genre of the song.

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
features = ['danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms']
label = 'playlist_genre'
scaled_features = scaler.fit_transform(df[features])
df_scaled = pd.DataFrame(scaled_features, columns = df[features].columns)
X = df_scaled
y = df[label]

Label Encoding:#

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

Now, we’re ready to start using the classification methods to see if we can classify songs by genre using the features 'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms'. Note that since we are using a total of \(12\) features to make the prediction, we are not going to be able to graph and visualize the decision boundary, since we are not able to plot a 12-dimensional graph. We are still going to be able to get accuracy scores.

kNN Classification:#

We will test different values of \(k\) to see which one gives us the best accuracy using 5-fold cross validation. Since there are a total of \(28352\) songs in the dataframe, we should test using a large number of neighbors, so we’ll test using the 100, 500, 2000, and 5000 nearest neighbors. Unfortunately, this will take a long time to run since there are so many songs and so many features for each song.

from sklearn.model_selection import KFold, cross_val_score
from sklearn.neighbors import KNeighborsClassifier

num_neighbors = [10, 100, 500, 2000, 5000]
average_accuracy = []
for n in num_neighbors:
    kf = KFold(n_splits=5, shuffle=True, random_state=0)
    knn = KNeighborsClassifier(n_neighbors=n)
    score = cross_val_score(knn, X, y, cv=kf)
    accuracy = score.mean()
    average_accuracy.append(accuracy)
average_accuracy
[0.456793239654581,
 0.4690676939545452,
 0.44042765927207234,
 0.39394114118148676,
 0.32604477683887545]

From this, we can see that of the number of neighbors being 10, 100, 500, 2000, or 5000, using 100 neighbors gave the best prediction accuracy of \(0.4691\). Just this test does not mean that 100 is the ideal number of neighbors, but of the five values we tested, 100 neighbors gave the best accuracy.

k-means Clustering:#

Here, we want to choose the best \(k\) value, or the number of clusters which will be best at predicting the genres. To do this, we will plot the loss as a function of \(k\) and then find the “elbow” in the graph.

from sklearn.cluster import KMeans

k_values = range(1, 15)
inertias = []

for k in k_values:
    kmeans = KMeans(n_clusters = k, random_state=42)
    kmeans.fit(X)
    inertias.append(kmeans.inertia_)

fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(k_values, inertias, marker='o')
ax.set_title("Loss vs. number of clusters k")
ax.set_xlabel("k")
ax.set_ylabel("Loss (Inertia)")
ax.grid(True)
plt.show()
../_images/c6286c59429f15b3bed66ad82d657ba5307c84ddd3326960121b53a1eaaa1b96.png

From this, we can see that the most decrease is at 2 clusters, but there is still a pretty steady decrease after 2 clusters. If I had to classify a true “elbow” for this graph, I would say that \(k=2\) is probably a good number to choose.

accuracy_scores = []
for n in range(1, 15):
    kmeans = KMeans(n_clusters = n, random_state=42)
    kmeans.fit(X)
    labels = kmeans.labels_
    correct_labels = sum(y==labels)
    accuracy_scores.append(correct_labels / y.size)
accuracy_scores
[0.1720160835214447,
 0.15282872460496613,
 0.15953019187358916,
 0.15244074492099322,
 0.15695541760722348,
 0.14637415349887134,
 0.13441732505643342,
 0.12521162528216703,
 0.1177341986455982,
 0.09600733634311512,
 0.07763120767494357,
 0.06867240406320542,
 0.061336060948081264,
 0.060983352144469524]

From this, we see that \(k=1\) gives the best accuracy score, but for \(k>1\), the best number of clusters is \(k=3\). Also, the accuracy of the k-means clustering is much lower (~0.16) than the accuracy we got using the kNN classification above (~0.47).

Random Forest Classification:#

Now, we use the Random Forest Classification, which is an algorithm that we did not discuss in Math 10.

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

num_estimators = [10, 100, 250, 500, 1000]
accuracy_scores = []
for n in num_estimators:
    rf = RandomForestClassifier(n_estimators = n, random_state=42)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracy_scores.append(accuracy)
accuracy_scores
[0.5132618510158014,
 0.5630643340857788,
 0.5633465011286681,
 0.5677200902934537,
 0.5660270880361173]

The Random Forest Classifier is a machine learning algorithm which creates an ensemble of decision trees, which are like flowcharts where the nodes represent the features, each branch is a decision rule, and each leaf node is the class label. The Random Forest classification takes a bunch of these decision trees and combines their predictions in a way which reduces the variance and increases the accuracy in comparison to just using a single decision tree. One of the main pros to using this algorithm is that it helps reduce the overfitting of data, especially when there are many features; in this case, since we are using a total of 12 features, this is especially helpful. In the above code, we tested the data using the Random Forest Classifier using 10, 100, 250, 500, and 1000 estimators (which is the number of trees in the forest). We obtained that of these, 500 trees in the forest gave the highest accuracy of classification, with an accuracy score of \(0.5677\). This is a higher accuracy than we were able to obtain using kNN and k-means clustering, which indicates that this algorithm is the best out of the three at classifying these songs by genre.

Conclusion:#

The data set we are analyzing consists of 28,352 unique songs and many features pulled from Spotify’s API. My goal with this project is to analyze any relationships between all of the features using linear regression, as well as to use multiclass classification methods to see how accurately we are able to use a song’s features to classify it by genre. Using multiple linear regression, we tested which combination of the features 'danceability', 'loudness', 'tempo', 'valence', and 'liveness' best predict 'energy', and we found that the combination of all five features was the best predictor, with a training \(R^2\) score of \(0.513358\) and a testing \(R^2\) score of \(0.508561\). Next, we used kNN classification, k-means clustering, and Random forest classification (which is an algorithm we didn’t discuss in class) to use all the numerical features in the dataset to predict a song’s genre. We found that the Random forest classification was the best classifier, where using 500 trees gave an accuracy of \(0.5677\) of correctly predicting the song’s genre.

References:#

The dataset was taken from Kaggle

The information for the Random Forest Classifier was taken from scikit-learn