Predicting ratings on Sephora based on price and number of ingredients

Predicting ratings on Sephora based on price and number of ingredients#

Author: Nicole Sonneville

Course Project, UC Irvine, Math 10, S24

I would like to post my notebook on the course’s website. Yes

Introduction

In this project, I will be trying to find a relationship between a product’s ratings and its price and/or number of ingredients. This data set tracks a product’s reviews in two ways: rating and loves. Rating is your standard 5 star scale where customers can choose to rate a product 5 stars, 4.5 stars, 4 stars, etc all the way down to 0 stars. Loves are the number of people who heart the product; this saves the product to their favorites for easier repurchase later.

I think its important to recognize the different reasonings someone may leave a rating versus hitting love on a product. People tend to leave ratings for one of two reasons: either they extremely hate the product and want to cathartically express their distaste online, or a brand offers some sort of coupon on their next order for leaving a good rating. Because of this, I expect ratings to be very extreme. There will be many 4-5 star reviews and quite a few 0-1 star reviews, but very minimal 2-3 star reviews. Loves, on the other hand, are not made public and are strictly for ease of repurchase. Because of this, I would think that cheaper products may tend to have more loves as they may be more frequently repurchased than expensive products.

Imports:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.tree import DecisionTreeRegressor, plot_tree

df = pd.read_csv("sephora_website_dataset.csv")

Data cleaning:

df_1 = df.drop(['details', 'how_to_use', 'URL', 'MarketingFlags_content'], axis=1)
df = df_1.sample(n=1000, random_state=42)

This was a large data set I found from kaggle (https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset), so I dropped columns I didn’t find useful. Also, since the data set was so large, I choose 1000 random rows to analyze as to not be overwhelmed

print(df['ingredients'].head())
1501    Sucrose- Morrocan Lava Clay- Ammonium Lauryl S...
4708    Algae (Seaweed) Extract- Cyclopentasiloxane- P...
7859                                              unknown
4347     -Squalane: Helps restore skin’s natural moist...
2420     -Hyaluronic Chronospheres: Provide time-relea...
Name: ingredients, dtype: object

I wanted to analyze the number of ingredients, but I realized that the ingredients column had all ingredients in a big string

# Function to count the number of ingredients
def count_ingredients(ingredient_list):
    if ingredient_list.lower() == 'unknown':
        return np.nan
    return len(ingredient_list.split('-'))

# Apply the function to create the num_ingredients column
df['num_ingredients'] = df['ingredients'].apply(count_ingredients)

print(f"Sampled DataFrame size: {df.shape}")
df.sample(10)
Sampled DataFrame size: (1000, 18)
id brand category name size rating number_of_reviews love price value_price MarketingFlags options ingredients online_only exclusive limited_edition limited_time_offer num_ingredients
1393 2115558 Charlotte Tilbury Eyeliner Rock 'N' Kohl Eyeliner Pencil .04 oz 4.0 85 16800 27.00 27.00 False no options Isododecane- Synthetic Wax- Hydrogenated Polyd... 0 0 0 0 28.0
2982 2148971 FOREO Face Masks Glow Addict Activated Mask 6 masks 4.5 6 2400 19.99 19.99 False no options Water- Glycerin- Butylene Glycol- Dipropylene... 0 0 0 0 39.0
8974 2086981 SEPHORA COLLECTION Face Masks Acai Smoothie Mask - Anti-Pollution 1.69oz/ 50 mL 5.0 16 2200 8.00 8.00 True no options -Açaí Berry Extract: Acts like a shield again... 0 1 0 0 44.0
5623 2030351 Moroccanoil Shampoo Moisture Repair Shampoo no size 4.5 912 12600 24.00 24.00 False no options -Argan Oil: Helps to nurture hair. -Lavender- ... 0 0 0 0 72.0
5111 2346328 Living Proof Value & Gift Sets Full Silicone Detox Kit no size 0.0 0 75 29.00 60.00 True no options -Healthy Hair Molecule: Keeps hair cleaner lo... 1 0 1 0 144.0
6552 2283349 philosophy Moisturizers Renewed Hope in A Jar Water Cream 2 oz/ 60 mL 4.5 135 1800 39.00 39.00 False no options Aqua/Water/Eau- Glycerin- Pentylene Glycol- Is... 0 0 0 0 40.0
6997 2295814 Sakara Life Beauty Supplements Sakara Life Super Powder no size 4.5 12 884 90.00 135.00 True no options -Plant Protein Blend: Derived from four organ... 1 0 0 0 29.0
2516 2313005 Drunk Elephant Face Masks F-Balm™ Electrolyte Waterfacial Mask 1.69 oz/ 50 mL 4.0 797 38700 52.00 52.00 True no options -4-Electrolyte Blend: Antioxidant-rich and po... 0 1 0 0 131.0
6001 2174050 NEST New York Candles & Home Scents Velvet Pear Candle 8.1oz/230g 5.0 1 303 42.00 42.00 False - 8.1oz/230g Candle - 21.1oz/ 600g Candle ... unknown 0 0 0 0 NaN
8084 2339760 Tatcha Value & Gift Sets The Japanese Ritual for Glowing Skin no size 5.0 8 16100 60.00 75.00 True no options -Tatcha's Signature Hadasei-3: Restores a hea... 1 1 1 0 141.0

I created a helper function to separate the ingredients by the “-” character and to register “unknown” as NaN. With these separated ingredients, I created a new column “num_ingredients” to keep track of how many ingredients there were in each product. Here are 10 rows showing the new num_ingredients column.

Exploratory Data Analysis:

# Basic statistics about the number of ingredients
print(df['num_ingredients'].describe())
print(f"Number of missing values in 'num_ingredients': {df['num_ingredients'].isna().sum()}")

# Replace missing values with the mean or drop rows with missing values for numerical analysis
df['num_ingredients'].fillna(df['num_ingredients'].mean(), inplace=True)

plt.figure(figsize=(10, 6))
sns.histplot(df['num_ingredients'], bins=30, kde=True)
plt.title('Distribution of Number of Ingredients')
plt.xlabel('Number of Ingredients')
plt.ylabel('Frequency')
plt.show()

plt.figure(figsize=(10, 6))
sns.histplot(df['rating'], bins=30, kde=True)
plt.title('Distribution of Ratings')
plt.xlabel('Ratings')
plt.ylabel('Frequency')
plt.show()
count    819.000000
mean      47.882784
std       36.664186
min        1.000000
25%       23.000000
50%       38.000000
75%       62.000000
max      354.000000
Name: num_ingredients, dtype: float64
Number of missing values in 'num_ingredients': 181
../_images/9a4828ef8afb472894eddfaf7c6f99a2d0e6e5032c545e4ef808235d986aa697.png ../_images/954db5c9e91e068aec7d121a8cddc42a94a1a8c43d2bf5fb8a8fa1ca1827da46.png

First I created a historgram of the frequency of the number of ingredients in the products. Overall, the mean tended to be around 50 products, with some going up as high as the 350s. I was shocked that the mean was this high, and even more shocked to see that several products have 100s of ingredients.

I also wanted to check the distribution of ratings to see if it fit with my hypothesis from earlier. Sure enough, there is a majority of 4-5 star reviews and very limited 1-3 star ratings.

plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='num_ingredients', y='love', alpha=0.7)
plt.title('Number of Ingredients vs. Number of Loves')
plt.xlabel('Number of Ingredients')
plt.ylabel('Number of Loves')
plt.show()

plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='num_ingredients', y='rating', alpha=0.7)
plt.title('Number of Ingredients vs. Rating')
plt.xlabel('Number of Ingredients')
plt.ylabel('Rating')
plt.show()
../_images/7db341d9f1689d3cfa09aa326643673176a4ae867002d5374c68e6d4c5975668.png ../_images/5e858e53d4b68bd253a30528791fccad7b11cda3dc2ea8b4e05e751d4640adb4.png

Now I wanted to see the distribution of the number of ingredients with loves and ratings. There doesn’t seem to be a noticeable relationship. The higher ratings (4’s and 5’s) had a larger spread of how many ingredients they had, whereas the mediocre and bad ratings tended to have a smaller range in number of ingredients.

# Group by brand and calculate the average rating and love count
brand_stats = df.groupby('brand').agg({
    'rating': 'mean',
    'love': 'mean',
    'num_ingredients': 'mean'
}).reset_index()

top_brands_love = brand_stats.sort_values(by='love', ascending=False).head(10)

plt.figure(figsize=(12, 6))
sns.barplot(x='brand', y='love', data=top_brands_love, palette='coolwarm')
plt.title('Top 10 Brands by Average Number of Loves')
plt.xlabel('Brand')
plt.ylabel('Average Number of Loves')
plt.xticks(rotation=45)
plt.show()

# Relationship between price and loves
plt.figure(figsize=(12, 6))
sns.scatterplot(data=df, x='price', y='love', alpha=0.7)
plt.title('Price vs. Number of Loves')
plt.xlabel('Price')
plt.ylabel('Number of Loves')
plt.show()

# Relationship between price and rating
plt.figure(figsize=(12, 6))
sns.scatterplot(data=df, x='price', y='rating', alpha=0.7)
plt.title('Price vs. Rating')
plt.xlabel('Price')
plt.ylabel('Rating')
plt.show()
../_images/f7767790003aa3133e9e7d8989905969a076385d13c7456d60e8b3a3f724b250.png ../_images/08cd7413e1e45a672b4130ff943c30bf0e3072932c8f2582e410127bc235d443.png ../_images/c8df2ab3e1648c40efa1ffeba3e653bfb2e200b0f716bdc2306cfce712d18145.png

Now I did a similar analysis between price and loves/ratings. Similar to with number of ingredients, the higher ratings had a much larger spread in prices. The lower ratings tended to be on less expensive products, but this relationship is not very strong.

Linear Regression:

X = df[['price']]  
y = df['love']  


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

y_pred = lin_reg.predict(X_test)

plt.figure(figsize=(10, 6))
plt.scatter(X_test, y_test, color='blue', alpha=0.5, label='Actual Number of Loves')
plt.plot(X_test, y_pred, color='red', linewidth=2, label='Line of Best Fit')
plt.xlabel('Price')
plt.ylabel('Number of Loves')
plt.title('Actual vs Predicted Number of Loves with Line of Best Fit')
plt.legend()
plt.show()

r2 = r2_score(y_test, y_pred)
print(f"R² Score: {r2}")
../_images/a9e864cc356d061896fc2146ae8e68d5acc0ace80393b3a742a1d4f722974240.png
R² Score: 0.0009509966720172569

Now I performed linear regression between number of loves and price. There is a slight negative correlation between loves and price, however not a strong relationship.

X = df[['num_ingredients']]
y = df['love']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

y_pred = lin_reg.predict(X_test)

plt.figure(figsize=(10, 6))
plt.scatter(X_test, y_test, color='blue', alpha=0.5, label='Actual Number of Loves')
plt.plot(X_test, y_pred, color='red', linewidth=2, label='Line of Best Fit')
plt.xlabel('Number of Ingredients')
plt.ylabel('Number of Loves')
plt.title('Actual vs Predicted Number of Loves with Line of Best Fit')
plt.legend()
plt.show()

r2 = r2_score(y_test, y_pred)
print(f"R² Score: {r2}")
../_images/1f24b9011ac574de0e27518185b2e3f86dc95ea51d0d1ae61f5b9e4063974890.png
R² Score: -0.021064954183508178

I also did linear regression on loves and number of ingredients. This time, there was a slight positive correlation between the two, but not strong.

K-Nearest Neighbors:

X = df[['num_ingredients']]
y = df['love']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

knn_regressor = KNeighborsRegressor(n_neighbors=5)
knn_regressor.fit(X_train, y_train)

y_pred = knn_regressor.predict(X_test)

X_test_sorted, y_pred_sorted = zip(*sorted(zip(X_test.values.flatten(), y_pred)))

plt.figure(figsize=(10, 6))
plt.scatter(X_test, y_test, color='blue', alpha=0.5, label='Actual Number of Loves')
plt.plot(X_test_sorted, y_pred_sorted, color='red', linewidth=2, label='K-Nearest Neighbors Fit Line')
plt.xlabel('Number of Ingredients')
plt.ylabel('Number of Loves')
plt.title('Actual vs Predicted Number of Loves with K-Nearest Neighbors Fit Line')
plt.legend()
plt.show()

r2 = r2_score(y_test, y_pred)
print(f"R² Score: {r2}")
../_images/491464b10c04e0ea2c201bd3bfc6f00f16e4b9c29cb26c6be893326f38811aa5.png
R² Score: -0.23177714981810005

I performed k nearest neighbors with number of ingredients and loves. I choose n_neighbors = 5 to avoid overfitting the data, as a lower value of k tends to do so. This means the fit line is determined by looking at the average of its 5 closest neighbors.

X = df[['price']]
y = df['love']  # Target: Number of Loves

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

knn_regressor = KNeighborsRegressor(n_neighbors=5)
knn_regressor.fit(X_train, y_train)

y_pred = knn_regressor.predict(X_test)

X_test_sorted, y_pred_sorted = zip(*sorted(zip(X_test.values.flatten(), y_pred)))

plt.figure(figsize=(10, 6))
plt.scatter(X_test, y_test, color='blue', alpha=0.5, label='Actual Number of Loves')
plt.plot(X_test_sorted, y_pred_sorted, color='red', linewidth=2, label='K-Nearest Neighbors Fit Line')
plt.xlabel('Price')
plt.ylabel('Number of Loves')
plt.title('Actual vs Predicted Number of Loves with K-Nearest Neighbors Fit Line')
plt.legend()
plt.show()

r2 = r2_score(y_test, y_pred)
print(f"R² Score: {r2}")
../_images/49679dbc3662d8076e636efe26db051121157b34504f488c9f5682bfdd52e323.png
R² Score: -0.1738229878440276

Similarly I performed k-nearest neighbors on price and number of loves.

Extra: Decision Tree Regression

I learned about another tool that could be used to analyze the data: decision tree regression (https://www.geeksforgeeks.org/python-decision-tree-regression-using-sklearn/).

X = df[['num_ingredients']] 
y = df['love']  

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

dt_regressor = DecisionTreeRegressor(max_depth=4, random_state=42)
dt_regressor.fit(X_train, y_train)

y_pred = dt_regressor.predict(X_test)

plt.figure(figsize=(20, 10))
plot_tree(dt_regressor, filled=True, feature_names=['num_ingredients'], proportion=True)
plt.title('Decision Tree for Predicting Number of Loves based on Number of Ingredients')
plt.show()

r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)

print(f"R² Score: {r2}")
../_images/a520e7dcdfeb629fbd1dae6cc6eef2ccc065e225efca9b5e8a38a94209364d0d.png
R² Score: -0.14020743893570753

Here, I choose a tree depth of 4 to avoid overfitting the data.

In Conclusion:

Despite the use of many tools such as linear regression, k nearest neighbors, and decision tree regression, all of these models failed to be an accurate depiction of the relationship between loves and number of ingredients or price. The vast majority of the models resulted in a negative R^2 value, which means they did horribly on predicting number of loved based on ingredients or price. The only model with some promise was the linear regression on price and loves, suggesting lower priced products result in more loves.

Future Research:

This analysis just scratches the surface of predicting makeup sales. I think it would be interesting to take a look at celebrity make up brands specifically, as their marketing is completely different from other brands. Not only are ratings about the products but also how the celebrity is perceived and marketed. It would also be interesting to look at how marketing tactics such as “limited time” have an effect on ratings.