Predicting computer prices using multiple linear regression and decision trees#

Author: Yu Chen

Course Project, UC Irvine, Math 10, S24

I would like to post my notebook on the course’s website. [Yes]

Introduction#

In this project, I will analyze the impact of different hardware of different laptops on the price, and which parameters have the greatest impact on the price.In kaggle (https://www.kaggle.com/datasets/jacksondivakarr/laptop-price-prediction-dataset)I found a dataset with information about more than 800 models of computers, including price, Ram, Rom, screen size, CPU, etc.

The goal of this project is to predict the price of a certain type of computer after knowing its hardware information, which may serve as a price reference for shoppers who don’t know much about laptops.

Importing#

import pandas as pd
import altair as alt
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import metrics
import warnings
import seaborn as sb
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score
from sklearn.tree import DecisionTreeRegressor, plot_tree
warnings.filterwarnings("ignore")

Basic information about the data and Visualization#

data=pd.read_csv('LaptopPricePredictionDataset.csv')
data.head()
Unnamed: 0.1 Unnamed: 0 brand name price spec_rating processor CPU Ram Ram_type ROM ROM_type GPU display_size resolution_width resolution_height OS warranty
0 0 0 HP Victus 15-fb0157AX Gaming Laptop 49900 73.000000 5th Gen AMD Ryzen 5 5600H Hexa Core, 12 Threads 8GB DDR4 512GB SSD 4GB AMD Radeon RX 6500M 15.6 1920.0 1080.0 Windows 11 OS 1
1 1 1 HP 15s-fq5007TU Laptop 39900 60.000000 12th Gen Intel Core i3 1215U Hexa Core (2P + 4E), 8 Threads 8GB DDR4 512GB SSD Intel UHD Graphics 15.6 1920.0 1080.0 Windows 11 OS 1
2 2 2 Acer One 14 Z8-415 Laptop 26990 69.323529 11th Gen Intel Core i3 1115G4 Dual Core, 4 Threads 8GB DDR4 512GB SSD Intel Iris Xe Graphics 14.0 1920.0 1080.0 Windows 11 OS 1
3 3 3 Lenovo Yoga Slim 6 14IAP8 82WU0095IN Laptop 59729 66.000000 12th Gen Intel Core i5 1240P 12 Cores (4P + 8E), 16 Threads 16GB LPDDR5 512GB SSD Intel Integrated Iris Xe 14.0 2240.0 1400.0 Windows 11 OS 1
4 4 4 Apple MacBook Air 2020 MGND3HN Laptop 69990 69.323529 Apple M1 Octa Core (4P + 4E) 8GB DDR4 256GB SSD Apple M1 Integrated Graphics 13.3 2560.0 1600.0 Mac OS 1
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 893 entries, 0 to 892
Data columns (total 18 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0.1       893 non-null    int64  
 1   Unnamed: 0         893 non-null    int64  
 2   brand              893 non-null    object 
 3   name               893 non-null    object 
 4   price              893 non-null    int64  
 5   spec_rating        893 non-null    float64
 6   processor          893 non-null    object 
 7   CPU                893 non-null    object 
 8   Ram                893 non-null    object 
 9   Ram_type           893 non-null    object 
 10  ROM                893 non-null    object 
 11  ROM_type           893 non-null    object 
 12  GPU                893 non-null    object 
 13  display_size       893 non-null    float64
 14  resolution_width   893 non-null    float64
 15  resolution_height  893 non-null    float64
 16  OS                 893 non-null    object 
 17  warranty           893 non-null    int64  
dtypes: float64(4), int64(4), object(10)
memory usage: 125.7+ KB

There are no missing values in the data set and no data needs to be deleted.

data.describe().round(2)
Unnamed: 0.1 Unnamed: 0 price spec_rating display_size resolution_width resolution_height warranty
count 893.00 893.00 893.00 893.00 893.00 893.00 893.00 893.00
mean 467.14 521.38 79907.41 69.38 15.17 2035.39 1218.32 1.08
std 270.21 299.92 60880.04 5.54 0.94 426.08 326.76 0.33
min 0.00 0.00 9999.00 60.00 11.60 1080.00 768.00 0.00
25% 235.00 265.00 44500.00 66.00 14.00 1920.00 1080.00 1.00
50% 467.00 531.00 61990.00 69.32 15.60 1920.00 1080.00 1.00
75% 702.00 784.00 90990.00 71.00 15.60 1920.00 1200.00 1.00
max 930.00 1019.00 450039.00 89.00 18.00 3840.00 3456.00 3.00
#Defined a function to draw bar
def draw_bar(df, title):
    fig, ax = plt.subplots(figsize=(12, 5))
    ax.bar(df.index, df.values)
    ax.set_title(title)
    for i,j in enumerate(df.values):
        ax.text(i,j+2,j,va='bottom',ha='center')
    plt.xticks(rotation=60)
    plt.show()
#Apple computers have fixed prices and different operating systems, so Apple computer data is removed here
data=data[~data['brand'].str.contains('Apple')]
df_count = data['brand'].value_counts()
draw_bar(df_count, 'Quantity of each operating systems')
../_images/691e926b4b39d17d95382065bc422c38e8e279d6608f54d1ea237c5aa46b6620.png

Popular brands include HP, Lenovo, and Asus

df_count = data['Ram_type'].value_counts()
draw_bar(df_count, 'Quantity of each Ram types')
../_images/0a453575aa7a888c8dabcaf207d333b23cfcdab8b75aee8f650524ae34ad676e.png

The mainstream memory type is DDR4

df_system = data['OS'].value_counts()
draw_bar(df_system,'Quantity of each operating system')
../_images/07d9bbe2d5188d3ec24cd21f45404c2f2927cdccf685e135de50c6c8aaa320c5.png

Optimize data#

Convert some data into numbers for analysis. In this step, I optimize the data into a type that is easier to analyze. For example, 1TB needs to be converted into 1024GB before analysis. And most of the CPU model data contains text, and the number of cores and threads needs to be extracted.

data['Ram']=data['Ram'].astype(str).apply(lambda r:int( r.replace("GB","").replace("TB","") ))
data['ROM']=data['ROM'].astype(str).apply(lambda r:int( r.replace("GB","").replace("TB","") ))
data['ROM']=data['ROM'].apply(lambda r: r*1024 if r < 10 else r)

def getThreads(cpu):
        l=0
        if len(cpu.split(","))>1:
            l=cpu.split(",")[-1].replace("Threads","")
        return int(l)
def getCores(cpu):
    cpu = cpu.strip()
    l=cpu.split(",")[0].split(" ")[0]
    try:
        return int(l)
    except:
        if "hexa" in l.lower():
            return 6
        elif "octa" in l.lower():
            return 8
        elif "dual" in l.lower():
            return 2
        elif "quad" in l.lower():
            return 4
        return 0
    
data['Thread']=data['CPU'].apply(getThreads)
data['Cores']=data['CPU'].apply(getCores)

data.drop('Unnamed: 0.1',axis = 1,inplace = True)
data.drop('Unnamed: 0',axis = 1,inplace = True)
data.head()
brand name price spec_rating processor CPU Ram Ram_type ROM ROM_type GPU display_size resolution_width resolution_height OS warranty Thread Cores
0 HP Victus 15-fb0157AX Gaming Laptop 49900 73.000000 5th Gen AMD Ryzen 5 5600H Hexa Core, 12 Threads 8 DDR4 512 SSD 4GB AMD Radeon RX 6500M 15.6 1920.0 1080.0 Windows 11 OS 1 12 6
1 HP 15s-fq5007TU Laptop 39900 60.000000 12th Gen Intel Core i3 1215U Hexa Core (2P + 4E), 8 Threads 8 DDR4 512 SSD Intel UHD Graphics 15.6 1920.0 1080.0 Windows 11 OS 1 8 6
2 Acer One 14 Z8-415 Laptop 26990 69.323529 11th Gen Intel Core i3 1115G4 Dual Core, 4 Threads 8 DDR4 512 SSD Intel Iris Xe Graphics 14.0 1920.0 1080.0 Windows 11 OS 1 4 2
3 Lenovo Yoga Slim 6 14IAP8 82WU0095IN Laptop 59729 66.000000 12th Gen Intel Core i5 1240P 12 Cores (4P + 8E), 16 Threads 16 LPDDR5 512 SSD Intel Integrated Iris Xe 14.0 2240.0 1400.0 Windows 11 OS 1 16 12
5 Acer Extensa EX214-53 Laptop 39990 62.000000 12th Gen Intel Core i5 1240P 12 Cores (4P + 8E), 16 Threads 8 DDR4 512 SSD Intel Iris Xe Graphics 14.0 1920.0 1080.0 Windows 11 OS 1 16 12
data.info()
<class 'pandas.core.frame.DataFrame'>
Index: 877 entries, 0 to 892
Data columns (total 18 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   brand              877 non-null    object 
 1   name               877 non-null    object 
 2   price              877 non-null    int64  
 3   spec_rating        877 non-null    float64
 4   processor          877 non-null    object 
 5   CPU                877 non-null    object 
 6   Ram                877 non-null    int64  
 7   Ram_type           877 non-null    object 
 8   ROM                877 non-null    int64  
 9   ROM_type           877 non-null    object 
 10  GPU                877 non-null    object 
 11  display_size       877 non-null    float64
 12  resolution_width   877 non-null    float64
 13  resolution_height  877 non-null    float64
 14  OS                 877 non-null    object 
 15  warranty           877 non-null    int64  
 16  Thread             877 non-null    int64  
 17  Cores              877 non-null    int64  
dtypes: float64(4), int64(6), object(8)
memory usage: 130.2+ KB

correlation test#

numColForCoef=data.select_dtypes(exclude=["object"])
plot=sb.heatmap(numColForCoef.corr(),annot=True)
plt.show()
../_images/c9ae95422c79a37d7d233e525afefc9dd94dfe56574e7f7a92a12baaf375aa5b.png

It can be seen that the price is strongly correlated with Cores, Thread, resolution, ROM, RAM, and spec rating.

TrainingSet = data.drop(columns=['warranty','display_size'])
numColForCoef=TrainingSet.select_dtypes(exclude=["object"])
plot=sb.heatmap(numColForCoef.corr(),annot=True)
plt.show()
../_images/033b6570bc09cf72ebc141a43b51c61ce1efaced51a6b55f572a4f96b74d9d21.png

After this step is completed, we have obtained several highly correlated variables,

TrainingSet['price'].skew()
2.865017269749211
TrainingSet['price_ln'] = np.log(TrainingSet['price'])
TrainingSet['price_ln'].skew()
0.4124667391274515
from scipy import stats
from scipy.stats import norm, skew 

def norm_test(data):
    fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(14, 5))
    sns.distplot(data, fit=norm, ax=ax[0])
    stats.probplot(data, plot=ax[1])
    plt.show()

norm_test(TrainingSet.price)
../_images/ae3cf379c1cac16bc67377db6af2a5df1677019cfe476021e95b071b17a67566.png
norm_test(TrainingSet.price_ln)
../_images/d048e1ab27c1e672ecb1b71ac612cf33ecc1ca9b86f1cc7c668ec9e0b33ac07d.png

After logarithmic transformation, the data is closer to a normal distribution

Machine Learning#

In this step, I will use linear regression and decision trees to analyze the variables.

features = ['spec_rating','Ram','ROM','resolution_width','resolution_height','Thread','Cores']
X = TrainingSet[features]
y = TrainingSet['price_ln']

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.2, random_state=42)
reg=LinearRegression()
reg.fit(X_train, y_train)
print('Train Score: ' + str(reg.score(X_train, y_train)))
print('Test Score: ' + str(reg.score(X_test, y_test)))
print('Score: ' + str(reg.score(TrainingSet[features], TrainingSet['price_ln'])))
Train Score: 0.7882271170948115
Test Score: 0.7612986644120081
Score: 0.7680228504845487

Here I got the prediction data of price from linear regression

TrainingSet['Pred_price'] = np.power(np.e,reg.predict(TrainingSet[features]))

Below I have tried different depths of the decision tree to achieve the maximum fit

DT = DecisionTreeRegressor(max_depth=10, min_samples_split=5, min_samples_leaf=6, max_features='sqrt',random_state=10)
DT.fit(X_train, y_train)
print('Train Score: ' + str(DT.score(X_train, y_train)))
print('Test Score: ' + str(DT.score(X_test, y_test)))
print('Score: ' + str(DT.score(TrainingSet[features], TrainingSet['price_ln'])))
Train Score: 0.8526644856850252
Test Score: 0.7197774115465023
Score: 0.7528334930481183
plt.figure(figsize=(200,100))
plot_tree(DT, feature_names=features, filled=True)
plt.show()
../_images/43036ad918aede9c22b2130895d525a37fa3f38015c7e54cb3a45dab06b13dc1.png

Here I got the prediction data of price from decision tree

TrainingSet['DT_Pred_price']=np.power(np.e,DT.predict(TrainingSet[features]))
TrainingSet['counter'] = range(len(TrainingSet))
P1 = alt.Chart(TrainingSet.head(100)).mark_circle(color = 'red').encode(
    y = 'price',
    x = 'counter'
).properties(
    width = 600,
    height = 300
)
P2 = alt.Chart(TrainingSet.head(100)).mark_circle(color = 'blue').encode(
    y = 'Pred_price',
    x = 'counter'
).properties(
    width = 600,
    height = 300
)
P3 = alt.Chart(TrainingSet.head(100)).mark_circle(color = 'green').encode(
    y = 'DT_Pred_price',
    x = 'counter'
).properties(
    width = 600,
    height = 300
)

Comparison of linear regression predictions and true values,

Red is the actual value, blue is the predicted value

P1+P2

Comparison of dicision tree predictions and true values,

Red is the actual value, green is the predicted value

P1+P3

Summary#

In this project, I analyzed the impact of different hardware on price. In the analysis, we learned that price has a strong correlation with hardware, and our predictions are relatively close to the real price.The most important factors are Ram, and threads. In fact, graphics cards also have a great impact on prices, but in this data set, it is difficult to analyze the graphics card factors, so there is a little inaccuracy in the prediction.

References#

Dataset: https://www.kaggle.com/datasets/jacksondivakarr/laptop-price-prediction-dataset

Decision Tree Regression: https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html