Predicting computer prices using multiple linear regression and decision trees

Predicting computer prices using multiple linear regression and decision trees#

Author: Yu Chen

Course Project, UC Irvine, Math 10, S24

I would like to post my notebook on the course’s website. [Yes]

Introduction#

In this project, I will analyze the impact of different hardware of different laptops on the price, and which parameters have the greatest impact on the price.In kaggle (https://www.kaggle.com/datasets/jacksondivakarr/laptop-price-prediction-dataset)I found a dataset with information about more than 800 models of computers, including price, Ram, Rom, screen size, CPU, etc.

The goal of this project is to predict the price of a certain type of computer after knowing its hardware information, which may serve as a price reference for shoppers who don’t know much about laptops.

Importing#

import pandas as pd
import altair as alt
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import metrics
import warnings
import seaborn as sb
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score
from sklearn.tree import DecisionTreeRegressor, plot_tree
warnings.filterwarnings("ignore")

Basic information about the data and Visualization#

data=pd.read_csv('LaptopPricePredictionDataset.csv')
data.head()

	Unnamed: 0.1	Unnamed: 0	brand	name	price	spec_rating	processor	CPU	Ram	Ram_type	ROM	ROM_type	GPU	display_size	resolution_width	resolution_height	OS	warranty
0	0	0	HP	Victus 15-fb0157AX Gaming Laptop	49900	73.000000	5th Gen AMD Ryzen 5 5600H	Hexa Core, 12 Threads	8GB	DDR4	512GB	SSD	4GB AMD Radeon RX 6500M	15.6	1920.0	1080.0	Windows 11 OS	1
1	1	1	HP	15s-fq5007TU Laptop	39900	60.000000	12th Gen Intel Core i3 1215U	Hexa Core (2P + 4E), 8 Threads	8GB	DDR4	512GB	SSD	Intel UHD Graphics	15.6	1920.0	1080.0	Windows 11 OS	1
2	2	2	Acer	One 14 Z8-415 Laptop	26990	69.323529	11th Gen Intel Core i3 1115G4	Dual Core, 4 Threads	8GB	DDR4	512GB	SSD	Intel Iris Xe Graphics	14.0	1920.0	1080.0	Windows 11 OS	1
3	3	3	Lenovo	Yoga Slim 6 14IAP8 82WU0095IN Laptop	59729	66.000000	12th Gen Intel Core i5 1240P	12 Cores (4P + 8E), 16 Threads	16GB	LPDDR5	512GB	SSD	Intel Integrated Iris Xe	14.0	2240.0	1400.0	Windows 11 OS	1
4	4	4	Apple	MacBook Air 2020 MGND3HN Laptop	69990	69.323529	Apple M1	Octa Core (4P + 4E)	8GB	DDR4	256GB	SSD	Apple M1 Integrated Graphics	13.3	2560.0	1600.0	Mac OS	1

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 893 entries, 0 to 892
Data columns (total 18 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 Unnamed: 0.1       893 non-null    int64  
 Unnamed: 0         893 non-null    int64  
 brand              893 non-null    object 
 name               893 non-null    object 
 price              893 non-null    int64  
 spec_rating        893 non-null    float64
 processor          893 non-null    object 
 CPU                893 non-null    object 
 Ram                893 non-null    object 
 Ram_type           893 non-null    object 
ROM                893 non-null    object 
ROM_type           893 non-null    object 
GPU                893 non-null    object 
display_size       893 non-null    float64
resolution_width   893 non-null    float64
resolution_height  893 non-null    float64
OS                 893 non-null    object 
warranty           893 non-null    int64  
dtypes: float64(4), int64(4), object(10)
memory usage: 125.7+ KB

There are no missing values in the data set and no data needs to be deleted.

data.describe().round(2)

	Unnamed: 0.1	Unnamed: 0	price	spec_rating	display_size	resolution_width	resolution_height	warranty
count	893.00	893.00	893.00	893.00	893.00	893.00	893.00	893.00
mean	467.14	521.38	79907.41	69.38	15.17	2035.39	1218.32	1.08
std	270.21	299.92	60880.04	5.54	0.94	426.08	326.76	0.33
min	0.00	0.00	9999.00	60.00	11.60	1080.00	768.00	0.00
25%	235.00	265.00	44500.00	66.00	14.00	1920.00	1080.00	1.00
50%	467.00	531.00	61990.00	69.32	15.60	1920.00	1080.00	1.00
75%	702.00	784.00	90990.00	71.00	15.60	1920.00	1200.00	1.00
max	930.00	1019.00	450039.00	89.00	18.00	3840.00	3456.00	3.00

#Defined a function to draw bar
def draw_bar(df, title):
    fig, ax = plt.subplots(figsize=(12, 5))
    ax.bar(df.index, df.values)
    ax.set_title(title)
    for i,j in enumerate(df.values):
        ax.text(i,j+2,j,va='bottom',ha='center')
    plt.xticks(rotation=60)
    plt.show()

#Apple computers have fixed prices and different operating systems, so Apple computer data is removed here
data=data[~data['brand'].str.contains('Apple')]
df_count = data['brand'].value_counts()
draw_bar(df_count, 'Quantity of each operating systems')

../_images/691e926b4b39d17d95382065bc422c38e8e279d6608f54d1ea237c5aa46b6620.png

Popular brands include HP, Lenovo, and Asus

df_count = data['Ram_type'].value_counts()
draw_bar(df_count, 'Quantity of each Ram types')

../_images/0a453575aa7a888c8dabcaf207d333b23cfcdab8b75aee8f650524ae34ad676e.png

The mainstream memory type is DDR4

df_system = data['OS'].value_counts()
draw_bar(df_system,'Quantity of each operating system')

../_images/07d9bbe2d5188d3ec24cd21f45404c2f2927cdccf685e135de50c6c8aaa320c5.png

Optimize data#

Convert some data into numbers for analysis. In this step, I optimize the data into a type that is easier to analyze. For example, 1TB needs to be converted into 1024GB before analysis. And most of the CPU model data contains text, and the number of cores and threads needs to be extracted.

data['Ram']=data['Ram'].astype(str).apply(lambda r:int( r.replace("GB","").replace("TB","") ))
data['ROM']=data['ROM'].astype(str).apply(lambda r:int( r.replace("GB","").replace("TB","") ))
data['ROM']=data['ROM'].apply(lambda r: r*1024 if r < 10 else r)

def getThreads(cpu):
        l=0
        if len(cpu.split(","))>1:
            l=cpu.split(",")[-1].replace("Threads","")
        return int(l)
def getCores(cpu):
    cpu = cpu.strip()
    l=cpu.split(",")[0].split(" ")[0]
    try:
        return int(l)
    except:
        if "hexa" in l.lower():
            return 6
        elif "octa" in l.lower():
            return 8
        elif "dual" in l.lower():
            return 2
        elif "quad" in l.lower():
            return 4
        return 0
    
data['Thread']=data['CPU'].apply(getThreads)
data['Cores']=data['CPU'].apply(getCores)

data.drop('Unnamed: 0.1',axis = 1,inplace = True)
data.drop('Unnamed: 0',axis = 1,inplace = True)

data.head()

	brand	name	price	spec_rating	processor	CPU	Ram	Ram_type	ROM	ROM_type	GPU	display_size	resolution_width	resolution_height	OS	warranty	Thread	Cores
0	HP	Victus 15-fb0157AX Gaming Laptop	49900	73.000000	5th Gen AMD Ryzen 5 5600H	Hexa Core, 12 Threads	8	DDR4	512	SSD	4GB AMD Radeon RX 6500M	15.6	1920.0	1080.0	Windows 11 OS	1	12	6
1	HP	15s-fq5007TU Laptop	39900	60.000000	12th Gen Intel Core i3 1215U	Hexa Core (2P + 4E), 8 Threads	8	DDR4	512	SSD	Intel UHD Graphics	15.6	1920.0	1080.0	Windows 11 OS	1	8	6
2	Acer	One 14 Z8-415 Laptop	26990	69.323529	11th Gen Intel Core i3 1115G4	Dual Core, 4 Threads	8	DDR4	512	SSD	Intel Iris Xe Graphics	14.0	1920.0	1080.0	Windows 11 OS	1	4	2
3	Lenovo	Yoga Slim 6 14IAP8 82WU0095IN Laptop	59729	66.000000	12th Gen Intel Core i5 1240P	12 Cores (4P + 8E), 16 Threads	16	LPDDR5	512	SSD	Intel Integrated Iris Xe	14.0	2240.0	1400.0	Windows 11 OS	1	16	12
5	Acer	Extensa EX214-53 Laptop	39990	62.000000	12th Gen Intel Core i5 1240P	12 Cores (4P + 8E), 16 Threads	8	DDR4	512	SSD	Intel Iris Xe Graphics	14.0	1920.0	1080.0	Windows 11 OS	1	16	12

data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 877 entries, 0 to 892
Data columns (total 18 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 brand              877 non-null    object 
 name               877 non-null    object 
 price              877 non-null    int64  
 spec_rating        877 non-null    float64
 processor          877 non-null    object 
 CPU                877 non-null    object 
 Ram                877 non-null    int64  
 Ram_type           877 non-null    object 
 ROM                877 non-null    int64  
 ROM_type           877 non-null    object 
GPU                877 non-null    object 
display_size       877 non-null    float64
resolution_width   877 non-null    float64
resolution_height  877 non-null    float64
OS                 877 non-null    object 
warranty           877 non-null    int64  
Thread             877 non-null    int64  
Cores              877 non-null    int64  
dtypes: float64(4), int64(6), object(8)
memory usage: 130.2+ KB

correlation test#

numColForCoef=data.select_dtypes(exclude=["object"])
plot=sb.heatmap(numColForCoef.corr(),annot=True)
plt.show()

../_images/c9ae95422c79a37d7d233e525afefc9dd94dfe56574e7f7a92a12baaf375aa5b.png

It can be seen that the price is strongly correlated with Cores, Thread, resolution, ROM, RAM, and spec rating.

TrainingSet = data.drop(columns=['warranty','display_size'])

numColForCoef=TrainingSet.select_dtypes(exclude=["object"])
plot=sb.heatmap(numColForCoef.corr(),annot=True)
plt.show()

../_images/033b6570bc09cf72ebc141a43b51c61ce1efaced51a6b55f572a4f96b74d9d21.png

After this step is completed, we have obtained several highly correlated variables,

TrainingSet['price'].skew()

2.865017269749211

TrainingSet['price_ln'] = np.log(TrainingSet['price'])
TrainingSet['price_ln'].skew()

0.4124667391274515

from scipy import stats
from scipy.stats import norm, skew 

def norm_test(data):
    fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(14, 5))
    sns.distplot(data, fit=norm, ax=ax[0])
    stats.probplot(data, plot=ax[1])
    plt.show()

norm_test(TrainingSet.price)

../_images/ae3cf379c1cac16bc67377db6af2a5df1677019cfe476021e95b071b17a67566.png

norm_test(TrainingSet.price_ln)

../_images/d048e1ab27c1e672ecb1b71ac612cf33ecc1ca9b86f1cc7c668ec9e0b33ac07d.png

After logarithmic transformation, the data is closer to a normal distribution

Machine Learning#

In this step, I will use linear regression and decision trees to analyze the variables.

features = ['spec_rating','Ram','ROM','resolution_width','resolution_height','Thread','Cores']
X = TrainingSet[features]
y = TrainingSet['price_ln']

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.2, random_state=42)

reg=LinearRegression()
reg.fit(X_train, y_train)
print('Train Score: ' + str(reg.score(X_train, y_train)))
print('Test Score: ' + str(reg.score(X_test, y_test)))
print('Score: ' + str(reg.score(TrainingSet[features], TrainingSet['price_ln'])))

Train Score: 0.7882271170948115
Test Score: 0.7612986644120081
Score: 0.7680228504845487

Here I got the prediction data of price from linear regression

TrainingSet['Pred_price'] = np.power(np.e,reg.predict(TrainingSet[features]))

Below I have tried different depths of the decision tree to achieve the maximum fit

DT = DecisionTreeRegressor(max_depth=10, min_samples_split=5, min_samples_leaf=6, max_features='sqrt',random_state=10)
DT.fit(X_train, y_train)
print('Train Score: ' + str(DT.score(X_train, y_train)))
print('Test Score: ' + str(DT.score(X_test, y_test)))
print('Score: ' + str(DT.score(TrainingSet[features], TrainingSet['price_ln'])))

Train Score: 0.8526644856850252
Test Score: 0.7197774115465023
Score: 0.7528334930481183

plt.figure(figsize=(200,100))
plot_tree(DT, feature_names=features, filled=True)
plt.show()

../_images/43036ad918aede9c22b2130895d525a37fa3f38015c7e54cb3a45dab06b13dc1.png

Here I got the prediction data of price from decision tree

TrainingSet['DT_Pred_price']=np.power(np.e,DT.predict(TrainingSet[features]))

TrainingSet['counter'] = range(len(TrainingSet))

P1 = alt.Chart(TrainingSet.head(100)).mark_circle(color = 'red').encode(
    y = 'price',
    x = 'counter'
).properties(
    width = 600,
    height = 300
)
P2 = alt.Chart(TrainingSet.head(100)).mark_circle(color = 'blue').encode(
    y = 'Pred_price',
    x = 'counter'
).properties(
    width = 600,
    height = 300
)
P3 = alt.Chart(TrainingSet.head(100)).mark_circle(color = 'green').encode(
    y = 'DT_Pred_price',
    x = 'counter'
).properties(
    width = 600,
    height = 300
)

Comparison of linear regression predictions and true values，

Red is the actual value, blue is the predicted value

P1+P2

Comparison of dicision tree predictions and true values，

Red is the actual value, green is the predicted value

P1+P3

Summary#

In this project, I analyzed the impact of different hardware on price. In the analysis, we learned that price has a strong correlation with hardware, and our predictions are relatively close to the real price.The most important factors are Ram, and threads. In fact, graphics cards also have a great impact on prices, but in this data set, it is difficult to analyze the graphics card factors, so there is a little inaccuracy in the prediction.

References#

Dataset: https://www.kaggle.com/datasets/jacksondivakarr/laptop-price-prediction-dataset

Decision Tree Regression: https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html