Youtube Video Trending Prediction

Introduction#

In this project, I tried to conduct a comprehensive analysis of trending YouTube videos in Japan. The goal was to identify key factors that influence the popularity of these videos. Through a systematic approach that included data cleaning, exploratory data analysis, and predictive modeling, I was able to gain several important insights.

Exploratory Data Analysis#

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import sklearn as sk 
import random

SEED = 123
np.random.seed(SEED)

# videos = pd.read_csv("JPvideos.csv", encoding='shift_jis', errors='replace')
videos = pd.read_csv('JPvideos.csv', encoding='utf-8', quotechar='"', delimiter=',', encoding_errors='ignore')
videos['category_id'] = pd.to_numeric(videos['category_id'])
videos['views'] = pd.to_numeric(videos['views'])
videos['tags'] = videos['tags'].str.split('|')
videos['like_dislike_ratio'] = videos['likes']/videos['dislikes']
videos['like_views_ratio'] = videos['likes']/videos['views']
videos['publish_time'] = pd.to_datetime(videos['publish_time'])

videos.head()

	video_id	trending_date	title	channel_title	category_id	publish_time	tags	views	likes	dislikes	comment_count	thumbnail_link	comments_disabled	ratings_disabled	video_error_or_removed	description	like_dislike_ratio	like_views_ratio
0	5ugKfHgsmYw	18.07.02	陸自ヘリ、垂直に落下＝路上の車が撮影	時事通信映像センター	25	2018-02-06 03:04:37+00:00	[事故, "佐賀", "佐賀県", "ヘリコプター", "ヘリ", "自衛隊", "墜落",...	188085	591	189	0	https://i.ytimg.com/vi/5ugKfHgsmYw/default.jpg	True	False	False	佐賀県神埼市の民家に墜落した陸上自衛隊のＡＨ６４Ｄ戦闘ヘリコプターが垂直に落下する様子を、近...	3.126984	0.003142
1	ohObafdd34Y	18.07.02	イッテQ お祭り男宮川×手越巨大ブランコ②	神谷えりな Kamiya Erina 2	1	2018-02-06 04:01:56+00:00	[[none]]	90929	442	88	174	https://i.ytimg.com/vi/ohObafdd34Y/default.jpg	False	False	False	NaN	5.022727	0.004861
2	aBr2kKAHN6M	18.07.02	Live Views of Starman	SpaceX	28	2018-02-06 21:38:22+00:00	[[none]]	6408303	165892	2331	3006	https://i.ytimg.com/vi/aBr2kKAHN6M/default.jpg	False	False	False	NaN	71.167739	0.025887
3	5wNnwChvmsQ	18.07.02	東京ディズニーリゾートの元キャストが暴露した秘密5選	アシタノワダイ	25	2018-02-06 06:08:49+00:00	[アシタノワダイ]	96255	1165	277	545	https://i.ytimg.com/vi/5wNnwChvmsQ/default.jpg	False	False	False	東京ディズニーリゾートの元キャストが暴露した秘密5選\n\nかたまりクリエイトさま\n【検証...	4.205776	0.012103
4	B7J47qFvdsk	18.07.02	榮倉奈々、衝撃の死んだふり！映画『家に帰ると妻が必ず死んだふりをしています。』特報	シネマトゥデイ	1	2018-02-06 02:30:00+00:00	[[none]]	108408	1336	74	201	https://i.ytimg.com/vi/B7J47qFvdsk/default.jpg	False	False	False	家に帰ってきたサラリーマンのじゅん（安田顕）は、玄関で血を出して倒れている妻ちえ（榮倉奈々）...	18.054054	0.012324

videos['trending_date'] = pd.to_datetime(videos['trending_date'], format='%y.%d.%m', utc=True)
videos['time_till_trending'] = videos['trending_date'] - videos['publish_time']

# Align datetime object format for tableau analysis
videos['publish_time'] = videos['publish_time'].dt.strftime('%Y-%m-%dT%H:%M:%S')
videos['trending_date'] = videos['trending_date'].dt.strftime('%Y-%m-%dT%H:%M:%S')

videos['hours_till_trending'] = videos['time_till_trending'].map(lambda x: x.total_seconds()) / 3600
videos['days_till_trending'] = videos['time_till_trending'].map(lambda x: x.total_seconds()) / (3600*24)

videos[['trending_date','publish_time','time_till_trending','days_till_trending']].head()

	trending_date	publish_time	time_till_trending	days_till_trending
0	2018-02-07T00:00:00	2018-02-06T03:04:37	0 days 20:55:23	0.871794
1	2018-02-07T00:00:00	2018-02-06T04:01:56	0 days 19:58:04	0.831991
2	2018-02-07T00:00:00	2018-02-06T21:38:22	0 days 02:21:38	0.098356
3	2018-02-07T00:00:00	2018-02-06T06:08:49	0 days 17:51:11	0.743877
4	2018-02-07T00:00:00	2018-02-06T02:30:00	0 days 21:30:00	0.895833

videos.to_csv('JP_videos.csv')

category = pd.read_json("JP_category_id.json")
category = pd.json_normalize(category['items'])

category['id'] = pd.to_numeric(category['id'])
#category.to_csv('JP_videos_category.csv')
videos = videos.merge(category,how='inner',left_on='category_id', right_on='id')

videos.dropna()
#videos.to_csv('JP_videos_complete.csv')
videos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20505 entries, 0 to 20504
Data columns (total 27 columns):
 #   Column                  Non-Null Count  Dtype          
---  ------                  --------------  -----          
 video_id                20505 non-null  object         
 trending_date           20505 non-null  object         
 title                   20505 non-null  object         
 channel_title           20505 non-null  object         
 category_id             20505 non-null  int64          
 publish_time            20505 non-null  object         
 tags                    20505 non-null  object         
 views                   20505 non-null  int64          
 likes                   20505 non-null  int64          
 dislikes                20505 non-null  int64          
comment_count           20505 non-null  int64          
thumbnail_link          20505 non-null  object         
comments_disabled       20505 non-null  bool           
ratings_disabled        20505 non-null  bool           
video_error_or_removed  20505 non-null  bool           
description             18381 non-null  object         
like_dislike_ratio      19114 non-null  float64        
like_views_ratio        20505 non-null  float64        
time_till_trending      20505 non-null  timedelta64[ns]
hours_till_trending     20505 non-null  float64        
days_till_trending      20505 non-null  float64        
kind                    20505 non-null  object         
etag                    20505 non-null  object         
id                      20505 non-null  int64          
snippet.channelId       20505 non-null  object         
snippet.title           20505 non-null  object         
snippet.assignable      20505 non-null  bool           
dtypes: bool(4), float64(4), int64(6), object(12), timedelta64[ns](1)
memory usage: 3.7+ MB

sns.pairplot(videos, vars=videos.columns[7:11])

<seaborn.axisgrid.PairGrid at 0x282b47eb200>

../_images/2aaeaaf45b8934e96b81852b68f592bc24f6dd1e8018b71e56c6be4c9e6154d3.png

Finding relationships between numeric variables and possible linear relationships between likes and dislikes

Multiple linear lines in a pair plot suggest that there may be linear relationships between Views, Likes, Dislikes, and # of Comments, but these relationships may be different within each video category. or dominated by video content.

videos.sort_values(by=['likes'], ascending=False).head()

	video_id	trending_date	title	channel_title	category_id	publish_time	tags	views	likes	dislikes	...	like_views_ratio	time_till_trending	hours_till_trending	days_till_trending	kind	etag	id	snippet.channelId	snippet.title	snippet.assignable
16331	7C2z4GqqS5E	2018-05-20T00:00:00	BTS (방탄소년단) 'FAKE LOVE' Official MV	ibighit	10	2018-05-18T09:00:02	[BIGHIT, "빅히트", "방탄소년단", "BTS", "BANGTAN", "방탄...	62796390	4470923	119053	...	0.071197	1 days 14:59:58	38.999444	1.624977	youtube#videoCategory	"XI7nbFXulYBIpL0ayR_gDh3eu1k/nqRIq97-xe5XRZTxb...	10	UCBR8-60-B28hp2BmDPdntcQ	Music	True
16317	7C2z4GqqS5E	2018-05-20T00:00:00	BTS (방탄소년단) 'FAKE LOVE' Official MV	ibighit	10	2018-05-18T09:00:02	[BIGHIT, "빅히트", "방탄소년단", "BTS", "BANGTAN", "방탄...	62796390	4470923	119053	...	0.071197	1 days 14:59:58	38.999444	1.624977	youtube#videoCategory	"XI7nbFXulYBIpL0ayR_gDh3eu1k/nqRIq97-xe5XRZTxb...	10	UCBR8-60-B28hp2BmDPdntcQ	Music	True
16303	7C2z4GqqS5E	2018-05-20T00:00:00	BTS (방탄소년단) 'FAKE LOVE' Official MV	ibighit	10	2018-05-18T09:00:02	[BIGHIT, "빅히트", "방탄소년단", "BTS", "BANGTAN", "방탄...	62796390	4470923	119053	...	0.071197	1 days 14:59:58	38.999444	1.624977	youtube#videoCategory	"XI7nbFXulYBIpL0ayR_gDh3eu1k/nqRIq97-xe5XRZTxb...	10	UCBR8-60-B28hp2BmDPdntcQ	Music	True
16289	7C2z4GqqS5E	2018-05-20T00:00:00	BTS (방탄소년단) 'FAKE LOVE' Official MV	ibighit	10	2018-05-18T09:00:02	[BIGHIT, "빅히트", "방탄소년단", "BTS", "BANGTAN", "방탄...	62796390	4470923	119053	...	0.071197	1 days 14:59:58	38.999444	1.624977	youtube#videoCategory	"XI7nbFXulYBIpL0ayR_gDh3eu1k/nqRIq97-xe5XRZTxb...	10	UCBR8-60-B28hp2BmDPdntcQ	Music	True
14599	p8npDG2ulKQ	2018-05-08T00:00:00	BTS (방탄소년단) LOVE YOURSELF 轉 Tear 'Singularity'...	ibighit	10	2018-05-06T15:00:02	[BIGHIT, "빅히트", "방탄소년단", "BTS", "BANGTAN", "방탄"]	15875379	2285436	19400	...	0.143961	1 days 08:59:58	32.999444	1.374977	youtube#videoCategory	"XI7nbFXulYBIpL0ayR_gDh3eu1k/nqRIq97-xe5XRZTxb...	10	UCBR8-60-B28hp2BmDPdntcQ	Music	True

5 rows × 27 columns

videos.sort_values(by=['views'],ascending=False).head()

	video_id	trending_date	title	channel_title	category_id	publish_time	tags	views	likes	dislikes	...	like_views_ratio	time_till_trending	hours_till_trending	days_till_trending	kind	etag	id	snippet.channelId	snippet.title	snippet.assignable
16289	7C2z4GqqS5E	2018-05-20T00:00:00	BTS (방탄소년단) 'FAKE LOVE' Official MV	ibighit	10	2018-05-18T09:00:02	[BIGHIT, "빅히트", "방탄소년단", "BTS", "BANGTAN", "방탄...	62796390	4470923	119053	...	0.071197	1 days 14:59:58	38.999444	1.624977	youtube#videoCategory	"XI7nbFXulYBIpL0ayR_gDh3eu1k/nqRIq97-xe5XRZTxb...	10	UCBR8-60-B28hp2BmDPdntcQ	Music	True
16303	7C2z4GqqS5E	2018-05-20T00:00:00	BTS (방탄소년단) 'FAKE LOVE' Official MV	ibighit	10	2018-05-18T09:00:02	[BIGHIT, "빅히트", "방탄소년단", "BTS", "BANGTAN", "방탄...	62796390	4470923	119053	...	0.071197	1 days 14:59:58	38.999444	1.624977	youtube#videoCategory	"XI7nbFXulYBIpL0ayR_gDh3eu1k/nqRIq97-xe5XRZTxb...	10	UCBR8-60-B28hp2BmDPdntcQ	Music	True
16317	7C2z4GqqS5E	2018-05-20T00:00:00	BTS (방탄소년단) 'FAKE LOVE' Official MV	ibighit	10	2018-05-18T09:00:02	[BIGHIT, "빅히트", "방탄소년단", "BTS", "BANGTAN", "방탄...	62796390	4470923	119053	...	0.071197	1 days 14:59:58	38.999444	1.624977	youtube#videoCategory	"XI7nbFXulYBIpL0ayR_gDh3eu1k/nqRIq97-xe5XRZTxb...	10	UCBR8-60-B28hp2BmDPdntcQ	Music	True
16331	7C2z4GqqS5E	2018-05-20T00:00:00	BTS (방탄소년단) 'FAKE LOVE' Official MV	ibighit	10	2018-05-18T09:00:02	[BIGHIT, "빅히트", "방탄소년단", "BTS", "BANGTAN", "방탄...	62796390	4470923	119053	...	0.071197	1 days 14:59:58	38.999444	1.624977	youtube#videoCategory	"XI7nbFXulYBIpL0ayR_gDh3eu1k/nqRIq97-xe5XRZTxb...	10	UCBR8-60-B28hp2BmDPdntcQ	Music	True
12338	u9Mv98Gr5pY	2018-04-26T00:00:00	VENOM - Official Trailer (HD)	Sony Pictures Entertainment	24	2018-04-24T03:45:03	[Venom, "Venom Movie", "Venom (2018)", "Marvel...	39128403	1077732	35764	...	0.027543	1 days 20:14:57	44.249167	1.843715	youtube#videoCategory	"XI7nbFXulYBIpL0ayR_gDh3eu1k/UVB9oxX2Bvqa_w_y3...	24	UCBR8-60-B28hp2BmDPdntcQ	Entertainment	True

5 rows × 27 columns

videos.groupby('snippet.title')['views'].mean().sort_values(ascending=False).head()

snippet.title
Science & Technology    1.213748e+06
Music                   8.297243e+05
Comedy                  3.781614e+05
Sports                  2.850889e+05
Entertainment           2.821966e+05
Name: views, dtype: float64

Based on the previous analysis, amongst the previously trending videos, a significant portion is in “music” category.

videos[videos['snippet.title']=='People & Blogs']['like_views_ratio'].describe()

count    3915.000000
mean        0.015004
std         0.020466
min         0.000000
25%         0.002627
50%         0.006597
75%         0.020536
max         0.194951
Name: like_views_ratio, dtype: float64

Now We try to determine the videos that were trending because of “fake views” (those with possibly flawed views), especially in “people and blog” catergory, by checking if some videos’ like-view ratio is considered as outlier.

fake = videos[(videos['snippet.title']=='People & Blogs')&(videos['like_views_ratio']<0.01)]
fake.shape

(2287, 27)

sns.jointplot(x='likes',y='views',data=videos[videos['snippet.title']=='Entertainment'])

<seaborn.axisgrid.JointGrid at 0x282e14f5ac0>

../_images/7ad3364f2accf7d2b84e2f7a85226609150cfd470b8dd0c2b919537ecfb07ec3.png

sns.jointplot(x='likes',y='views',data=videos[videos['snippet.title']=='People & Blogs'])

<seaborn.axisgrid.JointGrid at 0x282e14f7e60>

../_images/a3aee307a0204e3a2584128259f944be8e7daf89737b5c79f51a07642e42722b.png

sns.lmplot(x='likes',y='views',data=videos)
videos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20505 entries, 0 to 20504
Data columns (total 27 columns):
 #   Column                  Non-Null Count  Dtype          
---  ------                  --------------  -----          
 video_id                20505 non-null  object         
 trending_date           20505 non-null  object         
 title                   20505 non-null  object         
 channel_title           20505 non-null  object         
 category_id             20505 non-null  int64          
 publish_time            20505 non-null  object         
 tags                    20505 non-null  object         
 views                   20505 non-null  int64          
 likes                   20505 non-null  int64          
 dislikes                20505 non-null  int64          
comment_count           20505 non-null  int64          
thumbnail_link          20505 non-null  object         
comments_disabled       20505 non-null  bool           
ratings_disabled        20505 non-null  bool           
video_error_or_removed  20505 non-null  bool           
description             18381 non-null  object         
like_dislike_ratio      19114 non-null  float64        
like_views_ratio        20505 non-null  float64        
time_till_trending      20505 non-null  timedelta64[ns]
hours_till_trending     20505 non-null  float64        
days_till_trending      20505 non-null  float64        
kind                    20505 non-null  object         
etag                    20505 non-null  object         
id                      20505 non-null  int64          
snippet.channelId       20505 non-null  object         
snippet.title           20505 non-null  object         
snippet.assignable      20505 non-null  bool           
dtypes: bool(4), float64(4), int64(6), object(12), timedelta64[ns](1)
memory usage: 3.7+ MB

../_images/65fa0c27b954c7ade37533aa056058b1d6517a209b5299a9a3d7008fbba1a110.png

What kind of videos are the “most popular” ones?#

Plots for Different Categories#

Here we use aggregate heatmap to analyze the popularity of each catergory

video_count_by_category = videos['category_id'].value_counts()

# Plotting the bar chart
plt.figure(figsize=(10, 6))
current_palette = sns.color_palette("tab20", n_colors=14) 
sns.barplot(x=video_count_by_category.index, y=video_count_by_category.values, hue=video_count_by_category.index, palette=current_palette) 
plt.title('Total Video Count by Different Category IDs')
plt.xlabel('Category ID')
plt.ylabel('Number of Videos')
plt.xticks(rotation=45)  # Rotates the category ID labels for better readability
plt.tight_layout()
plt.show()

../_images/30b807e046a4ad7e91c4ba7a9c6480e83720187b7c523f5fe0a2815f7fc04b53.png

# Aggregate total views by category_id
total_views_by_category = videos.groupby('category_id')['views'].sum()

# Sort the data for better visualization
total_views_by_category = total_views_by_category.sort_values(ascending=False)

# Plotting the bar chart
plt.figure(figsize=(10, 6))
sns.barplot(x=total_views_by_category.index, y=total_views_by_category.values, hue=total_views_by_category.index, palette=current_palette) 
plt.title('Total Views Count by Different Category IDs')
plt.xlabel('Category ID')
plt.ylabel('Total Views')
plt.xticks(rotation=45)  # Rotates the category ID labels for better readability
plt.show()

../_images/06b7511b6487514488e60f87cfe1438275760a8fa3bcde3f63b6ccc6c154b578.png

videos[videos['category_id'] == 24]['snippet.title'].unique()

array(['Entertainment'], dtype=object)

Based on the histogram, videos in “Music” category have the greatest total views.

# For an Aggregated Heatmap, we first need to aggregate the data.
# Let's create an aggregated DataFrame with mean views, likes, dislikes, and comment counts for each category

# Aggregating the data
agg_data = videos.groupby('category_id').agg({
    'views': 'mean',
    'likes': 'mean',
    'dislikes': 'mean',
    'comment_count': 'mean'
}).reset_index()

pivot_table = agg_data.pivot(index="category_id", columns="views", values="likes")

# Continue with your plotting code
plt.figure(figsize=(12, 8))
sns.heatmap(pivot_table, annot=True, fmt=".0f", cmap = 'coolwarm')
plt.title("Heatmap of Average Likes by Category ID and Views")
plt.ylabel("Category ID")
plt.xlabel("Average Views")
plt.show()

../_images/50e90ca49f851cff58ef2fd3fb3d49c812fcb3f6a4b5219cb51aa7ae6734c75f.png

videos[videos['category_id'] == 10]['snippet.title'].unique()

array(['Music'], dtype=object)

Based on the aggregate heatmap, videos in “Music” category tend to be those with the popular.

Linear Regression for Fraud Check#

# For music category
df = videos[videos['snippet.title']=='Music']
# df = videos

from sklearn.model_selection import train_test_split
y = df['views']
X = df[['likes','dislikes','comment_count']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)
lm.coef_, lm.intercept_

(array([1678710.4584304 , 1985409.18248571,  576865.44128654]),
 879282.0397286821)

pred_1 = lm.predict(X_test)
plt.scatter(y_test,pred_1,color="slateblue")
plt.xlabel('Y Test')
plt.ylabel('Predicted Y')

Text(0, 0.5, 'Predicted Y')

../_images/089d2ae4fe7625e5cbc25936f3bf2c161732d2afa5cfcee85a0c95ebf17909bf.png

## RMSE
from sklearn.metrics import mean_squared_error
print("MSE:",mean_squared_error(y_test,pred_1))
print("RMSE:",np.sqrt(mean_squared_error(y_test,pred_1)))

MSE: 791171311427.6908
RMSE: 889478.1118317025

## R^2
from sklearn.metrics import r2_score
r2_score(y_test, pred_1)

0.9525663278976524

Identify Outliers and Remove Them#

# Calculate residuals
pred_df = lm.predict(X.values)  # Use training data for predictions
residuals = y - pred_df  # Calculate residuals for the entire DataFrame

# Set a threshold for identifying outliers (e.g., 3 standard deviations)
threshold = 3 * np.std(residuals)

# Identify outliers
outliers = np.abs(residuals) > (threshold+np.mean(residuals))

# Create a cleaned DataFrame (excluding outliers)
df_clean = df[~outliers]

y1 = df_clean['views']
X1 = df_clean[['likes','dislikes','comment_count']]
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=0.2, random_state=123)
X1_train = scaler.fit_transform(X1_train)
X1_test = scaler.fit_transform(X1_test)
lm1 = LinearRegression()
lm1.fit(X1_train,y1_train)
pred_11 = lm1.predict(X1_test)
plt.scatter(y1_test,pred_11, color="mediumpurple")
plt.xlabel('Y Test')
plt.ylabel('Predicted Y')

print("MSE:",mean_squared_error(y1_test,pred_11))
print("R^2:",r2_score(y1_test, pred_11))

MSE: 954687579800.9286
R^2: 0.7166970758420361

../_images/7ec14cfa1c94c5ad9ce04005d7770ba8d0547c8b19a5451aa0a6644604b2d31e.png

We try to use different machine learning models to predict the trending status of videos. Models such as ridge regression, decision trees, and random forests were trained and evaluated. The performance of these models was measured using appropriate metrics to provide a robust framework for prediction.

Decision Tree Regression#

from sklearn.tree import DecisionTreeRegressor
dt = DecisionTreeRegressor(max_depth=10)
dt.fit(X_train, y_train)
pred_2 = dt.predict(X_test)

# MSE
print("MSE:",mean_squared_error(y_test,pred_2))

MSE: 757936993054.9852

# R^2
r2_score(y_test, pred_2)

0.9545588492864666

plt.figure()
plt.xlabel('Y Test')
plt.ylabel('Predicted Y')
plt.scatter(y_test, pred_2, color="m", label="max_depth=2", linewidth=2)

<matplotlib.collections.PathCollection at 0x28300eddd60>

../_images/2b28077e1ed8467b10a608db0f595342ed587d106917c3c02a658ea5152cf613.png

Random Forest Regression#

from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(random_state=123)
rf.fit(X_train, y_train)
pred_3 = rf.predict(X_test)

plt.xlabel('Y Test')
plt.ylabel('Predicted Y')
plt.scatter(y_test, pred_3, color="purple", linewidth=2)

<matplotlib.collections.PathCollection at 0x28301031d00>

../_images/73d00997eed0e020d2b42e3feaafd9ce3d3ae74fb818e0bb4c9adc3f97467918.png

## R^2
r2_score(y_test, pred_3)

0.955550878511539

Ridge Regression#

from sklearn.linear_model import Ridge
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report, confusion_matrix

# Standardize the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Initialize the Ridge model
rg = Ridge(alpha=1.0)  # Adjust alpha as needed

# Train the model
rg.fit(X_train, y_train)

# Make predictions
pred_4 = rg.predict(X_test)

plt.xlabel('Y Test')
plt.ylabel('Predicted Y')
plt.scatter(y_test, pred_4, color="orchid", linewidth=2)

<matplotlib.collections.PathCollection at 0x2830358bd10>

../_images/db41b1d3291b75c9ec0448345f45bb44021c3759a7c6e35de8ec939989f3ec65.png

## R^2
r2_score(y_test, pred_4)

0.9525963263828874

10-fold Cross Validation#

from sklearn.model_selection import cross_val_score

# Linear model
score_lm = cross_val_score(lm,X,y,cv=10)
score_lm.mean()

0.7723659637449632

# Decision tree regression
score_dt = cross_val_score(dt,X,y,cv=10)
score_dt.mean()

0.5014532073170431

# Random forest regression
score_dt = cross_val_score(rf,X,y,cv=10)
score_dt.mean()

0.5874674659994525

# Lasso Regression
score_dt = cross_val_score(rf,X,y,cv=10)
score_dt.mean()

0.5874674659994525

All the R^2 scores of the models are above 0.5, which acceptable for our purposes, especially since most of the explanatory variables are statistically significant. The linear regression model, with a mean R^2 score greater than 0.7 after 10 folds validation, seems to be the one with considerable significance describing how the’likes’,’dislikes’,’comment_count’ of a video are correlated to its’view’.

When would the video be trending?#

popular = videos[ (videos['views']>=1000000)]

plt.hist(popular['publish_time'], alpha=0.5, label='publish_time')
plt.hist(popular['trending_date'], alpha=0.5, label='trending_date')
plt.legend(loc='upper right')

<matplotlib.legend.Legend at 0x2830352c0b0>

../_images/750e6423dfa8e21bf6e2aab1babe2eabcbb5d88ddb096d897c39e962a60cbe75.png

popular['days_till_trending'].hist()

<Axes: >

../_images/aa756e7fbae7a4a8a09af7db27df991a750c1a207cf0fbb555f6c29c14ea5ea2.png

Most trending videos in Japan were on the trending board within 3 days after its publish time.

popular['days_till_trending'].mean()

1.082795635604581

popular['days_till_trending'].median()

0.7291666666666666

popular['days_till_trending'].mode()

0    0.729167
Name: days_till_trending, dtype: float64

fig, ax = plt.subplots(figsize=(12, 6))
popular_time = []
target_categories = videos['snippet.title'].unique()

for category in target_categories:
    popular = videos[(videos['snippet.title']==category) & (videos['views']>=1000000)]
    if abs(popular['days_till_trending'].mean() - popular['days_till_trending'].median()) < 0.001:
        # Remove outliers
        continue
    sns.kdeplot(popular['days_till_trending'], fill = True)
    plt.legend(labels=videos['snippet.title'].unique())
    print(category.ljust(22," "), popular['days_till_trending'].median())
    popular_time.append([category, popular['days_till_trending'].median()])

News & Politics        0.8379629629629629
Film & Animation       0.687494212962963
Science & Technology   1.1228703703703704
People & Blogs         1.0109027777777777
Comedy                 0.8438888888888889
Entertainment          0.6840277777777778
Howto & Style          0.5371122685185186
Autos & Vehicles       1.597037037037037
Sports                 0.7689178240740742
Music                  0.49997685185185187
Gaming                 0.9085185185185185
Education              0.3283449074074074

../_images/0d2d28dadc69532603a4137e3412bc97f9398d88ebcafaa8898a8bb0606c8e62.png

popular_time = pd.DataFrame(popular_time)
popular_time

	0	1
0	News & Politics	0.837963
1	Film & Animation	0.687494
2	Science & Technology	1.122870
3	People & Blogs	1.010903
4	Comedy	0.843889
5	Entertainment	0.684028
6	Howto & Style	0.537112
7	Autos & Vehicles	1.597037
8	Sports	0.768918
9	Music	0.499977
10	Gaming	0.908519
11	Education	0.328345

Summary#

Previous analysis implies that Youtube videos in the music category with certain tags are more likely to trend in Japan. Engagement metrics such as likes, dislikes, and number of comments have a significant impact on a video’s trending potential. In addition, the time of release and the initial performance of the video play a critical role in its virality, as most popular videos start trending within 2 days of being uploaded.

Reference#

Data

kaggle

Algorithm and Methods

Decision Tree
Random Forest
kde plot
Math 10 Course website

Thanks for reading :)