Introduction#

  • In this project, I tried to conduct a comprehensive analysis of trending YouTube videos in Japan. The goal was to identify key factors that influence the popularity of these videos. Through a systematic approach that included data cleaning, exploratory data analysis, and predictive modeling, I was able to gain several important insights.

Exploratory Data Analysis#

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import sklearn as sk 
import random

SEED = 123
np.random.seed(SEED)
# videos = pd.read_csv("JPvideos.csv", encoding='shift_jis', errors='replace')
videos = pd.read_csv('JPvideos.csv', encoding='utf-8', quotechar='"', delimiter=',', encoding_errors='ignore')
videos['category_id'] = pd.to_numeric(videos['category_id'])
videos['views'] = pd.to_numeric(videos['views'])
videos['tags'] = videos['tags'].str.split('|')
videos['like_dislike_ratio'] = videos['likes']/videos['dislikes']
videos['like_views_ratio'] = videos['likes']/videos['views']
videos['publish_time'] = pd.to_datetime(videos['publish_time'])
videos.head()
video_id trending_date title channel_title category_id publish_time tags views likes dislikes comment_count thumbnail_link comments_disabled ratings_disabled video_error_or_removed description like_dislike_ratio like_views_ratio
0 5ugKfHgsmYw 18.07.02 陸自ヘリ、垂直に落下=路上の車が撮影 時事通信映像センター 25 2018-02-06 03:04:37+00:00 [事故, "佐賀", "佐賀県", "ヘリコプター", "ヘリ", "自衛隊", "墜落",... 188085 591 189 0 https://i.ytimg.com/vi/5ugKfHgsmYw/default.jpg True False False 佐賀県神埼市の民家に墜落した陸上自衛隊のAH64D戦闘ヘリコプターが垂直に落下する様子を、近... 3.126984 0.003142
1 ohObafdd34Y 18.07.02 イッテQ お祭り男宮川×手越 巨大ブランコ② 神谷えりな Kamiya Erina 2 1 2018-02-06 04:01:56+00:00 [[none]] 90929 442 88 174 https://i.ytimg.com/vi/ohObafdd34Y/default.jpg False False False NaN 5.022727 0.004861
2 aBr2kKAHN6M 18.07.02 Live Views of Starman SpaceX 28 2018-02-06 21:38:22+00:00 [[none]] 6408303 165892 2331 3006 https://i.ytimg.com/vi/aBr2kKAHN6M/default.jpg False False False NaN 71.167739 0.025887
3 5wNnwChvmsQ 18.07.02 東京ディズニーリゾートの元キャストが暴露した秘密5選 アシタノワダイ 25 2018-02-06 06:08:49+00:00 [アシタノワダイ] 96255 1165 277 545 https://i.ytimg.com/vi/5wNnwChvmsQ/default.jpg False False False 東京ディズニーリゾートの元キャストが暴露した秘密5選\n\nかたまりクリエイトさま\n【検証... 4.205776 0.012103
4 B7J47qFvdsk 18.07.02 榮倉奈々、衝撃の死んだふり!映画『家に帰ると妻が必ず死んだふりをしています。』特報 シネマトゥデイ 1 2018-02-06 02:30:00+00:00 [[none]] 108408 1336 74 201 https://i.ytimg.com/vi/B7J47qFvdsk/default.jpg False False False 家に帰ってきたサラリーマンのじゅん(安田顕)は、玄関で血を出して倒れている妻ちえ(榮倉奈々)... 18.054054 0.012324
videos['trending_date'] = pd.to_datetime(videos['trending_date'], format='%y.%d.%m', utc=True)
videos['time_till_trending'] = videos['trending_date'] - videos['publish_time']

# Align datetime object format for tableau analysis
videos['publish_time'] = videos['publish_time'].dt.strftime('%Y-%m-%dT%H:%M:%S')
videos['trending_date'] = videos['trending_date'].dt.strftime('%Y-%m-%dT%H:%M:%S')
videos['hours_till_trending'] = videos['time_till_trending'].map(lambda x: x.total_seconds()) / 3600
videos['days_till_trending'] = videos['time_till_trending'].map(lambda x: x.total_seconds()) / (3600*24)
videos[['trending_date','publish_time','time_till_trending','days_till_trending']].head()
trending_date publish_time time_till_trending days_till_trending
0 2018-02-07T00:00:00 2018-02-06T03:04:37 0 days 20:55:23 0.871794
1 2018-02-07T00:00:00 2018-02-06T04:01:56 0 days 19:58:04 0.831991
2 2018-02-07T00:00:00 2018-02-06T21:38:22 0 days 02:21:38 0.098356
3 2018-02-07T00:00:00 2018-02-06T06:08:49 0 days 17:51:11 0.743877
4 2018-02-07T00:00:00 2018-02-06T02:30:00 0 days 21:30:00 0.895833
videos.to_csv('JP_videos.csv')
category = pd.read_json("JP_category_id.json")
category = pd.json_normalize(category['items'])
category['id'] = pd.to_numeric(category['id'])
#category.to_csv('JP_videos_category.csv')
videos = videos.merge(category,how='inner',left_on='category_id', right_on='id')
videos.dropna()
#videos.to_csv('JP_videos_complete.csv')
videos.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20505 entries, 0 to 20504
Data columns (total 27 columns):
 #   Column                  Non-Null Count  Dtype          
---  ------                  --------------  -----          
 0   video_id                20505 non-null  object         
 1   trending_date           20505 non-null  object         
 2   title                   20505 non-null  object         
 3   channel_title           20505 non-null  object         
 4   category_id             20505 non-null  int64          
 5   publish_time            20505 non-null  object         
 6   tags                    20505 non-null  object         
 7   views                   20505 non-null  int64          
 8   likes                   20505 non-null  int64          
 9   dislikes                20505 non-null  int64          
 10  comment_count           20505 non-null  int64          
 11  thumbnail_link          20505 non-null  object         
 12  comments_disabled       20505 non-null  bool           
 13  ratings_disabled        20505 non-null  bool           
 14  video_error_or_removed  20505 non-null  bool           
 15  description             18381 non-null  object         
 16  like_dislike_ratio      19114 non-null  float64        
 17  like_views_ratio        20505 non-null  float64        
 18  time_till_trending      20505 non-null  timedelta64[ns]
 19  hours_till_trending     20505 non-null  float64        
 20  days_till_trending      20505 non-null  float64        
 21  kind                    20505 non-null  object         
 22  etag                    20505 non-null  object         
 23  id                      20505 non-null  int64          
 24  snippet.channelId       20505 non-null  object         
 25  snippet.title           20505 non-null  object         
 26  snippet.assignable      20505 non-null  bool           
dtypes: bool(4), float64(4), int64(6), object(12), timedelta64[ns](1)
memory usage: 3.7+ MB
sns.pairplot(videos, vars=videos.columns[7:11])
<seaborn.axisgrid.PairGrid at 0x282b47eb200>
../_images/2aaeaaf45b8934e96b81852b68f592bc24f6dd1e8018b71e56c6be4c9e6154d3.png

Finding relationships between numeric variables and possible linear relationships between likes and dislikes

  • Multiple linear lines in a pair plot suggest that there may be linear relationships between Views, Likes, Dislikes, and # of Comments, but these relationships may be different within each video category. or dominated by video content.

videos.sort_values(by=['likes'], ascending=False).head()
video_id trending_date title channel_title category_id publish_time tags views likes dislikes ... like_views_ratio time_till_trending hours_till_trending days_till_trending kind etag id snippet.channelId snippet.title snippet.assignable
16331 7C2z4GqqS5E 2018-05-20T00:00:00 BTS (방탄소년단) 'FAKE LOVE' Official MV ibighit 10 2018-05-18T09:00:02 [BIGHIT, "빅히트", "방탄소년단", "BTS", "BANGTAN", "방탄... 62796390 4470923 119053 ... 0.071197 1 days 14:59:58 38.999444 1.624977 youtube#videoCategory "XI7nbFXulYBIpL0ayR_gDh3eu1k/nqRIq97-xe5XRZTxb... 10 UCBR8-60-B28hp2BmDPdntcQ Music True
16317 7C2z4GqqS5E 2018-05-20T00:00:00 BTS (방탄소년단) 'FAKE LOVE' Official MV ibighit 10 2018-05-18T09:00:02 [BIGHIT, "빅히트", "방탄소년단", "BTS", "BANGTAN", "방탄... 62796390 4470923 119053 ... 0.071197 1 days 14:59:58 38.999444 1.624977 youtube#videoCategory "XI7nbFXulYBIpL0ayR_gDh3eu1k/nqRIq97-xe5XRZTxb... 10 UCBR8-60-B28hp2BmDPdntcQ Music True
16303 7C2z4GqqS5E 2018-05-20T00:00:00 BTS (방탄소년단) 'FAKE LOVE' Official MV ibighit 10 2018-05-18T09:00:02 [BIGHIT, "빅히트", "방탄소년단", "BTS", "BANGTAN", "방탄... 62796390 4470923 119053 ... 0.071197 1 days 14:59:58 38.999444 1.624977 youtube#videoCategory "XI7nbFXulYBIpL0ayR_gDh3eu1k/nqRIq97-xe5XRZTxb... 10 UCBR8-60-B28hp2BmDPdntcQ Music True
16289 7C2z4GqqS5E 2018-05-20T00:00:00 BTS (방탄소년단) 'FAKE LOVE' Official MV ibighit 10 2018-05-18T09:00:02 [BIGHIT, "빅히트", "방탄소년단", "BTS", "BANGTAN", "방탄... 62796390 4470923 119053 ... 0.071197 1 days 14:59:58 38.999444 1.624977 youtube#videoCategory "XI7nbFXulYBIpL0ayR_gDh3eu1k/nqRIq97-xe5XRZTxb... 10 UCBR8-60-B28hp2BmDPdntcQ Music True
14599 p8npDG2ulKQ 2018-05-08T00:00:00 BTS (방탄소년단) LOVE YOURSELF 轉 Tear 'Singularity'... ibighit 10 2018-05-06T15:00:02 [BIGHIT, "빅히트", "방탄소년단", "BTS", "BANGTAN", "방탄"] 15875379 2285436 19400 ... 0.143961 1 days 08:59:58 32.999444 1.374977 youtube#videoCategory "XI7nbFXulYBIpL0ayR_gDh3eu1k/nqRIq97-xe5XRZTxb... 10 UCBR8-60-B28hp2BmDPdntcQ Music True

5 rows × 27 columns

videos.sort_values(by=['views'],ascending=False).head()
video_id trending_date title channel_title category_id publish_time tags views likes dislikes ... like_views_ratio time_till_trending hours_till_trending days_till_trending kind etag id snippet.channelId snippet.title snippet.assignable
16289 7C2z4GqqS5E 2018-05-20T00:00:00 BTS (방탄소년단) 'FAKE LOVE' Official MV ibighit 10 2018-05-18T09:00:02 [BIGHIT, "빅히트", "방탄소년단", "BTS", "BANGTAN", "방탄... 62796390 4470923 119053 ... 0.071197 1 days 14:59:58 38.999444 1.624977 youtube#videoCategory "XI7nbFXulYBIpL0ayR_gDh3eu1k/nqRIq97-xe5XRZTxb... 10 UCBR8-60-B28hp2BmDPdntcQ Music True
16303 7C2z4GqqS5E 2018-05-20T00:00:00 BTS (방탄소년단) 'FAKE LOVE' Official MV ibighit 10 2018-05-18T09:00:02 [BIGHIT, "빅히트", "방탄소년단", "BTS", "BANGTAN", "방탄... 62796390 4470923 119053 ... 0.071197 1 days 14:59:58 38.999444 1.624977 youtube#videoCategory "XI7nbFXulYBIpL0ayR_gDh3eu1k/nqRIq97-xe5XRZTxb... 10 UCBR8-60-B28hp2BmDPdntcQ Music True
16317 7C2z4GqqS5E 2018-05-20T00:00:00 BTS (방탄소년단) 'FAKE LOVE' Official MV ibighit 10 2018-05-18T09:00:02 [BIGHIT, "빅히트", "방탄소년단", "BTS", "BANGTAN", "방탄... 62796390 4470923 119053 ... 0.071197 1 days 14:59:58 38.999444 1.624977 youtube#videoCategory "XI7nbFXulYBIpL0ayR_gDh3eu1k/nqRIq97-xe5XRZTxb... 10 UCBR8-60-B28hp2BmDPdntcQ Music True
16331 7C2z4GqqS5E 2018-05-20T00:00:00 BTS (방탄소년단) 'FAKE LOVE' Official MV ibighit 10 2018-05-18T09:00:02 [BIGHIT, "빅히트", "방탄소년단", "BTS", "BANGTAN", "방탄... 62796390 4470923 119053 ... 0.071197 1 days 14:59:58 38.999444 1.624977 youtube#videoCategory "XI7nbFXulYBIpL0ayR_gDh3eu1k/nqRIq97-xe5XRZTxb... 10 UCBR8-60-B28hp2BmDPdntcQ Music True
12338 u9Mv98Gr5pY 2018-04-26T00:00:00 VENOM - Official Trailer (HD) Sony Pictures Entertainment 24 2018-04-24T03:45:03 [Venom, "Venom Movie", "Venom (2018)", "Marvel... 39128403 1077732 35764 ... 0.027543 1 days 20:14:57 44.249167 1.843715 youtube#videoCategory "XI7nbFXulYBIpL0ayR_gDh3eu1k/UVB9oxX2Bvqa_w_y3... 24 UCBR8-60-B28hp2BmDPdntcQ Entertainment True

5 rows × 27 columns

videos.groupby('snippet.title')['views'].mean().sort_values(ascending=False).head()
snippet.title
Science & Technology    1.213748e+06
Music                   8.297243e+05
Comedy                  3.781614e+05
Sports                  2.850889e+05
Entertainment           2.821966e+05
Name: views, dtype: float64

Based on the previous analysis, amongst the previously trending videos, a significant portion is in “music” category.

videos[videos['snippet.title']=='People & Blogs']['like_views_ratio'].describe()
count    3915.000000
mean        0.015004
std         0.020466
min         0.000000
25%         0.002627
50%         0.006597
75%         0.020536
max         0.194951
Name: like_views_ratio, dtype: float64

Now We try to determine the videos that were trending because of “fake views” (those with possibly flawed views), especially in “people and blog” catergory, by checking if some videos’ like-view ratio is considered as outlier.

fake = videos[(videos['snippet.title']=='People & Blogs')&(videos['like_views_ratio']<0.01)]
fake.shape
(2287, 27)
sns.jointplot(x='likes',y='views',data=videos[videos['snippet.title']=='Entertainment'])
<seaborn.axisgrid.JointGrid at 0x282e14f5ac0>
../_images/7ad3364f2accf7d2b84e2f7a85226609150cfd470b8dd0c2b919537ecfb07ec3.png
sns.jointplot(x='likes',y='views',data=videos[videos['snippet.title']=='People & Blogs'])
<seaborn.axisgrid.JointGrid at 0x282e14f7e60>
../_images/a3aee307a0204e3a2584128259f944be8e7daf89737b5c79f51a07642e42722b.png
sns.lmplot(x='likes',y='views',data=videos)
videos.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20505 entries, 0 to 20504
Data columns (total 27 columns):
 #   Column                  Non-Null Count  Dtype          
---  ------                  --------------  -----          
 0   video_id                20505 non-null  object         
 1   trending_date           20505 non-null  object         
 2   title                   20505 non-null  object         
 3   channel_title           20505 non-null  object         
 4   category_id             20505 non-null  int64          
 5   publish_time            20505 non-null  object         
 6   tags                    20505 non-null  object         
 7   views                   20505 non-null  int64          
 8   likes                   20505 non-null  int64          
 9   dislikes                20505 non-null  int64          
 10  comment_count           20505 non-null  int64          
 11  thumbnail_link          20505 non-null  object         
 12  comments_disabled       20505 non-null  bool           
 13  ratings_disabled        20505 non-null  bool           
 14  video_error_or_removed  20505 non-null  bool           
 15  description             18381 non-null  object         
 16  like_dislike_ratio      19114 non-null  float64        
 17  like_views_ratio        20505 non-null  float64        
 18  time_till_trending      20505 non-null  timedelta64[ns]
 19  hours_till_trending     20505 non-null  float64        
 20  days_till_trending      20505 non-null  float64        
 21  kind                    20505 non-null  object         
 22  etag                    20505 non-null  object         
 23  id                      20505 non-null  int64          
 24  snippet.channelId       20505 non-null  object         
 25  snippet.title           20505 non-null  object         
 26  snippet.assignable      20505 non-null  bool           
dtypes: bool(4), float64(4), int64(6), object(12), timedelta64[ns](1)
memory usage: 3.7+ MB
../_images/65fa0c27b954c7ade37533aa056058b1d6517a209b5299a9a3d7008fbba1a110.png