Introduction#

Data Loading#

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_regression
from prophet import Prophet
import warnings

warnings.filterwarnings('ignore')
# Load the example CO2 emissions dataset (using flights as a proxy)
flights = sns.load_dataset('flights')
flights.to_csv('flights.csv', index=False)
co2_data = flights.rename(columns={'passengers': 'co2'})  # Rename for consistency
# Load the example renewable energy dataset (using diamonds as a proxy)
diamonds = sns.load_dataset('diamonds')
diamonds.to_csv('diamonds.csv', index=False)
# Assume 'carat' as energy production, 'depth' as another feature, 'price' as another feature
energy_data = diamonds[['carat', 'depth', 'price']].rename(columns={'carat': 'energy'})

# Add a 'year' column to energy_data for merging
energy_data['year'] = co2_data['year'].iloc[:len(energy_data)]

# Display the first few rows of the datasets
co2_data.head()
year month co2
0 1949 Jan 112
1 1949 Feb 118
2 1949 Mar 132
3 1949 Apr 129
4 1949 May 121
# Display the first few rows of the datasets
energy_data.head()
energy depth price year
0 0.23 61.5 326 1949.0
1 0.21 59.8 326 1949.0
2 0.23 56.9 327 1949.0
3 0.29 62.4 334 1949.0
4 0.31 63.3 335 1949.0
# Display the shape of the dataset
co2_data.shape
(144, 3)
# Display the shape of the dataset
energy_data.shape
(53940, 4)
# Display summary of the dataset
co2_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 144 entries, 0 to 143
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   year    144 non-null    int64   
 1   month   144 non-null    category
 2   co2     144 non-null    int64   
dtypes: category(1), int64(2)
memory usage: 2.9 KB
# Display summary of the dataset
energy_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   energy  53940 non-null  float64
 1   depth   53940 non-null  float64
 2   price   53940 non-null  int64  
 3   year    144 non-null    float64
dtypes: float64(3), int64(1)
memory usage: 1.6 MB
co2_data.dtypes
year        int64
month    category
co2         int64
dtype: object
energy_data.dtypes
energy    float64
depth     float64
price       int64
year      float64
dtype: object

Data cleaning#

# Check for missing values
co2_data.isnull().sum()
year     0
month    0
co2      0
dtype: int64
# Check for missing values
energy_data.isnull().sum()
energy        0
depth         0
price         0
year      53796
dtype: int64
# Handle missing values
co2_data.dropna(inplace=True)
energy_data.dropna(inplace=True)
# Convert data types if necessary
co2_data['year'] = co2_data['year'].astype(int)
energy_data['year'] = energy_data['year'].astype(int)
# Display summary of the cleaned data
co2_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 144 entries, 0 to 143
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   year    144 non-null    int32   
 1   month   144 non-null    category
 2   co2     144 non-null    int64   
dtypes: category(1), int32(1), int64(1)
memory usage: 2.3 KB
energy_data.info()
<class 'pandas.core.frame.DataFrame'>
Index: 144 entries, 0 to 143
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   energy  144 non-null    float64
 1   depth   144 non-null    float64
 2   price   144 non-null    int64  
 3   year    144 non-null    int32  
dtypes: float64(2), int32(1), int64(1)
memory usage: 5.1 KB
# Merge the datasets on the 'year' column
data = pd.merge(co2_data, energy_data, on='year')
data.head()
year month co2 energy depth price
0 1949 Jan 112 0.23 61.5 326
1 1949 Jan 112 0.21 59.8 326
2 1949 Jan 112 0.23 56.9 327
3 1949 Jan 112 0.29 62.4 334
4 1949 Jan 112 0.31 63.3 335
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   year    1728 non-null   int32   
 1   month   1728 non-null   category
 2   co2     1728 non-null   int64   
 3   energy  1728 non-null   float64 
 4   depth   1728 non-null   float64 
 5   price   1728 non-null   int64   
dtypes: category(1), float64(2), int32(1), int64(2)
memory usage: 62.9 KB

Visual analysis#

# Pairplot to visualize relationships between variables
sns.pairplot(data)
plt.show()
../_images/8d0a97a97932b52e81c8477dda9d36d80811d2f89073ce0a1d5afd4a48ea1ce6.png
co2_data['year'] = co2_data['year'].astype('int')
energy_data['year'] = energy_data['year'].astype('int')
co2_data_grouped = co2_data.groupby('year', as_index=False)['co2'].mean()
co2_data_grouped
year co2
0 1949 126.666667
1 1950 139.666667
2 1951 170.166667
3 1952 197.000000
4 1953 225.000000
5 1954 238.916667
6 1955 284.000000
7 1956 328.250000
8 1957 368.416667
9 1958 381.000000
10 1959 428.333333
11 1960 476.166667
co2_data_grouped.dtypes
year      int32
co2     float64
dtype: object
# Line plot of CO2 emissions over time
plt.figure(figsize=(10, 6))
plt.plot(co2_data_grouped['year'], co2_data_grouped['co2'], marker='o')
plt.title('Global CO2 Emissions Over Time')
plt.xlabel('Year')
plt.ylabel('CO2 Emissions (Metric Tons)')
plt.show()
../_images/23d871031a36997f4fe882f87acd1fda60e632024237768eea8e6add8d204dac.png
energy_data_grouped = energy_data.groupby('year', as_index=False)['energy'].mean()
energy_data_grouped
year energy
0 1949 0.249167
1 1950 0.276667
2 1951 0.243333
3 1952 0.296667
4 1953 0.272500
5 1954 0.305833
6 1955 0.280000
7 1956 0.499167
8 1957 0.772500
9 1958 0.754167
10 1959 0.755833
11 1960 0.704167
# Line plot of renewable energy production over time
plt.figure(figsize=(10, 6))
plt.plot(energy_data_grouped['year'], energy_data_grouped['energy'], marker='o')
plt.title('Global Renewable Energy Production Over Time')
plt.xlabel('Year')
plt.ylabel('Renewable Energy Production (Metric Tons)')
plt.show()
../_images/7d155bc38d3456479ac20437d4f1fd1ca51d3c690666cc1bf20218c07aaf7aa9.png
# Distribution plots for CO2 emissions and renewable energy production
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.histplot(data['co2'], kde=True)
plt.title('Distribution of CO2 Emissions')
plt.xlabel('CO2 Emissions (Metric Tons)')

plt.subplot(1, 2, 2)
sns.histplot(data['energy'], kde=True)
plt.title('Distribution of Renewable Energy Production')
plt.xlabel('Renewable Energy Production (Metric Tons)')
plt.show()
../_images/032791193f3992d667075f41cef56f8c99f28d30c09074f950fc142009cf29c0.png
# Correlation matrix (excluding non-numeric columns)
numeric_data = data.select_dtypes(include=[float, int])
plt.figure(figsize=(12, 10))
sns.heatmap(numeric_data.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
../_images/ce67c229c72e1974f48be429228155aff69c103c1f42bbbe3d79c1fa956581a8.png

Feature selection#

# Select the best feature using SelectKBest
X = data[['year', 'energy', 'depth', 'price']]
y = data['co2']

# Use f_regression to select the 3 most relevant features
selector = SelectKBest(score_func=f_regression, k=3)
X_new = selector.fit_transform(X, y)

# Displays selected features
selected_features = X.columns[selector.get_support()]
print("Selected features:", selected_features)
Selected features: Index(['year', 'energy', 'price'], dtype='object')

Feature scaling#

# Scale features using StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_new)

# Displays some scaled features
X_scaled[:5]
array([[-1.59325501, -0.93808395, -0.86848677],
       [-1.59325501, -1.0230425 , -0.86848677],
       [-1.59325501, -0.93808395, -0.86760127],
       [-1.59325501, -0.68320831, -0.86140271],
       [-1.59325501, -0.59824977, -0.8605172 ]])

Data preparation#

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Check the shape of the training and test sets
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)
X_train shape: (1382, 3)
X_test shape: (346, 3)
y_train shape: (1382,)
y_test shape: (346,)

Model#

# Linear Regression model
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

# Predictions and accuracy
y_pred = lin_reg.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f'Linear Regression Mean Squared Error: {mse:.4f}')

# Cross-validation
cv_scores = cross_val_score(lin_reg, X, y, cv=5, scoring='neg_mean_squared_error')
print(f'Linear Regression Cross-Validation Mean Squared Error: {-cv_scores.mean():.4f}')

# Regression plot
plt.figure(figsize=(10, 6))
sns.regplot(x='energy', y='co2', data=data)
plt.title('Linear Regression: CO2 Emissions vs Renewable Energy Production')
plt.xlabel('Renewable Energy Production (Metric Tons)')
plt.ylabel('CO2 Emissions (Metric Tons)')
plt.show()
Linear Regression Mean Squared Error: 2453.6332
Linear Regression Cross-Validation Mean Squared Error: 2340.3392
../_images/2dc65916a1c2d26d5118628647bd6acbb18a6de29af05de46993fe61d2c81573.png
# Decision Tree Regressor model
tree = DecisionTreeRegressor(max_leaf_nodes=5)
tree.fit(X_train, y_train)

# Predictions and accuracy
tree_pred = tree.predict(X_test)
tree_mse = mean_squared_error(y_test, tree_pred)
print(f'Decision Tree Mean Squared Error: {tree_mse:.4f}')

# Cross-validation
cv_scores_tree = cross_val_score(tree, X, y, cv=5, scoring='neg_mean_squared_error')
print(f'Decision Tree Cross-Validation Mean Squared Error: {-cv_scores_tree.mean():.4f}')

# Visualize the decision tree
from sklearn.tree import plot_tree
fig = plt.figure(figsize=(20, 10))
_ = plot_tree(tree, feature_names=X.columns, filled=True)
plt.show()
Decision Tree Mean Squared Error: 2786.9683
Decision Tree Cross-Validation Mean Squared Error: 8093.9584
../_images/7030aae50996cf613f5914f7e100abebcb2d2bd14a462669dca6a89914682927.png
# Random Forest Regressor model
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Predictions and accuracy
rf_pred = rf.predict(X_test)
rf_mse = mean_squared_error(y_test, rf_pred)
print(f'Random Forest Mean Squared Error: {rf_mse:.4f}')

# Cross-validation
cv_scores_rf = cross_val_score(rf, X, y, cv=5, scoring='neg_mean_squared_error')
print(f'Random Forest Cross-Validation Mean Squared Error: {-cv_scores_rf.mean():.4f}')
Random Forest Mean Squared Error: 2507.1670
Random Forest Cross-Validation Mean Squared Error: 4630.6372
# K-Nearest Neighbors Regressor model
knn = KNeighborsRegressor()
knn.fit(X_train, y_train)

# Predictions and accuracy
knn_pred = knn.predict(X_test)
knn_mse = mean_squared_error(y_test, knn_pred)
print(f'K-Nearest Neighbors Mean Squared Error: {knn_mse:.4f}')

# Cross-validation
cv_scores_knn = cross_val_score(knn, X, y, cv=5, scoring='neg_mean_squared_error')
print(f'K-Nearest Neighbors Cross-Validation Mean Squared Error: {-cv_scores_knn.mean():.4f}')
K-Nearest Neighbors Mean Squared Error: 2931.5133
K-Nearest Neighbors Cross-Validation Mean Squared Error: 5311.0993
# Hyperparameter tuning for Random Forest using GridSearchCV
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)

# Best parameters and model evaluation
best_rf = grid_search.best_estimator_
best_rf_pred = best_rf.predict(X_test)
best_rf_mse = mean_squared_error(y_test, best_rf_pred)
print(f'Best Random Forest Mean Squared Error: {best_rf_mse:.4f}')
print(f'Best Random Forest Parameters: {grid_search.best_params_}')
Fitting 5 folds for each of 243 candidates, totalling 1215 fits
Best Random Forest Mean Squared Error: 2467.9437
Best Random Forest Parameters: {'max_depth': 10, 'max_features': 'sqrt', 'min_samples_leaf': 4, 'min_samples_split': 2, 'n_estimators': 50}
# Prepare the data for Prophet
prophet_data = co2_data[['year', 'co2']]
prophet_data.columns = ['ds', 'y']

# Initialize and fit the Prophet model
model = Prophet()
model.fit(prophet_data)

# Make a future dataframe for predictions
future = model.make_future_dataframe(periods=10, freq='Y')
forecast = model.predict(future)

forecast['year'] = forecast['ds'].dt.year

# Plot the forecast
plt.figure(figsize=(10, 6))
plt.plot(prophet_data['ds'], prophet_data['y'], label='Historical Data')
plt.plot(forecast['ds'], forecast['yhat'], label='Forecast', color='orange')

plt.title('CO2 Emissions Forecast using Prophet')
plt.xlabel('Year')
plt.ylabel('CO2 Emissions (Metric Tons)')
plt.legend()
plt.show()
14:16:02 - cmdstanpy - INFO - Chain [1] start processing
14:16:02 - cmdstanpy - INFO - Chain [1] done processing
../_images/fcf091696c3eb1f86cefd0df9c95d74e3fb6f99339a5d981ada9029026fc0cb6.png

Conclusion#

In this project, we analyzed global CO2 emissions and renewable energy production trends using various data science techniques. We performed comprehensive exploratory data analysis to identify key trends and correlations, applied multiple machine learning models including linear regression, decision tree, random forest, and K-nearest neighbors to predict CO2 emissions, and utilized the Prophet library for time series forecasting. The analysis provided insights into the factors affecting CO2 emissions, such as the production of renewable energy. The machine learning models demonstrated reasonable accuracy, with the Random Forest Regressor performing the best after hyperparameter tuning. However, there are limitations such as potential biases in the dataset and the need for more features to improve predictions. Future work could involve exploring more complex models, incorporating additional data sources, and further tuning the models to enhance performance.

This project demonstrated the application of data science techniques learned in Math 10 to a real-world environmental dataset.

References#