An Analysis on the Likeliness of People Changing Their Occupation#

Author: Nikolina Sentovich

Course Project, UC Irvine, Math 10, F24

I would like to post my notebook on the course’s website. [Yes]

Introduction#

In this project, I will analyze what causes people to change jobs. As a college student who is not 100% set on what I want to do, I am curious to find out what factors have the biggest impact on prompting a career switch. Are there certain fields of study that are more likely to ? What about certain occupations? Are there other important factors?

The goal of this project is to perform data cleaning and explore the data set by creating different models in order to get a better understanding of industry, the job market, and likelihood of changing occupation. I found a dataset on Kaggle which, according to the creator, has 30,000+ records and 22 features.

Kaggle Dataset: Field Of Study vs Occupation

Libraries and Imports#

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import display
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix
from pandas.plotting import scatter_matrix

Dataset#

df = pd.read_csv('career_change_prediction_dataset.csv').copy()
df.head()
Field of Study Current Occupation Age Gender Years of Experience Education Level Industry Growth Rate Job Satisfaction Work-Life Balance Job Opportunities ... Skills Gap Family Influence Mentorship Available Certifications Freelancing Experience Geographic Mobility Professional Networks Career Change Events Technology Adoption Likely to Change Occupation
0 Medicine Business Analyst 48 Male 7 High School High 7 10 83 ... 8 High 0 0 0 1 2 0 1 0
1 Education Economist 44 Male 26 Master's Low 10 3 55 ... 3 Medium 0 0 1 1 2 1 9 0
2 Education Biologist 21 Female 27 Master's Low 8 3 78 ... 4 Low 0 0 0 0 2 1 2 0
3 Education Business Analyst 33 Male 14 PhD Medium 7 9 62 ... 2 Medium 1 0 0 0 9 0 1 0
4 Arts Doctor 28 Female 0 PhD Low 3 1 8 ... 5 Low 0 0 1 0 2 0 7 1

5 rows × 23 columns

Features#

The following features and their descriptions are provided from the Kaggle Dataset.

Field of Study:

Type: Categorical (String)
Description: The area of academic focus during the individual’s education.

Current Occupation:

Type: Categorical (String)
Description: The individual’s current job or industry they are employed in.

Age:

Type: Integer
Description: The age of the individual.

Gender:

Type: Categorical (String)
Description: The gender of the individual.

Years of Experience:

Type: Integer
Description: The number of years the individual has been in the workforce.

Education Level:

Type: Categorical (String)
Description: The highest level of education completed by the individual.

Industry Growth Rate:

Type: Categorical (String)
Description: The growth rate of the industry the individual works in.

Job Satisfaction:

Type: Integer (1-10 scale)
Description: A rating of the individual’s job satisfaction.

Work-Life Balance:

Type: Integer (1-10 scale)
Description: A rating of the individual’s perceived work-life balance.

Job Opportunities:

Type: Integer
Description: The number of available job opportunities in the individual’s field.

Salary:

Type: Integer
Description: The annual salary of the individual (in USD or local currency equivalent).

Job Security:

Type: Integer (1-10 scale)
Description: A rating of the individual’s perceived job security.

Career Change Interest:

Type: Integer (0 or 1)
Description: Whether the individual is interested in changing their occupation (1 for yes, 0 for no).

Skills Gap:

Type: Integer (1-10 scale)
Description: A measure of how well the individual’s current skills match their job requirements.

Family Influence:

Type: Categorical (String)
Description: The degree of influence the individual’s family has on their career choice.

Mentorship Available:

Type: Integer (0 or 1)
Description: Whether the individual has access to a mentor in their current job.

Certifications:

Type: Integer (0 or 1)
Description: Whether the individual holds any certifications relevant to their occupation.

Freelancing Experience:

Type: Integer (0 or 1)
Description: Whether the individual has freelanced in the past.

Geographic Mobility:

Type: Integer (0 or 1)
Description: Whether the individual is willing to relocate for a job.

Professional Networks:

Type: Integer (1-10 scale)
Description: A measure of how strong the individual’s professional network is.

Career Change Events:

Type: Integer
Description: The number of career changes the individual has made in the past.

Technology Adoption:

Type: Integer (1-10 scale)
Description: A measure of the individual’s comfort level with adopting new technologies.

Target Variable: Likely to Change Occupation:

Type: Integer (0 or 1)
Description: The target variable indicating whether an individual is likely to change their occupation.

Data Cleaning#

I am first going to convert all my catagorical data into numerical data. This way, it will be easier to clean the data and reduce the dimension. I did some research and found that the One Hot Encoder Method would be good to use on my data set. Code is inspired by Geeks for Geeks: One Hot Encoding using Scikit Learn Library

# Selecting the categorical features
categorical_features = ['Field of Study', 'Current Occupation', 'Gender', 
                        'Education Level', 'Industry Growth Rate', 'Family Influence']

# Initialize the encoder
encoder = OneHotEncoder(sparse_output=False)

# Applying one-hot encoding to the categorical features
one_hot_encoded = encoder.fit_transform(df[categorical_features])

# Create a df with the one-hot encoded/dummy features
one_hot_df = pd.DataFrame(one_hot_encoded, columns=encoder.get_feature_names_out(categorical_features), index=df.index)

# Concatenate the one-hot encoded df with the original df
df_encoded = pd.concat([df, one_hot_df], axis=1)

# drop the original categorical columns (since they are now encoded)
df_encoded = df_encoded.drop(columns=categorical_features)


display(df_encoded.head())
Age Years of Experience Job Satisfaction Work-Life Balance Job Opportunities Salary Job Security Career Change Interest Skills Gap Mentorship Available ... Education Level_High School Education Level_Master's Education Level_PhD Industry Growth Rate_High Industry Growth Rate_Low Industry Growth Rate_Medium Family Influence_High Family Influence_Low Family Influence_Medium Family Influence_nan
0 48 7 7 10 83 198266 8 0 8 0 ... 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0
1 44 26 10 3 55 96803 9 0 3 0 ... 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0
2 21 27 8 3 78 65920 4 0 4 0 ... 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0
3 33 14 7 9 62 85591 5 0 2 1 ... 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0
4 28 0 3 1 8 43986 3 0 5 0 ... 0.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0

5 rows × 50 columns

We proceed with data cleaning by checking for columns with missing values. The results show that there are no columns with missing values.

missing_values = df_encoded.isnull().sum()
print(missing_values[missing_values > 0])
Series([], dtype: int64)

Now that my dataframe only consists of numerical data, I am going to do a Lasso Regression to see which features have the greatest significance on the targert variable (likley to change occupation)
The following code is inspired by the Math 10 Lecture Notes on Lasso Regression.. I modified the plot to be bars instead for easier viewing.

# Define the target variable and features
target = 'Likely to Change Occupation'  
X = df_encoded.drop(columns=[target])
y = df_encoded[target]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

#Lasso regression
lasso = Lasso(alpha=0.001, random_state=42)  
lasso.fit(X_train_scaled, y_train)

# predictions
y_pred = lasso.predict(X_test_scaled)

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.4f}")

# Get coefficients
coefficients = pd.Series(lasso.coef_, index=X.columns)

plt.figure(figsize=(12, 8))
coefficients.sort_values(ascending=False).plot(kind='bar', color='skyblue', edgecolor='black')
plt.title("Lasso Regression: Feature Coefficients")
plt.xlabel("Features")
plt.ylabel("Coefficient Value")
plt.xticks(rotation=90)  # Rotate feature names for better readability
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()  # Adjust layout to prevent clipping
plt.show()
Mean Squared Error: 0.1054
../_images/a23d408410b693e3c6c0b6d976441602da7c419c185428308b3389bd9a33ce02.png

Based on the results of the Lasso Regression, we can infer from this plot that the three main features that have the greatest impact on the target variable are Career Change Interest, Salary, and Job Satisfaction.
Career change interest has a positive relationship with the likeliness to change occupation. That is, as career change interest increases, likeliness to change occupation increases. However, salary and job satisfaction has a negative relationship with the likliness to change occupation, meaning that as salary and job satisfaction increase (decrease), the likeliness to change occupation decreases (increases).

So, now we know the features that have the most significant impact on the target variable, but what about the about the other features? I want to keep the top features that have an impact on the target variable to keep this project more interesting.

In the following code, I do another plot of coefficents from the Lasso Regression, but I drop Career Change Interest, Salary, and Job Satisfaction so I can zoom in and determine what the next most significant features are.

# Define features to drop
excluded_features = ['Career Change Interest', 'Salary', 'Job Satisfaction']

# Filter the coefficients
filtered_coefficients = coefficients.drop(index=excluded_features)


# Sort filtered coefficients by absolute value
sorted_filtered_coefficients = filtered_coefficients.reindex(
    filtered_coefficients.abs().sort_values(ascending=False).index
)
print("Filtered Coefficients (Sorted by Absolute Value):\n", sorted_filtered_coefficients)


# y-axis limits
y_min, y_max = filtered_coefficients.min() * 1.1, filtered_coefficients.max() * 1.1  # Add margin

# Plot the filtered coefficients
plt.figure(figsize=(12, 8))
filtered_coefficients.sort_values(ascending=False).plot(kind='bar', color='lightcoral', edgecolor='black')
plt.title("Lasso Regression: Feature Coefficients (Excluding Selected Features)")
plt.ylim(y_min, y_max)  
plt.xlabel("Features")
plt.ylabel("Coefficient Value")
plt.xticks(rotation=90)  # Rotate feature names for better readability
plt.grid(axis='y', linestyle='--', alpha=0.5)  
plt.tight_layout()  # Adjust layout to prevent clipping
plt.show()
Filtered Coefficients (Sorted by Absolute Value):
 Field of Study_Education                  3.295490e-03
Industry Growth Rate_Medium               3.239855e-03
Family Influence_High                     2.469651e-03
Professional Networks                     2.251828e-03
Years of Experience                      -2.205886e-03
Education Level_High School              -1.962927e-03
Freelancing Experience                    1.766056e-03
Field of Study_Psychology                -1.623722e-03
Current Occupation_Teacher               -1.544169e-03
Field of Study_Arts                      -1.290953e-03
Mentorship Available                     -1.142822e-03
Current Occupation_Software Developer    -1.049295e-03
Job Opportunities                        -8.464807e-04
Gender_Female                            -8.162868e-04
Age                                      -6.384907e-04
Field of Study_Medicine                   5.987319e-04
Field of Study_Biology                   -4.552262e-04
Current Occupation_Biologist              4.497016e-04
Field of Study_Business                   4.444105e-04
Current Occupation_Mechanical Engineer    3.598012e-04
Career Change Events                      3.188301e-04
Certifications                            2.205796e-04
Current Occupation_Lawyer                 2.968502e-05
Geographic Mobility                      -5.533156e-06
Gender_Male                               2.044644e-17
Education Level_Master's                  0.000000e+00
Education Level_PhD                       0.000000e+00
Family Influence_Low                     -0.000000e+00
Education Level_Bachelor's                0.000000e+00
Industry Growth Rate_High                -0.000000e+00
Family Influence_Medium                  -0.000000e+00
Industry Growth Rate_Low                 -0.000000e+00
Current Occupation_Artist                 0.000000e+00
Current Occupation_Psychologist          -0.000000e+00
Current Occupation_Economist             -0.000000e+00
Current Occupation_Doctor                 0.000000e+00
Current Occupation_Business Analyst      -0.000000e+00
Field of Study_Mechanical Engineering     0.000000e+00
Field of Study_Law                        0.000000e+00
Field of Study_Economics                 -0.000000e+00
Field of Study_Computer Science          -0.000000e+00
Technology Adoption                      -0.000000e+00
Skills Gap                               -0.000000e+00
Job Security                              0.000000e+00
Work-Life Balance                        -0.000000e+00
Family Influence_nan                     -0.000000e+00
dtype: float64
../_images/9aca1af77035918ddc6456d8ee3169509a0cf8512d2181d5a8c748df4788a425.png

Based on the results of this Lasso Regression, I will drop all of the features that the Lasso Regression drives the coefficent to zero since that means they don’t hold a lot of signficance to the target variable. I will create a new dataframe (features_reduced_dataframe) to work with.

# Name zero coefficient features
zero_coefficient_features = filtered_coefficients[filtered_coefficients == 0].index.tolist()

# Droping zero coefficient features from the dataset
features_reduced_df = X.drop(columns=zero_coefficient_features)


#Adding back the target variable
features_reduced_df['Likely to Change Occupation'] = df_encoded['Likely to Change Occupation']


display(features_reduced_df)
Age Years of Experience Job Satisfaction Job Opportunities Salary Career Change Interest Mentorship Available Certifications Freelancing Experience Geographic Mobility ... Current Occupation_Lawyer Current Occupation_Mechanical Engineer Current Occupation_Software Developer Current Occupation_Teacher Gender_Female Gender_Male Education Level_High School Industry Growth Rate_Medium Family Influence_High Likely to Change Occupation
0 48 7 7 83 198266 0 0 0 0 1 ... 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 1.0 0
1 44 26 10 55 96803 0 0 0 1 1 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0
2 21 27 8 78 65920 0 0 0 0 0 ... 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0
3 33 14 7 62 85591 0 1 0 0 0 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0
4 28 0 3 8 43986 0 0 0 1 0 ... 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
38439 24 34 8 92 117728 1 0 0 0 0 ... 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 1
38440 21 24 2 73 132500 1 1 0 0 0 ... 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 1
38441 35 21 4 77 55301 0 1 1 1 0 ... 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 1
38442 35 11 9 63 171459 0 0 0 0 1 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0
38443 37 23 6 49 189967 0 1 0 0 1 ... 0.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0

38444 rows × 29 columns

Here are some basic general statistics of some of features of the reduced features dataframe that we will be using for the remainder of the project.

Please note that for Career Change Interest, 1 means YES and 0 means NO. So based off the statistics, the average leans toward NO

stat_columns = ['Age', 'Salary', 'Job Opportunities', 'Job Satisfaction', 'Years of Experience','Career Change Interest']
summary_stats = features_reduced_df[stat_columns].describe(include='all')

print(summary_stats)
                Age         Salary  Job Opportunities  Job Satisfaction  \
count  38444.000000   38444.000000       38444.000000      38444.000000   
mean      39.540422  114975.623999          50.308267          5.489673   
std       11.574509   48963.725598          28.877294          2.870407   
min       20.000000   30005.000000           1.000000          1.000000   
25%       30.000000   72701.500000          25.000000          3.000000   
50%       40.000000  114861.000000          50.000000          6.000000   
75%       50.000000  157241.000000          75.000000          8.000000   
max       59.000000  199996.000000         100.000000         10.000000   

       Years of Experience  Career Change Interest  
count         38444.000000            38444.000000  
mean             19.548200                0.199901  
std              11.552474                0.399931  
min               0.000000                0.000000  
25%              10.000000                0.000000  
50%              20.000000                0.000000  
75%              30.000000                0.000000  
max              39.000000                1.000000  

To further clarify the distribution of career change interest, here is a histogram, indicating that more people selected that they have an interest in changing their career.

Field of Study/Occupation#

Even though generally Field of Study and Occupation types do not have as big of a significance on the target variable as other features, I still want to see which of the Fields of Study and Occupations cause people to be more likely to change careers.

import matplotlib.pyplot as plt

# Selecting columns
field_of_study_df = filtered_coefficients[['Field of Study_Biology', 'Field of Study_Business', 'Field of Study_Medicine', 'Field of Study_Mechanical Engineering', 'Field of Study_Education', 'Field of Study_Law', 'Field of Study_Computer Science', 'Field of Study_Arts', 'Field of Study_Psychology', 'Field of Study_Economics']]

occupation_df = filtered_coefficients[['Current Occupation_Artist', 'Current Occupation_Biologist', 'Current Occupation_Business Analyst', 'Current Occupation_Doctor', 'Current Occupation_Economist', 'Current Occupation_Lawyer', 'Current Occupation_Mechanical Engineer', 'Current Occupation_Psychologist', 'Current Occupation_Software Developer', 'Current Occupation_Teacher']]

plt.figure(figsize=(12, 6))
field_of_study_df.T.plot(kind="bar")
plt.title("Field of Study Coefficients")
plt.xlabel("Field of Study")
plt.ylabel("Coefficient Value")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.grid(axis = 'y',alpha = 0.5)
plt.show()

plt.figure(figsize=(12, 6))
occupation_df.T.plot(kind="bar")
plt.title("Occupation Coefficients")
plt.xlabel("Occupation")
plt.ylabel("Coefficient Value")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.grid(axis = 'y', alpha = 0.5)
plt.show()
../_images/ad906feef1a6cf36d75840b77bba6a40055474e924671602cd3bac5a53da90dc.png ../_images/ffa12f7c690f5d44e8fed8d15fcdcd897b5ea43276fa32859d6b3ece96baca19.png

By these results, we can conclude that those who study Education are more likely to change occupations, while those who study psychology are the least likely.
Also, Biologists are the most likely to change occupations, while teachers are the least likely to change occupations.

Examining Relationships Between the Three Most Important Features#

Based on the results of the Lasso Regression, we know that the three features that have the greatest influence on the target variable: likelihood of a career change. The features are: Salary, Career Change Interest, and Job Satisfaction.

#Making Box Plots
plt.figure()
sns.boxplot(x='Likely to Change Occupation', y='Job Satisfaction', data=features_reduced_df)
plt.title('Job Satisfaction vs. Career Change Interest')
plt.show()

plt.figure()
sns.boxplot(x='Likely to Change Occupation', y='Salary', data=features_reduced_df)
plt.title('Salary vs. Career Change Interest')
plt.show()
../_images/aab792d253b2db024960a2723c5b0dc1a72826d355e86b18eb9e9a5627d16e8b.png ../_images/544986a1c518a72adde2161b8ae4b2cf5a56f35b36c499b27dabf7e13f377855.png

The results suggest that the median salary is higher for those not likely to change jobs. Additionallly, the median Job Satisfaction is lower for those likely to change jobs, and their satisfaction distribution is more variable and skewed downward.

As for the range, the range of the Salary and Career Change Interest are more similar than the range of Job Satisfaction and Career Change Interest. Since Job Satisfaction with respect to Career Change Interest varies so much, it suggests that we should also examine the relationship with other variables effecting the likeliness to change occupations.

Examining Relationships Between More Features#

I want to see the relationship between more significant features we identified, not just the 3 most important. So, I will contruct a scatter plot matrix to do so.
We can see from the results that many of the non-diagonal scatter plots visually demonstate non-linear relationships.

selected_columns = ['Age', 'Job Opportunities','Years of Experience','Career Change Interest','Salary','Job Satisfaction','Likely to Change Occupation']
scatter_matrix(features_reduced_df[selected_columns], alpha=0.5, figsize=(20, 20), diagonal='kde')


plt.show()
../_images/8a86240438eab2c9f3ed7bf3a84e7676b3042f865066193601eea2f08ea15c7f.png

In order to solidify the visual representation of relationships between features, I will construct a correlation matrix to give numerical values to the relationships.
We can see that of the feature relationships, Career Change Interest and Likely to Change Occupation has a moderate positive linear relationship and Job Satisfaction and Likely to Change Occupation have a moderate negative linear relationship. We can see that all the other relations have a correlation coefficent very close to zero, indicating a non-linear relationship.
I am glad I did both a correlation matrix and a scatter plot matrix as it made the results more clear.

correlation_matrix = df[['Age', 'Job Opportunities','Years of Experience','Career Change Interest','Salary','Job Satisfaction','Likely to Change Occupation']].corr()
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm")
plt.title("Correlation Matrix")
plt.show()
../_images/8619dedc3bbfca81c03257750042eb94d4be5e4c7c20716e92aae72189fb4736.png

Extra: Random Forest#

Since most of the relationships are non-linear, it would be interesting to run a Random Forest Algorithm, which works better on non-linear relationships and compare it to the results of the Lasso Regression which predominatly is better for linear relationships.

The information above and following code are adapted from Geeks for Geeks: Random Forest Algorithm in Machine Learning

# Extract feature names (excluding the target variable)
features_reduced = features_reduced_df.columns.drop('Likely to Change Occupation').tolist()

# Subset the dataset using the reduced feature list
X_train_reduced = X_train[features_reduced]
X_test_reduced = X_test[features_reduced]

# Train the Random Forest model
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train_reduced, y_train)

# Make predictions on the test set
y_pred_rf = rf.predict(X_test_reduced)

# Generate confusion matrix
cm = confusion_matrix(y_test, y_pred_rf)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=rf.classes_)
disp.plot(cmap="Blues")
plt.title('Confusion Matrix for Random Forest')
plt.show()
../_images/211ce466220d580bbcbf3433d5b1cb410e24d4565ac939e4c12dee5b64491101.png

To confirm the results of the Confusion Matrix and visualize them in a different way, I will calculate the Cross-Validation accuracy and see if the results are the same.

k=5
scores = cross_val_score(rf, X_train[features_reduced], y_train, cv=k, scoring='accuracy')
print(f"Cross-Validation Accuracy: {scores.mean():.4f} ± {scores.std():.4f}")
Cross-Validation Accuracy: 0.9999 ± 0.0001

The results of the Confusion Matrix and Cross-Validation Accuracy for the Random Forest implies that the model is working perfectly on the dataset since there is no Type I or Type II error.
Below is the code inspired by Stack Overflow (this page and this one too) to compare Lasso coefficents and Random Forest feature importance.

# Compare Lasso coefficients and Random Forest feature importances
rf_importances = rf.feature_importances_
lasso_coefficients = lasso.coef_[:len(features_reduced)]  # Ensure coefficients match feature set

# Create a comparison df
comparison_df = pd.DataFrame({'Feature': features_reduced,'Lasso Coefficient': lasso_coefficients,'Random Forest Importance': rf_importances}).sort_values(by='Random Forest Importance', ascending=False)

# Plot the comparison
plt.figure(figsize=(10, 6))
comparison_df.plot(x='Feature', kind='barh', stacked=False,figsize=(10, 8),title='Feature Importance: Lasso vs Random Forest', alpha=0.7)
plt.gca().invert_yaxis()  # Reverse the order for better readability
plt.show()
<Figure size 1000x600 with 0 Axes>
../_images/3d18b9594d9a22f1635b540d62878dc7a9daef49eebc48b35cfff785fae25081.png

This graph shows that Random Forest identifies significance in more features then Lasso Regression, since there are more orange bars than blue. Also, for shared features of Random Forest and Lasso, the Random Forest importance is greater than the Lasso’s in most cases. This shows that Random Forest is more sensitive, and thus has a potential for overfitting, while Lasso is more selective so it has the potential for underfitting. By the Bias-Variance Tradeoff, that makes Lasso have high bias and low variance and Random Forest have low bias and high variance for this dataset.

Conclusion:#

Through my project, I found that the three most significant factors (in order from greatest to least) are: Job Satisfaction, Career Change Interest, and Salary. The median salary is higher for those not likely to change jobs. Additionallly, the median Job Satisfaction is lower for those likely to change jobs. It was also interesting to see the how Field of Study and Occupation effect the likeliness to change jobs. Those who study Education are more likely to change occupations, while those who study psychology are the least likely. Also, Biologists are the most likely to change occupations, while teachers are the least likely to change occupations. These results are heavily influenced by salary and job satisfaction. The majority of features forming non-linear relationships was predictable, but made for a very interesting analysis when I compared the results of the Lasso Regression to the Random Forest. Ultimatly, the Random Forest was better as it was more sensitive to the data.