Pandas dataframe#
Setup#
Download the penguins dataset from the internet.
If pandas is not installed, you can install it using the following command:
%conda install pandas
or
%pip install pandas
import pandas as pd
df = pd.read_csv('penguins.csv')
df.head()
# you need to put the penguins.csv file in the same directory as this notebook
# otherwise you need to specify the path to the file e.g. pd.read_csv('/path/to/penguins.csv')
# You can get the current working directory of the notebook by running the following command
# !pwd
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | MALE |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | FEMALE |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | FEMALE |
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | FEMALE |
Alternatively, you can use the following code to load the penguins dataset from the seaborn library:
import seaborn as sns
df = sns.load_dataset("penguins")
DataFrames#
type(df)
pandas.core.frame.DataFrame
# get the type of each column
df.dtypes
species object
island object
bill_length_mm float64
bill_depth_mm float64
flipper_length_mm float64
body_mass_g float64
sex object
dtype: object
# get the number of rows and columns
df.shape
(344, 7)
Indexing and slicing#
# get column
col = df['species']
print(col)
type(col)
0 Adelie
1 Adelie
2 Adelie
3 Adelie
4 Adelie
...
339 Gentoo
340 Gentoo
341 Gentoo
342 Gentoo
343 Gentoo
Name: species, Length: 344, dtype: object
pandas.core.series.Series
# althernatively
col = df.species
df.bill_length_mm.mean()
43.92192982456142
# get the first two rows
df[0:2]
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | MALE |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | FEMALE |
# use boolean indexing to filter the data
df[df['species'] == 'Adelie']
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | MALE |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | FEMALE |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | FEMALE |
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | FEMALE |
... | ... | ... | ... | ... | ... | ... | ... |
147 | Adelie | Dream | 36.6 | 18.4 | 184.0 | 3475.0 | FEMALE |
148 | Adelie | Dream | 36.0 | 17.8 | 195.0 | 3450.0 | FEMALE |
149 | Adelie | Dream | 37.8 | 18.1 | 193.0 | 3750.0 | MALE |
150 | Adelie | Dream | 36.0 | 17.1 | 187.0 | 3700.0 | FEMALE |
151 | Adelie | Dream | 41.5 | 18.5 | 201.0 | 4000.0 | MALE |
152 rows × 7 columns
# .loc is primarily label based, in this example, 0, 1, 2 are the labels of the rows.
# In some other cases, the row label may be a string such as 'a', 'b', 'c', etc.
# May also be used with a boolean array.
# for loc, end index is included
df.loc[0:2, ['species', 'island']]
species | island | |
---|---|---|
0 | Adelie | Torgersen |
1 | Adelie | Torgersen |
2 | Adelie | Torgersen |
df.loc[df.species == 'Adelie', ['species', 'island']]
species | island | |
---|---|---|
0 | Adelie | Torgersen |
1 | Adelie | Torgersen |
2 | Adelie | Torgersen |
3 | Adelie | Torgersen |
4 | Adelie | Torgersen |
... | ... | ... |
147 | Adelie | Dream |
148 | Adelie | Dream |
149 | Adelie | Dream |
150 | Adelie | Dream |
151 | Adelie | Dream |
152 rows × 2 columns
# get a subset of the data frame
df.loc[:2, 'bill_length_mm':'body_mass_g']
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | |
---|---|---|---|---|
0 | 39.1 | 18.7 | 181.0 | 3750.0 |
1 | 39.5 | 17.4 | 186.0 | 3800.0 |
2 | 40.3 | 18.0 | 195.0 | 3250.0 |
# .iloc is primarily integer position based (from 0 to length-1 of the axis)
# may also be used with a boolean array.
df.iloc[0:2, 0:2]
species | island | |
---|---|---|
0 | Adelie | Torgersen |
1 | Adelie | Torgersen |
Descriptive statistics#
df.describe()
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | |
---|---|---|---|---|
count | 342.000000 | 342.000000 | 342.000000 | 342.000000 |
mean | 43.921930 | 17.151170 | 200.915205 | 4201.754386 |
std | 5.459584 | 1.974793 | 14.061714 | 801.954536 |
min | 32.100000 | 13.100000 | 172.000000 | 2700.000000 |
25% | 39.225000 | 15.600000 | 190.000000 | 3550.000000 |
50% | 44.450000 | 17.300000 | 197.000000 | 4050.000000 |
75% | 48.500000 | 18.700000 | 213.000000 | 4750.000000 |
max | 59.600000 | 21.500000 | 231.000000 | 6300.000000 |
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 species 344 non-null object
1 island 344 non-null object
2 bill_length_mm 342 non-null float64
3 bill_depth_mm 342 non-null float64
4 flipper_length_mm 342 non-null float64
5 body_mass_g 342 non-null float64
6 sex 333 non-null object
dtypes: float64(4), object(3)
memory usage: 18.9+ KB
df["species"].value_counts(dropna=False)
Adelie 152
Gentoo 124
Chinstrap 68
Name: species, dtype: int64
# compute the mean of one species
df[df['species'] == 'Adelie']['body_mass_g'].mean()
3700.662251655629
# compute the mean of all species
df.groupby("species")[["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"]].mean()
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | |
---|---|---|---|---|
species | ||||
Adelie | 38.791391 | 18.346358 | 189.953642 | 3700.662252 |
Chinstrap | 48.833824 | 18.420588 | 195.823529 | 3733.088235 |
Gentoo | 47.504878 | 14.982114 | 217.186992 | 5076.016260 |
Missing values#
# show rows with missing values
df_missinig = df[df.isna().any(axis=1)]
df_missinig
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN |
8 | Adelie | Torgersen | 34.1 | 18.1 | 193.0 | 3475.0 | NaN |
9 | Adelie | Torgersen | 42.0 | 20.2 | 190.0 | 4250.0 | NaN |
10 | Adelie | Torgersen | 37.8 | 17.1 | 186.0 | 3300.0 | NaN |
11 | Adelie | Torgersen | 37.8 | 17.3 | 180.0 | 3700.0 | NaN |
47 | Adelie | Dream | 37.5 | 18.9 | 179.0 | 2975.0 | NaN |
246 | Gentoo | Biscoe | 44.5 | 14.3 | 216.0 | 4100.0 | NaN |
286 | Gentoo | Biscoe | 46.2 | 14.4 | 214.0 | 4650.0 | NaN |
324 | Gentoo | Biscoe | 47.3 | 13.8 | 216.0 | 4725.0 | NaN |
336 | Gentoo | Biscoe | 44.5 | 15.7 | 217.0 | 4875.0 | NaN |
339 | Gentoo | Biscoe | NaN | NaN | NaN | NaN | NaN |
How to handle missing values#
There is no one-size-fits-all solution for handling missing value. If the dataset is large and only a few values are missing, it might be practical to simply drop these data points. Alternatively, techniques such as imputation can be used to fill in the gaps. For instance, if temperature readings are missing in a time series, interpolating based on surrounding data points might be effective.
Sometimes retaining missing values is beneficial, as their absence itself can provide meaningful insights into the data. For example, if a survey question is left blank, it might indicate that the respondent was uncomfortable answering it. Certain algorithms are designed to interpret and learn from these missing patterns, adding an extra layer of analysis.
# Drop rows with missing values
# you can use inplace=True to modify the original data frame
# In our dataset, we can see that there are two samples with no numerical measurements
# There are 11 samples with only sex missing
# we can drop all rows with missing values, becasue that's a small fraction of the data
# maybe we can drop rows with missing numerical measurements and keep those with missing sex
# This can be achieved by thresh=6, which means that the row must have at least 6 non-missing values
df_clean = df.dropna(thresh=6)
df_clean
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | MALE |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | FEMALE |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | FEMALE |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | FEMALE |
5 | Adelie | Torgersen | 39.3 | 20.6 | 190.0 | 3650.0 | MALE |
... | ... | ... | ... | ... | ... | ... | ... |
338 | Gentoo | Biscoe | 47.2 | 13.7 | 214.0 | 4925.0 | FEMALE |
340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | FEMALE |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | MALE |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | FEMALE |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | MALE |
342 rows × 7 columns
Apply#
The apply
method is used to apply a function along the axis of a DataFrame or Series.
# Convert the body_mass_g column to kg
df['body_mass_g'] = df['body_mass_g'].apply(lambda x: x/1000)
# rename the column
df.rename(columns={'body_mass_g': 'body_mass_kg'}, inplace=True)
df.head(5)
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_kg | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3.75 | MALE |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3.80 | FEMALE |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3.25 | FEMALE |
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3.45 | FEMALE |
# compute the ratio of bill length to flipper length
df['bill_ratio'] = df.apply(lambda row: row['bill_length_mm'] / row['flipper_length_mm'], axis=1)
df.head(5)
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_kg | sex | bill_ratio | |
---|---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3.75 | MALE | 0.216022 |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3.80 | FEMALE | 0.212366 |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3.25 | FEMALE | 0.206667 |
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN | NaN |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3.45 | FEMALE | 0.190155 |
View or Copy#
Use iloc or loc to modify the DataFrame.
Be careful when slice a dataframe, and modify it. It is better to use the copy()
method to avoid modifying the original DataFrame.
For more detailed disucssion, see here.
# the suggested way to modify the dataframe is to use .loc or .iloc
tmp = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
print(tmp)
tmp.loc[0, 'A'] = 10
print(tmp)
# Example data frame
tmp = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
print(tmp)
# this is a view of the original dataframe
col_A = tmp['A']
# this modify the original dataframe
col_A[0] = 10
print(tmp)
# Similarly, changing tmp also changes col_A
tmp = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
# this is a view of the original dataframe
col_A = tmp['A']
# this modify the original dataframe
tmp['A'][0] = 10
print(col_A)
# This gives a warning
tmp = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
tmp['A'][0] = 10
print(tmp)
tmp = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
# this is a copy of the column
col_A = tmp['A'].copy()
# change the copy does not change the original dataframe
col_A[0] = 10
print(tmp)