Pandas dataframe#

Setup#

Download the penguins dataset from the internet.

If pandas is not installed, you can install it using the following command:

%conda install pandas

or

%pip install pandas

import pandas as pd
df = pd.read_csv('penguins.csv')
df.head()
# you need to put the penguins.csv file in the same directory as this notebook
# otherwise you need to specify the path to the file e.g. pd.read_csv('/path/to/penguins.csv')
# You can get the current working directory of the notebook by running the following command
# !pwd
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 MALE
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 FEMALE
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 FEMALE
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 FEMALE

DataFrames#

type(df)
pandas.core.frame.DataFrame
# get the type of each column
df.dtypes
species               object
island                object
bill_length_mm       float64
bill_depth_mm        float64
flipper_length_mm    float64
body_mass_g          float64
sex                   object
dtype: object
# get the number of rows and columns
df.shape
(344, 7)

Indexing and slicing#

# get column
col = df['species']
print(col)
type(col)
0      Adelie
1      Adelie
2      Adelie
3      Adelie
4      Adelie
        ...  
339    Gentoo
340    Gentoo
341    Gentoo
342    Gentoo
343    Gentoo
Name: species, Length: 344, dtype: object
pandas.core.series.Series
# althernatively
col = df.species
df.bill_length_mm.mean()
43.92192982456142
# get the first two rows 
df[0:2]
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 MALE
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 FEMALE
# use boolean indexing to filter the data
df[df['species'] == 'Adelie']
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 MALE
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 FEMALE
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 FEMALE
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 FEMALE
... ... ... ... ... ... ... ...
147 Adelie Dream 36.6 18.4 184.0 3475.0 FEMALE
148 Adelie Dream 36.0 17.8 195.0 3450.0 FEMALE
149 Adelie Dream 37.8 18.1 193.0 3750.0 MALE
150 Adelie Dream 36.0 17.1 187.0 3700.0 FEMALE
151 Adelie Dream 41.5 18.5 201.0 4000.0 MALE

152 rows × 7 columns

# .loc is primarily label based, in this example, 0, 1, 2 are the labels of the rows.
# In some other cases, the row label may be a string such as 'a', 'b', 'c', etc.
# May also be used with a boolean array.

# for loc, end index is included
df.loc[0:2, ['species', 'island']]
species island
0 Adelie Torgersen
1 Adelie Torgersen
2 Adelie Torgersen
df.loc[df.species == 'Adelie', ['species', 'island']]
species island
0 Adelie Torgersen
1 Adelie Torgersen
2 Adelie Torgersen
3 Adelie Torgersen
4 Adelie Torgersen
... ... ...
147 Adelie Dream
148 Adelie Dream
149 Adelie Dream
150 Adelie Dream
151 Adelie Dream

152 rows × 2 columns

# get a subset of the data frame
df.loc[:2, 'bill_length_mm':'body_mass_g']
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
0 39.1 18.7 181.0 3750.0
1 39.5 17.4 186.0 3800.0
2 40.3 18.0 195.0 3250.0
# .iloc is primarily integer position based (from 0 to length-1 of the axis)
#  may also be used with a boolean array.
df.iloc[0:2, 0:2]
species island
0 Adelie Torgersen
1 Adelie Torgersen

Descriptive statistics#

df.describe()
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
count 342.000000 342.000000 342.000000 342.000000
mean 43.921930 17.151170 200.915205 4201.754386
std 5.459584 1.974793 14.061714 801.954536
min 32.100000 13.100000 172.000000 2700.000000
25% 39.225000 15.600000 190.000000 3550.000000
50% 44.450000 17.300000 197.000000 4050.000000
75% 48.500000 18.700000 213.000000 4750.000000
max 59.600000 21.500000 231.000000 6300.000000
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB
df["species"].value_counts(dropna=False)
Adelie       152
Gentoo       124
Chinstrap     68
Name: species, dtype: int64
# compute the mean of one species

df[df['species'] == 'Adelie']['body_mass_g'].mean()
3700.662251655629
# compute the mean of all species
df.groupby("species")[["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"]].mean()
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
species
Adelie 38.791391 18.346358 189.953642 3700.662252
Chinstrap 48.833824 18.420588 195.823529 3733.088235
Gentoo 47.504878 14.982114 217.186992 5076.016260

Missing values#

# show rows with missing values
df_missinig = df[df.isna().any(axis=1)]
df_missinig
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
3 Adelie Torgersen NaN NaN NaN NaN NaN
8 Adelie Torgersen 34.1 18.1 193.0 3475.0 NaN
9 Adelie Torgersen 42.0 20.2 190.0 4250.0 NaN
10 Adelie Torgersen 37.8 17.1 186.0 3300.0 NaN
11 Adelie Torgersen 37.8 17.3 180.0 3700.0 NaN
47 Adelie Dream 37.5 18.9 179.0 2975.0 NaN
246 Gentoo Biscoe 44.5 14.3 216.0 4100.0 NaN
286 Gentoo Biscoe 46.2 14.4 214.0 4650.0 NaN
324 Gentoo Biscoe 47.3 13.8 216.0 4725.0 NaN
336 Gentoo Biscoe 44.5 15.7 217.0 4875.0 NaN
339 Gentoo Biscoe NaN NaN NaN NaN NaN

How to handle missing values#

There is no one-size-fits-all solution for handling missing value. If the dataset is large and only a few values are missing, it might be practical to simply drop these data points. Alternatively, techniques such as imputation can be used to fill in the gaps. For instance, if temperature readings are missing in a time series, interpolating based on surrounding data points might be effective.

Sometimes retaining missing values is beneficial, as their absence itself can provide meaningful insights into the data. For example, if a survey question is left blank, it might indicate that the respondent was uncomfortable answering it. Certain algorithms are designed to interpret and learn from these missing patterns, adding an extra layer of analysis.

# Drop rows with missing values
# you can use inplace=True to modify the original data frame

# In our dataset, we can see that there are two samples with no numerical measurements
# There are 11 samples with only sex missing

# we can drop all rows with missing values, becasue that's a small fraction of the data

# maybe we can drop rows with missing numerical measurements and keep those with missing sex
# This can be achieved by thresh=6, which means that the row must have at least 6 non-missing values

df_clean = df.dropna(thresh=6)
df_clean
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 MALE
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 FEMALE
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 FEMALE
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 FEMALE
5 Adelie Torgersen 39.3 20.6 190.0 3650.0 MALE
... ... ... ... ... ... ... ...
338 Gentoo Biscoe 47.2 13.7 214.0 4925.0 FEMALE
340 Gentoo Biscoe 46.8 14.3 215.0 4850.0 FEMALE
341 Gentoo Biscoe 50.4 15.7 222.0 5750.0 MALE
342 Gentoo Biscoe 45.2 14.8 212.0 5200.0 FEMALE
343 Gentoo Biscoe 49.9 16.1 213.0 5400.0 MALE

342 rows × 7 columns

Apply#

The apply method is used to apply a function along the axis of a DataFrame or Series.

# Convert the body_mass_g column to kg
df['body_mass_g'] = df['body_mass_g'].apply(lambda x: x/1000)
# rename the column
df.rename(columns={'body_mass_g': 'body_mass_kg'}, inplace=True)
df.head(5)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_kg sex
0 Adelie Torgersen 39.1 18.7 181.0 3.75 MALE
1 Adelie Torgersen 39.5 17.4 186.0 3.80 FEMALE
2 Adelie Torgersen 40.3 18.0 195.0 3.25 FEMALE
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3.45 FEMALE
# compute the ratio of bill length to flipper length
df['bill_ratio'] = df.apply(lambda row: row['bill_length_mm'] / row['flipper_length_mm'], axis=1)
df.head(5)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_kg sex bill_ratio
0 Adelie Torgersen 39.1 18.7 181.0 3.75 MALE 0.216022
1 Adelie Torgersen 39.5 17.4 186.0 3.80 FEMALE 0.212366
2 Adelie Torgersen 40.3 18.0 195.0 3.25 FEMALE 0.206667
3 Adelie Torgersen NaN NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3.45 FEMALE 0.190155

View or Copy#

Use iloc or loc to modify the DataFrame.

Be careful when slice a dataframe, and modify it. It is better to use the copy() method to avoid modifying the original DataFrame.

For more detailed disucssion, see here.

# the suggested way to modify the dataframe is to use .loc or .iloc
tmp = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
print(tmp)
tmp.loc[0, 'A'] = 10
print(tmp)
# Example data frame
tmp = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
print(tmp)
# this is a view of the original dataframe
col_A = tmp['A']
# this modify the original dataframe
col_A[0] = 10
print(tmp)
# Similarly, changing tmp also changes col_A
tmp = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
# this is a view of the original dataframe
col_A = tmp['A']
# this modify the original dataframe
tmp['A'][0] = 10
print(col_A)
# This gives a warning
tmp = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
tmp['A'][0] = 10
print(tmp)
tmp = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
# this is a copy of the column
col_A = tmp['A'].copy()
# change the copy does not change the original dataframe
col_A[0] = 10
print(tmp)