Pandas dataframe

Pandas dataframe#

Setup#

Download the penguins dataset from the internet.

If pandas is not installed, you can install it using the following command:

%conda install pandas

or

%pip install pandas

import pandas as pd
df = pd.read_csv('penguins.csv')
df.head()
# you need to put the penguins.csv file in the same directory as this notebook
# otherwise you need to specify the path to the file e.g. pd.read_csv('/path/to/penguins.csv')
# You can get the current working directory of the notebook by running the following command
# !pwd

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	MALE
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	FEMALE
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	FEMALE
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	FEMALE

Alternatively, you can use the following code to load the penguins dataset from the seaborn library:

import seaborn as sns
df = sns.load_dataset("penguins")

DataFrames#

type(df)

pandas.core.frame.DataFrame

# get the type of each column
df.dtypes

species               object
island                object
bill_length_mm       float64
bill_depth_mm        float64
flipper_length_mm    float64
body_mass_g          float64
sex                   object
dtype: object

# get the number of rows and columns
df.shape

(344, 7)

Indexing and slicing#

# get column
col = df['species']
print(col)
type(col)

    Adelie
    Adelie
    Adelie
    Adelie
    Adelie
        ...  
  Gentoo
  Gentoo
  Gentoo
  Gentoo
  Gentoo
Name: species, Length: 344, dtype: object

pandas.core.series.Series

# althernatively
col = df.species

df.bill_length_mm.mean()

43.92192982456142

# get the first two rows 
df[0:2]

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	MALE
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	FEMALE

# use boolean indexing to filter the data
df[df['species'] == 'Adelie']

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	MALE
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	FEMALE
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	FEMALE
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	FEMALE
...	...	...	...	...	...	...	...
147	Adelie	Dream	36.6	18.4	184.0	3475.0	FEMALE
148	Adelie	Dream	36.0	17.8	195.0	3450.0	FEMALE
149	Adelie	Dream	37.8	18.1	193.0	3750.0	MALE
150	Adelie	Dream	36.0	17.1	187.0	3700.0	FEMALE
151	Adelie	Dream	41.5	18.5	201.0	4000.0	MALE

152 rows × 7 columns

# .loc is primarily label based, in this example, 0, 1, 2 are the labels of the rows.
# In some other cases, the row label may be a string such as 'a', 'b', 'c', etc.
# May also be used with a boolean array.

# for loc, end index is included
df.loc[0:2, ['species', 'island']]

	species	island
0	Adelie	Torgersen
1	Adelie	Torgersen
2	Adelie	Torgersen

df.loc[df.species == 'Adelie', ['species', 'island']]

	species	island
0	Adelie	Torgersen
1	Adelie	Torgersen
2	Adelie	Torgersen
3	Adelie	Torgersen
4	Adelie	Torgersen
...	...	...
147	Adelie	Dream
148	Adelie	Dream
149	Adelie	Dream
150	Adelie	Dream
151	Adelie	Dream

152 rows × 2 columns

# get a subset of the data frame
df.loc[:2, 'bill_length_mm':'body_mass_g']

	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g
0	39.1	18.7	181.0	3750.0
1	39.5	17.4	186.0	3800.0
2	40.3	18.0	195.0	3250.0

# .iloc is primarily integer position based (from 0 to length-1 of the axis)
#  may also be used with a boolean array.
df.iloc[0:2, 0:2]

	species	island
0	Adelie	Torgersen
1	Adelie	Torgersen

Descriptive statistics#

df.describe()

	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g
count	342.000000	342.000000	342.000000	342.000000
mean	43.921930	17.151170	200.915205	4201.754386
std	5.459584	1.974793	14.061714	801.954536
min	32.100000	13.100000	172.000000	2700.000000
25%	39.225000	15.600000	190.000000	3550.000000
50%	44.450000	17.300000	197.000000	4050.000000
75%	48.500000	18.700000	213.000000	4750.000000
max	59.600000	21.500000	231.000000	6300.000000

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB

df["species"].value_counts(dropna=False)

Adelie       152
Gentoo       124
Chinstrap     68
Name: species, dtype: int64

# compute the mean of one species

df[df['species'] == 'Adelie']['body_mass_g'].mean()

3700.662251655629

# compute the mean of all species
df.groupby("species")[["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"]].mean()

	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g
species
Adelie	38.791391	18.346358	189.953642	3700.662252
Chinstrap	48.833824	18.420588	195.823529	3733.088235
Gentoo	47.504878	14.982114	217.186992	5076.016260

Missing values#

# show rows with missing values
df_missinig = df[df.isna().any(axis=1)]
df_missinig

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN
8	Adelie	Torgersen	34.1	18.1	193.0	3475.0	NaN
9	Adelie	Torgersen	42.0	20.2	190.0	4250.0	NaN
10	Adelie	Torgersen	37.8	17.1	186.0	3300.0	NaN
11	Adelie	Torgersen	37.8	17.3	180.0	3700.0	NaN
47	Adelie	Dream	37.5	18.9	179.0	2975.0	NaN
246	Gentoo	Biscoe	44.5	14.3	216.0	4100.0	NaN
286	Gentoo	Biscoe	46.2	14.4	214.0	4650.0	NaN
324	Gentoo	Biscoe	47.3	13.8	216.0	4725.0	NaN
336	Gentoo	Biscoe	44.5	15.7	217.0	4875.0	NaN
339	Gentoo	Biscoe	NaN	NaN	NaN	NaN	NaN

How to handle missing values#

There is no one-size-fits-all solution for handling missing value. If the dataset is large and only a few values are missing, it might be practical to simply drop these data points. Alternatively, techniques such as imputation can be used to fill in the gaps. For instance, if temperature readings are missing in a time series, interpolating based on surrounding data points might be effective.

Sometimes retaining missing values is beneficial, as their absence itself can provide meaningful insights into the data. For example, if a survey question is left blank, it might indicate that the respondent was uncomfortable answering it. Certain algorithms are designed to interpret and learn from these missing patterns, adding an extra layer of analysis.

# Drop rows with missing values
# you can use inplace=True to modify the original data frame

# In our dataset, we can see that there are two samples with no numerical measurements
# There are 11 samples with only sex missing

# we can drop all rows with missing values, becasue that's a small fraction of the data

# maybe we can drop rows with missing numerical measurements and keep those with missing sex
# This can be achieved by thresh=6, which means that the row must have at least 6 non-missing values

df_clean = df.dropna(thresh=6)
df_clean

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	MALE
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	FEMALE
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	FEMALE
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	FEMALE
5	Adelie	Torgersen	39.3	20.6	190.0	3650.0	MALE
...	...	...	...	...	...	...	...
338	Gentoo	Biscoe	47.2	13.7	214.0	4925.0	FEMALE
340	Gentoo	Biscoe	46.8	14.3	215.0	4850.0	FEMALE
341	Gentoo	Biscoe	50.4	15.7	222.0	5750.0	MALE
342	Gentoo	Biscoe	45.2	14.8	212.0	5200.0	FEMALE
343	Gentoo	Biscoe	49.9	16.1	213.0	5400.0	MALE

342 rows × 7 columns

Apply#

The apply method is used to apply a function along the axis of a DataFrame or Series.

# Convert the body_mass_g column to kg
df['body_mass_g'] = df['body_mass_g'].apply(lambda x: x/1000)

# rename the column
df.rename(columns={'body_mass_g': 'body_mass_kg'}, inplace=True)
df.head(5)

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_kg	sex
0	Adelie	Torgersen	39.1	18.7	181.0	3.75	MALE
1	Adelie	Torgersen	39.5	17.4	186.0	3.80	FEMALE
2	Adelie	Torgersen	40.3	18.0	195.0	3.25	FEMALE
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN
4	Adelie	Torgersen	36.7	19.3	193.0	3.45	FEMALE

# compute the ratio of bill length to flipper length
df['bill_ratio'] = df.apply(lambda row: row['bill_length_mm'] / row['flipper_length_mm'], axis=1)
df.head(5)

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_kg	sex	bill_ratio
0	Adelie	Torgersen	39.1	18.7	181.0	3.75	MALE	0.216022
1	Adelie	Torgersen	39.5	17.4	186.0	3.80	FEMALE	0.212366
2	Adelie	Torgersen	40.3	18.0	195.0	3.25	FEMALE	0.206667
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN	NaN
4	Adelie	Torgersen	36.7	19.3	193.0	3.45	FEMALE	0.190155

View or Copy#

Use iloc or loc to modify the DataFrame.

Be careful when slice a dataframe, and modify it. It is better to use the copy() method to avoid modifying the original DataFrame.

For more detailed disucssion, see here.

# the suggested way to modify the dataframe is to use .loc or .iloc
tmp = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
print(tmp)
tmp.loc[0, 'A'] = 10
print(tmp)

# Example data frame
tmp = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
print(tmp)
# this is a view of the original dataframe
col_A = tmp['A']
# this modify the original dataframe
col_A[0] = 10
print(tmp)

# Similarly, changing tmp also changes col_A
tmp = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
# this is a view of the original dataframe
col_A = tmp['A']
# this modify the original dataframe
tmp['A'][0] = 10
print(col_A)

# This gives a warning
tmp = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
tmp['A'][0] = 10
print(tmp)

tmp = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
# this is a copy of the column
col_A = tmp['A'].copy()
# change the copy does not change the original dataframe
col_A[0] = 10
print(tmp)