Lecture Week 3 Wed 10/16#

Poll If I have some temperature measure in degree Celsius, compute the mean as 0 and variance 25. I convert the temperature to Fahrenheit, what will be the mean and variance of the temperature in Fahrenheit? recall that F = C * 9/5 + 32

E[aX+b] = a E[X] + b

Var[aX+b] = a^2 Var[X]

Covariance#

Covariance provides a measure of how two random variables change together. For example, if we treat weight and height as random variables, then knowing that someone is taller than average might make it more likely that they have above-average weight.

For random variables \(X\) and \(Y\), the covariance is defined as \(\text{Cov}(X, Y) = E[(X - E[X])(Y - E[Y])]\).

Correlation#

Correlation is a standardized measure of how two random variables change together, ranging from -1 to 1. \( \text{Corr}(X, Y) = \frac{\text{Cov}(X, Y)}{\sqrt{\text{Var}(X) \cdot \text{Var}(Y)}}\).

from matplotlib import pyplot as plt
import numpy as np

n = 100 # sample size

# X is a random variable uniformly distributed on [0, 1]
x = np.random.uniform(-1, 1, n)

# Y is a linear function of X plus noise
y = x + np.random.normal(0, 0.1, n)

# z is independent of X
z = np.random.uniform(-1, 1, n)

cov_xy = np.cov(x, y)[0, 1]
cov_xz = np.cov(x, z)[0, 1]

corr_xy = np.corrcoef(x, y)[0, 1]
corr_xz = np.corrcoef(x, z)[0, 1]

# scatter plot
fig, ax = plt.subplots()
# plot x vs y and x vs z, different colors, show name
ax.scatter(x, y, color='red', label='(X, Y), cov = %.2f, corr = %.2f' % (cov_xy, corr_xy))
ax.scatter(x, z, color='blue', label='(X, Z), cov = %.2f, corr = %.2f' % (cov_xz, corr_xz))
# show legend
ax.legend()
plt.show()
../_images/8d8dd3d779118d2bb61b918685a087ac2e5e9d8148559ddbead5c2a0dcb80bce.png
np.cov(x, y)
array([[0.39840063, 0.40003188],
       [0.40003188, 0.41129174]])
np.corrcoef(x, y)
array([[1.        , 0.98823356],
       [0.98823356, 1.        ]])
N = 1000
x = np.random.uniform(-1, 1, N)
y = x**2
plt.scatter(x, y, label='(X, Y)')
plt.legend()
plt.xlabel('X')
plt.ylabel('Y')
Text(0, 0.5, 'Y')
../_images/c64dcdaac68ee9b334803051a2f13955175317ab204841e057e103d9ab3b0d4c.png
np.corrcoef(x, y)
array([[ 1.00000000e+00, -8.62213152e-04],
       [-8.62213152e-04,  1.00000000e+00]])

Law of Large Numbers (the more the better)#

# In each experiment, we draw n samples and estimate the mean.
# For small n, each estimation might have large variability for different repetitions.
# As n increases, the variability decreases.
# For very large n, the estimates are almost the same as the true mean and the variability is very small.

# use n sample to estimate the mean
n = [10,100,1000,10000,100000]

# number of repetitions for each n
k = 10 

# store the estimates, each row is a different n, each column is a different repetition
estiamtes = np.zeros((len(n), k))

# sample sizes
logn = [] # log10 of n
means = [] # estimated mean

for ni in n:
    for k in range(10):
        # x = np.random.randint(1, 7, ni)
        x = np.random.uniform(0, 1, ni)
        logn.append(np.log10(ni))
        means.append(np.mean(x))

fig, ax = plt.subplots()
# scatter plot of 
ax.scatter(logn, means)
ax.set_xlabel('log10(n)')
ax.set_ylabel('estimated mean')
Text(0, 0.5, 'estimated mean')
../_images/b36010b0ac0fac067f6f0467b49e8fe402877fcb8759a98324d54470f08addb0.png

Central limit theorem (Why the normal distribution is so common)#

Roughly speaking, sample average are normally distributed for large sample sizes, regardless of the distribution of the underlying random variable!

If \(X_1, X_2, \ldots, X_n\) are independent and identically distributed random variables with mean \(\mu\) and variance \(\sigma^2\), if we define

\[ Z = \frac{\sqrt{n}(\bar{X} - \mu)}{\sigma}\]

then for large n, Z is approximately normally distributed with mean 0 and variance 1.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Gaussian density function
gaussian_density = lambda x, mu, sigma: (1 / (sigma * np.sqrt(2 * np.pi)) * np.exp(- (x - mu) ** 2 / (2 * sigma ** 2)))

# Parameters
N = 10000  # number of experiments
n = 1000   # number of samples per experiment

# Uniform Distribution
mu_uniform = (6 + 1) / 2  # mean of uniform distribution over [1, 6]
sigma_uniform = np.sqrt(((6 - 1 + 1) ** 2 - 1) / 12)  # standard deviation of uniform distribution over [1, 6]

uniform_samples = np.random.randint(1, 7, n)  # draw n samples from uniform integer distribution between 1 and 6
z_samples_uniform = []
for i in range(N):
    x = np.random.randint(1, 7, n)
    sample_mean = np.mean(x)
    z = np.sqrt(n) * (sample_mean - mu_uniform) / sigma_uniform
    z_samples_uniform.append(z)

# Exponential Distribution
lambda_exp = 1  # rate parameter for exponential distribution
mu_exp = 1 / lambda_exp
sigma_exp = 1 / lambda_exp

exp_samples = np.random.exponential(1 / lambda_exp, n)  # draw n samples from exponential distribution with rate 1
z_samples_exp = []
for i in range(N):
    x = np.random.exponential(1 / lambda_exp, n)
    sample_mean = np.mean(x)
    z = np.sqrt(n) * (sample_mean - mu_exp) / sigma_exp
    z_samples_exp.append(z)

# Plotting in 2-by-2 subfigures
fig, axs = plt.subplots(2, 2, figsize=(12, 10))

# Plot original Uniform Distribution
sns.histplot(uniform_samples, discrete=True, kde=False, ax=axs[0, 0], edgecolor='black', stat='density', alpha=0.6, label='Uniform Distribution')
axs[0, 0].set_xlabel('Value')
axs[0, 0].set_ylabel('Density')
axs[0, 0].set_title('Uniform Distribution Histogram')
axs[0, 0].legend()

# Plot histogram of Z-transformed Uniform Distribution
sns.histplot(z_samples_uniform, bins=30, kde=False, ax=axs[0, 1], edgecolor='black', stat='density', alpha=0.6, label='Z')
x = np.linspace(-5, 5, 1000)
y = gaussian_density(x, 0, 1)
axs[0, 1].plot(x, y, linewidth=2, color='r', label='Standard Normal')
axs[0, 1].set_xlim(-5, 5)
axs[0, 1].set_xlabel('Z')
axs[0, 1].set_ylabel('Density')
axs[0, 1].set_title('Distribution of Z')
axs[0, 1].legend()

# Plot original Exponential Distribution
sns.histplot(exp_samples, bins=30, kde=False, ax=axs[1, 0], edgecolor='black', stat='density', alpha=0.6, label='Exponential Distribution')
axs[1, 0].set_xlabel('Value')
axs[1, 0].set_ylabel('Density')
axs[1, 0].set_title('Exponential Distribution Histogram')
axs[1, 0].legend()

# Plot histogram of Z-transformed Exponential Distribution
sns.histplot(z_samples_exp, bins=30, kde=False, ax=axs[1, 1], edgecolor='black', stat='density', alpha=0.6, label='Z')
axs[1, 1].plot(x, y, linewidth=2, color='r', label='Standard Normal')
axs[1, 1].set_xlim(-5, 5)
axs[1, 1].set_xlabel('Z')
axs[1, 1].set_ylabel('Density')
axs[1, 1].set_title('Distribution of Z')
axs[1, 1].legend()

# Adjust layout and show plot
plt.tight_layout()
plt.show()
../_images/12cdc4f68e99a6a47596f43329f98baa0080601813e5976a91dc350fd25ef27f.png
import seaborn as sns
import pandas as pd

df = sns.load_dataset("penguins")
df
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female
... ... ... ... ... ... ... ...
339 Gentoo Biscoe NaN NaN NaN NaN NaN
340 Gentoo Biscoe 46.8 14.3 215.0 4850.0 Female
341 Gentoo Biscoe 50.4 15.7 222.0 5750.0 Male
342 Gentoo Biscoe 45.2 14.8 212.0 5200.0 Female
343 Gentoo Biscoe 49.9 16.1 213.0 5400.0 Male

344 rows × 7 columns

type(df)
pandas.core.frame.DataFrame
df.shape
(344, 7)
df.head()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female
df['species']
0      Adelie
1      Adelie
2      Adelie
3      Adelie
4      Adelie
        ...  
339    Gentoo
340    Gentoo
341    Gentoo
342    Gentoo
343    Gentoo
Name: species, Length: 344, dtype: object
df.species
0      Adelie
1      Adelie
2      Adelie
3      Adelie
4      Adelie
        ...  
339    Gentoo
340    Gentoo
341    Gentoo
342    Gentoo
343    Gentoo
Name: species, Length: 344, dtype: object
type(df.species)
pandas.core.series.Series
df.bill_length_mm.mean()
43.9219298245614
df[0:2]
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
df['species'] == 'Adelie'
0       True
1       True
2       True
3       True
4       True
       ...  
339    False
340    False
341    False
342    False
343    False
Name: species, Length: 344, dtype: bool
df[df['species'] == 'Adelie']
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female
... ... ... ... ... ... ... ...
147 Adelie Dream 36.6 18.4 184.0 3475.0 Female
148 Adelie Dream 36.0 17.8 195.0 3450.0 Female
149 Adelie Dream 37.8 18.1 193.0 3750.0 Male
150 Adelie Dream 36.0 17.1 187.0 3700.0 Female
151 Adelie Dream 41.5 18.5 201.0 4000.0 Male

152 rows × 7 columns

df.loc[0:2, ['species', 'island']]
species island
0 Adelie Torgersen
1 Adelie Torgersen
2 Adelie Torgersen
df.iloc[0:2, [0,4]]
species flipper_length_mm
0 Adelie 181.0
1 Adelie 186.0