Exercises for Section 3.1 Estimates and estimators#

This notebook contains the solutions to the exercises from Section 3.1 Estimates and estimators in the No Bullshit Guide to Statistics.

Notebooks setup#

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Pandas setup
pd.set_option("display.precision", 2)

# Plot helper functions
from ministats.utils import savefigure

# Figures setup
plt.clf()  # needed otherwise sns.set_theme doesn't work
from plot_helpers import RCPARAMS
RCPARAMS.update({'figure.figsize': (5, 1.6)})  # good for print
sns.set_theme(
context="paper",
style="whitegrid",
palette="colorblind",
rc=RCPARAMS,
)

# Useful colors
snspal = sns.color_palette()
blue, orange, purple = snspal[0], snspal[1], snspal[4]

%config InlineBackend.figure_format = 'retina'

# Where to store figures
DESTDIR = "figures/stats/estimators"

<Figure size 640x480 with 0 Axes>


Estimator functions defined in Section 3.1#

def mean(sample):
return sum(sample) / len(sample)

def var(sample):
xbar = mean(sample)
sumsqdevs = sum([(xi-xbar)**2 for xi in sample])
return sumsqdevs / (len(sample)-1)

def std(sample):
s2 = var(sample)
return np.sqrt(s2)

def dmeans(xsample, ysample):
dhat = mean(xsample) - mean(ysample)
return dhat


Exercises 1#

Exercise 3.1#

Compute the sample mean from Batch 01 and Batch 04 of the kombucha dataset datasets/kombucha.csv .

kombucha = pd.read_csv("datasets/kombucha.csv")
ksample01 = kombucha[kombucha["batch"]==1]["volume"]

# compute mean here

ksample04 = kombucha[kombucha["batch"]==4]["volume"]

# compute mean here


Exercise 3.2#

Compute the sample variance from Batch 02 and Batch 08 of the kombucha dataset datasets/kombucha.csv.

kombucha = pd.read_csv("datasets/kombucha.csv")

ksample02 = ...  # select Batch 02 here...

# compute variance here...

ksample08 = ... # select Batch 08 here...

# compute variance here


Exercise 3.3#

Compute the difference between the means of the sleep score s for the doctors working in rural locations and urban locations in the doctors dataset datasets/doctors.csv .

Hint: Use the code doctors[doctors["loc"]=="rur"] to select the subset of the doctors working in a rural location.

doctors = pd.read_csv("datasets/doctors.csv")
scoresR = doctors[doctors["loc"]=="rur"]["score"]
scoresU = ... # select doctors from "urb" location here...

# compute difference between scores


Exercises 2#

Exercise 3.4#

Generate $$N=10000$$ observations from the sampling distribution of the estimator std for samples of size $$n=20$$ from the population rvK $$= K \sim \mathcal{N}(1000,10)$$ . Plot a histogram.

from scipy.stats import norm
muK = 1000
sigmaK = 10
rvK = norm(muK, sigmaK)

# generate sampling distribution of the standard deviaiton
from ministats import gen_sampling_dist

# plot histogram of sampling distribution


Exercise 3.5#

Generate the sampling distributions of the estimator mean for samples of size $$n=10$$ , $$n=30$$ , and $$n=100$$ from the population rvK $$= K \sim \mathcal{N}(1000,10)$$ . Plot a histogram for each distribution. Plot also the probability density function of the normal approximations predicted by the central limit theorem for each case.

from scipy.stats import norm
muK = 1000
sigmaK = 10
rvK = norm(muK, sigmaK)

# simulations of Xbar for different sample sizes:
xbars10 = ...
xbars30 = ...
xbars100 = ...

# CASE n = 10

# Plot the histogram obtained from simulation
...

# Find the CLT approximation
seKbar10 = ...
rvKbarCLT10 = norm(..., ...)

# Plot the pdf of the CLT approximation
from ministats import plot_pdf
...

Ellipsis

# CASE n = 30

# CASE n = 100


Exercise 3.6#

Generate sampling distribution of the mean for samples of size $$n=30$$ from the standard uniform model rvU $$= U \sim \mathcal{U}(0,1)$$ . Plot a histogram. Compare your result with Figure~ \ref{fig:sampling_dist_of_Ubar} on page \pageref{fig:sampling_dist_of_Ubar} . Plot the prediction of the central limit theorem on the same axis.

from scipy.stats import uniform
rvU = uniform(0,1)

# the mean and standard deviation of the pupulation are
muU = rvU.mean()
sigmaU = rvU.std()
muU, sigmaU

(0.5, 0.28867513459481287)

# simulations of Ubar for samples of size 30
ubars30 = ...

# Plot the histogram obtained from simulation
...  # hint: ax = sns.histplot(ubars30, stat="density")

# Find the CLT approximation
...

# Plot the pdf of the CLT approximation
from ministats import plot_pdf
...

Ellipsis


Exercises 3#

Exercise 3.7#

Use bootstrap estimation to obtain an approximation to the sampling distribution of the mean from the sample of apple weights in the dataset datasets/apples.csv . Based on your result, what can you say about the unknown population mean $$\mu_A$$? What is the standard error of your estimate? State your answer as $$\overline{\mathbf{a}} \pm \stderrhat{\overline{\mathbf{a}}}$$ and interpret in words.

apples = pd.read_csv("datasets/apples.csv")
asample = apples["weight"]
# apples

from ministats import gen_boot_dist

abars_boot = ...

# compute mean and standard deviation of abars_boot


Exercise 3.8#

Compute bootstrap distribution for the sample mean from Batch 04 of the kombucha dataset, and plot a histogram. Do you think Batch 04 is a regular batch or an irregular batch of production?

kombucha = pd.read_csv("datasets/kombucha.csv")
ksample04 = kombucha[kombucha["batch"]==4]["volume"]


Exercises 4#

Exercise 3.9#

Describe your uncertainty about the unknown population mean $$\mu_A$$ based on sample of apple weights from the dataset datasets/apples.csv:
a) find an analytical formula in terms Student’s $$t$$ -distribution,
b) use bootstrap estimation,

apples = pd.read_csv("datasets/apples.csv")
asample = apples["weight"]

# a) Find the CLT approximation
from scipy.stats import t as tdist

n = asample.count()
df = ...
loc = ...
scale = ...
rvTAbar = tdist(df, loc=loc, scale=scale)

# Plot the pdf of the CLT approximation
...

Ellipsis

# b) obtain bootstrap approximation
abars_boot = ...


Exercises 5#

Exercise 3.10#

Under regular operations, the kombucha volume is described by the model rvK $$= K \sim \mathcal{N}(1000,10)$$ .
a) Plot the probability density function for the sampling distribution of the variance for samples of size $$n=40$$ from the population $$K$$ .
b) Compute the sample variance of Batch 08 from the kombucha dataset.
c) Compare the observed value in b) to the sampling distribution in a). Does it look like this is a regular batch or irregular batch?

# a) Find the sampling distribution of the sample variance
from scipy.stats import chi2

n = ...
df = ...
scale = ...
rvS2 = chi2(df, loc=0, scale=scale)

# Plot the pdf of the CLT approximation
...

Ellipsis

# b) the sample variance of Batch 04 is
ksample08 = ... # select Batch 08 here ...


c) The observed value $$s_{\mathbf{k}_{08}} = 167$$ is very unlikely to occur under the expected sampling distribution for the sample variance for regular samples. This suggests that Batch 08 is an irregular batch with abnormally high variance. Better check the machine.

Exercise 3.11#

Compute the bootstrap approximation for sampling distribution of the variance from Batch 08 of the kombucha dataset. Plot a histogram.

kombucha = pd.read_csv("datasets/kombucha.csv")
ksample08 = ... # select Batch 08 here ...

kvars08_boot = ...


The boostrap etimates we obverve are very far from the expected standard deviation for regular batches, which is $$\sigma_K^2 = 100$$, which tells us Batch 08 might be irregular (abnormally high variance).

Exercises 6#

Exercise 3.12#

Describe the uncertainty about the difference between means dmeans(scoresR,scoresU) , where scoresR and scoresU are sleep scores of the rural and urban doctors from the doctors dataset datasets/doctors.csv .
a) Find an analytical formula in terms Student’s $$t$$-distribution.
b) Use bootstrap estimation.

doctors = pd.read_csv("datasets/doctors.csv")
scoresR = doctors[doctors["loc"]=="rur"]["score"]
scoresU = ... # select urban doctors ...

# observed difference
...

Ellipsis

# a) analytical approximation

# obtain the sample sizes and stds of the two groups
...

# compute standard error of the difference between group means
...

# calculate the degrees of freedom
from ministats import calcdf
...

# probability model based on Student's t-distribution
from scipy.stats import t as tdist
...

Ellipsis

# b) bootstrap estimate
# compute bootstrap estimates for mean in each group
meanR_boot = ...
meanU_boot = ...

# compute the difference between means of bootstrap samples
dmeans_boot = ...

# c) plot the answers form a) and b)

# pdf of the analytical approximation
...

# histogtam of the bootstrap approximation
...

Ellipsis


Exercises 7#

Exercise 3.13#

Generate the sampling distribution of the median for samples of size $$n=30$$ from the population rvK $$= K \sim \mathcal{N}(1000,10)$$ .

from scipy.stats import norm
rvK = norm(1000, 10)

# simulations of the median from samples of size n=30
kmedians30 = ...

# compute se of median
...


Exercise 3.14#

Generate the sampling distribution of P90 , the 90th percentile estimator, from samples of size $$n=30$$ from rvK $$= K \sim \mathcal{N}(1000,10)$$ . Calculate the standard error $$\stderr{\tt{P90}}$$ of the estimator.

from scipy.stats import norm
rvK = norm(1000, 10)

# 90th percential estimator function
def p90(sample):
...

# simulations of p90 from samples of size n=30
kp90s30 = gen_sampling_dist(rvK, estfunc=p90, n=30)

# histogram
...

Ellipsis

# compute se of p90
...


Exercise 3.16#

Use the function scipy.stats.boostrap to compute the bootstrap distribution of the variance from the sample ksample02.

from scipy.stats import bootstrap


Ellipsis