# Exercises for Section 3.1 Estimates and estimators#

This notebook contains the solutions to the exercises
from Section 3.1 Estimates and estimators
in the **No Bullshit Guide to Statistics**.

## Notebooks setup#

```
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
```

```
# Pandas setup
pd.set_option("display.precision", 2)
```

```
# Plot helper functions
from plot_helpers import savefigure
```

```
# Figures setup
plt.clf() # needed otherwise `sns.set_theme` doesn't work
from plot_helpers import RCPARAMS
RCPARAMS.update({'figure.figsize': (5, 1.6)}) # good for print
sns.set_theme(
context="paper",
style="whitegrid",
palette="colorblind",
rc=RCPARAMS,
)
# Useful colors
snspal = sns.color_palette()
blue, orange, purple = snspal[0], snspal[1], snspal[4]
# High-resolution please
%config InlineBackend.figure_format = 'retina'
# Where to store figures
DESTDIR = "figures/stats/estimators"
```

```
<Figure size 640x480 with 0 Axes>
```

## Estimator functions defined in Section 3.1#

```
def mean(sample):
return sum(sample) / len(sample)
def var(sample):
xbar = mean(sample)
sumsqdevs = sum([(xi-xbar)**2 for xi in sample])
return sumsqdevs / (len(sample)-1)
def std(sample):
s2 = var(sample)
return np.sqrt(s2)
def dmeans(xsample, ysample):
dhat = mean(xsample) - mean(ysample)
return dhat
```

## Exercises 1#

### Exercise 3.1#

Compute the sample mean from Batch 01 and Batch 04 of the kombucha dataset `datasets/kombucha.csv`

.

```
kombucha = pd.read_csv("datasets/kombucha.csv")
ksample01 = kombucha[kombucha["batch"]==1]["volume"]
# compute mean here
```

```
ksample04 = kombucha[kombucha["batch"]==4]["volume"]
# compute mean here
```

### Exercise 3.2#

Compute the sample variance from Batch 02 and Batch 08 of the kombucha dataset `datasets/kombucha.csv`

.

```
kombucha = pd.read_csv("datasets/kombucha.csv")
ksample02 = ... # select Batch 02 here...
# compute variance here...
```

```
ksample08 = ... # select Batch 08 here...
# compute variance here
```

### Exercise 3.3#

Compute the difference between the means of the sleep `score`

s
for the doctors working in `rural`

locations and `urban`

locations
in the doctors dataset `datasets/doctors.csv`

.

Hint: Use the code `doctors[doctors["location"]=="rural"]`

to select
the subset of the doctors working in a `rural`

location.

```
doctors = pd.read_csv("datasets/doctors.csv")
scoresR = doctors[doctors["location"]=="rural"]["score"]
scoresU = ... # select doctors from "urban" location here...
# compute difference between scores
```

## Exercises 2#

### Exercise 3.4#

Generate \(N=10000\) observations from the sampling distribution of the estimator `std`

for samples of size \(n=20\) from the population `rvK`

\(= K \sim \mathcal{N}(1000,10)\) .
Plot a histogram.

```
from scipy.stats import norm
muK = 1000
sigmaK = 10
rvK = norm(muK, sigmaK)
# generate sampling distribution of the standard deviaiton
from stats_helpers import gen_sampling_dist
# plot histogram of sampling distribution
```

### Exercise 3.5#

Generate the sampling distributions of the estimator `mean`

for samples of size \(n=10\) , \(n=30\) , and \(n=100\) from the population `rvK`

\(= K \sim \mathcal{N}(1000,10)\) .
Plot a histogram for each distribution.
Plot also the probability density function of the normal approximations
predicted by the central limit theorem for each case.

```
from scipy.stats import norm
muK = 1000
sigmaK = 10
rvK = norm(muK, sigmaK)
# simulations of Xbar for different sample sizes:
xbars10 = ...
xbars30 = ...
xbars100 = ...
```

```
# CASE n = 10
# Plot the histogram obtained from simulation
...
# Find the CLT approximation
seKbar10 = ...
rvKbarCLT10 = norm(..., ...)
# Plot the pdf of the CLT approximation
from plot_helpers import plot_pdf
...
```

```
Ellipsis
```

```
# CASE n = 30
```

```
# CASE n = 100
```

### Exercise 3.6#

Generate sampling distribution of the mean for samples of size \(n=30\) from the standard uniform model `rvU`

\(= U \sim \mathcal{U}(0,1)\) .
Plot a histogram.
Compare your result with Figure~ \ref{fig:sampling_dist_of_Ubar} on page \pageref{fig:sampling_dist_of_Ubar} .
Plot the prediction of the central limit theorem on the same axis.

```
from scipy.stats import uniform
rvU = uniform(0,1)
# the mean and standard deviation of the pupulation are
muU = rvU.mean()
sigmaU = rvU.std()
muU, sigmaU
```

```
(0.5, 0.28867513459481287)
```

```
# simulations of Ubar for samples of size 30
ubars30 = ...
# Plot the histogram obtained from simulation
... # hint: ax = sns.histplot(ubars30, stat="density")
# Find the CLT approximation
...
# Plot the pdf of the CLT approximation
from plot_helpers import plot_pdf
...
```

```
Ellipsis
```

## Exercises 3#

### Exercise 3.7#

Use bootstrap estimation to obtain an approximation to the sampling distribution of the mean
from the sample of apple weights in the dataset `datasets/apples.csv`

.
Based on your result,
what can you say about the unknown population mean \(\mu_A\)?
What is the standard error of your estimate?
State your answer as \(\overline{\mathbf{a}} \pm \stderrhat{\overline{\mathbf{a}}}\) and interpret in words.

```
apples = pd.read_csv("datasets/apples.csv")
asample = apples["weight"]
# apples
```

```
from stats_helpers import gen_boot_dist
abars_boot = ...
```

```
# compute mean and standard deviation of abars_boot
```

### Exercise 3.8#

Compute bootstrap distribution for the sample mean from Batch 04 of the kombucha dataset, and plot a histogram. Do you think Batch 04 is a regular batch or an irregular batch of production?

```
kombucha = pd.read_csv("datasets/kombucha.csv")
ksample04 = kombucha[kombucha["batch"]==4]["volume"]
```

## Exercises 4#

### Exercise 3.9#

Describe your uncertainty about the unknown population mean \(\mu_A\) based on sample of apple weights from the dataset `datasets/apples.csv`

:

**a)** find an analytical formula in terms Student’s \(t\) -distribution,

**b)** use bootstrap estimation,

**c)** compare your answers from part a) and b) graphically.

```
apples = pd.read_csv("datasets/apples.csv")
asample = apples["weight"]
```

```
# a) Find the CLT approximation
from scipy.stats import t as tdist
n = asample.count()
df = ...
loc = ...
scale = ...
rvTAbar = tdist(df, loc=loc, scale=scale)
# Plot the pdf of the CLT approximation
...
```

```
Ellipsis
```

```
# b) obtain bootstrap approximation
abars_boot = ...
```

## Exercises 5#

### Exercise 3.10#

Under regular operations,
the kombucha volume is described by the model `rvK`

\(= K \sim \mathcal{N}(1000,10)\) .

**a)** Plot the probability density function for the sampling distribution of the variance for samples of size \(n=40\) from the population \(K\) .

**b)** Compute the sample variance of Batch 08 from the kombucha dataset.

**c)** Compare the observed value in b) to the sampling distribution in a).
Does it look like this is a regular batch or irregular batch?

```
# a) Find the sampling distribution of the sample variance
from scipy.stats import chi2
n = ...
df = ...
scale = ...
rvS2 = chi2(df, loc=0, scale=scale)
# Plot the pdf of the CLT approximation
...
```

```
Ellipsis
```

```
# b) the sample variance of Batch 04 is
kombucha = pd.read_csv("datasets/kombucha.csv")
ksample08 = ... # select Batch 08 here ...
```

c) The observed value \(s_{\mathbf{k}_{08}} = 167\) is very unlikely to occur under the expected sampling distribution for the sample variance for regular samples. This suggests that Batch 08 is an irregular batch with abnormally high variance. Better check the machine.

### Exercise 3.11#

Compute the bootstrap approximation for sampling distribution of the variance from Batch 08 of the kombucha dataset. Plot a histogram.

```
kombucha = pd.read_csv("datasets/kombucha.csv")
ksample08 = ... # select Batch 08 here ...
kvars08_boot = ...
```

The boostrap etimates we obverve are very far from the expected standard deviation for regular batches, which is \(\sigma_K^2 = 100\), which tells us Batch 08 might be irregular (abnormally high variance).

## Exercises 6#

### Exercise 3.12#

Describe the uncertainty about the difference between means `dmeans(scoresR,scoresU)`

,
where `scoresR`

and `scoresU`

are sleep scores of the rural and urban doctors
from the doctors dataset `datasets/apples.csv`

.

**a)** Find an analytical formula in terms Student’s \(t\) -distribution.

**b)** Use bootstrap estimation.

**c)** Compare your answers from part a) and b) graphically.

```
doctors = pd.read_csv("datasets/doctors.csv")
scoresR = doctors[doctors["location"]=="rural"]["score"]
scoresU = ... # select urban doctors ...
# observed difference
...
```

```
Ellipsis
```

```
# a) analytical approximation
# obtain the sample sizes and stds of the two groups
...
# compute standard error of the difference between group means
...
# calculate the degrees of freedom
from stats_helpers import calcdf
...
# probability model based on Student's t-distribution
from scipy.stats import t as tdist
...
```

```
Ellipsis
```

```
# b) bootstrap estimate
# compute bootstrap estimates for mean in each group
meanR_boot = ...
meanU_boot = ...
# compute the difference between means of bootstrap samples
dmeans_boot = ...
```

```
# c) plot the answers form a) and b)
# pdf of the analytical approximation
...
# histogtam of the bootstrap approximation
...
```

```
Ellipsis
```

## Exercises 7#

### Exercise 3.13#

Generate the sampling distribution of the median for samples of size \(n=30\) from the population `rvK`

\(= K \sim \mathcal{N}(1000,10)\) .

```
from scipy.stats import norm
rvK = norm(1000, 10)
# simulations of the median from samples of size n=30
kmedians30 = ...
```

```
# compute se of median
...
```

### Exercise 3.14#

Generate the sampling distribution of `P90`

, the 90th percentile estimator,
from samples of size \(n=30\) from `rvK`

\(= K \sim \mathcal{N}(1000,10)\) .
Calculate the standard error \(\stderr{\tt{P90}}\) of the estimator.

```
from scipy.stats import norm
rvK = norm(1000, 10)
# 90th percential estimator function
def p90(sample):
...
# simulations of p90 from samples of size n=30
kp90s30 = gen_sampling_dist(rvK, estfunc=p90, n=30)
# histogram
...
```

```
Ellipsis
```

```
# compute se of p90
...
```

### Exercise 3.16#

Use the function `scipy.stats.boostrap`

to compute the bootstrap distribution
of the variance from the sample `ksample02`

.

```
from scipy.stats import bootstrap
kombucha = pd.read_csv("datasets/kombucha.csv")
ksample02 = ... # select Batch 02 here...
# call the function bootstrap
...
# plot histogram
...
```

```
Ellipsis
```