Chapter 1 problems#

This notebook contains the problems from Chapter 1 Data in the No Bullshit Guide to Statistics.

Notebooks setup#

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Figures setup
sns.set_theme(
    context="paper",
    style="whitegrid",
    palette="colorblind",
    rc={"figure.figsize": (5, 2)},

)

%config InlineBackend.figure_format = 'retina'
# Set pandas precision
pd.set_option("display.precision", 2)
# Simple float __repr__
import numpy as np
if int(np.__version__.split(".")[0]) >= 2:
    np.set_printoptions(legacy='1.25')
# Download datasets/ directory if necessary
from ministats import ensure_datasets
ensure_datasets()
datasets/ directory present and ready.

Problem {prob:bar-chart_score-vs-curriculum}#

import pandas as pd
students = pd.read_csv("datasets/students.csv")

# bar plot

The above bar plot compares the mean score between the debate and lecture curriculum variants. The error bars stretch from \(\textbf{mean} - \textbf{std}\) to \(\textbf{mean} + \textbf{std}\) within each group, and provide an idea of the data variability.

P?#

Melt wide data in datasets/markswide.csv into long data.

import pandas as pd
markswide = pd.read_csv("datasets/markswide.csv")

# use the .melt() method and set the options `id_vars`, `var_name`, and `value_name`

P?#

convert bpwide to bplong

import pandas as pd
bpwide = pd.read_csv("datasets/bpwide.csv")
bpwide.head(3)

# use the .melt() method and set the options `id_vars`, `var_name`, and `value_name`
patient sex agegrp bp_before bp_after
0 1 Male 30-45 143 153
1 2 Male 30-45 163 170
2 3 Male 30-45 153 168

The first three columns contain identifying variables for a particular individual

P#

z = 5
xs = pd.Series([10, 10-z, 10+z], name="x")
xs.mean(), xs.std()
(10.0, 5.0)

P?#

Draw a scatter plot for the following dataset of (x,y) pairs: { (2,2), (3,3), (4,3), (5,5), (6,4) }.

P#

We want to summarize the dataset \(\mathbf{x}_z = (1, 2, 3, 4, 5, z)\) by reporting a the mean \(\mathbf{mean}(\mathbf{x})\), the standard deviation \(\mathbf{std}(\mathbf{x})\), and the median \(\mathbf{med}(\mathbf{x})\).

import pandas as pd
xs10 = pd.Series([1, 2, 3, 4, 5, 10], name="x")

# a) compute the mean, the standard deviation, and the median

P?#

Load the doctors dataset and compute the following conditional relative frequencies.

a) rur | cli

b) cli | rur

doctors = pd.read_csv("datasets/doctors.csv")

P?#

The data file datasets/formats/students_meta.csv contains metadata as the first four lines of the file:

# title: Student's dataset with metadata rows
# description: A copy of students.csv with extra metadata rows at the top
# author: Ivan Savov
# date: 2026-03-25
student_ID,background,curriculum,effort,score
1,arts,debate,10.96,75
2,science,lecture,8.69,75
3,arts,debate,8.6,67
... 12 more lines ...

This is not a standard CSV file, and you’ll get an error if you try loading it using pd.read_csv with no options. Consult the help docs for the function pd.read_csv to find an option that skips the metadata rows, and load the data below them.

Hint: Use help(pd.read_csv), pd.read_csv?, or the online docs at https://pandas.pydata.org/docs/.

Hint: There are at least two options you can use to skip the metadata rows.

# # This doesn't work:
# students_meta = pd.read_csv("datasets/formats/students_meta.csv")

# Read the `pd.read_csv` help docs to find

One approach you can use is the add the option comment="#", which tells pd.read_csv to ignore all rows starting with #. This is similar to how Python ignores comments.

Another approach is to use the option skiprows=4, which tells pd.read_csv to skip the first four rows and start reading the data file only at line 5, which is the header row.

P?#

Old Faithful geyser

Eruption and waiting time of Old Faithful Geyser

faithful = pd.read_csv("datasets/faithful.csv")
faithful.shape
# faithful.sort_values("waiting")
faithful
duration waiting
0 3.60 79
1 1.80 54
2 3.33 74
3 2.28 62
4 4.53 85
... ... ...
267 4.12 81
268 2.15 46
269 4.42 90
270 1.82 46
271 4.47 74

272 rows × 2 columns

faithful["duration"].describe()
count    272.00
mean       3.49
std        1.14
min        1.60
25%        2.16
50%        4.00
75%        4.45
max        5.10
Name: duration, dtype: float64
sns.boxplot(data=faithful, x="duration");
sns.histplot(data=faithful, x="duration", bins=50);
faithful["duration"].min(), faithful["duration"].max()
(1.6, 5.1)
import numpy as np

# histogram
bins = np.linspace(1.5, 5.2, 50)
hist = pd.cut(faithful["duration"], bins=bins, right=False) \
         .value_counts(sort=False, normalize=True)
bins_left = hist.index.categories.left
bins_width = hist.index.categories.length
x = hist.index.categories.mid
y = hist.values


# fit a mixture of Gaussians to the histogram
from scipy.optimize import curve_fit

def gauss(x, mu, sigma, A):
    return A*np.exp(-0.5*(x-mu)**2/sigma**2)

def bimodal(x, mu1, sigma1, A1, mu2, sigma2, A2):
    return gauss(x,mu1,sigma1,A1) + gauss(x,mu2,sigma2,A2)

p0 = (2,0.5,0.4,  4.5,0.5,0.6)
params, cov = curve_fit(bimodal, x, y, p0=p0)
plt.bar(x=bins_left, width=bins_width, height=y, label='data')
plt.plot(x, bimodal(x,*params), color='red', label='model')
plt.legend();
# params
sns.scatterplot(data=faithful, x="duration", y="waiting");
sns.regplot(data=faithful, x="duration", y="waiting");