Section 5.5 — Hierarchical models#
This notebook contains the code examples from Section 5.5 Hierarchical models from the No Bullshit Guide to Statistics.
See also:
https://mc-stan.org/users/documentation/case-studies/radon_cmdstanpy_plotnine.html#data-prep
https://www.pymc.io/projects/examples/en/latest/generalized_linear_models/multilevel_modeling.html
Notebook setup#
# load Python modules
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Figures setup
plt.clf() # needed otherwise `sns.set_theme` doesn"t work
from plot_helpers import RCPARAMS
RCPARAMS.update({"figure.figsize": (5, 3)}) # good for screen
# RCPARAMS.update({"figure.figsize": (5, 1.6)}) # good for print
sns.set_theme(
context="paper",
style="whitegrid",
palette="colorblind",
rc=RCPARAMS,
)
# High-resolution please
%config InlineBackend.figure_format = "retina"
# Where to store figures
from ministats.utils import savefigure
DESTDIR = "figures/bayes/hierarchical"
WARNING (pytensor.tensor.blas): Using NumPy C-API based implementation for BLAS functions.
<Figure size 640x480 with 0 Axes>
# set random seed for repeatability
np.random.seed(42)
#######################################################
# silence statsmodels kurtosistest warning when using n < 20
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=UserWarning)
Definitions#
Hierarchical (multilevel) models#
Model formula#
Radon dataset#
https://bambinos.github.io/bambi/notebooks/radon_example.html
Description: Contains measurements of radon levels in homes across various counties.
Source: Featured in Andrew Gelman and Jennifer Hill’s book Data Analysis Using Regression and Multilevel/Hierarchical Models.
Application: Demonstrates partial pooling and varying intercepts/slopes in hierarchical modeling.
Loading the data#
radon = pd.read_csv("../datasets/radon.csv")
radon.shape
(919, 6)
radon.head()
idnum | state | county | floor | log_radon | log_uranium | |
---|---|---|---|---|---|---|
0 | 5081 | MN | AITKIN | ground | 0.788457 | -0.689048 |
1 | 5082 | MN | AITKIN | basement | 0.788457 | -0.689048 |
2 | 5083 | MN | AITKIN | basement | 1.064711 | -0.689048 |
3 | 5084 | MN | AITKIN | basement | 0.000000 | -0.689048 |
4 | 5085 | MN | ANOKA | basement | 1.131402 | -0.847313 |
len(radon["county"].unique())
85
radon[["log_radon", "log_uranium"]].describe().T.round(2)
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
log_radon | 919.0 | 1.22 | 0.85 | -2.30 | 0.64 | 1.28 | 1.79 | 3.88 |
log_uranium | 919.0 | -0.13 | 0.37 | -0.88 | -0.47 | -0.10 | 0.18 | 0.53 |
#######################################################
from ministats.book.figures import plot_counties
sel_counties = [
"LAC QUI PARLE", "AITKIN", "KOOCHICHING", "DOUGLAS",
"HENNEPIN", "STEARNS", "RAMSEY", "ST LOUIS",
]
plot_counties(radon, counties=sel_counties);
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[9], line 2
1 #######################################################
----> 2 from ministats.book.figures import plot_counties
3 sel_counties = [
4 "LAC QUI PARLE", "AITKIN", "KOOCHICHING", "DOUGLAS",
5 "HENNEPIN", "STEARNS", "RAMSEY", "ST LOUIS",
6 ]
7 plot_counties(radon, counties=sel_counties);
ModuleNotFoundError: No module named 'ministats.book'
# # FIGURES ONLY
# sel_counties = [
# "LAC QUI PARLE", "AITKIN", "KOOCHICHING", "DOUGLAS",
# "HENNEPIN", "STEARNS", "RAMSEY", "ST LOUIS",
# ]
# fig = plot_counties(radon, counties=sel_counties, figsize=(7,3.2))
# filename = os.path.join(DESTDIR, "sel_counties_scatter_only.pdf")
# savefigure(fig, filename)
Example 1: complete pooling model#
= common linear regression model for all counties
Bayesian model#
We can pool all the data and estimate one big regression to asses the influence of the floor variable on radon levels across all counties.
The variable \(x\) corresponds to the floor
column,
with \(0\) representing basement, and \(1\) representing ground floor.
By ignoring the county feature, we do not differenciate on counties.
Bambi model#
import bambi as bmb
priors1 = {
"Intercept": bmb.Prior("Normal", mu=1, sigma=2),
"floor": bmb.Prior("Normal", mu=0, sigma=5),
"sigma": bmb.Prior("HalfStudentT", nu=4, sigma=1),
}
mod1 = bmb.Model(formula="log_radon ~ 1 + floor",
family="gaussian",
link="identity",
priors=priors1,
data=radon)
mod1
Formula: log_radon ~ 1 + floor
Family: gaussian
Link: mu = identity
Observations: 919
Priors:
target = mu
Common-level effects
Intercept ~ Normal(mu: 1.0, sigma: 2.0)
floor ~ Normal(mu: 0.0, sigma: 5.0)
Auxiliary parameters
sigma ~ HalfStudentT(nu: 4.0, sigma: 1.0)
mod1.build()
mod1.graph()
# # FIGURES ONLY
# filename = os.path.join(DESTDIR, "example1_complete_pooling_mod1_graph")
# mod1.graph(name=filename, fmt="png", dpi=300)
Model fitting and analysis#
idata1 = mod1.fit(random_seed=[42,43,44,45])
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma, Intercept, floor]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 2 seconds.
import arviz as az
az.summary(idata1, kind="stats")
mean | sd | hdi_3% | hdi_97% | |
---|---|---|---|---|
Intercept | 1.326 | 0.029 | 1.271 | 1.381 |
floor[ground] | -0.612 | 0.073 | -0.753 | -0.479 |
sigma | 0.823 | 0.019 | 0.791 | 0.863 |
# # FIGURES ONLY
# az.plot_posterior(idata1, round_to=2, figsize=(6,1.8));
# filename = os.path.join(DESTDIR, "example1_posterior.pdf")
# savefigure(plt.gcf(), filename)
fig, axs = bmb.interpret.plot_predictions(mod1, idata1, conditional="floor")
means1 = az.summary(idata1, kind="stats")["mean"]
y0 = means1["Intercept"]
y1 = means1["Intercept"] + means1["floor[ground]"]
sns.lineplot(x=[0,1], y=[y0,y1], ax=axs[0]);
midpoint = [0.5, (y0+y1)/2 + 0.03]
slope = means1["floor[ground]"].round(3)
axs[0].annotate("$\\mu_{B_{f}}=%.3f$" % slope, midpoint);
# # FIGURES ONLY
# fig, axs = bmb.interpret.plot_predictions(mod1, idata1, conditional="floor",
# fig_kwargs={"figsize":(3,2)})
# means1 = az.summary(idata1, kind="stats")["mean"]
# y0 = means1["Intercept"]
# y1 = means1["Intercept"] + means1["floor[ground]"]
# sns.lineplot(x=[0,1], y=[y0,y1], ax=axs[0]);
# midpoint = [0.5, (y0+y1)/2 + 0.03]
# slope = means1["floor[ground]"].round(3)
# axs[0].annotate("$\\mu_{B_{f}}=%.3f$" % slope, midpoint);
# filename = os.path.join(DESTDIR, "example1_basement_ground_slope.pdf")
# savefigure(plt.gcf(), filename)
Conclusion#
not using group membership, so we have lots of bias
Example 2: no pooling model#
= separate intercept for each county
Bayesian model#
If we treat different counties as independent, so each one gets an intercept term:
Bambi model#
priors2 = {
"county": bmb.Prior("Normal", mu=0, sigma=10),
"floor": bmb.Prior("Normal", mu=0, sigma=30),
"sigma": bmb.Prior("HalfStudentT", nu=4, sigma=1),
}
mod2 = bmb.Model("log_radon ~ 0 + county + floor",
family="gaussian",
link="identity",
priors=priors2,
data=radon)
mod2
Formula: log_radon ~ 0 + county + floor
Family: gaussian
Link: mu = identity
Observations: 919
Priors:
target = mu
Common-level effects
county ~ Normal(mu: 0.0, sigma: 10.0)
floor ~ Normal(mu: 0.0, sigma: 30.0)
Auxiliary parameters
sigma ~ HalfStudentT(nu: 4.0, sigma: 1.0)
mod2.build()
mod2.graph()
# # FIGURES ONLY
# filename = os.path.join(DESTDIR, "example2_no_pooling_mod2_graph")
# mod2.graph(name=filename, fmt="png", dpi=300)
Model fitting and analysis#
idata2 = mod2.fit(random_seed=[42,43,44,45])
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma, county, floor]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 4 seconds.
idata2sel = idata2.sel(county_dim=sel_counties)
az.summary(idata2sel, kind="stats")
mean | sd | hdi_3% | hdi_97% | |
---|---|---|---|---|
county[LAC QUI PARLE] | 2.961 | 0.531 | 1.988 | 3.970 |
county[AITKIN] | 0.834 | 0.377 | 0.128 | 1.560 |
county[KOOCHICHING] | 0.814 | 0.289 | 0.261 | 1.347 |
county[DOUGLAS] | 1.736 | 0.256 | 1.253 | 2.224 |
county[HENNEPIN] | 1.362 | 0.075 | 1.227 | 1.510 |
county[STEARNS] | 1.489 | 0.151 | 1.217 | 1.773 |
county[RAMSEY] | 1.161 | 0.132 | 0.929 | 1.426 |
county[ST LOUIS] | 0.867 | 0.071 | 0.731 | 0.998 |
floor[ground] | -0.718 | 0.072 | -0.851 | -0.583 |
sigma | 0.757 | 0.019 | 0.723 | 0.793 |
axs = az.plot_forest(idata2sel, combined=True, figsize=(6,2.6))
axs[0].set_xticks(np.arange(-0.5,4.6,0.5))
axs[0].set_title(None);
# # FIGURES ONLY
# filename = os.path.join(DESTDIR, "forest_plot_mod2_sel_counties.pdf")
# savefigure(axs[0], filename)
# # FIGURES ONLY
# sel_counties = [
# "LAC QUI PARLE", "AITKIN", "KOOCHICHING", "DOUGLAS",
# "HENNEPIN", "STEARNS", "RAMSEY", "ST LOUIS",
# ]
# fig = plot_counties(radon, idata_cp=idata1, idata_np=idata2,
# counties=sel_counties, figsize=(7,3.2))
# filename = os.path.join(DESTDIR, "sel_counties_complete_and_no_pooling.pdf")
# savefigure(fig, filename)
# fig, axs = bmb.interpret.plot_predictions(mod2, idata2, ["floor", "county"]);
# axs[0].get_legend().remove()
# post2 = idata2["posterior"]
# unpooled_means = post2.mean(dim=("chain", "draw"))
# unpooled_hdi = az.hdi(idata2)
# unpooled_means_iter = unpooled_means.sortby("county")
# unpooled_hdi_iter = unpooled_hdi.sortby(unpooled_means_iter.county)
# _, ax = plt.subplots(figsize=(12, 5))
# unpooled_means_iter.plot.scatter(x="county_dim", y="county", ax=ax, alpha=0.9)
# ax.vlines(
# np.arange(len(radon["county"].unique())),
# unpooled_hdi_iter.county.sel(hdi="lower"),
# unpooled_hdi_iter.county.sel(hdi="higher"),
# color="orange",
# alpha=0.6,
# )
# ax.set(ylabel="Radon estimate", ylim=(-2, 4.5))
# ax.tick_params(rotation=90);
Conclusion#
treating each group independently, so we have lots of variance
Example 3: hierarchical model#
= partial pooling model = varying intercepts model
Bayesian hierarchical model#
The partial pooling formula estimates per-county intercepts which drawn
from the same distribution which is estimated jointly with the rest of
the model parameters. The 1
is the intercept co-efficient. The
estimates across counties will all have the same slope.
log_radon ~ 1 + (1|county_id) + floor
There is a middle ground to both of these extremes. Specifically, we may assume that the intercepts are different for each county as in the unpooled case, but they are drawn from the same distribution. The different counties are effectively sharing information through the common prior.
NOTE: some counties have very few sample; the hierarchical model will provide “shrinkage” for these groups, and use global information learned from all counties
radon["log_radon"].describe()
count 919.000000
mean 1.224623
std 0.853327
min -2.302585
25% 0.641854
50% 1.280934
75% 1.791759
max 3.875359
Name: log_radon, dtype: float64
radon.groupby("floor")["log_radon"].describe()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
floor | ||||||||
basement | 766.0 | 1.326744 | 0.782709 | -2.302585 | 0.788457 | 1.360977 | 1.883253 | 3.875359 |
ground | 153.0 | 0.713349 | 0.999376 | -2.302585 | 0.095310 | 0.741937 | 1.308333 | 3.234749 |
Bambi model#
priors3 = {
"Intercept": bmb.Prior("Normal", mu=1, sigma=2),
"floor": bmb.Prior("Normal", mu=0, sigma=5),
"1|county": bmb.Prior("Normal", mu=0, sigma=bmb.Prior("Exponential", lam=1)),
# "1|county": bmb.Prior("Normal", mu=0, sigma=bmb.Prior("HalfNormal", sigma=2)),
"sigma": bmb.Prior("HalfStudentT", nu=4, sigma=1),
# "sigma": bmb.Prior("Exponential", lam=1),
# "sigma": bmb.Prior("HalfNormal", sigma=10), # from PyStan tutorial
# "sigma": bmb.Prior("Uniform", lower=0, upper=100), # from PyMC example
}
formula3 = "log_radon ~ 1 + (1|county) + floor"
#######################################################
mod3 = bmb.Model(formula=formula3,
family="gaussian",
link="identity",
priors=priors3,
data=radon,
noncentered=False)
mod3
Formula: log_radon ~ 1 + (1|county) + floor
Family: gaussian
Link: mu = identity
Observations: 919
Priors:
target = mu
Common-level effects
Intercept ~ Normal(mu: 1.0, sigma: 2.0)
floor ~ Normal(mu: 0.0, sigma: 5.0)
Group-level effects
1|county ~ Normal(mu: 0.0, sigma: Exponential(lam: 1.0))
Auxiliary parameters
sigma ~ HalfStudentT(nu: 4.0, sigma: 1.0)
mod3.build()
mod3.graph()
# # FIGURES ONLY
# filename = os.path.join(DESTDIR, "example3_partial_pooling_mod3_graph")
# mod3.graph(name=filename, fmt="png", dpi=300)
Model fitting and analysis#
idata3 = mod3.fit(random_seed=[42,43,44,45])
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma, Intercept, floor, 1|county_sigma, 1|county]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 3 seconds.
The group level parameters
idata3sel = idata3.sel(county__factor_dim=sel_counties)
az.summary(idata3sel, kind="stats")
mean | sd | hdi_3% | hdi_97% | |
---|---|---|---|---|
1|county[LAC QUI PARLE] | 0.413 | 0.294 | -0.173 | 0.941 |
1|county[AITKIN] | -0.267 | 0.251 | -0.734 | 0.201 |
1|county[KOOCHICHING] | -0.376 | 0.227 | -0.835 | 0.022 |
1|county[DOUGLAS] | 0.170 | 0.206 | -0.219 | 0.545 |
1|county[HENNEPIN] | -0.099 | 0.085 | -0.260 | 0.063 |
1|county[STEARNS] | 0.017 | 0.144 | -0.237 | 0.301 |
1|county[RAMSEY] | -0.261 | 0.133 | -0.509 | -0.013 |
1|county[ST LOUIS] | -0.572 | 0.085 | -0.730 | -0.413 |
1|county_sigma | 0.333 | 0.046 | 0.250 | 0.421 |
Intercept | 1.463 | 0.052 | 1.371 | 1.567 |
floor[ground] | -0.694 | 0.070 | -0.829 | -0.568 |
sigma | 0.757 | 0.018 | 0.723 | 0.791 |
The intercept offsets for each county are:
# sum( idata3["posterior"]["1|county"].stack(sample=("chain","draw")).values.mean(axis=1) )
# az.plot_forest(idata3, combined=True, figsize=(7,2),
# var_names=["Intercept", "floor", "1|county_sigma", "sigma"]);
#######################################################
axs = az.plot_forest(idata3sel, var_names=["Intercept", "1|county", "floor", "1|county_sigma", "sigma"], combined=True, figsize=(6,3))
# axs = az.plot_forest(idata3sel, combined=True, figsize=(6,3))
axs[0].set_xticks(np.arange(-0.8,1.6,0.2))
axs[0].set_title(None);
# # FIGURES ONLY
# filename = os.path.join(DESTDIR, "forest_plot_mod3_sel_counties.pdf")
# savefigure(plt.gcf(), filename)
# az.plot_forest(idata3, var_names=["1|county"], combined=True);
Compare models#
Compare complete pooling, no pooling, and partial pooling models
# # FIGURES ONLY
# fig = plot_counties(radon, idata_cp=idata1, idata_np=idata2, idata_pp=idata3, figsize=(7,3.2))
# filename = os.path.join(DESTDIR, "sel_counties_all_models.pdf")
# savefigure(fig, filename)
Conclusions#
Explanations#
Prior selection for hierarchical models#
?
Varying intercepts and slopes model#
= Group-specific slopes We can also make beta_x group-specific
The varying-slope, varying intercept model adds floor
to the
group-level co-efficients. Now estimates across counties will all have
varying slope.
log_radon ~ floor + (1 + floor | county_id)
#######################################################
formula4 = "log_radon ~ (1 + floor | county)"
priors4 = {
"Intercept": bmb.Prior("Normal", mu=1, sigma=2),
"1|county": bmb.Prior("Normal", mu=0, sigma=bmb.Prior("Exponential", lam=1)),
"floor|county": bmb.Prior("Normal", mu=0, sigma=bmb.Prior("Exponential", lam=1)),
"sigma": bmb.Prior("HalfStudentT", nu=4, sigma=1),
}
mod4 = bmb.Model(formula=formula4,
family="gaussian",
link="identity",
priors=priors4,
data=radon,
noncentered=False)
mod4
Formula: log_radon ~ (1 + floor | county)
Family: gaussian
Link: mu = identity
Observations: 919
Priors:
target = mu
Common-level effects
Intercept ~ Normal(mu: 1.0, sigma: 2.0)
Group-level effects
1|county ~ Normal(mu: 0.0, sigma: Exponential(lam: 1.0))
floor|county ~ Normal(mu: 0.0, sigma: Exponential(lam: 1.0))
Auxiliary parameters
sigma ~ HalfStudentT(nu: 4.0, sigma: 1.0)
mod4.build()
mod4.graph()
# # FIGURES ONLY
# filename = os.path.join(DESTDIR, "varying_int_and_slopes_mod4_graph")
# mod4.graph(name=filename, fmt="png", dpi=300)
idata4 = mod4.fit(random_seed=[42,43,44,45])
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma, Intercept, 1|county_sigma, 1|county, floor|county_sigma, floor|county]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 5 seconds.
az.autocorr(idata4["posterior"]["sigma"].values.flatten())[0:10]
array([ 1. , -0.23618776, 0.09624438, -0.02344782, -0.00212682,
0.00192633, -0.01292044, 0.0159805 , 0.01103348, -0.01561947])
az.summary(idata4)
mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
1|county[AITKIN] | -0.309 | 0.265 | -0.817 | 0.171 | 0.003 | 0.003 | 6315.0 | 3243.0 | 1.0 |
1|county[ANOKA] | -0.469 | 0.114 | -0.681 | -0.261 | 0.002 | 0.001 | 5493.0 | 3306.0 | 1.0 |
1|county[BECKER] | -0.056 | 0.283 | -0.618 | 0.453 | 0.003 | 0.004 | 7326.0 | 3369.0 | 1.0 |
1|county[BELTRAMI] | 0.032 | 0.251 | -0.438 | 0.500 | 0.003 | 0.004 | 6540.0 | 3269.0 | 1.0 |
1|county[BENTON] | -0.050 | 0.255 | -0.529 | 0.415 | 0.003 | 0.004 | 7859.0 | 2649.0 | 1.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
floor|county[ground, WINONA] | -1.321 | 0.408 | -2.053 | -0.550 | 0.006 | 0.004 | 4898.0 | 3255.0 | 1.0 |
floor|county[ground, WRIGHT] | -0.349 | 0.526 | -1.355 | 0.590 | 0.007 | 0.008 | 6462.0 | 2891.0 | 1.0 |
floor|county[ground, YELLOW MEDICINE] | -0.021 | 0.726 | -1.370 | 1.294 | 0.008 | 0.013 | 8360.0 | 2857.0 | 1.0 |
floor|county_sigma[ground] | 0.723 | 0.108 | 0.527 | 0.924 | 0.003 | 0.002 | 1111.0 | 1844.0 | 1.0 |
sigma | 0.748 | 0.018 | 0.713 | 0.782 | 0.000 | 0.000 | 6013.0 | 3475.0 | 1.0 |
174 rows × 9 columns
# idata4["posterior"]
idata4sel = idata4.sel(county__factor_dim=sel_counties)
var_names = ["Intercept",
"1|county",
"floor|county",
"1|county_sigma",
"floor|county_sigma",
"sigma"]
axs = az.plot_forest(idata4sel, combined=True, var_names=var_names, figsize=(6,4))
axs[0].set_xticks(np.arange(-1.6,1.6,0.2))
axs[0].set_title(None);
# FIGURES ONLY
# filename = os.path.join(DESTDIR, "forest_plot_mod4_sel_counties.pdf")
# savefigure(plt.gcf(), filename)
Comparing models#
idata1 = mod1.fit(idata_kwargs={"log_likelihood": True}, random_seed=[42,43,44,45])
idata2 = mod2.fit(idata_kwargs={"log_likelihood": True}, random_seed=[42,43,44,45])
idata3 = mod3.fit(idata_kwargs={"log_likelihood": True}, random_seed=[42,43,44,45])
idata4 = mod4.fit(idata_kwargs={"log_likelihood": True}, random_seed=[42,43,44,45])
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma, Intercept, floor]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 2 seconds.
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma, county, floor]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 3 seconds.
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma, Intercept, floor, 1|county_sigma, 1|county]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 2 seconds.
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma, Intercept, 1|county_sigma, 1|county, floor|county_sigma, floor|county]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 4 seconds.
Compare models based on their expected log pointwise predictive density (ELPD).
The ELPD is estimated either by Pareto smoothed importance sampling leave-one-out cross-validation (LOO) or using the widely applicable information criterion (WAIC). We recommend loo. Read more theory here - in a paper by some of the leading authorities on model comparison dx.doi.org/10.1111/1467-9868.00353
radon_models = {
"mod1 (complete pooling)": idata1,
"mod2 (no pooling)": idata2,
"mod3 (varying intercepts)": idata3,
"mod4 (varying int. and slopes)": idata4,
}
compare_results = az.compare(radon_models)
compare_results
rank | elpd_loo | p_loo | elpd_diff | weight | se | dse | warning | scale | |
---|---|---|---|---|---|---|---|---|---|
mod3 (varying intercepts) | 0 | -1074.193864 | 49.481974 | 0.000000 | 7.140791e-01 | 28.159665 | 0.000000 | False | log |
mod4 (varying int. and slopes) | 1 | -1088.697153 | 88.144803 | 14.503289 | 2.112574e-01 | 30.093302 | 7.369674 | True | log |
mod2 (no pooling) | 2 | -1099.745752 | 86.771902 | 25.551887 | 8.261194e-14 | 28.185332 | 6.483849 | True | log |
mod1 (complete pooling) | 3 | -1126.868605 | 3.724237 | 52.674741 | 7.466341e-02 | 25.547441 | 10.576320 | False | log |
az.plot_compare(compare_results);
# # FIGURES ONLY
# ax = az.plot_compare(compare_results, figsize=(9,3))
# ax.set_title(None)
# filename = os.path.join(DESTDIR, "model_comparison_mod1_mod2_mod3_mod4.pdf")
# savefigure(ax, filename)
ELPD and elpd_loo#
The ELPD is the theoretical expected log pointwise predictive density for a new dataset (Eq 1 in VGG2017), which can be estimated, e.g., using cross-validation. elpd_loo is the Bayesian LOO estimate of the expected log pointwise predictive density (Eq 4 in VGG2017) and is a sum of N individual pointwise log predictive densities. Probability densities can be smaller or larger than 1, and thus log predictive densities can be negative or positive. For simplicity the ELPD acronym is used also for expected log pointwise predictive probabilities for discrete models. Probabilities are always equal or less than 1, and thus log predictive probabilities are 0 or negative.
Frequentist multilevel models#
We can use statsmodels
to fit multilevel models too.
Varying intercepts model using statsmodels
#
import statsmodels.formula.api as smf
sm3 = smf.mixedlm("log_radon ~ 0 + floor", # Fixed effects (no intercept and floor as a fixed effect)
groups="county", # Grouping variable for random effects
re_formula="1", # Random effects = intercept
data=radon)
res3 = sm3.fit()
res3.summary().tables[1]
Coef. | Std.Err. | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
floor[basement] | 1.462 | 0.052 | 28.164 | 0.000 | 1.360 | 1.563 |
floor[ground] | 0.769 | 0.074 | 10.339 | 0.000 | 0.623 | 0.914 |
county Var | 0.108 | 0.041 |
# slope
res3.params["floor[ground]"] - res3.params["floor[basement]"]
-0.6929937406558043
# sigma-hat
np.sqrt(res3.scale)
0.7555891484188184
# standard deviation of the variability among county-specific Intercepts
np.sqrt(res3.cov_re.values)
array([[0.32822242]])
# intercept for first country in res3
res3.random_effects["AITKIN"].values
array([-0.27009756])
This is very close to the mean of the random effect coefficient for AITKIN
in the Bayesian model mod3
which was \(-0.268\).
Varying intercepts and slopes model using statsmodels
#
sm4 = smf.mixedlm("log_radon ~ 0 + floor", # Fixed effects (no intercept and floor as a fixed effect)
groups="county", # Grouping variable for random effects
re_formula="1 + floor", # Random effects: 1 for intercept, floor for slope
data=radon)
res4 = sm4.fit()
res4.summary().tables[1]
Coef. | Std.Err. | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
floor[basement] | 1.463 | 0.054 | 26.977 | 0.000 | 1.356 | 1.569 |
floor[ground] | 0.782 | 0.085 | 9.208 | 0.000 | 0.615 | 0.948 |
county Var | 0.122 | 0.049 | ||||
county x floor[T.ground] Cov | -0.040 | 0.057 | ||||
floor[T.ground] Var | 0.118 | 0.120 |
# slope estimate
res4.params["floor[ground]"] - res4.params["floor[basement]"]
-0.6810977572101944
# sigma-hat
np.sqrt(res4.scale)
0.7461559982563548
# standard deviation of the variability among county-specific Intercepts
county_var_int = res4.cov_re.loc["county", "county"]
np.sqrt(county_var_int)
0.3486722110725776
# standard deviation of the variability among county-specific slopes
county_var_slopes = res4.cov_re.loc["floor[T.ground]", "floor[T.ground]"]
np.sqrt(county_var_slopes)
0.3435539748320278
# correlation between Intercept and slope group-level coefficients
county_floor_cov = res4.cov_re.loc["county", "floor[T.ground]"]
county_floor_cov / (np.sqrt(county_var_int)*np.sqrt(county_var_slopes))
-0.3372303607779023
Discussion#
Alternative notations for hierarchical models#
IMPORT FROM Gelman & Hill Section 12.5 printout
watch the subscripts!
Computational challenges associated with hierarchical models#
centred vs. noncentred representations
Benefits of multilevel models#
TODO LIST
Better than repeated measures ANOVA because:
tells you the direction and magnitude of effect
can handle more multiple grouping scenarios (e.g. by-item, and by-student)
works for categorical predictors
Applications#
Need for hierarchical models often occurs in social sciences (better than ANOVA)
Hierarchical models are often used for Bayesian meta-analysis
Exercises#
Exercise: mod1u#
Same model as Example 3 but also include the predictor log_uranium
.
import bambi as bmb
covariate_priors = {
"floor": bmb.Prior("Normal", mu=0, sigma=10),
"log_uranium": bmb.Prior("Normal", mu=0, sigma=10),
"1|county": bmb.Prior("Normal", mu=0, sigma=bmb.Prior("Exponential", lam=1)),
"sigma": bmb.Prior("Exponential", lam=1),
}
mod1u = bmb.Model(formula="log_radon ~ 1 + floor + (1|county) + log_uranium",
priors=covariate_priors,
data=radon)
mod1u
Formula: log_radon ~ 1 + floor + (1|county) + log_uranium
Family: gaussian
Link: mu = identity
Observations: 919
Priors:
target = mu
Common-level effects
Intercept ~ Normal(mu: 1.2246, sigma: 2.1322)
floor ~ Normal(mu: 0.0, sigma: 10.0)
log_uranium ~ Normal(mu: 0.0, sigma: 10.0)
Group-level effects
1|county ~ Normal(mu: 0.0, sigma: Exponential(lam: 1.0))
Auxiliary parameters
sigma ~ Exponential(lam: 1.0)
Exercise: pigs dataset#
https://bambinos.github.io/bambi/notebooks/multi-level_regression.html
import statsmodels.api as sm
dietox = sm.datasets.get_rdataset("dietox", "geepack").data
dietox
Pig | Evit | Cu | Litter | Start | Weight | Feed | Time | |
---|---|---|---|---|---|---|---|---|
0 | 4601 | Evit000 | Cu000 | 1 | 26.5 | 26.50000 | NaN | 1 |
1 | 4601 | Evit000 | Cu000 | 1 | 26.5 | 27.59999 | 5.200005 | 2 |
2 | 4601 | Evit000 | Cu000 | 1 | 26.5 | 36.50000 | 17.600000 | 3 |
3 | 4601 | Evit000 | Cu000 | 1 | 26.5 | 40.29999 | 28.500000 | 4 |
4 | 4601 | Evit000 | Cu000 | 1 | 26.5 | 49.09998 | 45.200001 | 5 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
856 | 8442 | Evit000 | Cu175 | 24 | 25.7 | 73.19995 | 83.800003 | 8 |
857 | 8442 | Evit000 | Cu175 | 24 | 25.7 | 81.69995 | 99.800003 | 9 |
858 | 8442 | Evit000 | Cu175 | 24 | 25.7 | 90.29999 | 115.200001 | 10 |
859 | 8442 | Evit000 | Cu175 | 24 | 25.7 | 96.00000 | 133.200001 | 11 |
860 | 8442 | Evit000 | Cu175 | 24 | 25.7 | 103.50000 | 151.400002 | 12 |
861 rows × 8 columns
pigsmodel = bmb.Model("Weight ~ Time + (Time|Pig)", dietox)
pigsidata = pigsmodel.fit()
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma, Intercept, Time, 1|Pig_sigma, 1|Pig_offset, Time|Pig_sigma, Time|Pig_offset]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 8 seconds.
az.summary(pigsidata, var_names=["Intercept", "Time", "1|Pig_sigma", "Time|Pig_sigma", "sigma"])
mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
Intercept | 15.744 | 0.569 | 14.722 | 16.836 | 0.020 | 0.014 | 794.0 | 1262.0 | 1.0 |
Time | 6.938 | 0.086 | 6.778 | 7.101 | 0.004 | 0.003 | 554.0 | 781.0 | 1.0 |
1|Pig_sigma | 4.558 | 0.438 | 3.773 | 5.414 | 0.013 | 0.009 | 1145.0 | 1911.0 | 1.0 |
Time|Pig_sigma | 0.664 | 0.064 | 0.547 | 0.784 | 0.002 | 0.001 | 1024.0 | 1750.0 | 1.0 |
sigma | 2.459 | 0.065 | 2.334 | 2.578 | 0.001 | 0.001 | 4541.0 | 3093.0 | 1.0 |
Exercise: educational data#
cf. https://mc-stan.org/users/documentation/case-studies/tutorial_rstanarm.html
1.1 Data example We will be analyzing the Gcsemv dataset (Rasbash et al. 2000) from the mlmRev package in R. The data include the General Certificate of Secondary Education (GCSE) exam scores of 1,905 students from 73 schools in England on a science subject. The Gcsemv dataset consists of the following 5 variables:
school: school identifier
student: student identifier
gender: gender of a student (M: Male, F: Female)
written: total score on written paper
course: total score on coursework paper
import pyreadr
# Gcsemv_r = pyreadr.read_r('/Users/ivan/Downloads/mlmRev/data/Gcsemv.rda')
# Gcsemv_r["Gcsemv"].dropna().to_csv("../datasets/gcsemv.csv", index=False)
gcsemv = pd.read_csv("../datasets/gcsemv.csv")
gcsemv.head()
school | student | gender | written | course | |
---|---|---|---|---|---|
0 | 20920 | 27 | F | 39.0 | 76.8 |
1 | 20920 | 31 | F | 36.0 | 87.9 |
2 | 20920 | 42 | M | 16.0 | 44.4 |
3 | 20920 | 101 | F | 49.0 | 89.8 |
4 | 20920 | 113 | M | 25.0 | 17.5 |
import bambi as bmb
m1 = bmb.Model(formula="course ~ 1 + (1 | school)", data=gcsemv)
m1
Formula: course ~ 1 + (1 | school)
Family: gaussian
Link: mu = identity
Observations: 1523
Priors:
target = mu
Common-level effects
Intercept ~ Normal(mu: 73.3814, sigma: 41.0781)
Group-level effects
1|school ~ Normal(mu: 0.0, sigma: HalfNormal(sigma: 41.0781))
Auxiliary parameters
sigma ~ HalfStudentT(nu: 4.0, sigma: 16.4312)
idata_m1 = m1.fit()
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma, Intercept, 1|school_sigma, 1|school_offset]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 3 seconds.
az.summary(idata_m1)
mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
1|school[20920] | -6.947 | 5.239 | -16.177 | 3.370 | 0.067 | 0.057 | 6201.0 | 3013.0 | 1.00 |
1|school[22520] | -15.590 | 2.181 | -20.054 | -11.804 | 0.041 | 0.029 | 2874.0 | 3060.0 | 1.00 |
1|school[22710] | 6.367 | 3.514 | -0.266 | 12.883 | 0.053 | 0.042 | 4367.0 | 2872.0 | 1.00 |
1|school[22738] | -0.708 | 4.315 | -8.809 | 7.303 | 0.057 | 0.072 | 5823.0 | 2984.0 | 1.00 |
1|school[22908] | -0.824 | 5.939 | -12.341 | 9.733 | 0.076 | 0.102 | 6018.0 | 2854.0 | 1.00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1|school[84707] | 4.102 | 7.388 | -10.026 | 17.715 | 0.094 | 0.107 | 6209.0 | 2690.0 | 1.00 |
1|school[84772] | 8.574 | 3.601 | 1.616 | 15.072 | 0.051 | 0.039 | 5054.0 | 2427.0 | 1.00 |
1|school_sigma | 8.829 | 0.917 | 7.177 | 10.539 | 0.027 | 0.019 | 1165.0 | 1640.0 | 1.00 |
Intercept | 73.778 | 1.146 | 71.707 | 75.915 | 0.035 | 0.025 | 1056.0 | 1718.0 | 1.01 |
sigma | 13.974 | 0.258 | 13.502 | 14.465 | 0.003 | 0.002 | 6298.0 | 3069.0 | 1.00 |
76 rows × 9 columns
m3 = bmb.Model(formula="course ~ gender + (1 + gender|school)", data=gcsemv)
m3
Formula: course ~ gender + (1 + gender|school)
Family: gaussian
Link: mu = identity
Observations: 1523
Priors:
target = mu
Common-level effects
Intercept ~ Normal(mu: 73.3814, sigma: 53.4663)
gender ~ Normal(mu: 0.0, sigma: 83.5292)
Group-level effects
1|school ~ Normal(mu: 0.0, sigma: HalfNormal(sigma: 53.4663))
gender|school ~ Normal(mu: 0.0, sigma: HalfNormal(sigma: 83.5292))
Auxiliary parameters
sigma ~ HalfStudentT(nu: 4.0, sigma: 16.4312)
idata_m3 = m3.fit()
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma, Intercept, gender, 1|school_sigma, 1|school_offset, gender|school_sigma, gender|school_offset]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 6 seconds.
Exercise: sleepstudy dataset#
Description: Contains reaction times of subjects under sleep deprivation conditions.
Source: Featured in the R package lme4.
Application: Demonstrates linear mixed-effects modeling with random slopes and intercepts.
Links:
sleepstudy = bmb.load_data("sleepstudy")
sleepstudy
Reaction | Days | Subject | |
---|---|---|---|
0 | 249.5600 | 0 | 308 |
1 | 258.7047 | 1 | 308 |
2 | 250.8006 | 2 | 308 |
3 | 321.4398 | 3 | 308 |
4 | 356.8519 | 4 | 308 |
... | ... | ... | ... |
175 | 329.6076 | 5 | 372 |
176 | 334.4818 | 6 | 372 |
177 | 343.2199 | 7 | 372 |
178 | 369.1417 | 8 | 372 |
179 | 364.1236 | 9 | 372 |
180 rows × 3 columns
Exercise: tadpoles (BONUS)#
https://www.youtube.com/watch?v=iwVqiiXYeC4
logistic regression model