Section 4.5 — Model selection#

This notebook contains the code examples from Section 4.5 Model selection from the No Bullshit Guide to Statistics.

Notebook setup#

# load Python modules
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Figures setup
plt.clf()  # needed otherwise `sns.set_theme` doesn't work
from plot_helpers import RCPARAMS
# RCPARAMS.update({'figure.figsize': (10, 3)})   # good for screen
RCPARAMS.update({'figure.figsize': (5, 1.6)})  # good for print
sns.set_theme(
    context="paper",
    style="whitegrid",
    palette="colorblind",
    rc=RCPARAMS,
)

# High-resolution please
%config InlineBackend.figure_format = 'retina'

# Where to store figures
DESTDIR = "figures/lm/modelselection"
<Figure size 640x480 with 0 Axes>
# set random seed for repeatability
np.random.seed(42)

Players dataset#

players_full = pd.read_csv("../datasets/players_full.csv")

Plot of linear model for time ~ 1 + age#

sns.lmplot(x="age", y="time", data=players_full)
<seaborn.axisgrid.FacetGrid at 0x7fb9fe487a60>
../_images/d7fa2266f685b3439210452fb0de308c2d0447d674caab9c44e92324e8e34bdd.png
import statsmodels.formula.api as smf

model1 = smf.ols('time ~ 1 + age', data=players_full)
result1 = model1.fit()
result1.summary()
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/scipy/stats/_stats_py.py:1736: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=12
  warnings.warn("kurtosistest only valid for n>=20 ... continuing "
OLS Regression Results
Dep. Variable: time R-squared: 0.260
Model: OLS Adj. R-squared: 0.186
Method: Least Squares F-statistic: 3.516
Date: Sun, 14 Apr 2024 Prob (F-statistic): 0.0902
Time: 03:17:26 Log-Likelihood: -72.909
No. Observations: 12 AIC: 149.8
Df Residuals: 10 BIC: 150.8
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 403.8531 88.666 4.555 0.001 206.292 601.414
age -4.5658 2.435 -1.875 0.090 -9.991 0.859
Omnibus: 2.175 Durbin-Watson: 2.490
Prob(Omnibus): 0.337 Jarque-Bera (JB): 0.930
Skew: -0.123 Prob(JB): 0.628
Kurtosis: 1.659 Cond. No. 97.0


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Plot of linear model for time ~ 1 + age + jobstatus#

We can “control for jobstatus” by including the variable in the linear model. Essentially, we’re fitting two separate models, one for jobstatus=0 players and one for jobstatus=1 players.

sns.lmplot(x="age", y="time", hue="jobstatus", data=players_full)
<seaborn.axisgrid.FacetGrid at 0x7fb9fba4f520>
../_images/19eefe3f645a33136f90412e5106d9fa9e3a8fe96d0f803b368384650a3e917c.png
import statsmodels.formula.api as smf

model2 = smf.ols('time ~ 1 + age + C(jobstatus)', data=players_full)
result2 = model2.fit()
result2.params
# print(result2.summary().as_text())
Intercept            377.817172
C(jobstatus)[T.1]   -124.013781
age                   -1.650913
dtype: float64

Manual select subset with jobstatus 1#

# import statsmodels.formula.api as smf

# players_job1 = players_full[players_full["jobstatus"]==1]
# model3 = smf.ols('time ~ 1 + age', data=players_job1)
# result3 = model3.fit()
# result3.summary()

Example confoudouder 2#

via https://stats.stackexchange.com/a/17338/62481

import numpy as np
from scipy.stats import uniform, randint

covariate = randint(0,2).rvs(100)
exposure  = uniform(0,1).rvs(100) +  0.3 * covariate
outcome   = 2.0 + 0.5*exposure + 0.25*covariate

# covariate, exposure, outcome
df2 = pd.DataFrame({
    "covariate":covariate,
    "exposure":exposure,
    "outcome":outcome
})
sns.lmplot(x="exposure", y="outcome", data=df2)
<seaborn.axisgrid.FacetGrid at 0x7fb9f8171c70>
../_images/315c86116a31f99fb19d1c7dc5e3b52fcdd35d97e812415cd07bdc9825f6de3c.png
import statsmodels.formula.api as smf

model2a = smf.ols('outcome ~ exposure', data=df2)
result2a = model2a.fit()
# result2a.summary()
result2a.params
Intercept    2.028636
exposure     0.670186
dtype: float64
fg = sns.lmplot(x="exposure", y="outcome", hue="covariate", data=df2)
../_images/5169e1ffe26e1438a271f159e2d6068b0ab60d24b4c45fc829c140da84a60fb1.png
model2b = smf.ols('outcome ~ exposure + C(covariate)', data=df2)
result2b = model2b.fit()
# result2b.summary()
result2b.params
Intercept            2.00
C(covariate)[T.1]    0.25
exposure             0.50
dtype: float64
x = np.linspace(0,1.4)

m = result2b.params["exposure"]
b0 = result2b.params["Intercept"]
b1 = b0 + result2b.params["C(covariate)[T.1]"]

b0, b1

y0 = b0 + m*x
y1 = b1 + m*x
ax = fg.figure.axes[0]
sns.lineplot(x=x, y=y0, ax=ax, color="r")
sns.lineplot(x=x, y=y1, ax=ax, color="m")
<Axes: xlabel='exposure', ylabel='outcome'>
fg.figure
../_images/551fa744823d9043a3e8e468713b9d4d9371db968a64598d294767871e88a25a.png
model2c = smf.ols('outcome ~ -1 + exposure*C(covariate)', data=df2)
result2c = model2c.fit()
result2c.summary()
# result2c.params
OLS Regression Results
Dep. Variable: outcome R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 9.006e+30
Date: Sun, 14 Apr 2024 Prob (F-statistic): 0.00
Time: 03:17:29 Log-Likelihood: 3386.6
No. Observations: 100 AIC: -6765.
Df Residuals: 96 BIC: -6755.
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
C(covariate)[0] 2.0000 1.34e-16 1.49e+16 0.000 2.000 2.000
C(covariate)[1] 2.2500 1.88e-16 1.2e+16 0.000 2.250 2.250
exposure 0.5000 2.36e-16 2.12e+15 0.000 0.500 0.500
exposure:C(covariate)[T.1] 6.661e-16 3.24e-16 2.055 0.043 2.28e-17 1.31e-15
Omnibus: 0.401 Durbin-Watson: 0.632
Prob(Omnibus): 0.818 Jarque-Bera (JB): 0.562
Skew: 0.060 Prob(JB): 0.755
Kurtosis: 2.653 Cond. No. 9.70


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# import seaborn as sns
# import matplotlib.pyplot as plt
# df = sns.load_dataset('iris')
# sns.regplot(x=df["sepal_length"], y=df["sepal_width"], line_kws={"color":"r","alpha":0.7,"lw":5})

Random slopes and random intercepts#

via https://patsy.readthedocs.io/en/latest/quickstart.html

You can even write interactions between categorical and numerical variables. Here we fit two different slope coefficients for x1; one for the a1 group, and one for the a2 group: dmatrix("a:x1", data)

This is what matches the seaborn plot when using hue as an extra variable

model2d = smf.ols('outcome ~ 1 + C(covariate) + C(covariate):exposure', data=df2)
result2d = model2d.fit()
# result2d.summary()
result2d.params
Intercept                   2.00
C(covariate)[T.1]           0.25
C(covariate)[0]:exposure    0.50
C(covariate)[1]:exposure    0.50
dtype: float64
import numpy as np
from scipy.stats import uniform, randint

covariate3 = randint(0,2).rvs(100)
exposure3  = uniform(0,1).rvs(100) +  0.3 * covariate
outcome3   = 2.0 + 0.25*covariate3 + (0.5 + 0.1*covariate3)*exposure3 
#                  \             /    \                  /
#                  different inst.      different slopes  

df3 = pd.DataFrame({
    "covariate":covariate3,
    "exposure":exposure3,
    "outcome":outcome3
})
fg = sns.lmplot(x="exposure", y="outcome", hue="covariate", data=df3)
../_images/64f4f24e1450f6a03d4f040348b4d80112243dc3c0f3f48a2cfc491df3a2e545.png
model3d = smf.ols('outcome ~ 1 + C(covariate) + C(covariate):exposure', data=df3)
result3d = model3d.fit()
result3d.params
Intercept                   2.00
C(covariate)[T.1]           0.25
C(covariate)[0]:exposure    0.50
C(covariate)[1]:exposure    0.60
dtype: float64
# ALT.
model3e = smf.ols('outcome ~ 0 + C(covariate) + C(covariate):exposure', data=df3)
result3e = model3e.fit()
result3e.params
C(covariate)[0]             2.00
C(covariate)[1]             2.25
C(covariate)[0]:exposure    0.50
C(covariate)[1]:exposure    0.60
dtype: float64