Section 4.3 — Interpreting linear models

Section 4.3 — Interpreting linear models#

This notebook contains the code examples from Section 4.3 Interpreting linear models from the No Bullshit Guide to Statistics.

Notebook setup#

# load Python modules
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Figures setup
plt.clf()  # needed otherwise `sns.set_theme` doesn't work
from plot_helpers import RCPARAMS
# RCPARAMS.update({'figure.figsize': (8, 5)})   # good for screen
RCPARAMS.update({'figure.figsize': (5, 3)})  # good for print
# RCPARAMS.update({'text.latex.preamble': r'\usepackage{amsmath}'})
# RCPARAMS.update({'text.usetex': True})
sns.set_theme(
    context="paper",
    style="whitegrid",
    palette="colorblind",
    rc=RCPARAMS,
)

# High-resolution please
%config InlineBackend.figure_format = 'retina'

# Where to store figures
DESTDIR = "figures/lm/interpreting"

<Figure size 640x480 with 0 Axes>

from ministats.utils import savefigure

# set random seed for repeatability
np.random.seed(42)
#######################################################

import statsmodels.formula.api as smf
import statsmodels.api as sm
import statsmodels.stats.api as sms

Introduction#

# load the dataset
doctors = pd.read_csv("../datasets/doctors.csv")
n = doctors.shape[0]

# fit the model
formula = "score ~ 1 + alc + weed + exrc"
lm2 = smf.ols(formula, data=doctors).fit()

# model degrees of freedom (number of predictors)
p = lm2.df_model

# display the summary table
lm2.summary()

OLS Regression Results
Dep. Variable:	score	R-squared:	0.842
Model:	OLS	Adj. R-squared:	0.839
Method:	Least Squares	F-statistic:	270.3
Date:	Fri, 03 May 2024	Prob (F-statistic):	1.05e-60
Time:	17:35:26	Log-Likelihood:	-547.63
No. Observations:	156	AIC:	1103.
Df Residuals:	152	BIC:	1115.
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	60.4529	1.289	46.885	0.000	57.905	63.000
alc	-1.8001	0.070	-25.726	0.000	-1.938	-1.662
weed	-1.0216	0.476	-2.145	0.034	-1.962	-0.081
exrc	1.7683	0.138	12.809	0.000	1.496	2.041

Omnibus:	1.140	Durbin-Watson:	1.828
Prob(Omnibus):	0.565	Jarque-Bera (JB):	0.900
Skew:	0.182	Prob(JB):	0.638
Kurtosis:	3.075	Cond. No.	31.2

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Model fit metrics#

Coefficient of determination#

lm2.rsquared

0.8421649167873537

# ALT.
1 - lm2.ssr/lm2.centered_tss

0.8421649167873537

Adjusted coefficient of determination#

lm2.rsquared_adj

0.8390497506713147

# ALT.
1 - (lm2.ssr/(n-p-1)) / (lm2.centered_tss/(n-1))

0.8390497506713147

F-statistic and associated p-value#

lm2.fvalue, lm2.f_pvalue

(270.34350189265837, 1.0512133413866766e-60)

Log-likelihood#

lm2.llf

-547.6259042117637

# ALT.
from scipy.stats import norm
sigmahat = np.sqrt(lm2.scale)
np.sum(np.log(norm.pdf(lm2.resid,scale=sigmahat)))
# Q: Why not exactly the same as lm2.llf?

-547.6519921512181

# How lm2.llf is computed:
SSR = np.sum(lm2.resid**2)
nobs2 = n/2.0
llf = -np.log(SSR)*nobs2                # concentrated likelihood
llf -= ( 1+np.log(np.pi/nobs2) )*nobs2  # with likelihood constant
llf

-547.6259042117637

Information criteria#

lm2.aic, lm2.bic

(1103.2518084235273, 1115.4512324525256)

# ALT.
2*(p+1) - 2*lm2.llf,  np.log(n)*(p+1) - 2*lm2.llf

(1103.2518084235273, 1115.4512324525256)

Other model-fit metrics#

The mean squared error (MSE)

from statsmodels.tools.eval_measures import mse

scores = doctors["score"]
scorehats = lm2.fittedvalues
mse(scores,scorehats), lm2.ssr/n

(65.56013803717558, 65.56013803717558)

Parameter estimates#

# estimated parameters
lm2.params

Intercept    60.452901
alc          -1.800101
weed         -1.021552
exrc          1.768289
dtype: float64

# estimated sigma (std. of error term)
sigmahat = np.sqrt(lm2.scale)
sigmahat

8.202768119825622

# ALT.
np.sqrt(lm2.ssr/(n-p-1))

8.202768119825622

Confidence intervals for model parameters#

# standard errors
lm2.bse

Intercept    1.289380
alc          0.069973
weed         0.476166
exrc         0.138056
dtype: float64

lm2.conf_int(alpha=0.05)

	0	1
Intercept	57.905480	63.000321
alc	-1.938347	-1.661856
weed	-1.962309	-0.080794
exrc	1.495533	2.041044

Hypothesis testing for linear models#

T-tests for individual parameters#

Hypothesis testing for slope coefficient

Is there a non-zero slope coefficient?

Null hypothesis \(H_0\): weed has no effect on score, which is equivalent to \(\beta_{\texttt{weed}} = 0\):

\[ H_0: \quad S \;\sim\; \mathcal{N}( {\color{red} \beta_0 + \beta_{\texttt{alc}}\!\cdot\!\textrm{alc} + \beta_{\texttt{exrc}}\!\cdot\!\textrm{exrc} }, \ \sigma) \qquad \qquad \]
Alternative hypothesis \(H_A\): weed has an effect on score, and the slope is not zero, \(\beta_{\texttt{weed}} \neq 0\):

\[ H_A: \quad S \;\sim\; \mathcal{N}\left( {\color{blue} \beta_0 + \beta_{\texttt{alc}}\!\cdot\!\textrm{alc} + \beta_{\texttt{weed}}\!\cdot\!\textrm{weed} + \beta_{\texttt{exrc}}\!\cdot\!\textrm{exrc} }, \ \sigma \right) \]

lm2.tvalues["weed"], lm2.pvalues["weed"], n-p-1

(-2.1453705454368617, 0.03351156181342315, 152.0)

Calculate the \(t\) statistic:

obst_weed = (lm2.params["weed"] - 0) / lm2.bse["weed"]
obst_weed

-2.1453705454368617

Calculate the associated \(p\)-value

from scipy.stats import t as tdist

pleft = tdist(df=n-p-1).cdf(obst_weed)
pright = 1 - tdist(df=lm2.df_resid).cdf(obst_weed)
pvalue = 2 * min(pleft, pright)
pvalue

0.03351156181342315

lm2.tvalues

Intercept    46.885245
alc         -25.725654
weed         -2.145371
exrc         12.808529
dtype: float64

lm2.df_resid, n-p-1

(152.0, 152.0)

lm2.pvalues

Intercept    2.756807e-92
alc          2.985013e-57
weed         3.351156e-02
exrc         6.136296e-26
dtype: float64

Example#

F-test for the overall model#

lm2.fvalue

270.34350189265837

# ALT.
F = lm2.mse_model / lm2.mse_resid
F

270.34350189265837

lm2.f_pvalue

1.0512133413866766e-60

# ALT.
from scipy.stats import f as fdist
fdist(dfn=p, dfd=n-p-1).sf(F)

1.0512133413866766e-60

Assumptions checks and diagnostics#

Are the assumptions for the linear model satisfied?

lm2.diagn

{'jb': 0.8999138579592745,
 'jbpv': 0.6376556155083202,
 'skew': 0.18224568453683593,
 'kurtosis': 3.074795238556266,
 'omni': 1.1403999852814177,
 'omnipv': 0.5654123490825862,
 'condno': 31.229721453770164,
 'mineigval': 40.14996367643263}

Normality checks#

QQ-plots#

from statsmodels.graphics.api import qqplot
# qqplot(lm2.resid, line="s");

with plt.rc_context({"figure.figsize":(4,3)}):
    qqplot(lm2.resid, line="s");
filename = os.path.join(DESTDIR, "sleep_scores_resid_qqplot.pdf")
savefigure(plt.gcf(), filename)

Saved figure to figures/lm/interpreting/sleep_scores_resid_qqplot.pdf
Saved figure to figures/lm/interpreting/sleep_scores_resid_qqplot.png

../_images/f72777743718d83821df54672691ffcf783d8b2f845cfcff9c3061ac214e29c8.png

# BONUS
ax = sns.histplot(lm2.resid)
ax.set_xlim([-27,27]);

../_images/084fef4ff0d17b4709449c61f155d1bef7f5beb4782d7c0c1f0f8b3407457bd5.png

Skew and kurtosis#

from scipy.stats import skew
from scipy.stats import kurtosis

skew(lm2.resid), kurtosis(lm2.resid, fisher=False)

(0.18224568453683593, 3.074795238556266)

from statsmodels.stats.stattools import jarque_bera
# from statsmodels.stats.stattools import omni_normtest

jb, jbpv, skew, kurtosis = jarque_bera(lm2.resid)
jb, jbpv, skew, kurtosis

(0.8999138579592745,
6376556155083202,
18224568453683593,
074795238556266)

from statsmodels.stats.stattools import durbin_watson

durbin_watson(lm2.resid)

1.8284830211766734

Independence checks#

sns.scatterplot(x=doctors["alc"], y=lm2.resid);

../_images/3b42309b603d339eae198a91ee3806ee2da5f13401fc1fedb693a726293c74f9.png

with plt.rc_context({"figure.figsize":(6,4)}):
    fig, axs_matrix = plt.subplots(2,2, sharey=True)
    fig.subplots_adjust(hspace=.4)
    axs = [ax for row in axs_matrix for ax in row]

    sns.scatterplot(x=doctors["alc"],   y=lm2.resid, ax=axs[0])
    axs[0].set_ylabel("resid")
    
    sns.scatterplot(x=doctors["weed"],  y=lm2.resid, ax=axs[1])
    
    sns.scatterplot(x=doctors["exrc"],  y=lm2.resid, ax=axs[2])
    axs[2].set_ylabel("resid")
    
    sns.scatterplot(x=lm2.fittedvalues, y=lm2.resid, ax=axs[3])
    axs[3].set_xlabel("fitted values")

    for ax in axs:
        ax.axhline(y=0, color="b", linestyle="dashed")

    filename = os.path.join(DESTDIR, "sleep_scores_resid_vs_vars_2x2_panel.pdf")
    savefigure(fig, filename)

Saved figure to figures/lm/interpreting/sleep_scores_resid_vs_vars_2x2_panel.pdf

Saved figure to figures/lm/interpreting/sleep_scores_resid_vs_vars_2x2_panel.png

../_images/52830ded437163401bc6102b2dc3b43d77bf0ead78bdff93bb8fd2ac9b07b29c.png

Linearity checks#

sns.regplot(x=doctors["alc"], y=lm2.resid, lowess=True);

../_images/3db9dc2c49c23e3d2f356d03e1abea1b1e8a0b57ecd77e01066e95ae390fd4f0.png

with plt.rc_context({"figure.figsize":(6,4)}):
    fig, axs_matrix = plt.subplots(2,2, sharey=True)
    axs = [ax for row in axs_matrix for ax in row]

    sns.regplot(x=doctors["alc"],  y=lm2.resid, lowess=True, scatter_kws={'s':8}, ax=axs[0])
    axs[0].set_ylabel("resid")

    # workaround to avoid RuntimeWarning: invalid value encountered in divide
    nzwdocs = doctors[doctors["weed"] != 0]
    zwdocs = doctors[doctors["weed"] == 0]
    n_subsample = int(0.8*len(zwdocs))
    np.random.seed(42)
    zwdocs_subsample = zwdocs.sample(n_subsample)
    idxs = np.concatenate((nzwdocs.index, zwdocs_subsample.index))
    sns.regplot(x=doctors["weed"][idxs], y=lm2.resid[idxs], lowess=True, scatter_kws={'s':8}, ax=axs[1])

    sns.regplot(x=doctors["exrc"], y=lm2.resid, lowess=True, scatter_kws={'s':8}, ax=axs[2])
    axs[2].set_ylabel("resid")

    sns.regplot(x=lm2.fittedvalues, y=lm2.resid, lowess=True, scatter_kws={'s':8}, ax=axs[3])
    axs[3].set_xlabel("fitted values")

    for ax in axs:
        ax.axhline(y=0, color="b", linestyle="dashed")

    filename = os.path.join(DESTDIR, "sleep_scores_resid_plots_with_lowess.pdf")
    savefigure(fig, filename)

Saved figure to figures/lm/interpreting/sleep_scores_resid_plots_with_lowess.pdf

Saved figure to figures/lm/interpreting/sleep_scores_resid_plots_with_lowess.png

../_images/9392e5b839c08184e0cc6eff82661087fbc4ce3a00c292e9af6de5bd89ebefdf.png

Hypothesis test for linearity#

sms.linear_harvey_collier(lm2)

TtestResult(statistic=1.4317316989777105, pvalue=0.15427366179829397, df=152)

Homoscedasticity checks#

The scale-location plot allows us to see if pattern in the variance of the residuals exists.

# Standardized residuals a.k.a. Pearson residuals
# lm2.resid / np.sqrt(lm2.scale))

#######################################################
scorehats = lm2.fittedvalues
std_resids = lm2.resid / np.sqrt(lm2.scale)
sqrt_std_resids = np.sqrt(np.abs(std_resids))
# sns.regplot(x=scorehats, y=sqrt_std_resids, lowess=True)

# FIGURES ONLY
ax = sns.regplot(x=scorehats, y=sqrt_std_resids, lowess=True)
ax.set_ylabel(r"$\sqrt{|\text{standardized residuals}|}$")
ax.set_xlabel("fitted values");

filename = os.path.join(DESTDIR, "sleep_scores_scale-location_plot.pdf")
savefigure(plt.gcf(), filename)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[45], line 13
ax.set_xlabel("fitted values");
filename = os.path.join(DESTDIR, "sleep_scores_scale-location_plot.pdf")
---> 13 savefigure(plt.gcf(), filename)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/ministats/utils.py:63, in savefigure(obj, filename, tight_layout_kwargs)
   fig.tight_layout(**tight_layout_kwargs)
else:
---> 63     fig.tight_layout()
# save as PDF
fig.savefig(filename, dpi=300, bbox_inches="tight", pad_inches=0)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/figure.py:3543, in Figure.tight_layout(self, pad, h_pad, w_pad, rect)
previous_engine = self.get_layout_engine()
self.set_layout_engine(engine)
-> 3543 engine.execute(self)
if previous_engine is not None and not isinstance(
   previous_engine, (TightLayoutEngine, PlaceHolderLayoutEngine)
):
   _api.warn_external('The figure layout has changed to tight')

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/layout_engine.py:184, in TightLayoutEngine.execute(self, fig)
renderer = fig._get_renderer()
with getattr(renderer, "_draw_disabled", nullcontext)():
--> 184     kwargs = get_tight_layout_figure(
       fig, fig.axes, get_subplotspec_list(fig.axes), renderer,
       pad=info['pad'], h_pad=info['h_pad'], w_pad=info['w_pad'],
       rect=info['rect'])
if kwargs:
   fig.subplots_adjust(**kwargs)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/_tight_layout.py:266, in get_tight_layout_figure(fig, axes_list, subplotspec_list, renderer, pad, h_pad, w_pad, rect)
       return {}
   span_pairs.append((
       slice(ss.rowspan.start * div_row, ss.rowspan.stop * div_row),
       slice(ss.colspan.start * div_col, ss.colspan.stop * div_col)))
--> 266 kwargs = _auto_adjust_subplotpars(fig, renderer,
                                 shape=(max_nrows, max_ncols),
                                 span_pairs=span_pairs,
                                 subplot_list=subplot_list,
                                 ax_bbox_list=ax_bbox_list,
                                 pad=pad, h_pad=h_pad, w_pad=w_pad)
# kwargs can be none if tight_layout fails...
if rect is not None and kwargs is not None:
   # if rect is given, the whole subplots area (including
   # labels) will fit into the rect instead of the
   (...)
   # auto_adjust_subplotpars twice, where the second run
   # with adjusted rect parameters.

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/_tight_layout.py:82, in _auto_adjust_subplotpars(fig, renderer, shape, span_pairs, subplot_list, ax_bbox_list, pad, h_pad, w_pad, rect)
for ax in subplots:
   if ax.get_visible():
---> 82         bb += [martist._get_tightbbox_for_layout_only(ax, renderer)]
tight_bbox_raw = Bbox.union(bb)
tight_bbox = fig.transFigure.inverted().transform_bbox(tight_bbox_raw)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/artist.py:1415, in _get_tightbbox_for_layout_only(obj, *args, **kwargs)
"""
Matplotlib's `.Axes.get_tightbbox` and `.Axis.get_tightbbox` support a
*for_layout_only* kwarg; this helper tries to use the kwarg but skips it
when encountering third-party subclasses that do not support it.
"""
try:
-> 1415     return obj.get_tightbbox(*args, **{**kwargs, "for_layout_only": True})
except TypeError:
   return obj.get_tightbbox(*args, **kwargs)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/axes/_base.py:4387, in _AxesBase.get_tightbbox(self, renderer, call_axes_locator, bbox_extra_artists, for_layout_only)
for axis in self._axis_map.values():
   if self.axison and axis.get_visible():
-> 4387         ba = martist._get_tightbbox_for_layout_only(axis, renderer)
       if ba:
           bb.append(ba)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/artist.py:1415, in _get_tightbbox_for_layout_only(obj, *args, **kwargs)
"""
Matplotlib's `.Axes.get_tightbbox` and `.Axis.get_tightbbox` support a
*for_layout_only* kwarg; this helper tries to use the kwarg but skips it
when encountering third-party subclasses that do not support it.
"""
try:
-> 1415     return obj.get_tightbbox(*args, **{**kwargs, "for_layout_only": True})
except TypeError:
   return obj.get_tightbbox(*args, **kwargs)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/axis.py:1353, in Axis.get_tightbbox(self, renderer, for_layout_only)
# take care of label
if self.label.get_visible():
-> 1353     bb = self.label.get_window_extent(renderer)
   # for constrained/tight_layout, we want to ignore the label's
   # width/height because the adjustments they make can't be improved.
   # this code collapses the relevant direction
   if for_layout_only:

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/text.py:959, in Text.get_window_extent(self, renderer, dpi)
   raise RuntimeError(
       "Cannot get window extent of text w/o renderer. You likely "
       "want to call 'figure.draw_without_rendering()' first.")
with cbook._setattr_cm(self.figure, dpi=dpi):
--> 959     bbox, info, descent = self._get_layout(self._renderer)
   x, y = self.get_unitless_position()
   x, y = self.get_transform().transform((x, y))

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/text.py:386, in Text._get_layout(self, renderer)
clean_line, ismath = self._preprocess_math(line)
if clean_line:
--> 386     w, h, d = _get_text_metrics_with_cache(
       renderer, clean_line, self._fontproperties,
       ismath=ismath, dpi=self.figure.dpi)
else:
   w = h = d = 0

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/text.py:97, in _get_text_metrics_with_cache(renderer, text, fontprop, ismath, dpi)
"""Call ``renderer.get_text_width_height_descent``, caching the results."""
# Cached based on a copy of fontprop so that later in-place mutations of
# the passed-in argument do not mess up the cache.
---> 97 return _get_text_metrics_with_cache_impl(
   weakref.ref(renderer), text, fontprop.copy(), ismath, dpi)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/text.py:105, in _get_text_metrics_with_cache_impl(renderer_ref, text, fontprop, ismath, dpi)
@functools.lru_cache(4096)
def _get_text_metrics_with_cache_impl(
       renderer_ref, text, fontprop, ismath, dpi):
   # dpi is unused, but participates in cache invalidation (via the renderer).
--> 105     return renderer_ref().get_text_width_height_descent(text, fontprop, ismath)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/backends/backend_agg.py:230, in RendererAgg.get_text_width_height_descent(self, s, prop, ismath)
   return super().get_text_width_height_descent(s, prop, ismath)
if ismath:
   ox, oy, width, height, descent, font_image = \
--> 230         self.mathtext_parser.parse(s, self.dpi, prop)
   return width, height, descent
font = self._prepare_font(prop)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/mathtext.py:226, in MathTextParser.parse(self, s, dpi, prop)
# lru_cache can't decorate parse() directly because prop
# is mutable; key the cache using an internal copy (see
# text._get_text_metrics_with_cache for a similar case).
prop = prop.copy() if prop is not None else None
--> 226 return self._parse_cached(s, dpi, prop)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/mathtext.py:247, in MathTextParser._parse_cached(self, s, dpi, prop)
if self._parser is None:  # Cache the parser globally.
   self.__class__._parser = _mathtext.Parser()
--> 247 box = self._parser.parse(s, fontset, fontsize, dpi)
output = _mathtext.ship(box)
if self._output_type == "vector":

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/_mathtext.py:1971, in Parser.parse(self, s, fonts_object, fontsize, dpi)
   result = self._expression.parseString(s)
except ParseBaseException as err:
   # explain becomes a plain method on pyparsing 3 (err.explain(0)).
-> 1971     raise ValueError("\n" + ParseException.explain(err, 0)) from None
self._state_stack = None
self._in_subscript_or_superscript = False

ValueError: 
\sqrt{|\text{standardized residuals}|}
       ^
ParseSyntaxException: Expected required_group, found '\'  (at char 7), (line:1, col:8)

Error in callback <function _draw_all_if_interactive at 0x7f3b9b83c5e0> (for post_execute):

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/pyplot.py:120, in _draw_all_if_interactive()
    118 def _draw_all_if_interactive():
    119     if matplotlib.is_interactive():
--> 120         draw_all()

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/_pylab_helpers.py:132, in Gcf.draw_all(cls, force)
    130 for manager in cls.get_all_fig_managers():
    131     if force or manager.canvas.figure.stale:
--> 132         manager.canvas.draw_idle()

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/backend_bases.py:2082, in FigureCanvasBase.draw_idle(self, *args, **kwargs)
   2080 if not self._is_idle_drawing:
   2081     with self._idle_draw_cntx():
-> 2082         self.draw(*args, **kwargs)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/backends/backend_agg.py:400, in FigureCanvasAgg.draw(self)
    396 # Acquire a lock on the shared font cache.
    397 with RendererAgg.lock, \
    398      (self.toolbar._wait_cursor_for_draw_cm() if self.toolbar
    399       else nullcontext()):
--> 400     self.figure.draw(self.renderer)
    401     # A GUI class may be need to update a window using this draw, so
    402     # don't forget to call the superclass.
    403     super().draw()

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/artist.py:95, in _finalize_rasterization.<locals>.draw_wrapper(artist, renderer, *args, **kwargs)
     93 @wraps(draw)
     94 def draw_wrapper(artist, renderer, *args, **kwargs):
---> 95     result = draw(artist, renderer, *args, **kwargs)
     96     if renderer._rasterizing:
     97         renderer.stop_rasterizing()

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/artist.py:72, in allow_rasterization.<locals>.draw_wrapper(artist, renderer)
     69     if artist.get_agg_filter() is not None:
     70         renderer.start_filter()
---> 72     return draw(artist, renderer)
     73 finally:
     74     if artist.get_agg_filter() is not None:

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/figure.py:3175, in Figure.draw(self, renderer)
   3172         # ValueError can occur when resizing a window.
   3174 self.patch.draw(renderer)
-> 3175 mimage._draw_list_compositing_images(
   3176     renderer, self, artists, self.suppressComposite)
   3178 for sfig in self.subfigs:
   3179     sfig.draw(renderer)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/image.py:131, in _draw_list_compositing_images(renderer, parent, artists, suppress_composite)
    129 if not_composite or not has_images:
    130     for a in artists:
--> 131         a.draw(renderer)
    132 else:
    133     # Composite any adjacent images together
    134     image_group = []

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/artist.py:72, in allow_rasterization.<locals>.draw_wrapper(artist, renderer)
     69     if artist.get_agg_filter() is not None:
     70         renderer.start_filter()
---> 72     return draw(artist, renderer)
     73 finally:
     74     if artist.get_agg_filter() is not None:

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/axes/_base.py:3064, in _AxesBase.draw(self, renderer)
   3061 if artists_rasterized:
   3062     _draw_rasterized(self.figure, artists_rasterized, renderer)
-> 3064 mimage._draw_list_compositing_images(
   3065     renderer, self, artists, self.figure.suppressComposite)
   3067 renderer.close_group('axes')
   3068 self.stale = False

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/image.py:131, in _draw_list_compositing_images(renderer, parent, artists, suppress_composite)
    129 if not_composite or not has_images:
    130     for a in artists:
--> 131         a.draw(renderer)
    132 else:
    133     # Composite any adjacent images together
    134     image_group = []

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/artist.py:72, in allow_rasterization.<locals>.draw_wrapper(artist, renderer)
     69     if artist.get_agg_filter() is not None:
     70         renderer.start_filter()
---> 72     return draw(artist, renderer)
     73 finally:
     74     if artist.get_agg_filter() is not None:

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/axis.py:1396, in Axis.draw(self, renderer, *args, **kwargs)
   1394 # Shift label away from axes to avoid overlapping ticklabels.
   1395 self._update_label_position(renderer)
-> 1396 self.label.draw(renderer)
   1398 self._update_offset_text_position(tlb1, tlb2)
   1399 self.offsetText.set_text(self.major.formatter.get_offset())

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/artist.py:72, in allow_rasterization.<locals>.draw_wrapper(artist, renderer)
     69     if artist.get_agg_filter() is not None:
     70         renderer.start_filter()
---> 72     return draw(artist, renderer)
     73 finally:
     74     if artist.get_agg_filter() is not None:

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/text.py:752, in Text.draw(self, renderer)
    749 renderer.open_group('text', self.get_gid())
    751 with self._cm_set(text=self._get_wrapped_text()):
--> 752     bbox, info, descent = self._get_layout(renderer)
    753     trans = self.get_transform()
    755     # don't use self.get_position here, which refers to text
    756     # position in Text:

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/text.py:386, in Text._get_layout(self, renderer)
    384 clean_line, ismath = self._preprocess_math(line)
    385 if clean_line:
--> 386     w, h, d = _get_text_metrics_with_cache(
    387         renderer, clean_line, self._fontproperties,
    388         ismath=ismath, dpi=self.figure.dpi)
    389 else:
    390     w = h = d = 0

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/text.py:97, in _get_text_metrics_with_cache(renderer, text, fontprop, ismath, dpi)
     94 """Call ``renderer.get_text_width_height_descent``, caching the results."""
     95 # Cached based on a copy of fontprop so that later in-place mutations of
     96 # the passed-in argument do not mess up the cache.
---> 97 return _get_text_metrics_with_cache_impl(
     98     weakref.ref(renderer), text, fontprop.copy(), ismath, dpi)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/text.py:105, in _get_text_metrics_with_cache_impl(renderer_ref, text, fontprop, ismath, dpi)
    101 @functools.lru_cache(4096)
    102 def _get_text_metrics_with_cache_impl(
    103         renderer_ref, text, fontprop, ismath, dpi):
    104     # dpi is unused, but participates in cache invalidation (via the renderer).
--> 105     return renderer_ref().get_text_width_height_descent(text, fontprop, ismath)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/backends/backend_agg.py:230, in RendererAgg.get_text_width_height_descent(self, s, prop, ismath)
    226     return super().get_text_width_height_descent(s, prop, ismath)
    228 if ismath:
    229     ox, oy, width, height, descent, font_image = \
--> 230         self.mathtext_parser.parse(s, self.dpi, prop)
    231     return width, height, descent
    233 font = self._prepare_font(prop)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/mathtext.py:226, in MathTextParser.parse(self, s, dpi, prop)
    222 # lru_cache can't decorate parse() directly because prop
    223 # is mutable; key the cache using an internal copy (see
    224 # text._get_text_metrics_with_cache for a similar case).
    225 prop = prop.copy() if prop is not None else None
--> 226 return self._parse_cached(s, dpi, prop)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/mathtext.py:247, in MathTextParser._parse_cached(self, s, dpi, prop)
    244 if self._parser is None:  # Cache the parser globally.
    245     self.__class__._parser = _mathtext.Parser()
--> 247 box = self._parser.parse(s, fontset, fontsize, dpi)
    248 output = _mathtext.ship(box)
    249 if self._output_type == "vector":

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/_mathtext.py:1971, in Parser.parse(self, s, fonts_object, fontsize, dpi)
   1968     result = self._expression.parseString(s)
   1969 except ParseBaseException as err:
   1970     # explain becomes a plain method on pyparsing 3 (err.explain(0)).
-> 1971     raise ValueError("\n" + ParseException.explain(err, 0)) from None
   1972 self._state_stack = None
   1973 self._in_subscript_or_superscript = False

ValueError: 
\sqrt{|\text{standardized residuals}|}
       ^
ParseSyntaxException: Expected required_group, found '\'  (at char 7), (line:1, col:8)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/IPython/core/formatters.py:340, in BaseFormatter.__call__(self, obj)
    338     pass
    339 else:
--> 340     return printer(obj)
    341 # Finally look for special method names
    342 method = get_real_method(obj, self.print_method)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/IPython/core/pylabtools.py:169, in retina_figure(fig, base64, **kwargs)
    160 def retina_figure(fig, base64=False, **kwargs):
    161     """format a figure as a pixel-doubled (retina) PNG
    162 
    163     If `base64` is True, return base64-encoded str instead of raw bytes
   (...)
    167         base64 argument
    168     """
--> 169     pngdata = print_figure(fig, fmt="retina", base64=False, **kwargs)
    170     # Make sure that retina_figure acts just like print_figure and returns
    171     # None when the figure is empty.
    172     if pngdata is None:

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/IPython/core/pylabtools.py:152, in print_figure(fig, fmt, bbox_inches, base64, **kwargs)
    149     from matplotlib.backend_bases import FigureCanvasBase
    150     FigureCanvasBase(fig)
--> 152 fig.canvas.print_figure(bytes_io, **kw)
    153 data = bytes_io.getvalue()
    154 if fmt == 'svg':

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/backend_bases.py:2342, in FigureCanvasBase.print_figure(self, filename, dpi, facecolor, edgecolor, orientation, format, bbox_inches, pad_inches, bbox_extra_artists, backend, **kwargs)
   2336     renderer = _get_renderer(
   2337         self.figure,
   2338         functools.partial(
   2339             print_method, orientation=orientation)
   2340     )
   2341     with getattr(renderer, "_draw_disabled", nullcontext)():
-> 2342         self.figure.draw(renderer)
   2344 if bbox_inches:
   2345     if bbox_inches == "tight":

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/artist.py:95, in _finalize_rasterization.<locals>.draw_wrapper(artist, renderer, *args, **kwargs)
     93 @wraps(draw)
     94 def draw_wrapper(artist, renderer, *args, **kwargs):
---> 95     result = draw(artist, renderer, *args, **kwargs)
     96     if renderer._rasterizing:
     97         renderer.stop_rasterizing()

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/artist.py:72, in allow_rasterization.<locals>.draw_wrapper(artist, renderer)
     69     if artist.get_agg_filter() is not None:
     70         renderer.start_filter()
---> 72     return draw(artist, renderer)
     73 finally:
     74     if artist.get_agg_filter() is not None:

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/figure.py:3175, in Figure.draw(self, renderer)
   3172         # ValueError can occur when resizing a window.
   3174 self.patch.draw(renderer)
-> 3175 mimage._draw_list_compositing_images(
   3176     renderer, self, artists, self.suppressComposite)
   3178 for sfig in self.subfigs:
   3179     sfig.draw(renderer)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/image.py:131, in _draw_list_compositing_images(renderer, parent, artists, suppress_composite)
    129 if not_composite or not has_images:
    130     for a in artists:
--> 131         a.draw(renderer)
    132 else:
    133     # Composite any adjacent images together
    134     image_group = []

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/artist.py:72, in allow_rasterization.<locals>.draw_wrapper(artist, renderer)
     69     if artist.get_agg_filter() is not None:
     70         renderer.start_filter()
---> 72     return draw(artist, renderer)
     73 finally:
     74     if artist.get_agg_filter() is not None:

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/axes/_base.py:3064, in _AxesBase.draw(self, renderer)
   3061 if artists_rasterized:
   3062     _draw_rasterized(self.figure, artists_rasterized, renderer)
-> 3064 mimage._draw_list_compositing_images(
   3065     renderer, self, artists, self.figure.suppressComposite)
   3067 renderer.close_group('axes')
   3068 self.stale = False

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/image.py:131, in _draw_list_compositing_images(renderer, parent, artists, suppress_composite)
    129 if not_composite or not has_images:
    130     for a in artists:
--> 131         a.draw(renderer)
    132 else:
    133     # Composite any adjacent images together
    134     image_group = []

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/artist.py:72, in allow_rasterization.<locals>.draw_wrapper(artist, renderer)
     69     if artist.get_agg_filter() is not None:
     70         renderer.start_filter()
---> 72     return draw(artist, renderer)
     73 finally:
     74     if artist.get_agg_filter() is not None:

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/axis.py:1396, in Axis.draw(self, renderer, *args, **kwargs)
   1394 # Shift label away from axes to avoid overlapping ticklabels.
   1395 self._update_label_position(renderer)
-> 1396 self.label.draw(renderer)
   1398 self._update_offset_text_position(tlb1, tlb2)
   1399 self.offsetText.set_text(self.major.formatter.get_offset())

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/artist.py:72, in allow_rasterization.<locals>.draw_wrapper(artist, renderer)
     69     if artist.get_agg_filter() is not None:
     70         renderer.start_filter()
---> 72     return draw(artist, renderer)
     73 finally:
     74     if artist.get_agg_filter() is not None:

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/text.py:752, in Text.draw(self, renderer)
    749 renderer.open_group('text', self.get_gid())
    751 with self._cm_set(text=self._get_wrapped_text()):
--> 752     bbox, info, descent = self._get_layout(renderer)
    753     trans = self.get_transform()
    755     # don't use self.get_position here, which refers to text
    756     # position in Text:

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/text.py:386, in Text._get_layout(self, renderer)
    384 clean_line, ismath = self._preprocess_math(line)
    385 if clean_line:
--> 386     w, h, d = _get_text_metrics_with_cache(
    387         renderer, clean_line, self._fontproperties,
    388         ismath=ismath, dpi=self.figure.dpi)
    389 else:
    390     w = h = d = 0

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/text.py:97, in _get_text_metrics_with_cache(renderer, text, fontprop, ismath, dpi)
     94 """Call ``renderer.get_text_width_height_descent``, caching the results."""
     95 # Cached based on a copy of fontprop so that later in-place mutations of
     96 # the passed-in argument do not mess up the cache.
---> 97 return _get_text_metrics_with_cache_impl(
     98     weakref.ref(renderer), text, fontprop.copy(), ismath, dpi)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/text.py:105, in _get_text_metrics_with_cache_impl(renderer_ref, text, fontprop, ismath, dpi)
    101 @functools.lru_cache(4096)
    102 def _get_text_metrics_with_cache_impl(
    103         renderer_ref, text, fontprop, ismath, dpi):
    104     # dpi is unused, but participates in cache invalidation (via the renderer).
--> 105     return renderer_ref().get_text_width_height_descent(text, fontprop, ismath)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/backends/backend_agg.py:230, in RendererAgg.get_text_width_height_descent(self, s, prop, ismath)
    226     return super().get_text_width_height_descent(s, prop, ismath)
    228 if ismath:
    229     ox, oy, width, height, descent, font_image = \
--> 230         self.mathtext_parser.parse(s, self.dpi, prop)
    231     return width, height, descent
    233 font = self._prepare_font(prop)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/mathtext.py:226, in MathTextParser.parse(self, s, dpi, prop)
    222 # lru_cache can't decorate parse() directly because prop
    223 # is mutable; key the cache using an internal copy (see
    224 # text._get_text_metrics_with_cache for a similar case).
    225 prop = prop.copy() if prop is not None else None
--> 226 return self._parse_cached(s, dpi, prop)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/mathtext.py:247, in MathTextParser._parse_cached(self, s, dpi, prop)
    244 if self._parser is None:  # Cache the parser globally.
    245     self.__class__._parser = _mathtext.Parser()
--> 247 box = self._parser.parse(s, fontset, fontsize, dpi)
    248 output = _mathtext.ship(box)
    249 if self._output_type == "vector":

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/_mathtext.py:1971, in Parser.parse(self, s, fonts_object, fontsize, dpi)
   1968     result = self._expression.parseString(s)
   1969 except ParseBaseException as err:
   1970     # explain becomes a plain method on pyparsing 3 (err.explain(0)).
-> 1971     raise ValueError("\n" + ParseException.explain(err, 0)) from None
   1972 self._state_stack = None
   1973 self._in_subscript_or_superscript = False

ValueError: 
\sqrt{|\text{standardized residuals}|}
       ^
ParseSyntaxException: Expected required_group, found '\'  (at char 7), (line:1, col:8)

<Figure size 500x300 with 1 Axes>

Breusch-Pagan homoskedasticity test#

https://en.wikipedia.org/wiki/Breusch-Pagan_test

Idea:

Lagrange Multiplier test for heteroscedasticity

# lm2.model.exog[0:3]

lm, p_lm, _, _ = sms.het_breuschpagan(lm2.resid, lm2.model.exog)
lm, p_lm

(6.254809752854315, 0.09985027888796541)

Goldfeld-Quandt homoskedasticity test#

https://en.wikipedia.org/wiki/Goldfeld-Quandt_test

# sort by score
idxs = doctors.sort_values("alc").index

# sort by scorehats
idxs = np.argsort(lm2.fittedvalues)
X = lm2.model.exog[idxs,:]
resid = lm2.resid[idxs]

# run the test
F, p_F, _ = sms.het_goldfeldquandt(resid, X, idx=1)
F, p_F

(1.5346515257083702, 0.03369473000286662)

Collinearity checks#

Check the correlation matrix#

corrM = doctors[["alc", "weed", "exrc"]].corr()
corrM.round(4)

	alc	weed	exrc
alc	1.0000	0.0422	0.0336
weed	0.0422	1.0000	0.0952
exrc	0.0336	0.0952	1.0000

Recall that the correlation coefficient \(r_{\mathbf{x}_1,\mathbf{x}_2}\) between two variables \(\mathbf{x}_1\) and \(\mathbf{x}_2\) is equivalent to the square root of the \(R^2\) metric for the linear model x1 ~ x2, which has the intuitive interpretation as the proportion of the variance in x1 that can be explained by the variability in x2.

Let’s verify this equivalence by computing the correlation coefficient \(r_{\texttt{alc},\texttt{exrc}}\) from a linear regression model of exrc on alc.

lm_alc_exrc = smf.ols("alc ~ exrc", data=doctors).fit()
R2_alc_exrc = lm_alc_exrc.rsquared
r_alc_exrc = np.sqrt(R2_alc_exrc).round(4)
r_alc_exrc, R2_alc_exrc

(0.0336, 0.0011306540709105084)

# # BONUS 1 (not covered anywhere earlier in the book)
# sns.pairplot(doctors[["alc", "weed", "exrc"]]);

# # BONUS 2 (not covered anywhere earlier in the book)
# sns.heatmap(corrM, annot=True, fmt='.2f', cmap="Grays");

Condition number#

lm2.condition_number

31.229721453770164

# ALT. (compute from scratch)
#######################################################
X = sm.add_constant(doctors[["alc","weed","exrc"]])
eigvals = np.linalg.eigvals(X.T @ X)
lam_max, lam_min = np.max(eigvals), np.min(eigvals)
np.sqrt(lam_max/lam_min)

31.229721453770264

Variance inflation factor#

from statsmodels.stats.outliers_influence \
     import variance_inflation_factor as statsmodels_vif

X = sm.add_constant(doctors[["alc","weed","exrc"]])
statsmodels_vif(X, 1)  # variation inflation factor of "alc" in lm2

1.0026692156304757

from statsmodels.stats.outliers_influence \
     import variance_inflation_factor as vif

X = sm.add_constant(doctors[["alc","weed","exrc"]])

for i, var in enumerate(["alc ", "weed", "exrc"]):
    print("VIF for", var, "=", vif(X, i+1))

VIF for alc  = 1.0026692156304757
VIF for weed = 1.010703145007322
VIF for exrc = 1.010048809433661

How VIF is calculated under the hood (optional)#

def calc_vif(dmatrix, pred_idx):
    """
    Calculate the variance inflation factor of the predictor
    in column `pred_idx` of the design matrix `dmatrix`.
    """
    n_cols = dmatrix.shape[1]
    x_i = dmatrix[:, pred_idx]
    mask = np.arange(n_cols) != pred_idx
    X_noti = dmatrix[:, mask]
    r_squared_i = sm.OLS(x_i, X_noti).fit().rsquared
    vif = 1. / (1. - r_squared_i)
    return vif

calc_vif(lm2.model.exog, 1), calc_vif(lm2.model.exog, 2), calc_vif(lm2.model.exog, 3)

(1.0026692156304757, 1.010703145007322, 1.010048809433661)

# ALT.
from ministats import calc_lm_vif

calc_lm_vif(lm2, "alc"), calc_lm_vif(lm2, "weed"), calc_lm_vif(lm2, "exrc")

(1.0026692156304757, 1.010703145007322, 1.010048809433661)

Statsmodels diagnostics plots#

Plot fit against one regressor#

see https://www.statsmodels.org/stable/generated/statsmodels.graphics.regressionplots.plot_fit.html

from statsmodels.graphics.api import plot_fit

with plt.rc_context({"figure.figsize":(9,2.6)}):
    fig, (ax1,ax2,ax3) = plt.subplots(1,3, sharey=True)
    plot_fit(lm2, "alc",  ax=ax1);
    plot_fit(lm2, "weed", ax=ax2);
    plot_fit(lm2, "exrc", ax=ax3)
    ax2.get_legend().remove()
    ax3.get_legend().remove()

filename = os.path.join(DESTDIR, "statsmodels_panel_plot_fit.pdf")
savefigure(fig, filename)

Saved figure to figures/lm/interpreting/statsmodels_panel_plot_fit.pdf
Saved figure to figures/lm/interpreting/statsmodels_panel_plot_fit.png

../_images/3a33af19e4aecc11549fa930265d23641af69f9a7560a5fc4872395dcd7f0e17.png

Partial regression plot#

https://www.statsmodels.org/stable/generated/statsmodels.graphics.regressionplots.plot_partregress.html

from statsmodels.graphics.api import plot_partregress

with plt.rc_context({"figure.figsize":(9,2.6)}):
    fig, (ax1,ax2,ax3) = plt.subplots(1,3, sharey=True)
    plot_partregress("score", "alc",  exog_others=[], data=doctors, obs_labels=False, ax=ax1)
    plot_partregress("score", "weed", exog_others=[], data=doctors, obs_labels=False, ax=ax2)
    plot_partregress("score", "exrc", exog_others=[], data=doctors, obs_labels=False, ax=ax3)

filename = os.path.join(DESTDIR, "statsmodels_panel_plot_partregress.pdf")
savefigure(fig, filename)

Saved figure to figures/lm/interpreting/statsmodels_panel_plot_partregress.pdf
Saved figure to figures/lm/interpreting/statsmodels_panel_plot_partregress.png

../_images/2100c359e7b63963654d46389832fdf5c863074993b8fe467f70052122ed19fa.png

CCPR plot#

component and component-plus-residual

see https://www.statsmodels.org/stable/generated/statsmodels.graphics.regressionplots.plot_ccpr.html

from statsmodels.graphics.api import plot_ccpr

with plt.rc_context({"figure.figsize":(9,2.6), "text.usetex":True}):
    fig, (ax1,ax2,ax3) = plt.subplots(1,3)
    fig.subplots_adjust(wspace=0.5)

    plot_ccpr(lm2, "alc", ax=ax1)
    ax1.set_title("")
    ax1.set_ylabel(r"$\texttt{alc} \cdot \widehat{\beta}_{\texttt{alc}} \; + \; \mathbf{r}$")

    plot_ccpr(lm2, "weed", ax=ax2)
    ax2.set_title("")
    ax2.set_ylabel(r"$\texttt{weed} \cdot \widehat{\beta}_{\texttt{weed}} \; + \; \mathbf{r}$", labelpad=-3)

    plot_ccpr(lm2, "exrc", ax=ax3)
    ax3.set_title("")
    ax3.set_ylabel(r"$\texttt{exrc} \cdot \widehat{\beta}_{\texttt{exrc}} \; + \; \mathbf{r}$", labelpad=-3)

filename = os.path.join(DESTDIR, "statsmodels_panel_plot_ccpr.pdf")
savefigure(fig, filename)

Saved figure to figures/lm/interpreting/statsmodels_panel_plot_ccpr.pdf
Saved figure to figures/lm/interpreting/statsmodels_panel_plot_ccpr.png

../_images/236d87701dad9a3a55500d9dab9d9e1ac83798756bef5eaf3e16e2e256e7ef4e.png

All-in-on convenience method#

from statsmodels.graphics.api import plot_regress_exog

with plt.rc_context({"figure.figsize":(6,4)}):
    plot_regress_exog(lm2, "alc");

../_images/b79d269baba3e72452ffa3231c3d9db49dca221a90d53c73765eb3c37b714820.png

Outliers and influential observations#

TODO: definitions

from scipy.stats import norm

npts = 30
beta_0 = 10
beta_x = 3
sigma = 1

np.random.seed(45)

xs = norm(loc=5, scale=2).rvs(npts)
ys = beta_0 + beta_x*xs + norm(0,sigma).rvs(npts)
dfxys = pd.DataFrame({"x":xs, "y":ys})

# outlier near middle
idx_mid = dfxys[dfxys["x"]>5].sort_values("x").head(1).index[0]
y1s = dfxys["y"].copy()
y1s[idx_mid] = 10
dfxys["y1"] = y1s

# high leverage points
dfxys["y2"] = ys
mask_hl = (dfxys["x"] < 1) | (dfxys["x"] > 7.5)
dfxys["hl"] = mask_hl
# dfxys.loc[idxs_hl,"hl"] = 1


# influential point near end
idx_last = dfxys.sort_values("x").tail(1).index[0]
y3s = dfxys["y"].copy()
y3s[idx_last] = 10
dfxys["y3"] = y3s


with plt.rc_context({"figure.figsize":(9,2.6), "text.usetex":True}):
    fig, (ax1,ax2,ax3) = plt.subplots(1,3, sharey=True)
    # fig.subplots_adjust(wspace=0.5)

    # Outlier
    # lmxy1 = smf.ols("y1 ~ x", data=dfxys).fit()
    # lmxy1.params
    sns.regplot(x="x", y="y1", ci=False, data=dfxys, ax=ax1)
    ax1.set_ylabel("$y_1$")
    ax1.set_xlabel("$x$")
    ax1.set_xticks(range(0,12,2))
    ax1.set_title("(a) Outlier")

    # (B) High leverage
    sns.regplot(x="x", y="y2", ci=False, data=dfxys, ax=ax2)
    dfhl = dfxys[dfxys["hl"]==True]
    sns.regplot(x="x", y="y2", ci=False, data=dfhl,
                fit_reg=False, color="red", marker="D", ax=ax2)
    ax2.set_ylabel("$y_2$")
    ax2.set_xlabel("$x$")
    ax2.set_xticks(range(0,12,2))
    ax2.set_title("(b) High leverage points")

    # (C) Influential
    sns.regplot(x="x", y="y3", ci=False, data=dfxys, ax=ax3)
    ax3.set_ylabel("$y_3$")
    ax3.set_xlabel("$x$")
    ax3.set_xticks(range(0,12,2))
    ax3.set_title("(c) Influential observation")

    filename = os.path.join(DESTDIR, "panel_outlier_hl_influential.pdf")
    savefigure(fig, filename)

Saved figure to figures/lm/interpreting/panel_outlier_hl_influential.pdf
Saved figure to figures/lm/interpreting/panel_outlier_hl_influential.png

../_images/053157df300c88b4725dcb2d87d6c27587e54aac7b0703bd093bf4d770852816.png

Leverage and influence metrics#

doctors.shape

(156, 12)

i = 0
doctors_no_i = doctors.drop(index=i)
doctors_no_i.shape

(155, 12)

Leverage score#

Manual calculations of leverage metric (optional)#

mtcars = sm.datasets.get_rdataset("mtcars", "datasets").data

# Fit model using all data points
lmcars = smf.ols("mpg ~ hp", data=mtcars).fit()

# Design matrix
X = sm.add_constant(mtcars["hp"]).values

# Hat matrix
H = X @ np.linalg.inv(X.T @ X) @ X.T

# First element of the diagonal form the hat matrix
h0 = np.diag(H)[0]
print(f"{h0 = }")

# Linear model fit without the first data point
lmcars_no0 = smf.ols("mpg ~ hp", data=mtcars[1:]).fit()

# Prediction of yhat0 from the model without the first data point
yhat0_no0 = lmcars_no0.predict({"hp":mtcars.iloc[0]["hp"]})[0]
print(f"{yhat0_no0 = }")

# Alternative formula   h0 = 1 - r0 / r0_no0
r0 = lmcars.resid.iloc[0]          # residual from the full model 
r0_no0 = lmcars_no0.resid.iloc[0]  # residual from model w/o first point
h0 = 1 - r0/r0_no0
print(f"{h0 = }")

# Alternative formula for h0 via `statsmodels`
infl_cars = lmcars.get_influence()
h0 = infl_cars.hat_matrix_diag[0]
print(f"{h0 = }")

h0 = 0.04048626926227573
yhat0_no0 = 22.660997545626714
h0 = 0.040486269262266505
h0 = 0.040486269262275755

Studentized residuals#

\[ t_{i} = \frac{r_i}{\hat{\sigma}_{(-i)} \sqrt{1-h_i}} \]

infl = lm2.get_influence()
infl.resid_studentized[0:5]  # = infl2.resid_studentized_internal

array([ 0.98097556, -2.01341843, -1.60234556, -0.21972192, -1.28811914])

# obtained using Jackknife
infl.resid_studentized_external[0:5]

array([ 0.98085317, -2.03409243, -1.61072775, -0.21903275, -1.29094028])

Cook’s distance#

\[ D_i = \frac{r_i^2}{p \widehat{\sigma}^2} \cdot \frac{h_{i}}{(1-h_{i})^2} = \frac{t_i^2 }{p}\frac{h_i}{1-h_i} \]

infl.cooks_distance[0][0:5]

array([0.02526116, 0.01331454, 0.01116562, 0.00017951, 0.0041123 ])

DFFITS diagnostic#

\[ TODO \]

Calculating leverage and influence metrics using `statsmodels`#

infl = lm2.get_influence()
infl_df = infl.summary_frame() 
list(infl_df.columns)

['dfb_Intercept',
 'dfb_alc',
 'dfb_weed',
 'dfb_exrc',
 'cooks_d',
 'standard_resid',
 'hat_diag',
 'dffits_internal',
 'student_resid',
 'dffits']

cols = ["student_resid", "hat_diag", "cooks_d", "dffits"]
print(infl_df[cols].round(3))

     student_resid  hat_diag  cooks_d  dffits
          0.981     0.095    0.025   0.318
         -2.034     0.013    0.013  -0.233
         -1.611     0.017    0.011  -0.212
         -0.219     0.015    0.000  -0.027
         -1.291     0.010    0.004  -0.129
..             ...       ...      ...     ...
        0.873     0.009    0.002   0.083
       -0.497     0.017    0.001  -0.065
        0.279     0.023    0.000   0.043
        2.320     0.040    0.055   0.475
        0.353     0.025    0.001   0.056

[156 rows x 4 columns]

Leverage plots#

import statsmodels.api as sm
sm.graphics.plot_leverage_resid2(lm2);

filename = os.path.join(DESTDIR, "lm2_plot_leverage_resid2.pdf")
savefigure(plt.gcf(), filename)

Saved figure to figures/lm/interpreting/lm2_plot_leverage_resid2.pdf
Saved figure to figures/lm/interpreting/lm2_plot_leverage_resid2.png

../_images/7e5e60a73b955103570eba1a31471d0a97bf197df45aa0cf708c88f8cde21380.png

Influence plots#

sm.graphics.influence_plot(lm2, criterion="cooks", size=4);

filename = os.path.join(DESTDIR, "lm2_influence_plot_cook.pdf")
savefigure(plt.gcf(), filename)

Saved figure to figures/lm/interpreting/lm2_influence_plot_cook.pdf
Saved figure to figures/lm/interpreting/lm2_influence_plot_cook.png

../_images/6f1f452286cfca67a30f06eaabfc49baf6d16e45bb08a0ef4b1eefb733b13aa6.png

# ALT. use DFFITS
# sm.graphics.influence_plot(lm2, criterion="DFFITS", size=4);

Outlier tests (BONUS TOPIC)#

cf. https://notebook.community/samuelsinayoko/kaggle-housing-prices/research/outlier_detection_statsmodels

lm2.outlier_test().head()

	student_resid	unadj_p	bonf(p)
0	0.980853	0.328234	1.0
1	-2.034092	0.043693	1.0
2	-1.610728	0.109327	1.0
3	-0.219033	0.826920	1.0
4	-1.290940	0.198697	1.0

Model prediction accuracy#

new_data = {"alc":3, "weed":1, "exrc":8}
lm2.predict(new_data)

0    68.177355
dtype: float64

pred_score = lm2.get_prediction(new_data)

In-sample prediction accuracy#

# Compute the in-sample mean squared error (MSE)
scores = doctors['score']
scorehats = lm2.fittedvalues
mse = np.mean( (scores - scorehats)**2 )
mse

65.56013803717558

# ALT.
lm2.ssr/n

65.56013803717559

# ALT2.
from statsmodels.tools.eval_measures import mse
mse(scores,scorehats)

65.56013803717558

Out-of-sample prediction accuracy#

Leave-one-out cross-validation#

#######################################################
loo_preds = np.zeros(n)
for i in range(n):
    doctors_no_i = doctors.drop(index=i)
    lm2_no_i = smf.ols(formula,data=doctors_no_i).fit()
    predictors_i = dict(doctors.loc[i,:])
    pred_i = lm2_no_i.predict(predictors_i)[0]
    loo_preds[i] = pred_i

# Calculate the out-of-sample mean squared error
mse_loo = np.mean( (doctors['score'] - loo_preds)**2 )
mse_loo

69.20054514710948

Compare with the in-sample MSE of the model which is lower.

lm2.ssr/n

65.56013803717559

Regularization#

doctors[["alc", "weed", "exrc"]].std()

alc     9.428506
weed    1.391068
exrc    4.796361
dtype: float64

# TRUE
# alc =  - 1.8
# weed = - 0.5
# exrc = + 1.9

# FITTED
lm2.params

Intercept    60.452901
alc          -1.800101
weed         -1.021552
exrc          1.768289
dtype: float64

L1 regularization (LASSO)#

lm2_L1reg = smf.ols(formula, data=doctors) \
               .fit_regularized(alpha=0.3, L1_wt=1.0)
lm2_L1reg.params

Intercept    58.930367
alc          -1.763306
weed          0.000000
exrc          1.795228
dtype: float64

L2 regularization (ridge)#

lm2_L2reg = smf.ols(formula, data=doctors) \
               .fit_regularized(alpha=0.02, L1_wt=0.0001)
lm2_L2reg.params

Intercept    56.127763
alc          -1.655051
weed         -0.769119
exrc          2.014540
dtype: float64

Explanations#

Calculating standard error of parameters#

lm2.bse

Intercept    1.289380
alc          0.069973
weed         0.476166
exrc         0.138056
dtype: float64

# construct the design matrix for the model 
X = sm.add_constant(doctors[["alc","weed","exrc"]])
# calculate the diagonal of the inverse-covariance matrix
inv_covs = np.diag(np.linalg.inv(X.T.dot(X)))
sigmahat*np.sqrt(inv_covs)

array([1.28938008, 0.06997301, 0.4761656 , 0.13805557])

# these lead to approx. same result because predictors are not correlated
# but the correct formula is to use the inv. covariance matrix as above.
sum_alc_dev2 = np.sum((doctors["alc"] - doctors["alc"].mean())**2)
sum_weed_dev2 = np.sum((doctors["weed"] - doctors["weed"].mean())**2)
sum_exrc_dev2 = np.sum((doctors["exrc"] - doctors["exrc"].mean())**2)
se_b_alc = sigmahat/np.sqrt(sum_alc_dev2)
se_b_weed = sigmahat/np.sqrt(sum_weed_dev2)
se_b_exrc = sigmahat/np.sqrt(sum_exrc_dev2)
se_b_alc, se_b_weed, se_b_exrc

(0.06987980522527788, 0.47363764322385443, 0.13736710473639846)

Discussion#

Towards machine learning#

The out-of-sample prediction accuracy is a common metric used in machine learning (ML) tasks.

Exercises#

E4.LOGNORM Non-normal error term#

from scipy.stats import uniform, lognorm
np.random.seed(42)
xs = uniform(0,10).rvs(100)
ys = 2*xs + lognorm(1).rvs(100)
df = pd.DataFrame({"x":xs, "y":ys})
lm = smf.ols("y ~ x", data=df).fit()

qqplot(lm.resid, line="s");

../_images/48aab3c383b81180d12857991126c28ab1ff6fa40da2fb8fd6041aae0aaf7ed0.png

# sns.scatterplot(x=xs, y=lm.resid);

# from ministats import plot_lm_scale_loc
# plot_lm_scale_loc(lm);

E4.DEP Dependent error term#

E4.NL Nonlinear relationship#

from scipy.stats import uniform, lognorm

np.random.seed(42)
xs = uniform(0,10).rvs(100)
ys = 2*xs + xs**2 + norm(0,1).rvs(100)
df = pd.DataFrame({"x":xs, "y":ys})
lm = smf.ols("y ~ x", data=df).fit()
lm.params

Intercept   -14.684585
x            11.688188
dtype: float64

sns.scatterplot(x=xs, y=ys)
sns.lineplot(x=xs, y=lm.fittedvalues);

../_images/c1e8edcdab1cb025a07167417fe7f2077baf447205b10667e321a5c87fed4f97.png

sns.regplot(x=xs, y=lm.resid, lowess=True);

../_images/0a265e1aaacce856c0c9fb378945e2c273f33fdfa74c60058f312287fbfaa3b8.png

qqplot(lm.resid, line="s");

../_images/eb356745931b197818b1b069b769ac04cc6056395a4ba97d036e40ecdeb3a26a.png

E4.H Heteroskedasticity#

from scipy.stats import uniform, norm

np.random.seed(43)
xs = np.sort(uniform(0,10).rvs(100))
sigmas = np.linspace(1,20,100)
ys = 2*xs + norm(loc=0,scale=sigmas).rvs(100)
df = pd.DataFrame({"x":xs, "y":ys})
lm = smf.ols("y ~ x", data=df).fit()

from ministats import plot_lm_scale_loc
plot_lm_scale_loc(lm);

../_images/558922b36bbd7bfee7b52973721a2fcaa71dfe03fa3e6813420b906d1e7f8160.png

E4.v Collinearity#

from scipy.stats import uniform, norm

np.random.seed(45)
x1s = uniform(0,10).rvs(100)
alpha = 0.8
x2s = alpha*x1s + (1-alpha)*uniform(0,10).rvs(100)
ys = 2*x1s + 3*x2s + norm(0,1).rvs(100)
df = pd.DataFrame({"x1":x1s, "x2":x2s, "y":ys})
lm = smf.ols("y ~ x1 + x2", data=df).fit()
lm.params, lm.bse

(Intercept    0.220927
 x1           1.985774
 x2           3.006225
 dtype: float64,
 Intercept    0.253988
 x1           0.137253
 x2           0.167492
 dtype: float64)

lm.condition_number

24.291898486251146

calc_lm_vif(lm, "x1"), calc_lm_vif(lm, "x2")

(17.17244980334442, 17.17244980334442)

Links#

More details about model checking https://ethanweed.github.io/pythonbook/05.04-regression.html#model-checking
Statistical Modeling: The Two Cultures paper that explains the importance of out-of-sample predictions for statistical modelling.
https://projecteuclid.org/journals/statistical-science/volume-16/issue-3/Statistical-Modeling–The-Two-Cultures-with-comments-and-a/10.1214/ss/1009213726.full

CUT MATERIAL#

import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm

# Generate some data
np.random.seed(0)
X = np.random.normal(size=100)
y = 2 * X + np.random.normal(size=100)

# Fit a regression model
model = sm.OLS(y, sm.add_constant(X))
results = model.fit()

# Compute residuals and fitted values
residuals = results.resid
fitted = results.fittedvalues
standardized_residuals = residuals / np.std(residuals)

# Create scale-location plot
plt.figure(figsize=(8, 6))
plt.scatter(fitted, np.sqrt(np.abs(standardized_residuals)), alpha=0.5)
plt.xlabel('Fitted values')
plt.ylabel(r'$\sqrt{|Standardized Residuals|}$')
plt.title('Scale-Location Plot')
plt.grid(True)
plt.show()

../_images/35ff796c3a3cb5b964e96ae77a64dd861d45b53f1476e0b8b59b934d2cc12d4f.png

Example dataset#

mtcars = sm.datasets.get_rdataset("mtcars", "datasets").data
mtcars

lmcars = smf.ols("mpg ~ hp", data=mtcars).fit()

sns.scatterplot(x=mtcars["hp"], y=mtcars["mpg"])
sns.lineplot(x=mtcars["hp"], y=lmcars.fittedvalues);

../_images/42b7b8ca2dc8aaaeddcbe9310db796f07f82cb90c682649af4f6767945469f72.png

Compute the variance/covariance matrix#

lm2.cov_params()
# == lm2.scale * np.linalg.inv(X.T @ X)
# where X = sm.add_constant(doctors[["alc","weed","exrc"]])

	Intercept	alc	weed	exrc
Intercept	1.662501	-0.055601	-0.093711	-0.095403
alc	-0.055601	0.004896	-0.001305	-0.000288
weed	-0.093711	-0.001305	0.226734	-0.006177
exrc	-0.095403	-0.000288	-0.006177	0.019059