Section 4.3 — Interpreting linear models#
This notebook contains the code examples from Section 4.3 Interpreting linear models from the No Bullshit Guide to Statistics.
Notebook setup#
# load Python modules
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Figures setup
plt.clf() # needed otherwise `sns.set_theme` doesn't work
from plot_helpers import RCPARAMS
# RCPARAMS.update({'figure.figsize': (8, 5)}) # good for screen
RCPARAMS.update({'figure.figsize': (5, 3)}) # good for print
# RCPARAMS.update({'text.latex.preamble': r'\usepackage{amsmath}'})
# RCPARAMS.update({'text.usetex': True})
sns.set_theme(
context="paper",
style="whitegrid",
palette="colorblind",
rc=RCPARAMS,
)
# High-resolution please
%config InlineBackend.figure_format = 'retina'
# Where to store figures
DESTDIR = "figures/lm/interpreting"
<Figure size 640x480 with 0 Axes>
from ministats.utils import savefigure
# set random seed for repeatability
np.random.seed(42)
#######################################################
import statsmodels.formula.api as smf
import statsmodels.api as sm
import statsmodels.stats.api as sms
Introduction#
# load the dataset
doctors = pd.read_csv("../datasets/doctors.csv")
n = doctors.shape[0]
# fit the model
formula = "score ~ 1 + alc + weed + exrc"
lm2 = smf.ols(formula, data=doctors).fit()
# model degrees of freedom (number of predictors)
p = lm2.df_model
# display the summary table
lm2.summary()
Dep. Variable: | score | R-squared: | 0.842 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.839 |
Method: | Least Squares | F-statistic: | 270.3 |
Date: | Fri, 03 May 2024 | Prob (F-statistic): | 1.05e-60 |
Time: | 17:35:26 | Log-Likelihood: | -547.63 |
No. Observations: | 156 | AIC: | 1103. |
Df Residuals: | 152 | BIC: | 1115. |
Df Model: | 3 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 60.4529 | 1.289 | 46.885 | 0.000 | 57.905 | 63.000 |
alc | -1.8001 | 0.070 | -25.726 | 0.000 | -1.938 | -1.662 |
weed | -1.0216 | 0.476 | -2.145 | 0.034 | -1.962 | -0.081 |
exrc | 1.7683 | 0.138 | 12.809 | 0.000 | 1.496 | 2.041 |
Omnibus: | 1.140 | Durbin-Watson: | 1.828 |
---|---|---|---|
Prob(Omnibus): | 0.565 | Jarque-Bera (JB): | 0.900 |
Skew: | 0.182 | Prob(JB): | 0.638 |
Kurtosis: | 3.075 | Cond. No. | 31.2 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Model fit metrics#
Coefficient of determination#
lm2.rsquared
0.8421649167873537
# ALT.
1 - lm2.ssr/lm2.centered_tss
0.8421649167873537
Adjusted coefficient of determination#
lm2.rsquared_adj
0.8390497506713147
# ALT.
1 - (lm2.ssr/(n-p-1)) / (lm2.centered_tss/(n-1))
0.8390497506713147
F-statistic and associated p-value#
lm2.fvalue, lm2.f_pvalue
(270.34350189265837, 1.0512133413866766e-60)
Log-likelihood#
lm2.llf
-547.6259042117637
# ALT.
from scipy.stats import norm
sigmahat = np.sqrt(lm2.scale)
np.sum(np.log(norm.pdf(lm2.resid,scale=sigmahat)))
# Q: Why not exactly the same as lm2.llf?
-547.6519921512181
# How lm2.llf is computed:
SSR = np.sum(lm2.resid**2)
nobs2 = n/2.0
llf = -np.log(SSR)*nobs2 # concentrated likelihood
llf -= ( 1+np.log(np.pi/nobs2) )*nobs2 # with likelihood constant
llf
-547.6259042117637
Information criteria#
lm2.aic, lm2.bic
(1103.2518084235273, 1115.4512324525256)
# ALT.
2*(p+1) - 2*lm2.llf, np.log(n)*(p+1) - 2*lm2.llf
(1103.2518084235273, 1115.4512324525256)
Other model-fit metrics#
The mean squared error (MSE)
from statsmodels.tools.eval_measures import mse
scores = doctors["score"]
scorehats = lm2.fittedvalues
mse(scores,scorehats), lm2.ssr/n
(65.56013803717558, 65.56013803717558)
Parameter estimates#
# estimated parameters
lm2.params
Intercept 60.452901
alc -1.800101
weed -1.021552
exrc 1.768289
dtype: float64
# estimated sigma (std. of error term)
sigmahat = np.sqrt(lm2.scale)
sigmahat
8.202768119825622
# ALT.
np.sqrt(lm2.ssr/(n-p-1))
8.202768119825622
Confidence intervals for model parameters#
# standard errors
lm2.bse
Intercept 1.289380
alc 0.069973
weed 0.476166
exrc 0.138056
dtype: float64
lm2.conf_int(alpha=0.05)
0 | 1 | |
---|---|---|
Intercept | 57.905480 | 63.000321 |
alc | -1.938347 | -1.661856 |
weed | -1.962309 | -0.080794 |
exrc | 1.495533 | 2.041044 |
Hypothesis testing for linear models#
T-tests for individual parameters#
Hypothesis testing for slope coefficient
Is there a non-zero slope coefficient?
Null hypothesis \(H_0\):
weed
has no effect onscore
, which is equivalent to \(\beta_{\texttt{weed}} = 0\):\[ H_0: \quad S \;\sim\; \mathcal{N}( {\color{red} \beta_0 + \beta_{\texttt{alc}}\!\cdot\!\textrm{alc} + \beta_{\texttt{exrc}}\!\cdot\!\textrm{exrc} }, \ \sigma) \qquad \qquad \]Alternative hypothesis \(H_A\):
weed
has an effect onscore
, and the slope is not zero, \(\beta_{\texttt{weed}} \neq 0\):\[ H_A: \quad S \;\sim\; \mathcal{N}\left( {\color{blue} \beta_0 + \beta_{\texttt{alc}}\!\cdot\!\textrm{alc} + \beta_{\texttt{weed}}\!\cdot\!\textrm{weed} + \beta_{\texttt{exrc}}\!\cdot\!\textrm{exrc} }, \ \sigma \right) \]
lm2.tvalues["weed"], lm2.pvalues["weed"], n-p-1
(-2.1453705454368617, 0.03351156181342315, 152.0)
Calculate the \(t\) statistic:
obst_weed = (lm2.params["weed"] - 0) / lm2.bse["weed"]
obst_weed
-2.1453705454368617
Calculate the associated \(p\)-value
from scipy.stats import t as tdist
pleft = tdist(df=n-p-1).cdf(obst_weed)
pright = 1 - tdist(df=lm2.df_resid).cdf(obst_weed)
pvalue = 2 * min(pleft, pright)
pvalue
0.03351156181342315
lm2.tvalues
Intercept 46.885245
alc -25.725654
weed -2.145371
exrc 12.808529
dtype: float64
lm2.df_resid, n-p-1
(152.0, 152.0)
lm2.pvalues
Intercept 2.756807e-92
alc 2.985013e-57
weed 3.351156e-02
exrc 6.136296e-26
dtype: float64
Example#
F-test for the overall model#
lm2.fvalue
270.34350189265837
# ALT.
F = lm2.mse_model / lm2.mse_resid
F
270.34350189265837
lm2.f_pvalue
1.0512133413866766e-60
# ALT.
from scipy.stats import f as fdist
fdist(dfn=p, dfd=n-p-1).sf(F)
1.0512133413866766e-60
Assumptions checks and diagnostics#
Are the assumptions for the linear model satisfied?
lm2.diagn
{'jb': 0.8999138579592745,
'jbpv': 0.6376556155083202,
'skew': 0.18224568453683593,
'kurtosis': 3.074795238556266,
'omni': 1.1403999852814177,
'omnipv': 0.5654123490825862,
'condno': 31.229721453770164,
'mineigval': 40.14996367643263}
Normality checks#
QQ-plots#
from statsmodels.graphics.api import qqplot
# qqplot(lm2.resid, line="s");
with plt.rc_context({"figure.figsize":(4,3)}):
qqplot(lm2.resid, line="s");
filename = os.path.join(DESTDIR, "sleep_scores_resid_qqplot.pdf")
savefigure(plt.gcf(), filename)
Saved figure to figures/lm/interpreting/sleep_scores_resid_qqplot.pdf
Saved figure to figures/lm/interpreting/sleep_scores_resid_qqplot.png
# BONUS
ax = sns.histplot(lm2.resid)
ax.set_xlim([-27,27]);
Skew and kurtosis#
from scipy.stats import skew
from scipy.stats import kurtosis
skew(lm2.resid), kurtosis(lm2.resid, fisher=False)
(0.18224568453683593, 3.074795238556266)
from statsmodels.stats.stattools import jarque_bera
# from statsmodels.stats.stattools import omni_normtest
jb, jbpv, skew, kurtosis = jarque_bera(lm2.resid)
jb, jbpv, skew, kurtosis
(0.8999138579592745,
0.6376556155083202,
0.18224568453683593,
3.074795238556266)
from statsmodels.stats.stattools import durbin_watson
durbin_watson(lm2.resid)
1.8284830211766734
Independence checks#
sns.scatterplot(x=doctors["alc"], y=lm2.resid);
with plt.rc_context({"figure.figsize":(6,4)}):
fig, axs_matrix = plt.subplots(2,2, sharey=True)
fig.subplots_adjust(hspace=.4)
axs = [ax for row in axs_matrix for ax in row]
sns.scatterplot(x=doctors["alc"], y=lm2.resid, ax=axs[0])
axs[0].set_ylabel("resid")
sns.scatterplot(x=doctors["weed"], y=lm2.resid, ax=axs[1])
sns.scatterplot(x=doctors["exrc"], y=lm2.resid, ax=axs[2])
axs[2].set_ylabel("resid")
sns.scatterplot(x=lm2.fittedvalues, y=lm2.resid, ax=axs[3])
axs[3].set_xlabel("fitted values")
for ax in axs:
ax.axhline(y=0, color="b", linestyle="dashed")
filename = os.path.join(DESTDIR, "sleep_scores_resid_vs_vars_2x2_panel.pdf")
savefigure(fig, filename)
Saved figure to figures/lm/interpreting/sleep_scores_resid_vs_vars_2x2_panel.pdf
Saved figure to figures/lm/interpreting/sleep_scores_resid_vs_vars_2x2_panel.png
Linearity checks#
sns.regplot(x=doctors["alc"], y=lm2.resid, lowess=True);
with plt.rc_context({"figure.figsize":(6,4)}):
fig, axs_matrix = plt.subplots(2,2, sharey=True)
axs = [ax for row in axs_matrix for ax in row]
sns.regplot(x=doctors["alc"], y=lm2.resid, lowess=True, scatter_kws={'s':8}, ax=axs[0])
axs[0].set_ylabel("resid")
# workaround to avoid RuntimeWarning: invalid value encountered in divide
nzwdocs = doctors[doctors["weed"] != 0]
zwdocs = doctors[doctors["weed"] == 0]
n_subsample = int(0.8*len(zwdocs))
np.random.seed(42)
zwdocs_subsample = zwdocs.sample(n_subsample)
idxs = np.concatenate((nzwdocs.index, zwdocs_subsample.index))
sns.regplot(x=doctors["weed"][idxs], y=lm2.resid[idxs], lowess=True, scatter_kws={'s':8}, ax=axs[1])
sns.regplot(x=doctors["exrc"], y=lm2.resid, lowess=True, scatter_kws={'s':8}, ax=axs[2])
axs[2].set_ylabel("resid")
sns.regplot(x=lm2.fittedvalues, y=lm2.resid, lowess=True, scatter_kws={'s':8}, ax=axs[3])
axs[3].set_xlabel("fitted values")
for ax in axs:
ax.axhline(y=0, color="b", linestyle="dashed")
filename = os.path.join(DESTDIR, "sleep_scores_resid_plots_with_lowess.pdf")
savefigure(fig, filename)
Saved figure to figures/lm/interpreting/sleep_scores_resid_plots_with_lowess.pdf
Saved figure to figures/lm/interpreting/sleep_scores_resid_plots_with_lowess.png
Hypothesis test for linearity#
sms.linear_harvey_collier(lm2)
TtestResult(statistic=1.4317316989777105, pvalue=0.15427366179829397, df=152)
Homoscedasticity checks#
The scale-location plot allows us to see if pattern in the variance of the residuals exists.
# Standardized residuals a.k.a. Pearson residuals
# lm2.resid / np.sqrt(lm2.scale))
#######################################################
scorehats = lm2.fittedvalues
std_resids = lm2.resid / np.sqrt(lm2.scale)
sqrt_std_resids = np.sqrt(np.abs(std_resids))
# sns.regplot(x=scorehats, y=sqrt_std_resids, lowess=True)
# FIGURES ONLY
ax = sns.regplot(x=scorehats, y=sqrt_std_resids, lowess=True)
ax.set_ylabel(r"$\sqrt{|\text{standardized residuals}|}$")
ax.set_xlabel("fitted values");
filename = os.path.join(DESTDIR, "sleep_scores_scale-location_plot.pdf")
savefigure(plt.gcf(), filename)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[45], line 13
10 ax.set_xlabel("fitted values");
12 filename = os.path.join(DESTDIR, "sleep_scores_scale-location_plot.pdf")
---> 13 savefigure(plt.gcf(), filename)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/ministats/utils.py:63, in savefigure(obj, filename, tight_layout_kwargs)
61 fig.tight_layout(**tight_layout_kwargs)
62 else:
---> 63 fig.tight_layout()
65 # save as PDF
66 fig.savefig(filename, dpi=300, bbox_inches="tight", pad_inches=0)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/figure.py:3543, in Figure.tight_layout(self, pad, h_pad, w_pad, rect)
3541 previous_engine = self.get_layout_engine()
3542 self.set_layout_engine(engine)
-> 3543 engine.execute(self)
3544 if previous_engine is not None and not isinstance(
3545 previous_engine, (TightLayoutEngine, PlaceHolderLayoutEngine)
3546 ):
3547 _api.warn_external('The figure layout has changed to tight')
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/layout_engine.py:184, in TightLayoutEngine.execute(self, fig)
182 renderer = fig._get_renderer()
183 with getattr(renderer, "_draw_disabled", nullcontext)():
--> 184 kwargs = get_tight_layout_figure(
185 fig, fig.axes, get_subplotspec_list(fig.axes), renderer,
186 pad=info['pad'], h_pad=info['h_pad'], w_pad=info['w_pad'],
187 rect=info['rect'])
188 if kwargs:
189 fig.subplots_adjust(**kwargs)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/_tight_layout.py:266, in get_tight_layout_figure(fig, axes_list, subplotspec_list, renderer, pad, h_pad, w_pad, rect)
261 return {}
262 span_pairs.append((
263 slice(ss.rowspan.start * div_row, ss.rowspan.stop * div_row),
264 slice(ss.colspan.start * div_col, ss.colspan.stop * div_col)))
--> 266 kwargs = _auto_adjust_subplotpars(fig, renderer,
267 shape=(max_nrows, max_ncols),
268 span_pairs=span_pairs,
269 subplot_list=subplot_list,
270 ax_bbox_list=ax_bbox_list,
271 pad=pad, h_pad=h_pad, w_pad=w_pad)
273 # kwargs can be none if tight_layout fails...
274 if rect is not None and kwargs is not None:
275 # if rect is given, the whole subplots area (including
276 # labels) will fit into the rect instead of the
(...)
280 # auto_adjust_subplotpars twice, where the second run
281 # with adjusted rect parameters.
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/_tight_layout.py:82, in _auto_adjust_subplotpars(fig, renderer, shape, span_pairs, subplot_list, ax_bbox_list, pad, h_pad, w_pad, rect)
80 for ax in subplots:
81 if ax.get_visible():
---> 82 bb += [martist._get_tightbbox_for_layout_only(ax, renderer)]
84 tight_bbox_raw = Bbox.union(bb)
85 tight_bbox = fig.transFigure.inverted().transform_bbox(tight_bbox_raw)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/artist.py:1415, in _get_tightbbox_for_layout_only(obj, *args, **kwargs)
1409 """
1410 Matplotlib's `.Axes.get_tightbbox` and `.Axis.get_tightbbox` support a
1411 *for_layout_only* kwarg; this helper tries to use the kwarg but skips it
1412 when encountering third-party subclasses that do not support it.
1413 """
1414 try:
-> 1415 return obj.get_tightbbox(*args, **{**kwargs, "for_layout_only": True})
1416 except TypeError:
1417 return obj.get_tightbbox(*args, **kwargs)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/axes/_base.py:4387, in _AxesBase.get_tightbbox(self, renderer, call_axes_locator, bbox_extra_artists, for_layout_only)
4385 for axis in self._axis_map.values():
4386 if self.axison and axis.get_visible():
-> 4387 ba = martist._get_tightbbox_for_layout_only(axis, renderer)
4388 if ba:
4389 bb.append(ba)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/artist.py:1415, in _get_tightbbox_for_layout_only(obj, *args, **kwargs)
1409 """
1410 Matplotlib's `.Axes.get_tightbbox` and `.Axis.get_tightbbox` support a
1411 *for_layout_only* kwarg; this helper tries to use the kwarg but skips it
1412 when encountering third-party subclasses that do not support it.
1413 """
1414 try:
-> 1415 return obj.get_tightbbox(*args, **{**kwargs, "for_layout_only": True})
1416 except TypeError:
1417 return obj.get_tightbbox(*args, **kwargs)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/axis.py:1353, in Axis.get_tightbbox(self, renderer, for_layout_only)
1351 # take care of label
1352 if self.label.get_visible():
-> 1353 bb = self.label.get_window_extent(renderer)
1354 # for constrained/tight_layout, we want to ignore the label's
1355 # width/height because the adjustments they make can't be improved.
1356 # this code collapses the relevant direction
1357 if for_layout_only:
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/text.py:959, in Text.get_window_extent(self, renderer, dpi)
954 raise RuntimeError(
955 "Cannot get window extent of text w/o renderer. You likely "
956 "want to call 'figure.draw_without_rendering()' first.")
958 with cbook._setattr_cm(self.figure, dpi=dpi):
--> 959 bbox, info, descent = self._get_layout(self._renderer)
960 x, y = self.get_unitless_position()
961 x, y = self.get_transform().transform((x, y))
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/text.py:386, in Text._get_layout(self, renderer)
384 clean_line, ismath = self._preprocess_math(line)
385 if clean_line:
--> 386 w, h, d = _get_text_metrics_with_cache(
387 renderer, clean_line, self._fontproperties,
388 ismath=ismath, dpi=self.figure.dpi)
389 else:
390 w = h = d = 0
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/text.py:97, in _get_text_metrics_with_cache(renderer, text, fontprop, ismath, dpi)
94 """Call ``renderer.get_text_width_height_descent``, caching the results."""
95 # Cached based on a copy of fontprop so that later in-place mutations of
96 # the passed-in argument do not mess up the cache.
---> 97 return _get_text_metrics_with_cache_impl(
98 weakref.ref(renderer), text, fontprop.copy(), ismath, dpi)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/text.py:105, in _get_text_metrics_with_cache_impl(renderer_ref, text, fontprop, ismath, dpi)
101 @functools.lru_cache(4096)
102 def _get_text_metrics_with_cache_impl(
103 renderer_ref, text, fontprop, ismath, dpi):
104 # dpi is unused, but participates in cache invalidation (via the renderer).
--> 105 return renderer_ref().get_text_width_height_descent(text, fontprop, ismath)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/backends/backend_agg.py:230, in RendererAgg.get_text_width_height_descent(self, s, prop, ismath)
226 return super().get_text_width_height_descent(s, prop, ismath)
228 if ismath:
229 ox, oy, width, height, descent, font_image = \
--> 230 self.mathtext_parser.parse(s, self.dpi, prop)
231 return width, height, descent
233 font = self._prepare_font(prop)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/mathtext.py:226, in MathTextParser.parse(self, s, dpi, prop)
222 # lru_cache can't decorate parse() directly because prop
223 # is mutable; key the cache using an internal copy (see
224 # text._get_text_metrics_with_cache for a similar case).
225 prop = prop.copy() if prop is not None else None
--> 226 return self._parse_cached(s, dpi, prop)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/mathtext.py:247, in MathTextParser._parse_cached(self, s, dpi, prop)
244 if self._parser is None: # Cache the parser globally.
245 self.__class__._parser = _mathtext.Parser()
--> 247 box = self._parser.parse(s, fontset, fontsize, dpi)
248 output = _mathtext.ship(box)
249 if self._output_type == "vector":
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/_mathtext.py:1971, in Parser.parse(self, s, fonts_object, fontsize, dpi)
1968 result = self._expression.parseString(s)
1969 except ParseBaseException as err:
1970 # explain becomes a plain method on pyparsing 3 (err.explain(0)).
-> 1971 raise ValueError("\n" + ParseException.explain(err, 0)) from None
1972 self._state_stack = None
1973 self._in_subscript_or_superscript = False
ValueError:
\sqrt{|\text{standardized residuals}|}
^
ParseSyntaxException: Expected required_group, found '\' (at char 7), (line:1, col:8)
Error in callback <function _draw_all_if_interactive at 0x7f3b9b83c5e0> (for post_execute):
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/pyplot.py:120, in _draw_all_if_interactive()
118 def _draw_all_if_interactive():
119 if matplotlib.is_interactive():
--> 120 draw_all()
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/_pylab_helpers.py:132, in Gcf.draw_all(cls, force)
130 for manager in cls.get_all_fig_managers():
131 if force or manager.canvas.figure.stale:
--> 132 manager.canvas.draw_idle()
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/backend_bases.py:2082, in FigureCanvasBase.draw_idle(self, *args, **kwargs)
2080 if not self._is_idle_drawing:
2081 with self._idle_draw_cntx():
-> 2082 self.draw(*args, **kwargs)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/backends/backend_agg.py:400, in FigureCanvasAgg.draw(self)
396 # Acquire a lock on the shared font cache.
397 with RendererAgg.lock, \
398 (self.toolbar._wait_cursor_for_draw_cm() if self.toolbar
399 else nullcontext()):
--> 400 self.figure.draw(self.renderer)
401 # A GUI class may be need to update a window using this draw, so
402 # don't forget to call the superclass.
403 super().draw()
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/artist.py:95, in _finalize_rasterization.<locals>.draw_wrapper(artist, renderer, *args, **kwargs)
93 @wraps(draw)
94 def draw_wrapper(artist, renderer, *args, **kwargs):
---> 95 result = draw(artist, renderer, *args, **kwargs)
96 if renderer._rasterizing:
97 renderer.stop_rasterizing()
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/artist.py:72, in allow_rasterization.<locals>.draw_wrapper(artist, renderer)
69 if artist.get_agg_filter() is not None:
70 renderer.start_filter()
---> 72 return draw(artist, renderer)
73 finally:
74 if artist.get_agg_filter() is not None:
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/figure.py:3175, in Figure.draw(self, renderer)
3172 # ValueError can occur when resizing a window.
3174 self.patch.draw(renderer)
-> 3175 mimage._draw_list_compositing_images(
3176 renderer, self, artists, self.suppressComposite)
3178 for sfig in self.subfigs:
3179 sfig.draw(renderer)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/image.py:131, in _draw_list_compositing_images(renderer, parent, artists, suppress_composite)
129 if not_composite or not has_images:
130 for a in artists:
--> 131 a.draw(renderer)
132 else:
133 # Composite any adjacent images together
134 image_group = []
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/artist.py:72, in allow_rasterization.<locals>.draw_wrapper(artist, renderer)
69 if artist.get_agg_filter() is not None:
70 renderer.start_filter()
---> 72 return draw(artist, renderer)
73 finally:
74 if artist.get_agg_filter() is not None:
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/axes/_base.py:3064, in _AxesBase.draw(self, renderer)
3061 if artists_rasterized:
3062 _draw_rasterized(self.figure, artists_rasterized, renderer)
-> 3064 mimage._draw_list_compositing_images(
3065 renderer, self, artists, self.figure.suppressComposite)
3067 renderer.close_group('axes')
3068 self.stale = False
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/image.py:131, in _draw_list_compositing_images(renderer, parent, artists, suppress_composite)
129 if not_composite or not has_images:
130 for a in artists:
--> 131 a.draw(renderer)
132 else:
133 # Composite any adjacent images together
134 image_group = []
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/artist.py:72, in allow_rasterization.<locals>.draw_wrapper(artist, renderer)
69 if artist.get_agg_filter() is not None:
70 renderer.start_filter()
---> 72 return draw(artist, renderer)
73 finally:
74 if artist.get_agg_filter() is not None:
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/axis.py:1396, in Axis.draw(self, renderer, *args, **kwargs)
1394 # Shift label away from axes to avoid overlapping ticklabels.
1395 self._update_label_position(renderer)
-> 1396 self.label.draw(renderer)
1398 self._update_offset_text_position(tlb1, tlb2)
1399 self.offsetText.set_text(self.major.formatter.get_offset())
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/artist.py:72, in allow_rasterization.<locals>.draw_wrapper(artist, renderer)
69 if artist.get_agg_filter() is not None:
70 renderer.start_filter()
---> 72 return draw(artist, renderer)
73 finally:
74 if artist.get_agg_filter() is not None:
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/text.py:752, in Text.draw(self, renderer)
749 renderer.open_group('text', self.get_gid())
751 with self._cm_set(text=self._get_wrapped_text()):
--> 752 bbox, info, descent = self._get_layout(renderer)
753 trans = self.get_transform()
755 # don't use self.get_position here, which refers to text
756 # position in Text:
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/text.py:386, in Text._get_layout(self, renderer)
384 clean_line, ismath = self._preprocess_math(line)
385 if clean_line:
--> 386 w, h, d = _get_text_metrics_with_cache(
387 renderer, clean_line, self._fontproperties,
388 ismath=ismath, dpi=self.figure.dpi)
389 else:
390 w = h = d = 0
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/text.py:97, in _get_text_metrics_with_cache(renderer, text, fontprop, ismath, dpi)
94 """Call ``renderer.get_text_width_height_descent``, caching the results."""
95 # Cached based on a copy of fontprop so that later in-place mutations of
96 # the passed-in argument do not mess up the cache.
---> 97 return _get_text_metrics_with_cache_impl(
98 weakref.ref(renderer), text, fontprop.copy(), ismath, dpi)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/text.py:105, in _get_text_metrics_with_cache_impl(renderer_ref, text, fontprop, ismath, dpi)
101 @functools.lru_cache(4096)
102 def _get_text_metrics_with_cache_impl(
103 renderer_ref, text, fontprop, ismath, dpi):
104 # dpi is unused, but participates in cache invalidation (via the renderer).
--> 105 return renderer_ref().get_text_width_height_descent(text, fontprop, ismath)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/backends/backend_agg.py:230, in RendererAgg.get_text_width_height_descent(self, s, prop, ismath)
226 return super().get_text_width_height_descent(s, prop, ismath)
228 if ismath:
229 ox, oy, width, height, descent, font_image = \
--> 230 self.mathtext_parser.parse(s, self.dpi, prop)
231 return width, height, descent
233 font = self._prepare_font(prop)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/mathtext.py:226, in MathTextParser.parse(self, s, dpi, prop)
222 # lru_cache can't decorate parse() directly because prop
223 # is mutable; key the cache using an internal copy (see
224 # text._get_text_metrics_with_cache for a similar case).
225 prop = prop.copy() if prop is not None else None
--> 226 return self._parse_cached(s, dpi, prop)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/mathtext.py:247, in MathTextParser._parse_cached(self, s, dpi, prop)
244 if self._parser is None: # Cache the parser globally.
245 self.__class__._parser = _mathtext.Parser()
--> 247 box = self._parser.parse(s, fontset, fontsize, dpi)
248 output = _mathtext.ship(box)
249 if self._output_type == "vector":
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/_mathtext.py:1971, in Parser.parse(self, s, fonts_object, fontsize, dpi)
1968 result = self._expression.parseString(s)
1969 except ParseBaseException as err:
1970 # explain becomes a plain method on pyparsing 3 (err.explain(0)).
-> 1971 raise ValueError("\n" + ParseException.explain(err, 0)) from None
1972 self._state_stack = None
1973 self._in_subscript_or_superscript = False
ValueError:
\sqrt{|\text{standardized residuals}|}
^
ParseSyntaxException: Expected required_group, found '\' (at char 7), (line:1, col:8)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/IPython/core/formatters.py:340, in BaseFormatter.__call__(self, obj)
338 pass
339 else:
--> 340 return printer(obj)
341 # Finally look for special method names
342 method = get_real_method(obj, self.print_method)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/IPython/core/pylabtools.py:169, in retina_figure(fig, base64, **kwargs)
160 def retina_figure(fig, base64=False, **kwargs):
161 """format a figure as a pixel-doubled (retina) PNG
162
163 If `base64` is True, return base64-encoded str instead of raw bytes
(...)
167 base64 argument
168 """
--> 169 pngdata = print_figure(fig, fmt="retina", base64=False, **kwargs)
170 # Make sure that retina_figure acts just like print_figure and returns
171 # None when the figure is empty.
172 if pngdata is None:
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/IPython/core/pylabtools.py:152, in print_figure(fig, fmt, bbox_inches, base64, **kwargs)
149 from matplotlib.backend_bases import FigureCanvasBase
150 FigureCanvasBase(fig)
--> 152 fig.canvas.print_figure(bytes_io, **kw)
153 data = bytes_io.getvalue()
154 if fmt == 'svg':
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/backend_bases.py:2342, in FigureCanvasBase.print_figure(self, filename, dpi, facecolor, edgecolor, orientation, format, bbox_inches, pad_inches, bbox_extra_artists, backend, **kwargs)
2336 renderer = _get_renderer(
2337 self.figure,
2338 functools.partial(
2339 print_method, orientation=orientation)
2340 )
2341 with getattr(renderer, "_draw_disabled", nullcontext)():
-> 2342 self.figure.draw(renderer)
2344 if bbox_inches:
2345 if bbox_inches == "tight":
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/artist.py:95, in _finalize_rasterization.<locals>.draw_wrapper(artist, renderer, *args, **kwargs)
93 @wraps(draw)
94 def draw_wrapper(artist, renderer, *args, **kwargs):
---> 95 result = draw(artist, renderer, *args, **kwargs)
96 if renderer._rasterizing:
97 renderer.stop_rasterizing()
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/artist.py:72, in allow_rasterization.<locals>.draw_wrapper(artist, renderer)
69 if artist.get_agg_filter() is not None:
70 renderer.start_filter()
---> 72 return draw(artist, renderer)
73 finally:
74 if artist.get_agg_filter() is not None:
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/figure.py:3175, in Figure.draw(self, renderer)
3172 # ValueError can occur when resizing a window.
3174 self.patch.draw(renderer)
-> 3175 mimage._draw_list_compositing_images(
3176 renderer, self, artists, self.suppressComposite)
3178 for sfig in self.subfigs:
3179 sfig.draw(renderer)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/image.py:131, in _draw_list_compositing_images(renderer, parent, artists, suppress_composite)
129 if not_composite or not has_images:
130 for a in artists:
--> 131 a.draw(renderer)
132 else:
133 # Composite any adjacent images together
134 image_group = []
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/artist.py:72, in allow_rasterization.<locals>.draw_wrapper(artist, renderer)
69 if artist.get_agg_filter() is not None:
70 renderer.start_filter()
---> 72 return draw(artist, renderer)
73 finally:
74 if artist.get_agg_filter() is not None:
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/axes/_base.py:3064, in _AxesBase.draw(self, renderer)
3061 if artists_rasterized:
3062 _draw_rasterized(self.figure, artists_rasterized, renderer)
-> 3064 mimage._draw_list_compositing_images(
3065 renderer, self, artists, self.figure.suppressComposite)
3067 renderer.close_group('axes')
3068 self.stale = False
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/image.py:131, in _draw_list_compositing_images(renderer, parent, artists, suppress_composite)
129 if not_composite or not has_images:
130 for a in artists:
--> 131 a.draw(renderer)
132 else:
133 # Composite any adjacent images together
134 image_group = []
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/artist.py:72, in allow_rasterization.<locals>.draw_wrapper(artist, renderer)
69 if artist.get_agg_filter() is not None:
70 renderer.start_filter()
---> 72 return draw(artist, renderer)
73 finally:
74 if artist.get_agg_filter() is not None:
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/axis.py:1396, in Axis.draw(self, renderer, *args, **kwargs)
1394 # Shift label away from axes to avoid overlapping ticklabels.
1395 self._update_label_position(renderer)
-> 1396 self.label.draw(renderer)
1398 self._update_offset_text_position(tlb1, tlb2)
1399 self.offsetText.set_text(self.major.formatter.get_offset())
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/artist.py:72, in allow_rasterization.<locals>.draw_wrapper(artist, renderer)
69 if artist.get_agg_filter() is not None:
70 renderer.start_filter()
---> 72 return draw(artist, renderer)
73 finally:
74 if artist.get_agg_filter() is not None:
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/text.py:752, in Text.draw(self, renderer)
749 renderer.open_group('text', self.get_gid())
751 with self._cm_set(text=self._get_wrapped_text()):
--> 752 bbox, info, descent = self._get_layout(renderer)
753 trans = self.get_transform()
755 # don't use self.get_position here, which refers to text
756 # position in Text:
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/text.py:386, in Text._get_layout(self, renderer)
384 clean_line, ismath = self._preprocess_math(line)
385 if clean_line:
--> 386 w, h, d = _get_text_metrics_with_cache(
387 renderer, clean_line, self._fontproperties,
388 ismath=ismath, dpi=self.figure.dpi)
389 else:
390 w = h = d = 0
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/text.py:97, in _get_text_metrics_with_cache(renderer, text, fontprop, ismath, dpi)
94 """Call ``renderer.get_text_width_height_descent``, caching the results."""
95 # Cached based on a copy of fontprop so that later in-place mutations of
96 # the passed-in argument do not mess up the cache.
---> 97 return _get_text_metrics_with_cache_impl(
98 weakref.ref(renderer), text, fontprop.copy(), ismath, dpi)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/text.py:105, in _get_text_metrics_with_cache_impl(renderer_ref, text, fontprop, ismath, dpi)
101 @functools.lru_cache(4096)
102 def _get_text_metrics_with_cache_impl(
103 renderer_ref, text, fontprop, ismath, dpi):
104 # dpi is unused, but participates in cache invalidation (via the renderer).
--> 105 return renderer_ref().get_text_width_height_descent(text, fontprop, ismath)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/backends/backend_agg.py:230, in RendererAgg.get_text_width_height_descent(self, s, prop, ismath)
226 return super().get_text_width_height_descent(s, prop, ismath)
228 if ismath:
229 ox, oy, width, height, descent, font_image = \
--> 230 self.mathtext_parser.parse(s, self.dpi, prop)
231 return width, height, descent
233 font = self._prepare_font(prop)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/mathtext.py:226, in MathTextParser.parse(self, s, dpi, prop)
222 # lru_cache can't decorate parse() directly because prop
223 # is mutable; key the cache using an internal copy (see
224 # text._get_text_metrics_with_cache for a similar case).
225 prop = prop.copy() if prop is not None else None
--> 226 return self._parse_cached(s, dpi, prop)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/mathtext.py:247, in MathTextParser._parse_cached(self, s, dpi, prop)
244 if self._parser is None: # Cache the parser globally.
245 self.__class__._parser = _mathtext.Parser()
--> 247 box = self._parser.parse(s, fontset, fontsize, dpi)
248 output = _mathtext.ship(box)
249 if self._output_type == "vector":
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/matplotlib/_mathtext.py:1971, in Parser.parse(self, s, fonts_object, fontsize, dpi)
1968 result = self._expression.parseString(s)
1969 except ParseBaseException as err:
1970 # explain becomes a plain method on pyparsing 3 (err.explain(0)).
-> 1971 raise ValueError("\n" + ParseException.explain(err, 0)) from None
1972 self._state_stack = None
1973 self._in_subscript_or_superscript = False
ValueError:
\sqrt{|\text{standardized residuals}|}
^
ParseSyntaxException: Expected required_group, found '\' (at char 7), (line:1, col:8)
<Figure size 500x300 with 1 Axes>
Breusch-Pagan homoskedasticity test#
https://en.wikipedia.org/wiki/Breusch-Pagan_test
Idea:
Lagrange Multiplier test for heteroscedasticity
# lm2.model.exog[0:3]
lm, p_lm, _, _ = sms.het_breuschpagan(lm2.resid, lm2.model.exog)
lm, p_lm
(6.254809752854315, 0.09985027888796541)
Goldfeld-Quandt homoskedasticity test#
https://en.wikipedia.org/wiki/Goldfeld-Quandt_test
# sort by score
idxs = doctors.sort_values("alc").index
# sort by scorehats
idxs = np.argsort(lm2.fittedvalues)
X = lm2.model.exog[idxs,:]
resid = lm2.resid[idxs]
# run the test
F, p_F, _ = sms.het_goldfeldquandt(resid, X, idx=1)
F, p_F
(1.5346515257083702, 0.03369473000286662)
Collinearity checks#
Check the correlation matrix#
corrM = doctors[["alc", "weed", "exrc"]].corr()
corrM.round(4)
alc | weed | exrc | |
---|---|---|---|
alc | 1.0000 | 0.0422 | 0.0336 |
weed | 0.0422 | 1.0000 | 0.0952 |
exrc | 0.0336 | 0.0952 | 1.0000 |
Recall that the correlation coefficient \(r_{\mathbf{x}_1,\mathbf{x}_2}\)
between two variables \(\mathbf{x}_1\) and \(\mathbf{x}_2\) is equivalent to
the square root of the \(R^2\) metric for the linear model x1 ~ x2
,
which has the intuitive interpretation as the proportion
of the variance in x1
that can be explained by the variability in x2
.
Let’s verify this equivalence by computing the correlation coefficient
\(r_{\texttt{alc},\texttt{exrc}}\) from a linear regression model of exrc
on alc
.
lm_alc_exrc = smf.ols("alc ~ exrc", data=doctors).fit()
R2_alc_exrc = lm_alc_exrc.rsquared
r_alc_exrc = np.sqrt(R2_alc_exrc).round(4)
r_alc_exrc, R2_alc_exrc
(0.0336, 0.0011306540709105084)
# # BONUS 1 (not covered anywhere earlier in the book)
# sns.pairplot(doctors[["alc", "weed", "exrc"]]);
# # BONUS 2 (not covered anywhere earlier in the book)
# sns.heatmap(corrM, annot=True, fmt='.2f', cmap="Grays");
Condition number#
lm2.condition_number
31.229721453770164
# ALT. (compute from scratch)
#######################################################
X = sm.add_constant(doctors[["alc","weed","exrc"]])
eigvals = np.linalg.eigvals(X.T @ X)
lam_max, lam_min = np.max(eigvals), np.min(eigvals)
np.sqrt(lam_max/lam_min)
31.229721453770264
Variance inflation factor#
from statsmodels.stats.outliers_influence \
import variance_inflation_factor as statsmodels_vif
X = sm.add_constant(doctors[["alc","weed","exrc"]])
statsmodels_vif(X, 1) # variation inflation factor of "alc" in lm2
1.0026692156304757
from statsmodels.stats.outliers_influence \
import variance_inflation_factor as vif
X = sm.add_constant(doctors[["alc","weed","exrc"]])
for i, var in enumerate(["alc ", "weed", "exrc"]):
print("VIF for", var, "=", vif(X, i+1))
VIF for alc = 1.0026692156304757
VIF for weed = 1.010703145007322
VIF for exrc = 1.010048809433661
How VIF is calculated under the hood (optional)#
def calc_vif(dmatrix, pred_idx):
"""
Calculate the variance inflation factor of the predictor
in column `pred_idx` of the design matrix `dmatrix`.
"""
n_cols = dmatrix.shape[1]
x_i = dmatrix[:, pred_idx]
mask = np.arange(n_cols) != pred_idx
X_noti = dmatrix[:, mask]
r_squared_i = sm.OLS(x_i, X_noti).fit().rsquared
vif = 1. / (1. - r_squared_i)
return vif
calc_vif(lm2.model.exog, 1), calc_vif(lm2.model.exog, 2), calc_vif(lm2.model.exog, 3)
(1.0026692156304757, 1.010703145007322, 1.010048809433661)
# ALT.
from ministats import calc_lm_vif
calc_lm_vif(lm2, "alc"), calc_lm_vif(lm2, "weed"), calc_lm_vif(lm2, "exrc")
(1.0026692156304757, 1.010703145007322, 1.010048809433661)
Statsmodels diagnostics plots#
Plot fit against one regressor#
see https://www.statsmodels.org/stable/generated/statsmodels.graphics.regressionplots.plot_fit.html
from statsmodels.graphics.api import plot_fit
with plt.rc_context({"figure.figsize":(9,2.6)}):
fig, (ax1,ax2,ax3) = plt.subplots(1,3, sharey=True)
plot_fit(lm2, "alc", ax=ax1);
plot_fit(lm2, "weed", ax=ax2);
plot_fit(lm2, "exrc", ax=ax3)
ax2.get_legend().remove()
ax3.get_legend().remove()
filename = os.path.join(DESTDIR, "statsmodels_panel_plot_fit.pdf")
savefigure(fig, filename)
Saved figure to figures/lm/interpreting/statsmodels_panel_plot_fit.pdf
Saved figure to figures/lm/interpreting/statsmodels_panel_plot_fit.png
Partial regression plot#
from statsmodels.graphics.api import plot_partregress
with plt.rc_context({"figure.figsize":(9,2.6)}):
fig, (ax1,ax2,ax3) = plt.subplots(1,3, sharey=True)
plot_partregress("score", "alc", exog_others=[], data=doctors, obs_labels=False, ax=ax1)
plot_partregress("score", "weed", exog_others=[], data=doctors, obs_labels=False, ax=ax2)
plot_partregress("score", "exrc", exog_others=[], data=doctors, obs_labels=False, ax=ax3)
filename = os.path.join(DESTDIR, "statsmodels_panel_plot_partregress.pdf")
savefigure(fig, filename)
Saved figure to figures/lm/interpreting/statsmodels_panel_plot_partregress.pdf
Saved figure to figures/lm/interpreting/statsmodels_panel_plot_partregress.png
CCPR plot#
component and component-plus-residual
see https://www.statsmodels.org/stable/generated/statsmodels.graphics.regressionplots.plot_ccpr.html
from statsmodels.graphics.api import plot_ccpr
with plt.rc_context({"figure.figsize":(9,2.6), "text.usetex":True}):
fig, (ax1,ax2,ax3) = plt.subplots(1,3)
fig.subplots_adjust(wspace=0.5)
plot_ccpr(lm2, "alc", ax=ax1)
ax1.set_title("")
ax1.set_ylabel(r"$\texttt{alc} \cdot \widehat{\beta}_{\texttt{alc}} \; + \; \mathbf{r}$")
plot_ccpr(lm2, "weed", ax=ax2)
ax2.set_title("")
ax2.set_ylabel(r"$\texttt{weed} \cdot \widehat{\beta}_{\texttt{weed}} \; + \; \mathbf{r}$", labelpad=-3)
plot_ccpr(lm2, "exrc", ax=ax3)
ax3.set_title("")
ax3.set_ylabel(r"$\texttt{exrc} \cdot \widehat{\beta}_{\texttt{exrc}} \; + \; \mathbf{r}$", labelpad=-3)
filename = os.path.join(DESTDIR, "statsmodels_panel_plot_ccpr.pdf")
savefigure(fig, filename)
Saved figure to figures/lm/interpreting/statsmodels_panel_plot_ccpr.pdf
Saved figure to figures/lm/interpreting/statsmodels_panel_plot_ccpr.png
All-in-on convenience method#
from statsmodels.graphics.api import plot_regress_exog
with plt.rc_context({"figure.figsize":(6,4)}):
plot_regress_exog(lm2, "alc");
Outliers and influential observations#
TODO: definitions
from scipy.stats import norm
npts = 30
beta_0 = 10
beta_x = 3
sigma = 1
np.random.seed(45)
xs = norm(loc=5, scale=2).rvs(npts)
ys = beta_0 + beta_x*xs + norm(0,sigma).rvs(npts)
dfxys = pd.DataFrame({"x":xs, "y":ys})
# outlier near middle
idx_mid = dfxys[dfxys["x"]>5].sort_values("x").head(1).index[0]
y1s = dfxys["y"].copy()
y1s[idx_mid] = 10
dfxys["y1"] = y1s
# high leverage points
dfxys["y2"] = ys
mask_hl = (dfxys["x"] < 1) | (dfxys["x"] > 7.5)
dfxys["hl"] = mask_hl
# dfxys.loc[idxs_hl,"hl"] = 1
# influential point near end
idx_last = dfxys.sort_values("x").tail(1).index[0]
y3s = dfxys["y"].copy()
y3s[idx_last] = 10
dfxys["y3"] = y3s
with plt.rc_context({"figure.figsize":(9,2.6), "text.usetex":True}):
fig, (ax1,ax2,ax3) = plt.subplots(1,3, sharey=True)
# fig.subplots_adjust(wspace=0.5)
# Outlier
# lmxy1 = smf.ols("y1 ~ x", data=dfxys).fit()
# lmxy1.params
sns.regplot(x="x", y="y1", ci=False, data=dfxys, ax=ax1)
ax1.set_ylabel("$y_1$")
ax1.set_xlabel("$x$")
ax1.set_xticks(range(0,12,2))
ax1.set_title("(a) Outlier")
# (B) High leverage
sns.regplot(x="x", y="y2", ci=False, data=dfxys, ax=ax2)
dfhl = dfxys[dfxys["hl"]==True]
sns.regplot(x="x", y="y2", ci=False, data=dfhl,
fit_reg=False, color="red", marker="D", ax=ax2)
ax2.set_ylabel("$y_2$")
ax2.set_xlabel("$x$")
ax2.set_xticks(range(0,12,2))
ax2.set_title("(b) High leverage points")
# (C) Influential
sns.regplot(x="x", y="y3", ci=False, data=dfxys, ax=ax3)
ax3.set_ylabel("$y_3$")
ax3.set_xlabel("$x$")
ax3.set_xticks(range(0,12,2))
ax3.set_title("(c) Influential observation")
filename = os.path.join(DESTDIR, "panel_outlier_hl_influential.pdf")
savefigure(fig, filename)
Saved figure to figures/lm/interpreting/panel_outlier_hl_influential.pdf
Saved figure to figures/lm/interpreting/panel_outlier_hl_influential.png
Leverage and influence metrics#
doctors.shape
(156, 12)
i = 0
doctors_no_i = doctors.drop(index=i)
doctors_no_i.shape
(155, 12)
Leverage score#
Manual calculations of leverage metric (optional)#
mtcars = sm.datasets.get_rdataset("mtcars", "datasets").data
# Fit model using all data points
lmcars = smf.ols("mpg ~ hp", data=mtcars).fit()
# Design matrix
X = sm.add_constant(mtcars["hp"]).values
# Hat matrix
H = X @ np.linalg.inv(X.T @ X) @ X.T
# First element of the diagonal form the hat matrix
h0 = np.diag(H)[0]
print(f"{h0 = }")
# Linear model fit without the first data point
lmcars_no0 = smf.ols("mpg ~ hp", data=mtcars[1:]).fit()
# Prediction of yhat0 from the model without the first data point
yhat0_no0 = lmcars_no0.predict({"hp":mtcars.iloc[0]["hp"]})[0]
print(f"{yhat0_no0 = }")
# Alternative formula h0 = 1 - r0 / r0_no0
r0 = lmcars.resid.iloc[0] # residual from the full model
r0_no0 = lmcars_no0.resid.iloc[0] # residual from model w/o first point
h0 = 1 - r0/r0_no0
print(f"{h0 = }")
# Alternative formula for h0 via `statsmodels`
infl_cars = lmcars.get_influence()
h0 = infl_cars.hat_matrix_diag[0]
print(f"{h0 = }")
h0 = 0.04048626926227573
yhat0_no0 = 22.660997545626714
h0 = 0.040486269262266505
h0 = 0.040486269262275755
Studentized residuals#
infl = lm2.get_influence()
infl.resid_studentized[0:5] # = infl2.resid_studentized_internal
array([ 0.98097556, -2.01341843, -1.60234556, -0.21972192, -1.28811914])
# obtained using Jackknife
infl.resid_studentized_external[0:5]
array([ 0.98085317, -2.03409243, -1.61072775, -0.21903275, -1.29094028])
Cook’s distance#
infl.cooks_distance[0][0:5]
array([0.02526116, 0.01331454, 0.01116562, 0.00017951, 0.0041123 ])
DFFITS diagnostic#
Calculating leverage and influence metrics using statsmodels
#
infl = lm2.get_influence()
infl_df = infl.summary_frame()
list(infl_df.columns)
['dfb_Intercept',
'dfb_alc',
'dfb_weed',
'dfb_exrc',
'cooks_d',
'standard_resid',
'hat_diag',
'dffits_internal',
'student_resid',
'dffits']
cols = ["student_resid", "hat_diag", "cooks_d", "dffits"]
print(infl_df[cols].round(3))
student_resid hat_diag cooks_d dffits
0 0.981 0.095 0.025 0.318
1 -2.034 0.013 0.013 -0.233
2 -1.611 0.017 0.011 -0.212
3 -0.219 0.015 0.000 -0.027
4 -1.291 0.010 0.004 -0.129
.. ... ... ... ...
151 0.873 0.009 0.002 0.083
152 -0.497 0.017 0.001 -0.065
153 0.279 0.023 0.000 0.043
154 2.320 0.040 0.055 0.475
155 0.353 0.025 0.001 0.056
[156 rows x 4 columns]
Leverage plots#
import statsmodels.api as sm
sm.graphics.plot_leverage_resid2(lm2);
filename = os.path.join(DESTDIR, "lm2_plot_leverage_resid2.pdf")
savefigure(plt.gcf(), filename)
Saved figure to figures/lm/interpreting/lm2_plot_leverage_resid2.pdf
Saved figure to figures/lm/interpreting/lm2_plot_leverage_resid2.png
Influence plots#
sm.graphics.influence_plot(lm2, criterion="cooks", size=4);
filename = os.path.join(DESTDIR, "lm2_influence_plot_cook.pdf")
savefigure(plt.gcf(), filename)
Saved figure to figures/lm/interpreting/lm2_influence_plot_cook.pdf
Saved figure to figures/lm/interpreting/lm2_influence_plot_cook.png
# ALT. use DFFITS
# sm.graphics.influence_plot(lm2, criterion="DFFITS", size=4);
Outlier tests (BONUS TOPIC)#
lm2.outlier_test().head()
student_resid | unadj_p | bonf(p) | |
---|---|---|---|
0 | 0.980853 | 0.328234 | 1.0 |
1 | -2.034092 | 0.043693 | 1.0 |
2 | -1.610728 | 0.109327 | 1.0 |
3 | -0.219033 | 0.826920 | 1.0 |
4 | -1.290940 | 0.198697 | 1.0 |
Model prediction accuracy#
new_data = {"alc":3, "weed":1, "exrc":8}
lm2.predict(new_data)
0 68.177355
dtype: float64
pred_score = lm2.get_prediction(new_data)
In-sample prediction accuracy#
# Compute the in-sample mean squared error (MSE)
scores = doctors['score']
scorehats = lm2.fittedvalues
mse = np.mean( (scores - scorehats)**2 )
mse
65.56013803717558
# ALT.
lm2.ssr/n
65.56013803717559
# ALT2.
from statsmodels.tools.eval_measures import mse
mse(scores,scorehats)
65.56013803717558
Out-of-sample prediction accuracy#
Leave-one-out cross-validation#
#######################################################
loo_preds = np.zeros(n)
for i in range(n):
doctors_no_i = doctors.drop(index=i)
lm2_no_i = smf.ols(formula,data=doctors_no_i).fit()
predictors_i = dict(doctors.loc[i,:])
pred_i = lm2_no_i.predict(predictors_i)[0]
loo_preds[i] = pred_i
# Calculate the out-of-sample mean squared error
mse_loo = np.mean( (doctors['score'] - loo_preds)**2 )
mse_loo
69.20054514710948
Compare with the in-sample MSE of the model which is lower.
lm2.ssr/n
65.56013803717559
Regularization#
doctors[["alc", "weed", "exrc"]].std()
alc 9.428506
weed 1.391068
exrc 4.796361
dtype: float64
# TRUE
# alc = - 1.8
# weed = - 0.5
# exrc = + 1.9
# FITTED
lm2.params
Intercept 60.452901
alc -1.800101
weed -1.021552
exrc 1.768289
dtype: float64
L1 regularization (LASSO)#
lm2_L1reg = smf.ols(formula, data=doctors) \
.fit_regularized(alpha=0.3, L1_wt=1.0)
lm2_L1reg.params
Intercept 58.930367
alc -1.763306
weed 0.000000
exrc 1.795228
dtype: float64
L2 regularization (ridge)#
lm2_L2reg = smf.ols(formula, data=doctors) \
.fit_regularized(alpha=0.02, L1_wt=0.0001)
lm2_L2reg.params
Intercept 56.127763
alc -1.655051
weed -0.769119
exrc 2.014540
dtype: float64
Explanations#
Calculating standard error of parameters#
lm2.bse
Intercept 1.289380
alc 0.069973
weed 0.476166
exrc 0.138056
dtype: float64
# construct the design matrix for the model
X = sm.add_constant(doctors[["alc","weed","exrc"]])
# calculate the diagonal of the inverse-covariance matrix
inv_covs = np.diag(np.linalg.inv(X.T.dot(X)))
sigmahat*np.sqrt(inv_covs)
array([1.28938008, 0.06997301, 0.4761656 , 0.13805557])
# these lead to approx. same result because predictors are not correlated
# but the correct formula is to use the inv. covariance matrix as above.
sum_alc_dev2 = np.sum((doctors["alc"] - doctors["alc"].mean())**2)
sum_weed_dev2 = np.sum((doctors["weed"] - doctors["weed"].mean())**2)
sum_exrc_dev2 = np.sum((doctors["exrc"] - doctors["exrc"].mean())**2)
se_b_alc = sigmahat/np.sqrt(sum_alc_dev2)
se_b_weed = sigmahat/np.sqrt(sum_weed_dev2)
se_b_exrc = sigmahat/np.sqrt(sum_exrc_dev2)
se_b_alc, se_b_weed, se_b_exrc
(0.06987980522527788, 0.47363764322385443, 0.13736710473639846)
Discussion#
Towards machine learning#
The out-of-sample prediction accuracy is a common metric used in machine learning (ML) tasks.
Exercises#
E4.LOGNORM Non-normal error term#
from scipy.stats import uniform, lognorm
np.random.seed(42)
xs = uniform(0,10).rvs(100)
ys = 2*xs + lognorm(1).rvs(100)
df = pd.DataFrame({"x":xs, "y":ys})
lm = smf.ols("y ~ x", data=df).fit()
qqplot(lm.resid, line="s");
# sns.scatterplot(x=xs, y=lm.resid);
# from ministats import plot_lm_scale_loc
# plot_lm_scale_loc(lm);
E4.DEP Dependent error term#
E4.NL Nonlinear relationship#
from scipy.stats import uniform, lognorm
np.random.seed(42)
xs = uniform(0,10).rvs(100)
ys = 2*xs + xs**2 + norm(0,1).rvs(100)
df = pd.DataFrame({"x":xs, "y":ys})
lm = smf.ols("y ~ x", data=df).fit()
lm.params
Intercept -14.684585
x 11.688188
dtype: float64
sns.scatterplot(x=xs, y=ys)
sns.lineplot(x=xs, y=lm.fittedvalues);
sns.regplot(x=xs, y=lm.resid, lowess=True);
qqplot(lm.resid, line="s");
E4.H Heteroskedasticity#
from scipy.stats import uniform, norm
np.random.seed(43)
xs = np.sort(uniform(0,10).rvs(100))
sigmas = np.linspace(1,20,100)
ys = 2*xs + norm(loc=0,scale=sigmas).rvs(100)
df = pd.DataFrame({"x":xs, "y":ys})
lm = smf.ols("y ~ x", data=df).fit()
from ministats import plot_lm_scale_loc
plot_lm_scale_loc(lm);
E4.v Collinearity#
from scipy.stats import uniform, norm
np.random.seed(45)
x1s = uniform(0,10).rvs(100)
alpha = 0.8
x2s = alpha*x1s + (1-alpha)*uniform(0,10).rvs(100)
ys = 2*x1s + 3*x2s + norm(0,1).rvs(100)
df = pd.DataFrame({"x1":x1s, "x2":x2s, "y":ys})
lm = smf.ols("y ~ x1 + x2", data=df).fit()
lm.params, lm.bse
(Intercept 0.220927
x1 1.985774
x2 3.006225
dtype: float64,
Intercept 0.253988
x1 0.137253
x2 0.167492
dtype: float64)
lm.condition_number
24.291898486251146
calc_lm_vif(lm, "x1"), calc_lm_vif(lm, "x2")
(17.17244980334442, 17.17244980334442)
Links#
More details about model checking https://ethanweed.github.io/pythonbook/05.04-regression.html#model-checking
Statistical Modeling: The Two Cultures paper that explains the importance of out-of-sample predictions for statistical modelling.
https://projecteuclid.org/journals/statistical-science/volume-16/issue-3/Statistical-Modeling–The-Two-Cultures-with-comments-and-a/10.1214/ss/1009213726.full
CUT MATERIAL#
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
# Generate some data
np.random.seed(0)
X = np.random.normal(size=100)
y = 2 * X + np.random.normal(size=100)
# Fit a regression model
model = sm.OLS(y, sm.add_constant(X))
results = model.fit()
# Compute residuals and fitted values
residuals = results.resid
fitted = results.fittedvalues
standardized_residuals = residuals / np.std(residuals)
# Create scale-location plot
plt.figure(figsize=(8, 6))
plt.scatter(fitted, np.sqrt(np.abs(standardized_residuals)), alpha=0.5)
plt.xlabel('Fitted values')
plt.ylabel(r'$\sqrt{|Standardized Residuals|}$')
plt.title('Scale-Location Plot')
plt.grid(True)
plt.show()
Example dataset#
mtcars = sm.datasets.get_rdataset("mtcars", "datasets").data
mtcars
lmcars = smf.ols("mpg ~ hp", data=mtcars).fit()
sns.scatterplot(x=mtcars["hp"], y=mtcars["mpg"])
sns.lineplot(x=mtcars["hp"], y=lmcars.fittedvalues);
Compute the variance/covariance matrix#
lm2.cov_params()
# == lm2.scale * np.linalg.inv(X.T @ X)
# where X = sm.add_constant(doctors[["alc","weed","exrc"]])
Intercept | alc | weed | exrc | |
---|---|---|---|---|
Intercept | 1.662501 | -0.055601 | -0.093711 | -0.095403 |
alc | -0.055601 | 0.004896 | -0.001305 | -0.000288 |
weed | -0.093711 | -0.001305 | 0.226734 | -0.006177 |
exrc | -0.095403 | -0.000288 | -0.006177 | 0.019059 |