Section 4.2 — Multiple linear regression

Section 4.2 — Multiple linear regression#

This notebook contains the code examples from Section 4.2 Multiple linear regression from the No Bullshit Guide to Statistics.

Notebook setup#

# load Python modules
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Figures setup
plt.clf()  # needed otherwise `sns.set_theme` doesn't work
from plot_helpers import RCPARAMS
# RCPARAMS.update({"figure.figsize": (10, 4)})   # good for screen
RCPARAMS.update({"figure.figsize": (5, 2.3)})  # good for print
sns.set_theme(
    context="paper",
    style="whitegrid",
    palette="colorblind",
    rc=RCPARAMS,
)

# High-resolution please
%config InlineBackend.figure_format = "retina"

# Where to store figures
DESTDIR = "figures/lm/multiple"

<Figure size 640x480 with 0 Axes>

# set random seed for repeatability
np.random.seed(42)

Definitions#

TODO

Doctors dataset#

doctors = pd.read_csv("../datasets/doctors.csv")
doctors.shape

(156, 9)

doctors.head()

	permit	loc	work	hours	caf	alc	weed	exrc	score
0	93273	rur	hos	21	2	0	5.0	0.0	63
1	90852	urb	cli	74	26	20	0.0	4.5	16
2	92744	urb	hos	63	25	1	0.0	7.0	58
3	73553	urb	eld	77	36	4	0.0	2.0	55
4	82441	rur	cli	36	22	9	0.0	7.5	47

Multiple linear regression model#

\(\newcommand{\Err}{ {\Large \varepsilon}}\)

\[ Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \Err, \]

where \(p\) is the number of predictors and \(\Err\) represents Gaussian noise \(\Err \sim \mathcal{N}(0,\sigma)\).

Model assumptions#

(LIN)
(INDEPɛ)
(NORMɛ)
(EQVARɛ)
(NOCOL)

Example: linear model for doctors’ sleep scores#

We want to know the influence of drinking alcohol, smoking weed, and exercise on sleep score?

import statsmodels.formula.api as smf

formula = "score ~ 1 + alc + weed + exrc"
lm2 = smf.ols(formula, data=doctors).fit()
lm2.params

Intercept    60.452901
alc          -1.800101
weed         -1.021552
exrc          1.768289
dtype: float64

sns.barplot(x=lm2.params.values[1:],
            y=lm2.params.index[1:]);

../_images/0eedd3a1fb9239a980b3c1c53827e615bcd1003f87cb38d92ddf0baf3e653fe2.png

Partial regression plots#

Partial regression plot for the predictor `alc`#

Step 1: Obtain the data for the x-axis, residuals of the model alc ~ 1 + others

#######################################################
lm_alc = smf.ols("alc~1+weed+exrc", data=doctors).fit()
xrs = lm_alc.resid

Step 2: Obtain the data for the y-axis, residuals of the model score ~ 1 + others

#######################################################
lm_score=smf.ols("score~1+weed+exrc",data=doctors).fit()
yrs = lm_score.resid

Step 3: Fit a linear model for the y-residuals versus the x-residuals.

dfrs = pd.DataFrame({"xrs": xrs, "yrs": yrs})
lm_resids = smf.ols("yrs ~ 1 + xrs", data=dfrs).fit()

Step 4: Draw a scatter plot of the residuals and the best-fitting linear model to the residuals.

from ministats import plot_reg

ax = sns.scatterplot(x=xrs, y=yrs, color="C0")
plot_reg(lm_resids, ax=ax);
ax.set_xlabel("alc~1+weed+exrc  residuals")
ax.set_ylabel("score~1+weed+exrc  residuals")
ax.set_title("Partial regression plot for `alc`");

../_images/01c928aa18dc6c9febf629f2cdb4d82a7d633003083d757866c689d7862e2e35.png

The slope parameter for the residuals model is the same as the slope parameter of the alc predictor in the original model lm2.

lm_resids.params["xrs"], lm2.params["alc"]

(-1.8001013152459409, -1.8001013152459384)

Partial regression plot for the predictor `weed`#

from ministats import plot_partreg
plot_partreg(lm2, pred="weed");

../_images/99ce0f8356a37047c595755dc40411b9da4f35683755dfc6fba7cf741ec8378d.png

Partial regression plot for the predictor `exrc`#

plot_partreg(lm2, pred="exrc");

../_images/ae65a60814685c10c2e66cc20e480e5bc34ca2faac137acdcbfecd2cb9a336d1.png

(BONUS TOPIC) Partial regression plots using `statsmodels`#

The function plot_partregress defined in statsmodels.graphics.api performs the same steps as the function plot_partreg we used above. but is a little more awkward to use.

When calling the function plot_partregress, you must provide the following arguments:

the outcome variable (endog)
the predictor you’re interested in (exog_i)
the predictors you want to regress out (exog_others)
the data frame that contains all these variable (data)

from statsmodels.graphics.api import plot_partregress

with plt.rc_context({"figure.figsize":(12,3)}):
    fig, (ax1, ax2, ax3) = plt.subplots(1,3, sharey=True)
    plot_partregress("score", "alc",  exog_others=["weed", "exrc"], data=doctors, obs_labels=False, ax=ax1)
    plot_partregress("score", "weed", exog_others=["alc",  "exrc"], data=doctors, obs_labels=False, ax=ax2)
    plot_partregress("score", "exrc", exog_others=["alc",  "weed"], data=doctors, obs_labels=False, ax=ax3)

../_images/681d2794e0d5b314011a1d85a72ebf972c65bb2519dd978f7c9c2e071782122c.png

The notation \(|\textrm{X}\) you see in the axis labels stands for “given all other predictors,” and is different for each subplot. For example, in the leftmost partial regression plot, the predictor we are focussing on is alc, so the other variables (\(|\textrm{X}\)) are weed and exrc.

Plot residuals#

from ministats import plot_resid
plot_resid(lm2);

../_images/e600b34e019934faa8dd5cad0117c3dde728e7a2c2c900f9720a547f22b17ac3.png

fig, (ax1,ax2,ax3) = plt.subplots(1, 3, sharey=True, figsize=(9,2.5))
plot_resid(lm2, pred="alc",  ax=ax1)
plot_resid(lm2, pred="weed", ax=ax2)
plot_resid(lm2, pred="exrc", ax=ax3);

../_images/c7c363ca182458aef2ccfd6b67670d7314cec5575296640a912e8c47416eef33.png

Model summary table#

lm2.summary()

OLS Regression Results
Dep. Variable:	score	R-squared:	0.842
Model:	OLS	Adj. R-squared:	0.839
Method:	Least Squares	F-statistic:	270.3
Date:	Mon, 28 Jul 2025	Prob (F-statistic):	1.05e-60
Time:	12:45:37	Log-Likelihood:	-547.63
No. Observations:	156	AIC:	1103.
Df Residuals:	152	BIC:	1115.
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	60.4529	1.289	46.885	0.000	57.905	63.000
alc	-1.8001	0.070	-25.726	0.000	-1.938	-1.662
weed	-1.0216	0.476	-2.145	0.034	-1.962	-0.081
exrc	1.7683	0.138	12.809	0.000	1.496	2.041

Omnibus:	1.140	Durbin-Watson:	1.828
Prob(Omnibus):	0.565	Jarque-Bera (JB):	0.900
Skew:	0.182	Prob(JB):	0.638
Kurtosis:	3.075	Cond. No.	31.2

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Explanations#

Nonlinear terms in linear regression#

Example: polynomial regression#

howell30 = pd.read_csv("../datasets/howell30.csv")
len(howell30)

# Fit quadratic model
formula2 = "height ~ 1 + age + np.square(age)"
lmq = smf.ols(formula2, data=howell30).fit()
lmq.params

Intercept         64.708568
age                7.100854
np.square(age)    -0.137302
dtype: float64

# Plot the data
sns.scatterplot(data=howell30, x="age", y="height");

# Plot the best-fit quadratic model
intercept, b_lin, b_quad = lmq.params
ages = np.linspace(0.1, howell30["age"].max())
heighthats = intercept + b_lin*ages + b_quad*ages**2
sns.lineplot(x=ages, y=heighthats, color="b");

../_images/53793efaf75a3e82d2396f630e384354d6b828e1a501bc502e7677eb5af28879.png

# ALT. using `plot_reg` function
from ministats import plot_reg
plot_reg(lmq);

../_images/bddf56a0f1bfe2f637b242963cc2a54ab920dc85ec394ef3d67cd35cf98e49c8.png

Feature engineering and transformed variables#

Bonus example: polynomial regression up to degree 3#

formula3 = "height ~ 1 + age + np.power(age,2) + np.power(age,3)"
exlm3 = smf.ols(formula3, data=howell30).fit()
exlm3.params

Intercept           63.461484
age                  7.636139
np.power(age, 2)    -0.183218
np.power(age, 3)     0.001033
dtype: float64

sns.scatterplot(data=howell30, x="age", y="height")
sns.lineplot(x=ages, y=exlm3.predict({"age":ages}));

../_images/6b9bea7b18f0cbadf880c52aee6f82a97d43847afc6d50afff2f3944423a13d3.png

Discussion#

Exercises#

Exercise E??: marketing dataset#

marketing = pd.read_csv("../datasets/exercises/marketing.csv")
formula_mkt = "sales ~ 1 + youtube + facebook + newspaper"
lm_mkt2 = smf.ols(formula_mkt, data=marketing).fit()
lm_mkt2.params

Intercept    3.526667
youtube      0.045765
facebook     0.188530
newspaper   -0.001037
dtype: float64

Section 4.2 — Multiple linear regression

Contents

Section 4.2 — Multiple linear regression#

Notebook setup#

Definitions#

Doctors dataset#

Multiple linear regression model#

Model assumptions#

Example: linear model for doctors’ sleep scores#

Partial regression plots#

Partial regression plot for the predictor `alc`#

Partial regression plot for the predictor `weed`#

Partial regression plot for the predictor `exrc`#

(BONUS TOPIC) Partial regression plots using `statsmodels`#

Plot residuals#

Model summary table#

Explanations#

Nonlinear terms in linear regression#

Example: polynomial regression#

Feature engineering and transformed variables#

Bonus example: polynomial regression up to degree 3#

Discussion#

Exercises#

Exercise E??: marketing dataset#

Links#

Section 4.2 — Multiple linear regression

Contents

Section 4.2 — Multiple linear regression#

Notebook setup#

Definitions#

Doctors dataset#

Multiple linear regression model#

Model assumptions#

Example: linear model for doctors’ sleep scores#

Partial regression plots#

Partial regression plot for the predictor alc#

Partial regression plot for the predictor weed#

Partial regression plot for the predictor exrc#

(BONUS TOPIC) Partial regression plots using statsmodels#

Plot residuals#

Model summary table#

Explanations#

Nonlinear terms in linear regression#

Example: polynomial regression#

Feature engineering and transformed variables#

Bonus example: polynomial regression up to degree 3#

Discussion#

Exercises#

Exercise E??: marketing dataset#

Links#

Partial regression plot for the predictor `alc`#

Partial regression plot for the predictor `weed`#

Partial regression plot for the predictor `exrc`#

(BONUS TOPIC) Partial regression plots using `statsmodels`#