# Section 4.2 — Multiple linear regression#

This notebook contains the code examples from Section 4.2 Multiple linear regression from the No Bullshit Guide to Statistics.

## Notebook setup#

# load Python modules
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Figures setup
plt.clf()  # needed otherwise sns.set_theme doesn't work
from plot_helpers import RCPARAMS
# RCPARAMS.update({"figure.figsize": (10, 4)})   # good for screen
RCPARAMS.update({"figure.figsize": (5, 2.3)})  # good for print
sns.set_theme(
context="paper",
style="whitegrid",
palette="colorblind",
rc=RCPARAMS,
)

%config InlineBackend.figure_format = "retina"

# Where to store figures
DESTDIR = "figures/lm/multiple"

<Figure size 640x480 with 0 Axes>

# set random seed for repeatability
np.random.seed(42)


## Doctors dataset#

doctors = pd.read_csv("../datasets/doctors.csv")
doctors.shape

(156, 9)

doctors.head()

permit loc work hours caf alc weed exrc score
0 93273 rur hos 21 2 0 5.0 0.0 63
1 90852 urb cli 74 26 20 0.0 4.5 16
2 92744 urb hos 63 25 1 0.0 7.0 58
3 73553 urb eld 77 36 4 0.0 2.0 55
4 82441 rur cli 36 22 9 0.0 7.5 47

## Multiple linear regression model#

$$\newcommand{\Err}{ {\Large \varepsilon}}$$

$Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \Err,$

where $$p$$ is the number of predictors and $$\Err$$ represents Gaussian noise $$\Err \sim \mathcal{N}(0,\sigma)$$.

• (LIN)

• (INDEPɛ)

• (NORMɛ)

• (EQVARɛ)

• (NOCOL)

## Example: linear model for doctors’ sleep scores#

We want to know the influence of drinking alcohol, smoking weed, and exercise on sleep score?

import statsmodels.formula.api as smf

formula = "score ~ 1 + alc + weed + exrc"
lm2 = smf.ols(formula, data=doctors).fit()
lm2.params

Intercept    60.452901
alc          -1.800101
weed         -1.021552
exrc          1.768289
dtype: float64

sns.barplot(x=lm2.params.values[1:],
y=lm2.params.index[1:]);


### Partial regression plots#

#### Partial regression plot for the predictor alc#

Step 1: Obtain the data for the x-axis, residuals of the model alc ~ 1 + others

#######################################################
lm_alc = smf.ols("alc~1+weed+exrc", data=doctors).fit()
xrs = lm_alc.resid


Step 2: Obtain the data for the y-axis, residuals of the model score ~ 1 + others

#######################################################
lm_score=smf.ols("score~1+weed+exrc",data=doctors).fit()
yrs = lm_score.resid


Step 3: Fit a linear model for the y-residuals versus the x-residuals.

dfrs = pd.DataFrame({"xrs": xrs, "yrs": yrs})
lm_resids = smf.ols("yrs ~ 1 + xrs", data=dfrs).fit()


Step 4: Draw a scatter plot of the residuals and the best-fitting linear model to the residuals.

from ministats import plot_reg

ax = sns.scatterplot(x=xrs, y=yrs, color="C0")
plot_reg(lm_resids, ax=ax);
ax.set_xlabel("alc~1+weed+exrc  residuals")
ax.set_ylabel("score~1+weed+exrc  residuals")
ax.set_title("Partial regression plot for alc");


The slope parameter for the residuals model is the same as the slope parameter of the alc predictor in the original model lm2.

lm_resids.params["xrs"], lm2.params["alc"]

(-1.8001013152459413, -1.8001013152459402)


#### Partial regression plot for the predictor weed#

from ministats import plot_partreg
plot_partreg(lm2, pred="weed");


#### Partial regression plot for the predictor exrc#

plot_partreg(lm2, pred="exrc");


#### (BONUS TOPIC) Partial regression plots using statsmodels#

The function plot_partregress defined in statsmodels.graphics.api performs the same steps as the function plot_partreg we used above. but is a little more awkward to use.

When calling the function plot_partregress, you must provide the following arguments:

• the outcome variable (endog)

• the predictor you’re interested in (exog_i)

• the predictors you want to regress out (exog_others)

• the data frame that contains all these variable (data)

from statsmodels.graphics.api import plot_partregress

with plt.rc_context({"figure.figsize":(12,3)}):
fig, (ax1, ax2, ax3) = plt.subplots(1,3, sharey=True)
plot_partregress("score", "alc",  exog_others=["weed", "exrc"], data=doctors, obs_labels=False, ax=ax1)
plot_partregress("score", "weed", exog_others=["alc",  "exrc"], data=doctors, obs_labels=False, ax=ax2)
plot_partregress("score", "exrc", exog_others=["alc",  "weed"], data=doctors, obs_labels=False, ax=ax3)


The notation $$|\textrm{X}$$ you see in the axis labels stands for “given all other predictors,” and is different for each subplot. For example, in the leftmost partial regression plot, the predictor we are focussing on is alc, so the other variables ($$|\textrm{X}$$) are weed and exrc.

### Plot residuals#

from ministats import plot_resid
plot_resid(lm2);

fig, (ax1,ax2,ax3) = plt.subplots(1, 3, sharey=True, figsize=(9,2.5))
plot_resid(lm2, pred="alc",  ax=ax1)
plot_resid(lm2, pred="weed", ax=ax2)
plot_resid(lm2, pred="exrc", ax=ax3);


### Model summary table#

lm2.summary()

Dep. Variable: R-squared: score 0.842 OLS 0.839 Least Squares 270.3 Fri, 19 Jul 2024 1.05e-60 14:52:11 -547.63 156 1103. 152 1115. 3 nonrobust
coef std err t P>|t| [0.025 0.975] 60.4529 1.289 46.885 0.000 57.905 63.000 -1.8001 0.070 -25.726 0.000 -1.938 -1.662 -1.0216 0.476 -2.145 0.034 -1.962 -0.081 1.7683 0.138 12.809 0.000 1.496 2.041
 Omnibus: Durbin-Watson: 1.14 1.828 0.565 0.9 0.182 0.638 3.075 31.2

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

## Explanations#

### Nonlinear terms in linear regression#

#### Example: polynomial regression#

howell30 = pd.read_csv("../datasets/howell30.csv")
len(howell30)

270

# Fit quadratic model
formula2 = "height ~ 1 + age + np.square(age)"
lmq = smf.ols(formula2, data=howell30).fit()
lmq.params

Intercept         64.708568
age                7.100854
np.square(age)    -0.137302
dtype: float64

# Plot the data
sns.scatterplot(data=howell30, x="age", y="height");

# Plot the best-fit quadratic model
ages = np.linspace(0.1, howell30["age"].max())
heighthats = intercept + b_lin*ages + b_quad*ages**2
sns.lineplot(x=ages, y=heighthats, color="b");

# ALT. using plot_reg function
from ministats import plot_reg
plot_reg(lmq);


### Feature engineering and transformed variables#

#### Bonus example: polynomial regression up to degree 3#

formula3 = "height ~ 1 + age + np.power(age,2) + np.power(age,3)"
exlm3 = smf.ols(formula3, data=howell30).fit()
exlm3.params

Intercept           63.461484
age                  7.636139
np.power(age, 2)    -0.183218
np.power(age, 3)     0.001033
dtype: float64

sns.scatterplot(data=howell30, x="age", y="height")
sns.lineplot(x=ages, y=exlm3.predict({"age":ages}));


## Exercises#

### Exercise E??: marketing dataset#

marketing = pd.read_csv("../datasets/exercises/marketing.csv")

Intercept    3.526667