Introduction to linear models#

import pandas as pd
import seaborn as sns
players_full = pd.read_csv("../datasets/players_full.csv")

Plot of linear model for time ~ 1 + age#

sns.lmplot(x="age", y="time", data=players_full)
<seaborn.axisgrid.FacetGrid at 0x7efd2bd76b20>
../_images/0ded40d723c3c6450d916e8539b2b4bc37a46a023c6f3745f828ba626dfc2580.png
import statsmodels.formula.api as smf

model1 = smf.ols('time ~ 1 + age', data=players_full)
result1 = model1.fit()
result1.summary()
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/scipy/stats/_stats_py.py:1736: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=12
  warnings.warn("kurtosistest only valid for n>=20 ... continuing "
OLS Regression Results
Dep. Variable: time R-squared: 0.260
Model: OLS Adj. R-squared: 0.186
Method: Least Squares F-statistic: 3.516
Date: Fri, 01 Dec 2023 Prob (F-statistic): 0.0902
Time: 04:06:10 Log-Likelihood: -72.909
No. Observations: 12 AIC: 149.8
Df Residuals: 10 BIC: 150.8
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 403.8531 88.666 4.555 0.001 206.292 601.414
age -4.5658 2.435 -1.875 0.090 -9.991 0.859
Omnibus: 2.175 Durbin-Watson: 2.490
Prob(Omnibus): 0.337 Jarque-Bera (JB): 0.930
Skew: -0.123 Prob(JB): 0.628
Kurtosis: 1.659 Cond. No. 97.0


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Plot of linear model for time ~ 1 + age + jobstatus#

We can “control for jobstatus” by including the variable in the linear model. Essentially, we’re fitting two separate models, one for jobstatus=0 players and one for jobstatus=1 players.

sns.lmplot(x="age", y="time", hue="jobstatus", data=players_full)
<seaborn.axisgrid.FacetGrid at 0x7efd295b1850>
../_images/a2b8962afe15e3ff42b5e59c318733bc998106919dea18f10eec894d9584d54b.png
import statsmodels.formula.api as smf

model2 = smf.ols('time ~ 1 + age + C(jobstatus)', data=players_full)
result2 = model2.fit()
result2.summary()
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/scipy/stats/_stats_py.py:1736: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=12
  warnings.warn("kurtosistest only valid for n>=20 ... continuing "
OLS Regression Results
Dep. Variable: time R-squared: 0.404
Model: OLS Adj. R-squared: 0.271
Method: Least Squares F-statistic: 3.045
Date: Fri, 01 Dec 2023 Prob (F-statistic): 0.0977
Time: 04:06:11 Log-Likelihood: -71.616
No. Observations: 12 AIC: 149.2
Df Residuals: 9 BIC: 150.7
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 377.8172 85.764 4.405 0.002 183.806 571.828
C(jobstatus)[T.1] -124.0138 84.307 -1.471 0.175 -314.728 66.701
age -1.6509 3.039 -0.543 0.600 -8.526 5.224
Omnibus: 0.983 Durbin-Watson: 1.910
Prob(Omnibus): 0.612 Jarque-Bera (JB): 0.661
Skew: -0.002 Prob(JB): 0.718
Kurtosis: 1.850 Cond. No. 108.


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Manual select subset with jobstatus 1#

# import statsmodels.formula.api as smf

# players_job1 = players_full[players_full["jobstatus"]==1]
# model3 = smf.ols('time ~ 1 + age', data=players_job1)
# result3 = model3.fit()
# result3.summary()

Example confoudouder 2#

via https://stats.stackexchange.com/a/17338/62481

import numpy as np
from scipy.stats import uniform, randint

covariate = randint(0,2).rvs(100)
exposure  = uniform(0,1).rvs(100) +  0.3 * covariate
outcome   = 2.0 + 0.5*exposure + 0.25*covariate

# covariate, exposure, outcome
df2 = pd.DataFrame({
    "covariate":covariate,
    "exposure":exposure,
    "outcome":outcome
})
sns.lmplot(x="exposure", y="outcome", data=df2)
<seaborn.axisgrid.FacetGrid at 0x7efd29503af0>
../_images/e23d6e68f1a4fa7ff4f816d11c2660da14991f0532ee93e049c09d012738855f.png
import statsmodels.formula.api as smf

model2a = smf.ols('outcome ~ exposure', data=df2)
result2a = model2a.fit()
# result2a.summary()
result2a.params
Intercept    1.973898
exposure     0.729637
dtype: float64
fg = sns.lmplot(x="exposure", y="outcome", hue="covariate", data=df2)
../_images/8676933627c6bc009019758b9e9e5d91e0e926e6ab6263e451f56df925f9d3ed.png
model2b = smf.ols('outcome ~ exposure + C(covariate)', data=df2)
result2b = model2b.fit()
# result2b.summary()
result2b.params
Intercept            2.00
C(covariate)[T.1]    0.25
exposure             0.50
dtype: float64
x = np.linspace(0,1.4)

m = result2b.params["exposure"]
b0 = result2b.params["Intercept"]
b1 = b0 + result2b.params["C(covariate)[T.1]"]

b0, b1

y0 = b0 + m*x
y1 = b1 + m*x
ax = fg.figure.axes[0]
sns.lineplot(x=x, y=y0, ax=ax, color="r")
sns.lineplot(x=x, y=y1, ax=ax, color="m")
<Axes: xlabel='exposure', ylabel='outcome'>
fg.figure
../_images/d7a7aa515f62c0484e969bf99b0a6977a03e8a49889b491989d641bb9d56146a.png
model2c = smf.ols('outcome ~ -1 + exposure*C(covariate)', data=df2)
result2c = model2c.fit()
result2c.summary()
# result2c.params
OLS Regression Results
Dep. Variable: outcome R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 2.864e+30
Date: Fri, 01 Dec 2023 Prob (F-statistic): 0.00
Time: 04:06:12 Log-Likelihood: 3330.7
No. Observations: 100 AIC: -6653.
Df Residuals: 96 BIC: -6643.
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
C(covariate)[0] 2.0000 2.94e-16 6.81e+15 0.000 2.000 2.000
C(covariate)[1] 2.2500 3.98e-16 5.65e+15 0.000 2.250 2.250
exposure 0.5000 5.14e-16 9.73e+14 0.000 0.500 0.500
exposure:C(covariate)[T.1] 1.887e-15 6.75e-16 2.797 0.006 5.48e-16 3.23e-15
Omnibus: 67.389 Durbin-Watson: 1.410
Prob(Omnibus): 0.000 Jarque-Bera (JB): 8.510
Skew: 0.255 Prob(JB): 0.0142
Kurtosis: 1.665 Cond. No. 12.4


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# import seaborn as sns
# import matplotlib.pyplot as plt
# df = sns.load_dataset('iris')
# sns.regplot(x=df["sepal_length"], y=df["sepal_width"], line_kws={"color":"r","alpha":0.7,"lw":5})

Random slopes and random intercepts#

via https://patsy.readthedocs.io/en/latest/quickstart.html

You can even write interactions between categorical and numerical variables. Here we fit two different slope coefficients for x1; one for the a1 group, and one for the a2 group: dmatrix("a:x1", data)

This is what matches the seaborn plot when using hue as an extra variable

model2d = smf.ols('outcome ~ 1 + C(covariate) + C(covariate):exposure', data=df2)
result2d = model2d.fit()
result2d.summary()
# result2c.params
OLS Regression Results
Dep. Variable: outcome R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 1.886e+30
Date: Fri, 01 Dec 2023 Prob (F-statistic): 0.00
Time: 04:06:12 Log-Likelihood: 3309.8
No. Observations: 100 AIC: -6612.
Df Residuals: 96 BIC: -6601.
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 2.0000 3.62e-16 5.53e+15 0.000 2.000 2.000
C(covariate)[T.1] 0.2500 6.1e-16 4.1e+14 0.000 0.250 0.250
C(covariate)[0]:exposure 0.5000 6.33e-16 7.9e+14 0.000 0.500 0.500
C(covariate)[1]:exposure 0.5000 5.39e-16 9.28e+14 0.000 0.500 0.500
Omnibus: 0.571 Durbin-Watson: 0.092
Prob(Omnibus): 0.752 Jarque-Bera (JB): 0.706
Skew: -0.152 Prob(JB): 0.702
Kurtosis: 2.723 Cond. No. 11.0


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
import numpy as np
from scipy.stats import uniform, randint

covariate3 = randint(0,2).rvs(100)
exposure3  = uniform(0,1).rvs(100) +  0.3 * covariate
outcome3   = 2.0 + 0.25*covariate3 + (0.5 + 0.1*covariate3)*exposure3 
#                  \             /    \                  /
#                  different inst.      different slopes  

df3 = pd.DataFrame({
    "covariate":covariate3,
    "exposure":exposure3,
    "outcome":outcome3
})
fg = sns.lmplot(x="exposure", y="outcome", hue="covariate", data=df3)
../_images/0035a8a576d23211be00f095f729b29b8d6697aa7e0d0c82882d708d29d08225.png
model3d = smf.ols('outcome ~ 1 + C(covariate) + C(covariate):exposure', data=df2)
result3d = model3d.fit()
result3d.summary()
# result2c.params
OLS Regression Results
Dep. Variable: outcome R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 1.886e+30
Date: Fri, 01 Dec 2023 Prob (F-statistic): 0.00
Time: 04:06:12 Log-Likelihood: 3309.8
No. Observations: 100 AIC: -6612.
Df Residuals: 96 BIC: -6601.
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 2.0000 3.62e-16 5.53e+15 0.000 2.000 2.000
C(covariate)[T.1] 0.2500 6.1e-16 4.1e+14 0.000 0.250 0.250
C(covariate)[0]:exposure 0.5000 6.33e-16 7.9e+14 0.000 0.500 0.500
C(covariate)[1]:exposure 0.5000 5.39e-16 9.28e+14 0.000 0.500 0.500
Omnibus: 0.571 Durbin-Watson: 0.092
Prob(Omnibus): 0.752 Jarque-Bera (JB): 0.706
Skew: -0.152 Prob(JB): 0.702
Kurtosis: 2.723 Cond. No. 11.0


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.