Section 4.1 — Simple linear regression¶

This notebook contains the code examples from Section 4.1 Simple linear regression from the No Bullshit Guide to Statistics.

Notebook setup¶

In [1]:

Copied!





# load Python modules
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# load Python modules
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:

Copied!





# Figures setup
plt.clf()  # needed otherwise `sns.set_theme` doesn"t work
from plot_helpers import RCPARAMS
RCPARAMS.update({"figure.figsize": (5, 3)})   # good for screen
# RCPARAMS.update({"figure.figsize": (5, 1.6)})  # good for print
sns.set_theme(
    context="paper",
    style="whitegrid",
    palette="colorblind",
    rc=RCPARAMS,
)

# High-resolution please
%config InlineBackend.figure_format = "retina"

# Where to store figures
DESTDIR = "figures/lm/simple"
# Figures setup
plt.clf()  # needed otherwise `sns.set_theme` doesn"t work
from plot_helpers import RCPARAMS
RCPARAMS.update({"figure.figsize": (5, 3)})   # good for screen
# RCPARAMS.update({"figure.figsize": (5, 1.6)})  # good for print
sns.set_theme(
    context="paper",
    style="whitegrid",
    palette="colorblind",
    rc=RCPARAMS,
)

# High-resolution please
%config InlineBackend.figure_format = "retina"

# Where to store figures
DESTDIR = "figures/lm/simple"

<Figure size 640x480 with 0 Axes>

In [3]:

Copied!

from ministats.plots.figures import plot_residuals
from ministats.plots.figures import plot_residuals2
from ministats.plots.figures import plot_residuals
from ministats.plots.figures import plot_residuals2

In [4]:

Copied!

# set random seed for repeatability
np.random.seed(42)
# set random seed for repeatability
np.random.seed(42)

In [5]:

Copied!

import warnings
# silence kurtosistest warning when using n < 20
warnings.filterwarnings("ignore", category=UserWarning)
import warnings
# silence kurtosistest warning when using n < 20
warnings.filterwarnings("ignore", category=UserWarning)

$\def\stderr#1{\mathbf{se}_{#1}}$ $\def\stderrhat#1{\hat{\mathbf{se}}_{#1}}$ $\newcommand{\Mean}{\textbf{Mean}}$ $\newcommand{\Var}{\textbf{Var}}$ $\newcommand{\Std}{\textbf{Std}}$ $\newcommand{\Freq}{\textbf{Freq}}$ $\newcommand{\RelFreq}{\textbf{RelFreq}}$ $\newcommand{\DMeans}{\textbf{DMeans}}$ $\newcommand{\Prop}{\textbf{Prop}}$ $\newcommand{\DProps}{\textbf{DProps}}$

$$ \newcommand{\CI}[1]{\textbf{CI}_{#1}} \newcommand{\CIL}[1]{\textbf{L}_{#1}} \newcommand{\CIU}[1]{\textbf{U}_{#1}} \newcommand{\ci}[1]{\textbf{ci}_{#1}} \newcommand{\cil}[1]{\textbf{l}_{#1}} \newcommand{\ciu}[1]{\textbf{u}_{#1}} $$

(this cell contains the macro definitions like $\stderr{\overline{\mathbf{x}}}$, $\stderrhat{}$, $\Mean$, ...)

Definitions¶

In [ ]:

Linear model¶

The regression line describes the expected value of the outcome variable Y at different values of x

In [ ]:

Example: students score as a function of effort¶

In [6]:

Copied!

students = pd.read_csv("../datasets/students.csv")
students.head()
students = pd.read_csv("../datasets/students.csv")
students.head()

Out[6]:

	student_ID	background	curriculum	effort	score
0	1	arts	debate	10.96	75.0
1	2	science	lecture	8.69	75.0
2	3	arts	debate	8.60	67.0
3	4	arts	lecture	7.92	70.3
4	5	science	debate	9.90	76.1

In [7]:

Copied!

efforts = students["effort"]
scores = students["score"]
sns.scatterplot(x=efforts, y=scores);
efforts = students["effort"]
scores = students["score"]
sns.scatterplot(x=efforts, y=scores);

No description has been provided for this image

Compute the correlation¶

In [8]:

Copied!

np.corrcoef(efforts, scores)[0,1]
# ALT. students[["effort","score"]].corr()
# np.corrcoef
np.corrcoef(efforts, scores)[0,1]
# ALT. students[["effort","score"]].corr()
# np.corrcoef

Out[8]:

0.8794375135614695

In [ ]:

Parameter estimation using least squares¶

In [9]:

Copied!





meaneffort = efforts.mean()
meanscore = scores.mean()
num = np.sum( (efforts-meaneffort)*(scores-meanscore) )
denom = np.sum( (efforts - meaneffort)**2 )
b1 = num / denom
b1
meaneffort = efforts.mean()
meanscore = scores.mean()
num = np.sum( (efforts-meaneffort)*(scores-meanscore) )
denom = np.sum( (efforts - meaneffort)**2 )
b1 = num / denom
b1

Out[9]:

4.504850344209071

In [10]:

Copied!

b0 = meanscore - b1*meaneffort
b0
b0 = meanscore - b1*meaneffort
b0

Out[10]:

32.46580930159963

In [11]:

Copied!





es = np.linspace(5, 12)
scorehats = b0 + b1*es
sns.lineplot(x=es, y=scorehats)
sns.scatterplot(x=efforts, y=scores);
es = np.linspace(5, 12)
scorehats = b0 + b1*es
sns.lineplot(x=es, y=scorehats)
sns.scatterplot(x=efforts, y=scores);

In [12]:

Copied!

# # ALT.
# sns.regplot(x=efforts, y=scores, ci=None);
# # ALT.
# sns.regplot(x=efforts, y=scores, ci=None);

Least squares optimization for the parameters¶

How do we find the parameter estimates of the model?

In [13]:

Copied!





plot_residuals(efforts, scores, b0, b1)
sns.scatterplot(x=efforts, y=scores)
es = np.linspace(5, 12.2)
scorehats = b0 + b1*es
sns.lineplot(x=es, y=scorehats, color="C4");
plot_residuals(efforts, scores, b0, b1)
sns.scatterplot(x=efforts, y=scores)
es = np.linspace(5, 12.2)
scorehats = b0 + b1*es
sns.lineplot(x=es, y=scorehats, color="C4");

In [14]:

Copied!





ax = sns.scatterplot(x=efforts, y=scores, zorder=4)
es = np.linspace(5, 12.2)
scorehats = b0 + b1*es
sns.lineplot(x=es, y=scorehats, color="C4", zorder=5)
plot_residuals2(efforts, scores, b0, b1, ax=ax);
ax = sns.scatterplot(x=efforts, y=scores, zorder=4)
es = np.linspace(5, 12.2)
scorehats = b0 + b1*es
sns.lineplot(x=es, y=scorehats, color="C4", zorder=5)
plot_residuals2(efforts, scores, b0, b1, ax=ax);

Estimating the standard deviation parameter¶

In [15]:

Copied!

scorehats = b0 + b1*efforts
residuals = scores - scorehats
residuals[0:4]
scorehats = b0 + b1*efforts
residuals = scores - scorehats
residuals[0:4]

Out[15]:

0   -6.838969
1    3.387041
2   -4.207522
3    2.155776
dtype: float64

In [16]:

Copied!





SSR = np.sum( residuals**2 )
n = len(students)
sigmahat = np.sqrt( SSR / (n-2) )
sigmahat
SSR = np.sum( residuals**2 )
n = len(students)
sigmahat = np.sqrt( SSR / (n-2) )
sigmahat

Out[16]:

4.929598282660258

In [ ]:

Model diagnostics¶

Scatter plots¶

In [17]:

Copied!

sns.scatterplot(x="effort", y="score", data=students);
sns.scatterplot(x="effort", y="score", data=students);

Examples of nonlinear patterns¶

Examples of scatter plots showing nonlinear patterns.

Residuals plots¶

In [18]:

Copied!

scorehats = b0 + b1*efforts
residuals = scores - scorehats
scorehats = b0 + b1*efforts
residuals = scores - scorehats

Residuals versus the predicted values¶

In [19]:

Copied!





ax = sns.scatterplot(x=scorehats, y=residuals)
ax.set_xlabel("model predictions ($\\hat{s}_i$)")
ax.set_ylabel("residuals ($r_i = s_i - \\hat{s}_i$)")
ax.axhline(y=0, color="b", linestyle="dotted");
ax = sns.scatterplot(x=scorehats, y=residuals)
ax.set_xlabel("model predictions ($\\hat{s}_i$)")
ax.set_ylabel("residuals ($r_i = s_i - \\hat{s}_i$)")
ax.axhline(y=0, color="b", linestyle="dotted");

Residuals versus the predictor (bonus)¶

In [20]:

Copied!





# ax = sns.scatterplot(x=efforts, y=residuals)
# ax.set_xticks(range(5,12+1))
# ax.set_ylabel("residuals ($r_i = s_i - \\hat{s}_i$)")
# ax.axhline(y=0, color="b", linestyle="dotted");
# ax = sns.scatterplot(x=efforts, y=residuals)
# ax.set_xticks(range(5,12+1))
# ax.set_ylabel("residuals ($r_i = s_i - \\hat{s}_i$)")
# ax.axhline(y=0, color="b", linestyle="dotted");

QQ-plot of the residuals¶

In [21]:

Copied!

from statsmodels.graphics.api import qqplot

qqplot(residuals, line="s");
from statsmodels.graphics.api import qqplot

qqplot(residuals, line="s");

Residual plots that show violated assumptions¶

Examples of residual plots showing violated modeling assumptions.

Sum of squares quantities¶

Sum of squared residuals¶

In [22]:

Copied!

SSR = np.sum( residuals**2 )
SSR
SSR = np.sum( residuals**2 )
SSR

Out[22]:

315.9122099692906

Explained sum of squares¶

In [23]:

Copied!

meanscore = scores.mean()
ESS = np.sum( (scorehats-meanscore)**2 ) 
ESS
meanscore = scores.mean()
ESS = np.sum( (scorehats-meanscore)**2 ) 
ESS

Out[23]:

1078.2917900307098

Total sum of squares¶

In [24]:

Copied!

TSS = np.sum( (scores - meanscore)**2 )
TSS
TSS = np.sum( (scores - meanscore)**2 )
TSS

Out[24]:

1394.2040000000002

In [25]:

Copied!

SSR + ESS  # == TSS
SSR + ESS  # == TSS

Out[25]:

1394.2040000000004

In [ ]:

Coefficient of determination $R^2$¶

In [26]:

Copied!

R2 = ESS / TSS
R2
R2 = ESS / TSS
R2

Out[26]:

0.7734103402591799

In [ ]:

Using linear models to make predictions¶

In [27]:

Copied!

def predict(x, b0, b1):
    yhat = b0 + b1*x
    return yhat
def predict(x, b0, b1):
    yhat = b0 + b1*x
    return yhat

In [ ]:

Confidence interval for the mean¶

TODO: add formulas

Confidence interval for observations¶

TODO: add formulas

In [ ]:

Example:predicting students' scores¶

Predict the score of a new student who invests 9 hours of effort per week.

In [28]:

Copied!

neweffort = 9
scorehat = predict(neweffort, b0=32.5, b1=4.5)
scorehat
neweffort = 9
scorehat = predict(neweffort, b0=32.5, b1=4.5)
scorehat

Out[28]:

73.0

Confidence interval for the mean score¶

In [29]:

Copied!





#######################################################
newdev = (neweffort - efforts.mean())**2
sum_dev2 = np.sum((efforts - efforts.mean())**2)
se_meanscore = sigmahat*np.sqrt(1/n + newdev/sum_dev2)
se_meanscore
#######################################################
newdev = (neweffort - efforts.mean())**2
sum_dev2 = np.sum((efforts - efforts.mean())**2)
se_meanscore = sigmahat*np.sqrt(1/n + newdev/sum_dev2)
se_meanscore

Out[29]:

1.2744485881877106

In [30]:

Copied!





from scipy.stats import t as tdist
alpha = 0.1
t_l, t_u = tdist(df=n-2).ppf([alpha/2, 1-alpha/2])
[scorehat + t_l*se_meanscore, scorehat + t_u*se_meanscore]
from scipy.stats import t as tdist
alpha = 0.1
t_l, t_u = tdist(df=n-2).ppf([alpha/2, 1-alpha/2])
[scorehat + t_l*se_meanscore, scorehat + t_u*se_meanscore]

Out[30]:

[70.74303643371016, 75.25696356628984]

Prediction band for the mean score¶

Plot of the 90% confidence interval for the mean

Confidence interval for predicted scores¶

In [31]:

Copied!

se_score = sigmahat*np.sqrt(1 + 1/n + newdev/sum_dev2)
se_score
se_score = sigmahat*np.sqrt(1 + 1/n + newdev/sum_dev2)
se_score

Out[31]:

5.0916754052414435

In [32]:

Copied!

alpha = 0.1
t_l, t_u = tdist(df=n-2).ppf([alpha/2, 1-alpha/2])
[scorehat + t_l*se_score, scorehat + t_u*se_score]
alpha = 0.1
t_l, t_u = tdist(df=n-2).ppf([alpha/2, 1-alpha/2])
[scorehat + t_l*se_score, scorehat + t_u*se_score]

Out[32]:

[63.98298198333331, 82.0170180166667]

Prediction band for scores¶

Plot of the 90% confidence interval for the outcomes.

Prediction caveats¶

In [33]:

Copied!

efforts.min(), efforts.max()
efforts.min(), efforts.max()

Out[33]:

(5.21, 12.0)

It's not OK to extrapolate the validity of the model outside of the range of values where we have observed data.

For example, there is no reason to believe in the model's predictions about an effort of 20 hours per week:

In [34]:

Copied!

predict(20, b0=32.5, b1=4.5)
predict(20, b0=32.5, b1=4.5)

Out[34]:

122.5

Indeed, the model predicts the grade will be above 100% which is impossible.

Explanations¶

Software for fitting linear models¶

scipy
statsmodels
scikit-learn

Fitting linear models with `statsmodels`¶

In [35]:

Copied!

import statsmodels.formula.api as smf

lm1 = smf.ols("score ~ 1 + effort", data=students).fit()
import statsmodels.formula.api as smf

lm1 = smf.ols("score ~ 1 + effort", data=students).fit()

In [36]:

Copied!

type(lm1)
type(lm1)

Out[36]:

statsmodels.regression.linear_model.RegressionResultsWrapper

Estimated parameters for the model¶

In [37]:

Copied!

lm1.params
lm1.params

Out[37]:

Intercept    32.465809
effort        4.504850
dtype: float64

In [38]:

Copied!

type(lm1.params)
type(lm1.params)

Out[38]:

pandas.core.series.Series

We often want to extract the intercept and slope parameters for use in subsequent calculations.

In [39]:

Copied!

b0 = lm1.params["Intercept"]  # = lm1.params[0]
b1 = lm1.params["effort"]     # = lm1.params[1]
b0, b1
b0 = lm1.params["Intercept"]  # = lm1.params[0]
b1 = lm1.params["effort"]     # = lm1.params[1]
b0, b1

Out[39]:

(32.465809301599606, 4.504850344209074)

The estimate $\widehat{\sigma}$ is obtained by taking the square root of the .scale attribute.

In [40]:

Copied!

sigmahat = np.sqrt(lm1.scale)
sigmahat
sigmahat = np.sqrt(lm1.scale)
sigmahat

Out[40]:

4.929598282660258

Model fitted values¶

In [41]:

Copied!

lm1.fittedvalues  # == scorehats
lm1.fittedvalues  # == scorehats

Out[41]:

0     81.838969
1     71.612959
2     71.207522
3     68.144224
4     77.063828
5     81.118193
6     67.648690
7     73.595093
8     55.936080
9     67.198205
10    76.703440
11    84.406734
12    64.450247
13    61.251803
14    86.524013
dtype: float64

Residuals¶

In [42]:

Copied!

lm1.resid  # == scores - scorehats
lm1.resid  # == scores - scorehats

Out[42]:

0     -6.838969
1      3.387041
2     -4.207522
3      2.155776
4     -0.963828
5     -1.318193
6      5.051310
7      1.804907
8      1.063920
9      1.801795
10    -6.303440
11    11.793266
12    -1.550247
13    -3.651803
14    -2.224013
dtype: float64

Sum-of-squared quantities¶

In [43]:

Copied!

# SSR     # ESS     # TSS              # R2
lm1.ssr,  lm1.ess,  lm1.centered_tss,  lm1.rsquared
# SSR     # ESS     # TSS              # R2
lm1.ssr,  lm1.ess,  lm1.centered_tss,  lm1.rsquared

Out[43]:

(315.91220996929053,
 1078.2917900307098,
 1394.2040000000002,
 0.7734103402591798)

Predictions¶

Predict the score of a new student who invests 9 hours of effort per week.

In [44]:

Copied!

lm1.predict({"effort":9})
lm1.predict({"effort":9})

Out[44]:

0    73.009462
dtype: float64

In [45]:

Copied!

pred = lm1.get_prediction({"effort":9})
pred.se_mean, pred.conf_int(alpha=0.1)
pred = lm1.get_prediction({"effort":9})
pred.se_mean, pred.conf_int(alpha=0.1)

Out[45]:

(array([1.27444859]), array([[70.75249883, 75.26642597]]))

In [46]:

Copied!

pred.se_obs, pred.conf_int(obs=True, alpha=0.1)
pred.se_obs, pred.conf_int(obs=True, alpha=0.1)

Out[46]:

(array([5.09167541]), array([[63.99244438, 82.02648042]]))

In [ ]:

Model summary table¶

In [47]:

Copied!

lm1.summary()
lm1.summary()

Out[47]:

OLS Regression Results
Dep. Variable:	score	R-squared:	0.773
Model:	OLS	Adj. R-squared:	0.756
Method:	Least Squares	F-statistic:	44.37
Date:	Thu, 18 Jul 2024	Prob (F-statistic):	1.56e-05
Time:	11:53:14	Log-Likelihood:	-44.140
No. Observations:	15	AIC:	92.28
Df Residuals:	13	BIC:	93.70
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	32.4658	6.155	5.275	0.000	19.169	45.763
effort	4.5049	0.676	6.661	0.000	3.044	5.966

Omnibus:	4.062	Durbin-Watson:	2.667
Prob(Omnibus):	0.131	Jarque-Bera (JB):	1.777
Skew:	0.772	Prob(JB):	0.411
Kurtosis:	3.677	Cond. No.	44.5

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Helper functions for plotting linear model results¶

plot_reg(lm): generate a regression plot for the model lm
plot_redid(lm): residuals plot for the model lm
plot_pred_bands(lm, ...): plot confidence intervals model lm. Use ci_mean=True to plot the predictions for the mean, or ci_obs=True to plot the predictions of the observations variable.

Regression plot¶

In [48]:

Copied!

from ministats import plot_reg

plot_reg(lm1);
from ministats import plot_reg

plot_reg(lm1);

Residuals plot¶

In [49]:

Copied!

xs = np.array([1,2,3])
hasattr(xs, "name")
xs = np.array([1,2,3])
hasattr(xs, "name")

Out[49]:

False

In [50]:

Copied!

xs = pd.Series([1,2,3])
hasattr(xs, "name") and xs.name is not None
xs = pd.Series([1,2,3])
hasattr(xs, "name") and xs.name is not None

Out[50]:

False

In [51]:

Copied!

from ministats import plot_resid

plot_resid(lm1);
from ministats import plot_resid

plot_resid(lm1);

In [52]:

Copied!

# BONUS plot residuals against predictor variable
plot_resid(lm1, pred="effort");
# BONUS plot residuals against predictor variable
plot_resid(lm1, pred="effort");

Prediction band plots¶

In [53]:

Copied!

from ministats import plot_pred_bands

plot_reg(lm1)
plot_pred_bands(lm1, ci_mean=True, alpha_mean=0.1);
from ministats import plot_pred_bands

plot_reg(lm1)
plot_pred_bands(lm1, ci_mean=True, alpha_mean=0.1);

In [54]:

Copied!

plot_reg(lm1)
plot_pred_bands(lm1, ci_obs=True, alpha_obs=0.1);
plot_reg(lm1)
plot_pred_bands(lm1, ci_obs=True, alpha_obs=0.1);

Seaborn functions for plotting linear models¶

Regression plot¶

In [55]:

Copied!

sns.regplot(x="effort", y="score", ci=None, data=students);
sns.regplot(x="effort", y="score", ci=None, data=students);

Residual plot¶

In [56]:

Copied!

sns.residplot(x="effort", y="score", data=students);
sns.residplot(x="effort", y="score", data=students);

Pearson correlation coefficient revisited¶

In [57]:

Copied!





efforts = students["effort"]
scores = students["score"]
pearson_r = efforts.corr(scores)
pearson_r
efforts = students["effort"]
scores = students["score"]
pearson_r = efforts.corr(scores)
pearson_r

Out[57]:

0.8794375135614695

In [58]:

Copied!

# # ALT.
# from scipy.stats import pearsonr
# pearson_r = pearsonr(efforts, scores)[0]
# # ALT.
# from scipy.stats import pearsonr
# pearson_r = pearsonr(efforts, scores)[0]

In [59]:

Copied!

pearson_r**2
pearson_r**2

Out[59]:

0.7734103402591799

In [60]:

Copied!

lm1 = smf.ols("score ~ 1 + effort", data=students).fit()
lm1.rsquared
lm1 = smf.ols("score ~ 1 + effort", data=students).fit()
lm1.rsquared

Out[60]:

0.7734103402591798

In [ ]:

Alternative methods for fitting linear models (optional)¶

Numerical optimization¶

In [61]:

Copied!





from scipy.optimize import minimize

def ssr(betas, xdata, ydata):
    yhat = betas[0] + betas[1]*xdata
    resid = ydata - yhat
    return np.sum(resid**2)

optres = minimize(ssr, x0=[0,0], args=(efforts,scores))
beta0, beta1 = optres.x
beta0, beta1
from scipy.optimize import minimize

def ssr(betas, xdata, ydata):
    yhat = betas[0] + betas[1]*xdata
    resid = ydata - yhat
    return np.sum(resid**2)

optres = minimize(ssr, x0=[0,0], args=(efforts,scores))
beta0, beta1 = optres.x
beta0, beta1

Out[61]:

(32.465809793102544, 4.504850301246796)

Linear algebra¶

linear algebra solution using numpy

In [62]:

Copied!





import numpy as np

# Prepare the design matrix
n = len(students)
X = np.ndarray((n,2))
X[:,0] = 1
X[:,1] = efforts
X
import numpy as np

# Prepare the design matrix
n = len(students)
X = np.ndarray((n,2))
X[:,0] = 1
X[:,1] = efforts
X

Out[62]:

array([[ 1.  , 10.96],
       [ 1.  ,  8.69],
       [ 1.  ,  8.6 ],
       [ 1.  ,  7.92],
       [ 1.  ,  9.9 ],
       [ 1.  , 10.8 ],
       [ 1.  ,  7.81],
       [ 1.  ,  9.13],
       [ 1.  ,  5.21],
       [ 1.  ,  7.71],
       [ 1.  ,  9.82],
       [ 1.  , 11.53],
       [ 1.  ,  7.1 ],
       [ 1.  ,  6.39],
       [ 1.  , 12.  ]])

We obtain the least squares solution using the Moore–Penrose inverse formula:

$$ \vec{\beta} = (X^{\sf T} X)^{-1}X^{\sf T}\; \mathbf{y} $$

In [63]:

Copied!

lares = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(scores)
beta0, beta1 = lares
beta0, beta1
lares = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(scores)
beta0, beta1 = lares
beta0, beta1

Out[63]:

(32.46580930159923, 4.504850344209087)

In [ ]:

Fitting linear models using `scipy`¶

The helper function scipy.stats.linregress ...

In [64]:

Copied!

from scipy.stats import linregress

scipyres = linregress(efforts, scores)
scipyres.intercept, scipyres.slope
from scipy.stats import linregress

scipyres = linregress(efforts, scores)
scipyres.intercept, scipyres.slope

Out[64]:

(32.46580930159963, 4.504850344209071)

In [ ]:

Fitting linear models using `scikit-learn`¶

The class sklearn.linear_model.LinearRegression ...

In [65]:

Copied!

from sklearn.linear_model import LinearRegression

sklmodel = LinearRegression()
sklmodel.fit(efforts.values[:,np.newaxis], scores)
sklmodel.intercept_, sklmodel.coef_
from sklearn.linear_model import LinearRegression

sklmodel = LinearRegression()
sklmodel.fit(efforts.values[:,np.newaxis], scores)
sklmodel.intercept_, sklmodel.coef_

Out[65]:

(32.46580930159961, array([4.50485034]))

Using the low-level `statsmodels` API¶

In [66]:

Copied!





import statsmodels.api as sm

X = sm.add_constant(efforts)
y = scores
smres = sm.OLS(y,X).fit()
smres.params["const"], smres.params["effort"]
import statsmodels.api as sm

X = sm.add_constant(efforts)
y = scores
smres = sm.OLS(y,X).fit()
smres.params["const"], smres.params["effort"]

Out[66]:

(32.465809301599606, 4.504850344209074)

In [ ]:

Discussion¶

In [ ]:

Examples of non-linear relationships¶

Hare are some examples of the different possible relationships between the effort and score variables.

nonlinear relantionships

In [ ]:

Exercises¶

Exercise E??: marketing dataset¶

In [67]:

Copied!





marketing = pd.read_csv("../datasets/exercises/marketing.csv")
print(marketing.columns)
lm_mkt = smf.ols("sales ~ 1 + youtube", data=marketing).fit()
plot_reg(lm_mkt)
marketing = pd.read_csv("../datasets/exercises/marketing.csv")
print(marketing.columns)
lm_mkt = smf.ols("sales ~ 1 + youtube", data=marketing).fit()
plot_reg(lm_mkt)

Index(['youtube', 'facebook', 'newspaper', 'sales'], dtype='object')

Out[67]:

<Axes: xlabel='youtube', ylabel='sales'>

In [68]:

Copied!

from ministats import plot_resid
plot_resid(lm_mkt)
from ministats import plot_resid
plot_resid(lm_mkt)

Out[68]:

<Axes: xlabel='fitted values', ylabel='residuals $r_i$'>

Links¶

In [ ]:

(BONUS MATERIAL) Formula for standard error of coefficients¶

In [69]:

Copied!

lm1.bse
lm1.bse

Out[69]:

Intercept    6.155051
effort       0.676276
dtype: float64

Formula using summations

$$ se(\beta_0) = \hat{\sigma} \cdot \sqrt{ \frac{1}{n} + \frac{\overline{x}^2}{\sum (x_i - \overline{x})^2} } \qquad se(\beta_1) = \hat{\sigma} \cdot \sqrt{\frac{1}{\sum (x_i - \overline{x})^2}} $$

TODO: show derivation why these formulas are equiv. to matrix formulas below when p=1

In [70]:

Copied!





sum_dev2 = np.sum((efforts - efforts.mean())**2)
se_Intercept = sigmahat * np.sqrt(1/n + efforts.mean()**2/sum_dev2)
se_b_effort = sigmahat/np.sqrt(sum_dev2)
se_Intercept, se_b_effort
sum_dev2 = np.sum((efforts - efforts.mean())**2)
se_Intercept = sigmahat * np.sqrt(1/n + efforts.mean()**2/sum_dev2)
se_b_effort = sigmahat/np.sqrt(sum_dev2)
se_Intercept, se_b_effort

Out[70]:

(6.155051380977695, 0.6762756464968055)

Alternative formula using design matrix

$$ [se(\beta_0), se(\beta_1)] = \hat{\sigma} \cdot \text{diag}\left( \sqrt{ (X^T X)^{-1} } \right) $$

where $X$ is the design matrix.

In [71]:

Copied!





# construct the design matrix for the model 
X = sm.add_constant(students[["effort"]])
# calculate the diagonal of the inverse-covariance matrix
inv_covs = np.diag(np.linalg.inv(X.T.dot(X)))
np.sqrt(sigmahat**2 * inv_covs)
# construct the design matrix for the model 
X = sm.add_constant(students[["effort"]])
# calculate the diagonal of the inverse-covariance matrix
inv_covs = np.diag(np.linalg.inv(X.T.dot(X)))
np.sqrt(sigmahat**2 * inv_covs)

Out[71]:

array([6.15505138, 0.67627565])

In [72]:

Copied!

lm1.model.exog[:,1]
lm1.model.exog[:,1]

Out[72]:

array([10.96,  8.69,  8.6 ,  7.92,  9.9 , 10.8 ,  7.81,  9.13,  5.21,
        7.71,  9.82, 11.53,  7.1 ,  6.39, 12.  ])

Section 4.1 — Simple linear regression¶

Notebook setup¶

Definitions¶

Linear model¶

Example: students score as a function of effort¶

Compute the correlation¶

Parameter estimation using least squares¶

Least squares optimization for the parameters¶

Estimating the standard deviation parameter¶

Model diagnostics¶

Scatter plots¶

Examples of nonlinear patterns¶

Residuals plots¶

Residuals versus the predicted values¶

Residuals versus the predictor (bonus)¶

QQ-plot of the residuals¶

Residual plots that show violated assumptions¶

Sum of squares quantities¶

Sum of squared residuals¶

Explained sum of squares¶

Total sum of squares¶

Coefficient of determination $R^2$¶

Using linear models to make predictions¶

Confidence interval for the mean¶

Confidence interval for observations¶

Example:predicting students' scores¶

Confidence interval for the mean score¶

Prediction band for the mean score¶

Confidence interval for predicted scores¶

Prediction band for scores¶

Prediction caveats¶

Explanations¶

Software for fitting linear models¶

Fitting linear models with statsmodels¶

Estimated parameters for the model¶

Model fitted values¶

Residuals¶

Sum-of-squared quantities¶

Predictions¶

Model summary table¶

Helper functions for plotting linear model results¶

Regression plot¶

Residuals plot¶

Prediction band plots¶

Seaborn functions for plotting linear models¶

Regression plot¶

Residual plot¶

Pearson correlation coefficient revisited¶

Related to the Pearson correlation coefficient¶

Alternative methods for fitting linear models (optional)¶

Numerical optimization¶

Linear algebra¶

Fitting linear models using scipy¶

Fitting linear models using scikit-learn¶

Using the low-level statsmodels API¶

Discussion¶

Examples of non-linear relationships¶

Exercises¶

Exercise E??: marketing dataset¶

Links¶

(BONUS MATERIAL) Formula for standard error of coefficients¶

Fitting linear models with `statsmodels`¶

Fitting linear models using `scipy`¶

Fitting linear models using `scikit-learn`¶

Using the low-level `statsmodels` API¶