OLD Section 3.4 — Analytical approximation methods¶

Notebook setup¶

In [1]:

Copied!





# load Python modules
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# load Python modules
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:

Copied!





# Plot helper functions
from ministats import calc_prob_and_plot_tails
from ministats import plot_pdf
from ministats.utils import savefigure
# Plot helper functions
from ministats import calc_prob_and_plot_tails
from ministats import plot_pdf
from ministats.utils import savefigure

In [3]:

Copied!





# Figures setup
plt.clf()  # needed otherwise `sns.set_theme` doesn't work
from plot_helpers import RCPARAMS
RCPARAMS.update({'figure.figsize': (10, 3)})   # good for screen
# RCPARAMS.update({'figure.figsize': (5, 1.6)})  # good for print
sns.set_theme(
    context="paper",
    style="whitegrid",
    palette="colorblind",
    rc=RCPARAMS,
)

# Useful colors
snspal = sns.color_palette()
blue, orange, purple = snspal[0], snspal[1], snspal[4]

# High-resolution please
%config InlineBackend.figure_format = 'retina'

# Where to store figures
DESTDIR = "figures/stats/NHST"
# Figures setup
plt.clf()  # needed otherwise `sns.set_theme` doesn't work
from plot_helpers import RCPARAMS
RCPARAMS.update({'figure.figsize': (10, 3)})   # good for screen
# RCPARAMS.update({'figure.figsize': (5, 1.6)})  # good for print
sns.set_theme(
    context="paper",
    style="whitegrid",
    palette="colorblind",
    rc=RCPARAMS,
)

# Useful colors
snspal = sns.color_palette()
blue, orange, purple = snspal[0], snspal[1], snspal[4]

# High-resolution please
%config InlineBackend.figure_format = 'retina'

# Where to store figures
DESTDIR = "figures/stats/NHST"

<Figure size 640x480 with 0 Axes>

In [4]:

Copied!

# set random seed for repeatability
np.random.seed(42)
# set random seed for repeatability
np.random.seed(42)

In [5]:

Copied!

#######################################################
#######################################################

Definitions¶

In [6]:

Copied!





def mean(sample):
    return sum(sample) / len(sample)

def var(sample):
    xbar = mean(sample)
    sumsqdevs = sum([(xi-xbar)**2 for xi in sample])
    return sumsqdevs / (len(sample)-1)

def std(sample):
    s2 = var(sample)
    return np.sqrt(s2)

def dmeans(xsample, ysample):
    dhat = mean(xsample) - mean(ysample)
    return dhat
def mean(sample):
    return sum(sample) / len(sample)

def var(sample):
    xbar = mean(sample)
    sumsqdevs = sum([(xi-xbar)**2 for xi in sample])
    return sumsqdevs / (len(sample)-1)

def std(sample):
    s2 = var(sample)
    return np.sqrt(s2)

def dmeans(xsample, ysample):
    dhat = mean(xsample) - mean(ysample)
    return dhat

Formulas¶

Analytical approximation formulas for the ...

In [ ]:

IMPORTED OLD WRITEUP¶

Student's t-test (pooled variance)¶

Student's $t$-test for comparison of the difference between two groups means is a procedure that makes use of the pooled variance $s^2_p$. Recall $H_0$ states there is no difference between the two groups. This means we can think of $s_S^2$ and $s_{NS}^2$ as two independent estimates of the population variance, so we can combine them (pool them together) to obtain an estimate $s^2_p$.

Black-box approach¶

The scipy.stats function ttest_ind will perform all the steps of Student's $t$-test procedure, without the need for us to understand the details.

In [38]:

Copied!





# from scipy.stats import ttest_ind

# # extract data for two groups
# xS = data[data["group"]=="S"]['ELV']
# xNS = data[data["group"]=="NS"]['ELV']

# # run the complete t-test procedure for ind-ependent samples:
# result = ttest_ind(xS, xNS)
# result.pvalue
# from scipy.stats import ttest_ind

# # extract data for two groups
# xS = data[data["group"]=="S"]['ELV']
# xNS = data[data["group"]=="NS"]['ELV']

# # run the complete t-test procedure for ind-ependent samples:
# result = ttest_ind(xS, xNS)
# result.pvalue

The $p$-value is less than 0.05 so our decision is to reject the null hypothesis.

Student's t-test under the hood¶

The computations hidden behind the function ttest_ind involve a six step procedure that makes use of the pooled variance $s^2_p$.

In [ ]:

Summary of Question 1¶

We saw two ways to answer Question 1 (is there really a difference between group means) and obtain the $p$-value. We interpreted the small $p$-values as evidence that the observed difference, $d=130$, is unlikely to be due to chance under $H_0$, so we rejected the null hypothesis. Note this whole procedure is just a sanity check—we haven't touched the alternative hypothesis at all yet, and for all we know the stats training could have the effect of decreasing ELV!

In [ ]:

Estimating the effect size¶

Question 2 of Amy's statistical investigation is to estimate the difference in ELV gained by stats training.
NEW CONCEPT: effect size is a measure of difference between intervention and control groups.
We assume the data of Group S and Group NS come from different populations with means $\mu_S$ and $\mu_{NS}$.
We're interested in estimating the difference between population means, denoted $\Delta = \mu_S - \mu_{NS}$.
By analyzing the sample, we have obtained an estimate $d=130$ for the unknown $\Delta$, but we know our data contains lots of variability, so we know our estimate might be off.
We want an answer to Question 2 (What is the estimated difference between group means?) that takes into account the variability of the data.
NEW CONCEPT: confidence interval is a way to describe a range of values for an estimate that takes into account the variability of the data.
We want to provide an answer to Question 2 in the form of a confidence interval that tells us a range of values where we believe the true value of $\Delta$ falls.
Similar to how we showed two approaches for hypothesis testing, we'll work on effect size estimation using two approaches: bootstrap estimation and analytical approximation methods.

In [ ]:

Approach 2: Confidence intervals using analytical approximations¶

Assumption: the variance of the two populations is the same (or approximately equal)
Using the theoretical model for the populations, we can obtain a formula for CI of effect size $\Delta$:

$$ \textrm{CI}_{(1-\alpha)} = \left[ d - t^*\!\cdot\!\sigma_D, \, d + t^*\!\cdot\!\sigma_D \right]. $$

The confidence interval is centred at $d$, with width proportional to the standard deviation $\sigma_D$. The constant $t^*$ denotes the value of the inverse CDF of Student's $t$-distribution with appropriate number of degrees of freedom df evaluated at $1-\frac{\alpha}{2}$. For a 90% confidence interval, we choose $\alpha=0.10$, which gives $(1-\frac{\alpha}{2}) = 0.95$, $t^* = F_{T_{\textrm{df}}}^{-1}\left(0.95\right)$.
We can use the two different analytical approximations to obtain a formula for $\sigma_D$ just as we did in the hypothesis testing:
- Pooled variance: $\sigma^2_p = \frac{(n_S-1)s_S^2 + (n_{NS}-1)s_{NS}^2}{n_S + n_{NS} - 2}$, and df = $n_S + n_{NS} -2$
- Unpooled variance: $\sigma^2_u = \tfrac{s^2_A}{n_A} + \tfrac{s^2_B}{n_B}$, and df = ...)

Using pooled variance¶

The calculations are similar to Student's t-test for hypothesis testing.

In [39]:

Copied!





# from scipy.stats import t

# d = np.mean(xS) - np.mean(xNS)

# nS, nNS = len(xS), len(xNS)
# stdS, stdNS = stdev(xS), stdev(xNS)
# var_pooled = ((nS-1)*stdS**2 + (nNS-1)*stdNS**2)/(nS + nNS - 2)
# std_pooled = np.sqrt(var_pooled)
# std_err = std_pooled * np.sqrt(1/nS + 1/nNS)

# df = nS + nNS - 2

# # for 90% confidence interval, need 10% in tails
# alpha = 0.10

# # now use inverse-CDF of Students t-distribution
# tstar = abs(t(df).ppf(alpha/2))

# CI_tpooled = [d - tstar*std_err, d + tstar*std_err]
# CI_tpooled
# from scipy.stats import t

# d = np.mean(xS) - np.mean(xNS)

# nS, nNS = len(xS), len(xNS)
# stdS, stdNS = stdev(xS), stdev(xNS)
# var_pooled = ((nS-1)*stdS**2 + (nNS-1)*stdNS**2)/(nS + nNS - 2)
# std_pooled = np.sqrt(var_pooled)
# std_err = std_pooled * np.sqrt(1/nS + 1/nNS)

# df = nS + nNS - 2

# # for 90% confidence interval, need 10% in tails
# alpha = 0.10

# # now use inverse-CDF of Students t-distribution
# tstar = abs(t(df).ppf(alpha/2))

# CI_tpooled = [d - tstar*std_err, d + tstar*std_err]
# CI_tpooled

Using unpooled variance¶

The calculations are similar to the Welch's t-test for hypothesis testing.

In [40]:

Copied!





# d = np.mean(xS) - np.mean(xNS)

# nS, nNS = len(xS), len(xNS)
# stdS, stdNS = stdev(xS), stdev(xNS)
# stdD = np.sqrt(stdS**2/nS + stdNS**2/nNS)

# df = (stdS**2/nS + stdNS**2/nNS)**2 / \
#     ((stdS**2/nS)**2/(nS-1) + (stdNS**2/nNS)**2/(nNS-1) )

# # for 90% confidence interval, need 10% in tails
# alpha = 0.10

# # now use inverse-CDF of Students t-distribution
# tstar = abs(t(df).ppf(alpha/2))

# CI_tunpooled = [d - tstar*stdD, d + tstar*stdD]
# CI_tunpooled
# d = np.mean(xS) - np.mean(xNS)

# nS, nNS = len(xS), len(xNS)
# stdS, stdNS = stdev(xS), stdev(xNS)
# stdD = np.sqrt(stdS**2/nS + stdNS**2/nNS)

# df = (stdS**2/nS + stdNS**2/nNS)**2 / \
#     ((stdS**2/nS)**2/(nS-1) + (stdNS**2/nNS)**2/(nNS-1) )

# # for 90% confidence interval, need 10% in tails
# alpha = 0.10

# # now use inverse-CDF of Students t-distribution
# tstar = abs(t(df).ppf(alpha/2))

# CI_tunpooled = [d - tstar*stdD, d + tstar*stdD]
# CI_tunpooled

In [ ]: