03:00
STA 101L - Summer I 2022
Raphael Morsomme
Tuesday: lecture + QA
Wednesday: work on project (online OH)
Thursday: work on project (online OH)
Friday: presentations
HT via simulation
CI via bootstrap
5 cases
one proportion
two proportions
one mean
two means
linear regression
03:00
\(\Rightarrow\) unimodal, symmetric, thin tails – bell-shaped
Source: IMS
Source: IMS
The normal distribution describes the variability of the different statistics
\(\hat{p}\), \(\bar{x}\), \(\hat{\beta}\)
simply look at all the histograms we have constructed from simulated samples (HT) and bootstrap samples (CI)!
Classical approach: instead of simulating the sampling distribution via simulation (HT) or bootstrapping (CI), we approximate it with a normal distribution.
We have seen that if a numerical variable \(X\) is normally distributed
\[ X\sim N(\mu, \sigma^2) \]
then the sample average is also normally distributed
\[ \bar{x} \sim N\left(\mu, \frac{\sigma^2}{n}\right) \]
In practice, we cannot assume that the variable \(X\) is exactly normally distributed.
But as long as
the sample is large, or
the variable is approximately normal: unimodal, roughly symmetric and no serious outlier
\(\bar{x}\) is well approximated by a normal distribution
\[ \bar{x} \sim N\left(\mu, \frac{\sigma^2}{n}\right) \]
See the numerous histograms for case 3 (one mean) where the distribution of \(\bar{x}\) always looks pretty normal.
If
the observations are independent – the independence condition
\(p\) is not extreme and \(n\) is not small \((pn\ge 10 \text{ and } (1-p)n\ge 10)\) – the success-failure condition
the distribution of \(\hat{p}\) can be approximated by a normal distribution
\[ \hat{p} \sim N\left(p, \frac{p(1-p)}{n}\right) \]
Step 1: we are interested in the distribution of the statistic under \(H_0\).
Modern approach: simulate from this distribution
Classical approach: approximate this distribution with a normal distribution
Step 2: we want to compute the p-value
Modern approach: the p-value is the proportion of simulations with a statistic at least as extreme as that of the observed sample
Classical approach: the p-value is the area under the curve of the normal distribution that is at least as extreme as the observed statistic.
R
doesR
will compute the p-value for you. Here is what R
does behind the scene:
Step 2: identify the upper and lower bounds of the CI
Modern approach: find the appropriate percentiles among the simulated values
Classical approach:find the appropriate percentiles of the normal approximation
R
doesR
will compute the upper and lower bounds for you. Here is what R
does behind the scene:
n <- 1500 # sample size
x <- 780 # number of successes
prop.test(
x, n, # observed data
p = 0.5, # value in the null hypothesis
conf.level = 0.99 # confidence level for CI
)
1-sample proportions test with continuity correction
data: x out of n, null probability 0.5
X-squared = 2.3207, df = 1, p-value = 0.1277
alternative hypothesis: true p is not equal to 0.5
99 percent confidence interval:
0.4864251 0.5533970
sample estimates:
p
0.52
The simulation-based HT yielded a p-value of 0.127.
Conditions: independence, success-failure condition
05:00
Consider the gender discrimination study.
n_m <- 24; n_f <- 24 # sample sizes
x_m <- 14; x_f <- 21 # numbers of promotions
prop.test(c(x_m, x_f), c(n_m, n_f))
2-sample test for equality of proportions with continuity correction
data: c(x_m, x_f) out of c(n_m, n_f)
X-squared = 3.7978, df = 1, p-value = 0.05132
alternative hypothesis: two.sided
95 percent confidence interval:
-0.57084188 -0.01249145
sample estimates:
prop 1 prop 2
0.5833333 0.8750000
Independence within groups (same as case 1)
Independence between groups
Success-failure condition for each group (10 successes and 10 failures in each group)
Using the simulation-based HT, we found a p-value of 0.0435.
06:00
Independence
Normality – can be relaxed for larger samples \((n\ge30)\)
03:00
There are two implementation; which one is more convenient depends on the structure of the data.
Welch Two Sample t-test
data: hwy by year
t = -0.032864, df = 231.64, p-value = 0.9738
alternative hypothesis: true difference in means between group 1999 and group 2008 is not equal to 0
95 percent confidence interval:
-1.562854 1.511572
sample estimates:
mean in group 1999 mean in group 2008
23.42735 23.45299
Welch Two Sample t-test
data: d$cty and d$hwy
t = -13.755, df = 421.79, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-7.521683 -5.640710
sample estimates:
mean of x mean of y
16.85897 23.44017
Independence within groups
Independence between groups
Normality in each group (same as case 3 – one mean)
01:00
Paired t-test
data: d$cty and d$hwy
t = -44.492, df = 233, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.872628 -6.289765
sample estimates:
mean of the differences
-6.581197
Paired observations
Independence between pairs
Normality
01:00
d <- heart_transplant %>% mutate(survived_binary = survived == "alive")
m <- glm(survived_binary ~ age + transplant, family = "binomial", data = d)
tidy(m)
# A tibble: 3 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.973 1.08 0.904 0.366
2 age -0.0763 0.0255 -2.99 0.00277
3 transplanttreatment 1.82 0.668 2.73 0.00635
Linearity
Independence
Normality
Equal variability (homoskedasticity)
\(\Rightarrow\) verify with a residual plot!
05:00
Standard error (SE): standard deviation of the normal approximation.
The SE measures the variability of the statistic.
\(SE(\hat{p})=\sqrt{\frac{p(1-p)}{n}}\)
\(SE(\hat{p}_{diff})=\sqrt{\frac{p_1(1-p_1)}{n_1}+\frac{p_2(1-p_2)}{n_2}}\)
\(SE(\bar{x}) = \sqrt{\frac{\sigma^2}{n}}\)
\(SE(\bar{x}_{diff}) = \sqrt{\frac{\sigma^2_1}{n_1}+\frac{\sigma^2_2}{n_2}}\)
\(SE(\hat{\beta})\) has a complicated form.
02:00
ask <- openintro::ask %>%
mutate(
response = if_else(response == "disclose", "Disclose problem", "Hide problem"),
question_class = case_when(
question_class == "general" ~ "General",
question_class == "neg_assumption" ~ "Negative assumption",
question_class == "pos_assumption" ~ "Positive assumption"
),
question_class = fct_relevel(question_class, "General", "Positive assumption", "Negative assumption")
)
Question | Disclose problem | Hide problem | Total |
---|---|---|---|
General | 2 | 71 | 73 |
Positive assumption | 23 | 50 | 73 |
Negative assumption | 36 | 37 | 73 |
Total | 61 | 158 | 219 |
Source: IMS
\(H_0\): the response is independent of the question asked
\(H_a\): the response depends on the question asked
We will not quantify the differences between the three question with CIs.
Disclose problem |
Hide problem |
Total |
|||
---|---|---|---|---|---|
General | 2 | (20.33) | 71 | (52.67) | 73 |
Positive assumption | 23 | (20.33) | 50 | (52.67) | 73 |
Negative assumption | 36 | (20.33) | 37 | (52.67) | 73 |
Total | 61 | NA | 158 | NA | 219 |
Source: IMS
Is the difference between the expected and observed counts is due to
chance alone, or
the fact that the way people responded depended on the question asked?
\(\chi^2\) (“Kai-squared”) statistic:
\[ \chi^2 = \dfrac{(O_{11} - E_{11})^2}{E_{11}} + \dfrac{(O_{21} - E_{21})^2}{E_{21}} + \dots + \dfrac{(O_{32} - E_{32})^2}{E_{32}} \]
\[ \begin{aligned} &\text{General formula} && \frac{(\text{observed count } - \text{expected count})^2} {\text{expected count}} \\ &\text{Row 1, Col 1} && \frac{(2 - 20.33)^2}{20.33} = 16.53 \\ &\text{Row 2, Col 1} && \frac{(23 - 20.33)^2}{20.33} = 0.35 \\ & \hspace{9mm}\vdots && \hspace{13mm}\vdots \\ &\text{Row 3, Col 2} && \frac{(37 - 52.67)^2}{52.67} = 4.66 \end{aligned} \]
\[\chi^2 = 16.53 + 0.35 + \dots + 4.66 = 40.13\]
Source: IMS
When the conditions of
independence
\(>5\) expected counts per cell
are satisfied, the \(\chi^2\) statistic approximately follows a \(\chi^2\) distribution.
Source: IMS
R
# A tibble: 6 x 3
question_class question response
<fct> <chr> <chr>
1 General What can you tell me about it? Hide problem
2 Positive assumption It doesn't have any problems, does it? Hide problem
3 Positive assumption It doesn't have any problems, does it? Disclose problem
4 Negative assumption What problems does it have? Disclose problem
5 General What can you tell me about it? Hide problem
6 Negative assumption What problems does it have? Disclose problem
Pearson's Chi-squared test
data: ask$response and ask$question_class
X-squared = 40.128, df = 2, p-value = 0.000000001933
01:30
When the conditions are not met, you need to conduct a HT via simulation.
See Section 18.1 for an example.
Source: IMS
\(H_0: \mu_{OF} = \mu_{IF} = \mu_{C}\): (the batting performance is the same across all three positions)
\(H_a\): at least one mean is different
We will not quantify the differences between the three positions with CIs.
R
Independence within
Independence between
Normality (sample size and outliers)
Constant variance
Verify assumptions 3 and 4 with side-sby-side histograms
03:00
When the conditions are not met, you need to conduct a HT via simulation.
See Section 22.2 for an example.