03:00
STA 101L - Summer I 2022
Raphael Morsomme
Types of data and studies
Visualization and numerical summaries
Regression models
linear regression
logistic regression
model selection
Source: IMS
Statistical inference
proportions
means
linear and logistic regression
Statistical inference
Five cases
Hypothesis test
Confidence interval
A first glimpse of modern statistics
We want to learn about some (unknown) parameter of some population of interest from a (small) sample of observations
03:00
Inference: estimating the population parameter from the sample.
Statistical inference: estimating the population parameter from the sample and rigorously quantify our certainty in the estimate.
Statistic: any function of some data
Sample statistic: a statistic computed on the sample
Summary statistic: a statistic used to summarize a sample
Test statistic: statistic used for statistical inference
To estimate a population parameter, we can simply
obtain a representative sample, and
use the corresponding sample statistic as an estimate
A single point (the sample statistic) does not indicate how certain we are in our estimate.
If we have a large sample, we can be pretty confident that our estimates will be close to the true value of the parameter,
if we have a small sample, we know that our estimate may be far from the truth.
e.g. free throws in basketball, penalty kicks in soccer.
Framework to make rigorous statements about uncertainty in our estimates
confidence intervals
hypothesis tests
What is the proportion of vegetarians among Duke students?
Population parameter: proportion of vegetarians among Duke students \((p)\)
Sample statistic: proportion of vegetarian in the sample \((\hat{p})\)
Confidence interval: a range a plausible values for the population parameter \(p\)
Hypothesis test: is the proportion of vegetarians among Duke students \(0.5\)?
Is the proportion of vegetarians the same among Duke undergraduate and graduate students?
Population parameter: difference between the proportion of vegetarians among Duke undergraduate and graduate students \((p_{diff} = p_{undergrad}-p_{grad})\)
Sample statistic: difference in proportion of vegetarians in the sample \((\hat{p}_{diff} = \hat{p}_{undergrad} - \hat{p}_{grad})\)
Confidence interval for \(p_{diff}\): \((-0.05, 0.08)\)
Hypothesis test: is the proportion of vegetarians the same among Duke undergraduate and graduate students?
How much time do Duke students sleep on average per night?
Population parameter: mean amount of time that Duke students sleep per night \((\mu)\)
Sample statistic: average amount of time slept in the sample \((\bar{x})\)
Confidence interval for \(\mu\): \((5.5, 7.5)\)
Hypothesis test: Do Duke students sleep on average \(8\) hours per night?
Do Duke undergraduate and graduate students sleep on average the same amount of time per night?
Population parameter: difference between the mean amount of time that Duke undergraduate and graduate students sleep per night \((\mu_{diff} = \mu_{undergrad}-\mu_{grad})\)
Sample statistic: difference between the two sample averages \((\bar{x}_{diff} = \bar{x}_{undergrad} - \bar{x}_{grad})\)
Confidence interval for \(\mu_{diff}\) : \((-0.5, 1)\)
Hypothesis test: Do Duke undergraduate and graduate students sleep on average the same amount of time per night?
What is the relation between fuel consumption in the city and on the highway?
Population parameter: the coefficient \(\beta_1\) is the equation \(\text{hwy} \approx \beta_0 + \beta_1 \text{cty}\).
Sample statistic: the least-square estimate \(\hat{\beta}_1\).
Confidence interval for \(\beta_1\): \((1.05, 1.2)\)
Hypothesis test: are the variables \(\text{cty}\) and \(\text{hwy}\) independent?
Two competing hypotheses:
Consider the 2nd case (difference in proportion of vegetarians between undergrad and grad students).
\(H_0:\) the proportion of vegetarian is the same among undergraduate and graduate students (“nothing is going on”) $$ H_0: p_{diff}=p_{undegrad}-p_{grad}=0
\[ \]
H_a: p_{undegrad}=p_{grad} $$
\(H_a:\) the proportion of vegetarians among undergraduate and graduate students is not the same (“something is going on”). $$ H_0: p_{diff}=p_{undegrad}-p_{grad}
\[ \]
H_a: p_{undegrad}p_{grad} $$
04:00
Go back to the 2nd case and suppose that \(H_0\) is true.
Now suppose that \(H_a\) is true.
03:00
CI: range of plausible values for the population parameter.
There always is natural variability in the data.
If we draw a second sample from the population, the two samples will differ and the sample statistics will not be the same (e.g., \(\hat{p}_1\neq\hat{p}_2\), \(\bar{x}_1\neq\bar{x}_2\) and \(\hat{\beta}^{(1)}_1\neq\hat{\beta}^{(2)}_1\)).
There is thus no reason to believe that the sample statistic in the first sample is exactly equal to the population parameter (e.g. \(\hat{p}_1 = p\) and \(\bar{x}_1=\mu\)).
What is the approval rate of the US president?
What proportion of Duke students are vegetarian?
How much weight have US adults gained since the start of the Covid-19 pandemic?
CI: range of plausible values based on a sample.
We will learn two approaches to statistical inference
computer-intensive
models the variability in the data by repeating a procedure many times (for-loop)
always applicable
02:00
The following for-loop does the previous experiment efficiently!
set.seed(0)
results <- tibble(prop_heads = numeric()) # empty data frame to collect the results
for(i in 1 : 1e3){ # repeat the experiment 1,000 times
flips <- rbernoulli(100, p = 0.5) # flip 100 coins (sample)
n_heads <- sum(flips) # count the number of heads
prop_heads <- n_heads / 100 # proportion of heads (sample statistic)
results <- results %>% add_row(prop_heads) # add the sample statistic to the data frame `result`
}
Distribution of the sample statistic \(\hat{p}\) when \(H_0\) is true.
\(H_0:p=0.5\) (the coin is fair)
\(H_a:p\neq0.5\) (the coin is not fair)
\(H_0:\) innocent (the coin is fair)
\(H_a:\) guilty (the coin is not fair)
Question: do the facts (the sample) provide sufficient evidence to reject the claim that the defendant is innocent (that the coin is fair)?
If so, we reject \(H_0\),
otherwise, we fail to reject \(H_0\).
Suppose the sample consists of 55 heads of 100 flips
\(\Rightarrow\) such sample is plausible under \(H_0\); the observed data do not provide strong evidence against the null hypothesis; we fail to reject the claim that the coin is fair
Now suppose that out of 100 flips, you observe 65 heads.
\(\Rightarrow\) this result is extremely unlikely under \(H_0\); the observed data provide strong evidence against the null hypothesis; we reject the claim that the coin is fair.
Due to the natural variability of the data, each sample is different.
In practice we only get to observe a single sample. But if we could observe other samples,
they would all be a bit different
\(\Rightarrow\) the sample statistics would also be different
\(\Rightarrow\) the estimates would also be different.
If the samples are very different
If the samples are all similar
Problem: we only get to observe a single sample!
Solution: Use the sample to approximate the population and take repeated samples from the estimated population to simulate many samples.
Source: IMS
Computing the sample statistic of each bootstrap sample provides its sampling distribution.
To construct a 90% CI of some parameter, we simply identify the 5th and 95th percentiles of the sampling distribution of the corresponding sample statistics.
Interpretation: “We are 90% confident that the interval captures the true value of the population parameter”.
Little variability in the data
\(\Rightarrow\) little sample-to-sample variability
\(\Rightarrow\) little variability between bootstrap samples
\(\Rightarrow\) sampling distribution of mean/proportion is concentrated
\(\Rightarrow\) CI is narrow.
Much variability in the data
\(\Rightarrow\) much sample-to-sample variability
\(\Rightarrow\) much variability between bootstrap samples
\(\Rightarrow\) sampling distribution of mean/proportion is diffuse
\(\Rightarrow\) CI is large.
d <- ggplot2::mpg
results <- tibble(mean = numeric(), sd = numeric())
for(i in 1 : 1e3){
d_boot <- slice_sample(d, n = nrow(d), replace = TRUE) # sampling from the sample with replacement
results <- results %>%
add_row(mean = mean(d_boot$cty), sd = sd(d_boot$cty)) # sample statistics of the bootstrap sample
}
06:00
03:00
Statistical inference
Five cases
single proportion
difference between two portions
single mean
difference between two means
regression parameters
Hypothesis test
Confidence interval
A first glimpse of modern statistics