STA 101L - Summer I 2022
Raphael Morsomme
Homework 6 due Thursday
Inference project
Remaining homework due on Mon/Thu instead of Wed/Sun?
Midterm survey
review holdout and CV
more live coding (?)
more individual assignments in class
OH before Sunday’s HW \(\Rightarrow\) HW due on Monday (OH Monday morning)
website
external links will open a new window
slide changes will not be pushed to browser history
see these steps to print the slides
One proportion
HT via simulation
CI via bootstrap
Two proportions
HT via simulation
CI via bootstrap
The distribution of a numerical variable is often modeled with a normal distribution.
The mean (\(\mu\)) – location
The standard deviation (\(\sigma\)) – spread
\[ N(\mu, \sigma) \] The standard normal distribution \(N(\mu = 0, \sigma = 1)\) is plotted on the previous slide.
03:00
If the variable \(X\sim N(\mu, \sigma)\), then for any number \(a\) and \(b\)
\[ X + a \sim N(\mu+a, \sigma), \]
\[ bX \sim N(b\mu, b \sigma) \]
and
\[ bX + a \sim N(b\mu+a, b\sigma). \]
If the variables \(X\sim N(\mu_1, \sigma_1)\) and \(Y\sim N(\mu_2, \sigma_2)\) are independent, then
\[ X+Y \sim N(\mu_{tot} = \mu_1 + \mu_2, \sigma_{tot} = \sqrt{\sigma_1^2 + \sigma_2^2}) \]
\(N(\mu, \sigma^2)\), where \(\sigma^2\) is the variance.
Property I gives
\[ bX + a \sim N(b\mu + a, b^2 \sigma^2) \]
and property II gives
\[ X+Y \sim N(\mu_{tot} = \mu_1 + \mu_2, \sigma_{tot}^2 = \sigma_1^2 + \sigma_2^2) \]
The mean of the sum is the sum of the means.
The variance of the sum is the sum of the variances.
04:00
02:00
Population parameter: mean \(\mu\)
Sample statistic: sample average \(\bar{x}\)
Hypothesis testing:
\(H_0:\mu=\mu_0\) where \(\mu_0\) is a fixed number
\(H_a:\mu\neq \mu_0\)
Confidence interval: range of plausible values for \(\mu\).
01:00
What is the average price of a car on Awesome Car?
\[\begin{align*} \bar{x} & = \dfrac{18,300+20,100+9,600+10,700+27,000}{5} \\ & = 17,140 \end{align*}\]Source: IMS
Let us assume that car prices follow a normal distribution with mean \(\mu\) and sd \(\sigma\):
\[ \text{price} \sim N(\mu, \sigma) \]
\(H_0:\mu=10,000\)
\(H_a:\mu\neq 10,000\)
Simulate many samples under \(H_0\).
Determine if the observed data could have plausibly arisen under \(H_0\).
To simulate from the normal distribution, we need to specify both the mean \(\mu\) and the sd \(\sigma\).
Problem: under \(H_0\) we only know that \(\mu = 10,000\); we do not know what value to use for \(\sigma\)!
\(\Rightarrow\) We cannot conduct a hypothesis test via simulation!
Sample with repetition from the observed sample to construct many bootstrap samples.
Bootstrap samples \(\Rightarrow\) sampling distribution \(\Rightarrow\) CI
Source: IMS
R
03:00
A population divided in two groups.
Population parameter: difference in mean
\[ \mu_{diff}=\mu_1-\mu_2 \]
Sample statistic: difference in proportion in the sample
\[ \bar{x}_{diff}=\bar{x}_1-\bar{x}_2 \]
\(H_0:\mu_{diff}=0\) (no difference between the two groups)
\(H_a:\mu_{diff}\neq0\)
01:30
A professor considers two designs for an exam. Are the two types of exam equally difficult?
03:00
\(H_0:\mu_{diff}=0\)
\(H_a:\mu_{diff}\neq 0\)
Simulate many samples under \(H_0\) (no difference)
Determine if the observed data could have plausibly arisen under \(H_0\)
Under \(H_0\), there is no difference between the two exams
\(\Rightarrow\) the score is independent of the type of exam
\(\Rightarrow\) randomly re-assign the scores independently of the exam type.
Source: IMS
Using the usual significance level \(\alpha = 0.05\), we fail to reject the null hypothesis
03:00
Same idea as before: sample with repetition from the observed data to construct many bootstrap samples.
Bootstrap samples \(\Rightarrow\) sampling distribution \(\Rightarrow\) CI
Source: IMS
R
sample_observed_a <- d %>% filter(exam == "a")
sample_observed_b <- d %>% filter(exam == "b")
set.seed(0)
sample_bootstrap(sample_observed_a) # bootstrap sample
# A tibble: 58 x 2
score exam
<dbl> <fct>
1 58 a
2 59 a
3 81 a
4 71 a
5 67 a
6 58 a
7 80 a
8 72 a
9 58 a
10 72 a
# ... with 48 more rows
# A tibble: 55 x 2
score exam
<dbl> <fct>
1 61 b
2 83 b
3 68 b
4 71 b
5 82 b
6 46 b
7 78 b
8 72 b
9 64 b
10 83 b
# ... with 45 more rows
results <- tibble(x_diff_bar = numeric())
for(i in 1 : 1e3){
d_boot_a <- sample_bootstrap(sample_observed_a) # bootstrap sample
d_boot_b <- sample_bootstrap(sample_observed_b) # bootstrap sample
x_diff_bar <- compute_x_diff(rbind(d_boot_a, d_boot_b)) # bootstrap statistic
results <- results %>% add_row(x_diff_bar)
}
03:00
Paired data: two groups in which each observation in one group has exactly one corresponding observation in the other group.
Example: pre/post-evaluations; supermarket items; batteries and electronic devices; tires and cars.
We want to compare the longevity of two brands of tire. The response variable is tire tread after 1000 miles.
25 cars drove 1000 miles. On each car, one tire was from Smooth Turn and another one was from Quick Spin. The other two tires were baseline tires.
set.seed(1)
bias <- runif(25, -0.01, 0.01)
brandA <- rnorm(25, bias + 0.310, 0.003)
brandB <- rnorm(25, bias + 0.308, 0.003)
car <- c(paste("car", 1:25))
miny <- min(brandA, brandB) - .003
maxy <- max(brandA, brandB) + .003
tires <- tibble(
tread = c(brandA, brandB),
car = rep(car, 2),
brand = c(rep("Smooth Turn", 25), rep("Quick Spin", 25))
) %>%
arrange(car)
head(tires)
# A tibble: 6 x 3
tread car brand
<dbl> <chr> <chr>
1 0.304 car 1 Smooth Turn
2 0.307 car 1 Quick Spin
3 0.302 car 10 Smooth Turn
4 0.303 car 10 Quick Spin
5 0.307 car 11 Smooth Turn
6 0.305 car 11 Quick Spin
Source: IMS
04:00
Exactly the same as with one mean (case 3).
tires_diff <- tires %>%
pivot_wider(names_from = brand, values_from = tread) %>%
mutate(tread_diff = `Smooth Turn` - `Quick Spin`) %>%
select(car, tread_diff)
head(tires_diff)
# A tibble: 6 x 2
car tread_diff
<chr> <dbl>
1 car 1 -0.00258
2 car 10 -0.00107
3 car 11 0.00192
4 car 12 0.00251
5 car 13 0.00271
6 car 14 0.00639
Simply construct a CI for the variable tread_diff
with bootstrap samples.
10:00
\(H_0:\mu_{diff}=0\)
\(H_a:\mu_{diff}\neq 0\)
Simulate many samples under \(H_0\) (no difference)
Determine if the observed data could have plausibly arisen under \(H_0\)
Under \(H_0\), there is no difference between the two tire brands
\(\Rightarrow\) tire tread is independent of tire brand
\(\Rightarrow\) randomly re-assign tire tread independently of tire brand.
02:00
R
function to shuffle datashuffle_data <- function(data){
tires %>%
group_by(car) %>%
mutate(tread = sample(tread))
}
set.seed(0)
tires_shuffled <- shuffle_data(tires)
head(tires)
# A tibble: 6 x 3
tread car brand
<dbl> <chr> <chr>
1 0.304 car 1 Smooth Turn
2 0.307 car 1 Quick Spin
3 0.302 car 10 Smooth Turn
4 0.303 car 10 Quick Spin
5 0.307 car 11 Smooth Turn
6 0.305 car 11 Quick Spin
# A tibble: 6 x 3
# Groups: car [3]
tread car brand
<dbl> <chr> <chr>
1 0.307 car 1 Smooth Turn
2 0.304 car 1 Quick Spin
3 0.303 car 10 Smooth Turn
4 0.302 car 10 Quick Spin
5 0.307 car 11 Smooth Turn
6 0.305 car 11 Quick Spin
R
function for computing the test statisticresults %>%
mutate(is_more_extreme = stat_sim >= stat_obs | stat_sim <= -stat_obs) %>%
summarize(p_value = mean(is_more_extreme))
# A tibble: 1 x 1
p_value
<dbl>
1 0.012
Using the usual significance level \(\alpha = 0.05\), we reject \(H_0\).
02:00
Let us conduct a hypothesis test for the tire data, but this time without pairing the data.
\(\Rightarrow\) This is simply a hypothesis test for two means.
compute_x_diff <- function(data){
x_bar <- data %>%
group_by(brand) %>%
summarize(x_bar = mean(tread))
x_diff_bar <- x_bar$x_bar[1] - x_bar$x_bar[2]
return(x_diff_bar)
}
compute_x_diff(tires)
[1] -0.0020934
x_diff_obs <- compute_x_diff(tires)
results %>%
mutate(is_more_extreme = abs(x_diff_bar) >= abs(x_diff_obs)) %>%
summarize(p_value = mean(is_more_extreme))
# A tibble: 1 x 1
p_value
<dbl>
1 0.229
The p-value is larger than \(\alpha=0.05\); we fail to reject the null hypothesis.
10:00