STA 101L - Summer I 2022
Raphael Morsomme
Find your teammate here.
If you have not started yet, start now. You can do already do 80% of the project.
You will have the tools to do the remaining 20% after this lecture.
simple linear regression model \[ Y \approx \beta_0 + \beta_1 X \]
multiple linear regression model \[ Y \approx \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + + \beta_p X_p \]
categorical predictor
feature engineering
Earlier, we saw that adding the predictor \(\dfrac{1}{\text{displ}}\) gave a better fit.
Let us see if the same idea work with the trees
dataset.
Suppose we want to predict volume using only the variable girth
.
One could argue that there is a slight nonlinear association
We start with the simple model
\[ \text{Volume} \approx \beta_0 + \beta_1 \text{girth} \]
Girth_new <- seq(min(d_tree$Girth), max(d_tree$Girth), by = 0.001)
new_data <- tibble(Girth = Girth_new)
Volume_pred <- predict(m1, new_data)
new_data <- mutate(new_data, Volume_pred = Volume_pred)
head(new_data)
# A tibble: 6 x 2
Girth Volume_pred
<dbl> <dbl>
1 8.44 5.51
2 8.44 5.52
3 8.44 5.52
4 8.44 5.53
5 8.44 5.53
6 8.44 5.54
To capture the nonlinear association between Girth
and Volume
, we consider the predictor \(\text{Girth}^2\).
\[ \text{Volume} \approx \beta_0 + \beta_1 \text{girth} + \beta_2 \text{girth}^2 \]
The command to fit this model is
d_tree2 <- mutate(d_tree, Girth2 = Girth^2)
m2 <- lm(Volume ~ Girth + Girth2, data = d_tree2)
compute_R2(m2)
[1] 0.958
\(R^2\) has increased! It went from 0.933 (model 1) to 0.958.
Girth_new <- seq(min(d_tree$Girth), max(d_tree$Girth), by = 0.001)
new_data <- tibble(Girth = Girth_new) %>% mutate(Girth2 = Girth^2)
Volume_pred <- predict(m2, new_data)
new_data <- mutate(new_data, Volume_pred = Volume_pred)
head(new_data)
# A tibble: 6 x 3
Girth Girth2 Volume_pred
<dbl> <dbl> <dbl>
1 8.44 71.2 11.5
2 8.44 71.2 11.5
3 8.44 71.2 11.5
4 8.44 71.2 11.5
5 8.44 71.2 11.5
6 8.44 71.3 11.5
Let us consider the predictor \(\text{Girth}^3\).
\[ \text{Volume} \approx \beta_0 + \beta_1 \text{girth} + \beta_2 \text{girth}^2 + \beta_3 \text{girth}^3 \]
d_tree3 <- mutate(d_tree, Girth2 = Girth^2, Girth3 = Girth^3)
m3 <- lm(Volume ~ Girth + Girth2 + Girth3, data = d_tree3)
compute_R2(m3)
[1] 0.96
\(R^2\) has increased! It went from 0.958 (model 2) to 0.96.
Girth_new <- seq(min(d_tree$Girth), max(d_tree$Girth), by = 0.001)
new_data <- tibble(Girth = Girth_new) %>% mutate(Girth2 = Girth^2, Girth3 = Girth^3)
Volume_pred <- predict(m3, new_data)
new_data <- mutate(new_data, Volume_pred = Volume_pred)
head(new_data)
# A tibble: 6 x 4
Girth Girth2 Girth3 Volume_pred
<dbl> <dbl> <dbl> <dbl>
1 8.44 71.2 601. 9.83
2 8.44 71.2 601. 9.83
3 8.44 71.2 601. 9.83
4 8.44 71.2 601. 9.84
5 8.44 71.2 601. 9.84
6 8.44 71.3 602. 9.84
What if we also include the predictors \(\text{Girth}^4, \text{Girth}^5, \dots, \text{Girth}^{k}\) for some larger number \(k\)?
The following R
command fits the model with \(k=81\), that is,
\[ \text{Volume} \approx \beta_0 + \beta_1 \text{girth} + \beta_2 \text{girth}^2 + \dots + \beta_{81} \text{girth}^{81} \]
[1] 0.981
\(R^2\) has again increased! It went from 0.96 (model 3) to 0.981.
As we keep adding predictors, \(R^2\) will always increase.
Is the model m_extreme
a good model?
Girth_new <- seq(min(d_tree$Girth), max(d_tree$Girth), by = 0.0001)
new_data <- tibble(Girth = Girth_new)
Volume_pred <- predict(m_extreme, new_data)
new_data <- mutate(new_data, Volume_pred = Volume_pred)
head(new_data)
# A tibble: 6 x 2
Girth Volume_pred
<dbl> <dbl>
1 8.44 10.3
2 8.44 10.4
3 8.44 10.4
4 8.44 10.4
5 8.44 10.5
6 8.44 10.5
The model m_extreme
overfits the data.
A model overfits the data when it fits the sample extremely well but does a poor job for new data.
Remember that we want to learn about the population, not the sample!
We saw that \(R^2\) keeps increasing as we add predictors.
\[ R^2 = 1 - \dfrac{SSR}{SST} \]
\(R^2\) can therefore not be used to identify models that overfit the data.
The adjusted-\(R^2\) is similar to \(R^2\), but penalizes large models:
\[ R^2_{\text{adj}} = 1 - \dfrac{SSR}{SST}\dfrac{n-1}{n-p-1} \]
where \(p\) corresponds to the number of predictors (model size).
The adjusted-\(R^2\) therefore balances goodness of fit and parsimony:
The model with the highest adjusted-\(R^2\) typically provides a good fit without overfitting.
04:00
Two other popular overall criteria that balance goodness of fit (small SSR) and parsimony (small \(p\)) are
The formula for AIC and BIC are respectively
\[ AIC = 2p - \text{GoF}, \qquad BIC = \ln(n)p- \text{GoF} \]
where \(\text{GoF}\) measures the goodness of fit of the model1.
AIC and BIC are easily accessible in R
with the command glance
.
# A tibble: 4 x 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.933 0.930 4.34 401. 1.57e-18 1 -88.5 183. 187.
2 0.958 0.955 3.50 317. 5.93e-20 2 -81.3 171. 176.
3 0.960 0.955 3.48 214. 6.39e-19 3 -80.5 171. 178.
4 0.981 0.960 3.28 46.3 2.32e- 9 16 -68.5 173. 199.
# ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
In this case, BIC favor m2
, AIC cannot decide between m2
and m3
and the adjusted-\(R^2\) (wrongly) favors m_extreme
.
The adjusted-\(R^2\), AIC and BIC all try to balance
They achieve this balance by favoring models with small SSR while penalizing models with larger \(p\).
…But
m_extreme
, although it was clearly overfitting the data.Instead of using these criteria, we could look for the model with the best predictions performance.
That is, the model that makes predictions for new observations that are the closest to the true values.
We will learn two approaches to accomplish this
The holdout method is a simple method to evaluate the predictive performance of a model.
Note that the test set consists of new observations for the model.
\(\Rightarrow\) Select the model with the best prediction accuracy in step 3.
Source: IMS
The following R
function splits a sample into a training and a test set
construct_training_test <- function(sample, prop_training = 2/3){
n <- nrow(sample)
n_training <- round(n * prop_training)
sample_random <- slice_sample(sample, n = n)
sample_training <- slice(sample_random, 1 : n_training )
sample_test <- slice(sample_random, - (1 : n_training))
return(list(
training = sample_training, test = sample_test
))
}
Girth Height Volume
1 8.931477 70 10.3
2 8.436883 65 10.3
3 9.464900 63 10.2
4 11.136215 72 16.4
5 10.907321 81 18.8
6 10.030025 83 19.7
7 10.535716 66 15.6
8 10.852640 75 18.2
9 11.097116 80 22.6
10 12.402327 75 19.9
11 11.681797 79 24.2
12 11.000495 76 21.0
13 10.826171 76 21.4
14 11.555269 69 21.3
15 11.850392 75 19.1
16 12.694245 74 22.2
17 13.026112 85 33.8
18 12.854039 86 27.4
19 13.917842 71 25.7
20 13.181231 64 24.9
21 13.887866 78 34.5
22 14.388698 80 31.7
23 14.566668 74 36.3
24 16.402095 72 38.3
25 16.271447 77 42.6
26 17.551804 81 55.4
27 18.042885 82 55.7
28 17.554523 80 58.3
29 17.357700 80 51.5
30 18.023363 80 51.0
31 20.482147 87 77.0
set.seed(0) # set the seed of the random number generator to 0
training_test_sets <- construct_training_test(d_tree)
training_set <- training_test_sets[["training"]]
training_set
Girth Height Volume
1 11.555269 69 21.3
2 16.271447 77 42.6
3 11.136215 72 16.4
4 10.535716 66 15.6
5 8.931477 70 10.3
6 8.436883 65 10.3
7 14.566668 74 36.3
8 11.681797 79 24.2
9 20.482147 87 77.0
10 12.854039 86 27.4
11 13.917842 71 25.7
12 18.042885 82 55.7
13 12.402327 75 19.9
14 18.023363 80 51.0
15 13.887866 78 34.5
16 17.554523 80 58.3
17 11.097116 80 22.6
18 10.907321 81 18.8
19 14.388698 80 31.7
20 11.850392 75 19.1
21 11.000495 76 21.0
We simply fit our regression model to the training set.
To evaluate the prediction accuracy, we start by computing the predictions for the observations in the test set.
A good model will make predictions that are closed to the true values of the response variable.
A good measure of prediction accuracy is the sum of squared errors (SSE).
\[ SSE = (y_1 - \hat{y}_1)^2 + (y_2 - \hat{y}_2)^2 + \dots + (y_m - \hat{y}_m)^2 \]
Small SSE is better.
In practice, the (root) mean sum of squared errors ((R)MSE) is often used.
\[ MSE = \dfrac{SSE}{m} = \dfrac{(y_1 - \hat{y}_1)^2 + (y_2 - \hat{y}_2)^2 + \dots + (y_m - \hat{y}_m)^2}{m} \]
\[ RMSE = \sqrt{\dfrac{SSE}{m}} = \sqrt{\dfrac{(y_1 - \hat{y}_1)^2 + (y_2 - \hat{y}_2)^2 + \dots + (y_m - \hat{y}_m)^2}{m}} \] Again, small (R)MSE is better.
Apply steps 2 and 3 on different models.
\(\Rightarrow\) Choose the model with the lowest (R)MSE!
This model has the best predictive accuracy.
# Step 1 - partition sample into a training set and a test set
# already done
# Step 2 - fit models to training set
m1 <- lm(Volume ~ poly(Girth, degree = 1 , raw = TRUE), data = training_set)
m2 <- lm(Volume ~ poly(Girth, degree = 2 , raw = TRUE), data = training_set)
m3 <- lm(Volume ~ poly(Girth, degree = 3 , raw = TRUE), data = training_set)
m48 <- lm(Volume ~ poly(Girth, degree = 48, raw = TRUE), data = training_set)
# Step 3 - evaluate models on test set
compute_RMSE(test_set, m1)
compute_RMSE(test_set, m2)
compute_RMSE(test_set, m3) # best
compute_RMSE(test_set, m48)
[1] 4.598416
[1] 4.07333
[1] 3.957379
[1] 29.97533
We select m3
.
The previous analysis was conducted with set.seed(0)
.
Note that models m2
and m3
were pretty close.
Could we have obtained a different result with a different seed?
# Step 1
set.seed(1) # new seed
training_test_sets <- construct_training_test(d_tree)
training_set <- training_test_sets[["training"]]
test_set <- training_test_sets[["test"]]
# Step 2
m1 <- lm(Volume ~ poly(Girth, degree = 1 , raw = TRUE), data = training_set)
m2 <- lm(Volume ~ poly(Girth, degree = 2 , raw = TRUE), data = training_set)
m3 <- lm(Volume ~ poly(Girth, degree = 3 , raw = TRUE), data = training_set)
m48 <- lm(Volume ~ poly(Girth, degree = 48, raw = TRUE), data = training_set)
# Step 3
compute_RMSE(test_set, m1)
compute_RMSE(test_set, m2) # best
compute_RMSE(test_set, m3)
compute_RMSE(test_set, m48)
[1] 6.097541
[1] 4.248172
[1] 10.37698
[1] 103753769
With set.seed(1)
, the holdout method indicates that m2
is the best model.
m3
, our previous best model, is worse than m1
!A drawback of the holdout method is that the test set matters a lot.
05:00
Cross-validation (CV) is a natural generalization of the holdout method:
Repeat the following steps \(k\) times
Let fold 1 be the test set and the remaining folds be the training set.
Fit the model to the training set (like the holdout method)
Evaluate the prediction accuracy of the model on the test set (like the holdout method)
Go back to step 1, let the next fold be the test set.
\(\Rightarrow\) Select the model with the best overall prediction accuracy in step 3.
Source: towardsdatascience
set.seed(0)
# Setup
n <- nrow(d_tree)
n_folds <- 10
# Fold assignment
folds <- c(rep(1:n_folds, n %/% n_folds), seq_along(n %% n_folds))
# you do not need to understand the details of the previous line
# just note that it creates folds of (almost) the same size
folds
[1] 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5
[26] 6 7 8 9 10 1
# Step 1
d_tree_fold <- d_tree %>%
slice_sample(n = n) %>% # shuffle the rows
mutate(fold = folds)
head(d_tree_fold)
Girth Height Volume fold
1 11.555269 69 21.3 1
2 16.271447 77 42.6 2
3 11.136215 72 16.4 3
4 10.535716 66 15.6 4
5 8.931477 70 10.3 5
6 8.436883 65 10.3 6
RMSE <- create_empty_RMSE()
for(i in 1 : n_folds){
# Step 1
training_set <- filter(d_tree_fold, fold != i)
test_set <- filter(d_tree_fold, fold == i)
# Step 2
m1 <- lm(Volume ~ poly(Girth, degree = 1 , raw = TRUE), data = training_set)
m2 <- lm(Volume ~ poly(Girth, degree = 2 , raw = TRUE), data = training_set)
m3 <- lm(Volume ~ poly(Girth, degree = 3 , raw = TRUE), data = training_set)
m48 <- lm(Volume ~ poly(Girth, degree = 48, raw = TRUE), data = training_set)
# Step 3
rmse_m1 <- compute_RMSE(test_set, m1 )
rmse_m2 <- compute_RMSE(test_set, m2 )
rmse_m3 <- compute_RMSE(test_set, m3 )
rmse_m48 <- compute_RMSE(test_set, m48)
RMSE <- RMSE %>%
add_row(fold = i, rmse_m1, rmse_m2, rmse_m3, rmse_m48)
}
Which model has the lowest RMSE across all folds?
# A tibble: 10 x 5
fold rmse_m1 rmse_m2 rmse_m3 rmse_m48
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 5.07 3.50 3.54 4.25
2 2 2.89 2.34 1.95 10.8
3 3 4.78 5.02 5.11 5.54
4 4 2.82 3.49 3.45 3.15
5 5 1.67 2.99 2.67 72.5
6 6 6.38 4.45 5.38 412.
7 7 3.52 2.89 2.83 4.73
8 8 1.37 2.01 1.77 3.90
9 9 7.49 3.14 3.92 2481852.
10 10 5.60 4.53 4.27 4.47
# A tibble: 1 x 5
fold rmse_m1 rmse_m2 rmse_m3 rmse_m48
<dbl> <dbl> <dbl> <dbl> <dbl>
1 5.5 4.16 3.44 3.49 248237.
# A tibble: 1 x 5
fold rmse_m1 rmse_m2 rmse_m3 rmse_m48
<dbl> <dbl> <dbl> <dbl> <dbl>
1 55 41.6 34.4 34.9 2482374.
m2
has slightly better overall RMSE.
04:00
Cross-validation is a central technique in machine learning.
Still not totally clear how it works? Check out this podcast to hear more about this important topic.
These two other podcasts on muliple linear regression and \(R^2\) are also excellent.
Set \(k=n\).
that is, use \(n\) folds, each of size \(1\).
Each test set will therefore consist of a single observation and the corresponding training set of the remaining \(n-1\) observations.
RMSE <- create_empty_RMSE()
# Step 0
n <- nrow(d_tree)
d_tree_loocv <- mutate(d_tree, folds = 1 : n) # every observation belongs to a different fold
for(i in 1:n){
# Step 1
training_set <- filter(d_tree_loocv, folds != i)
test_set <- filter(d_tree_loocv, folds == i)
# Step 2
m1 <- lm(Volume ~ poly(Girth, degree = 1 , raw = TRUE), data = training_set)
m2 <- lm(Volume ~ poly(Girth, degree = 2 , raw = TRUE), data = training_set)
m3 <- lm(Volume ~ poly(Girth, degree = 3 , raw = TRUE), data = training_set)
m48 <- lm(Volume ~ poly(Girth, degree = 48, raw = TRUE), data = training_set)
# Step 3
rmse_m1 <- compute_RMSE(test_set, m1 )
rmse_m2 <- compute_RMSE(test_set, m2 )
rmse_m3 <- compute_RMSE(test_set, m3 )
rmse_m48 <- compute_RMSE(test_set, m48)
RMSE <- RMSE %>%
add_row(fold = i, rmse_m1, rmse_m2, rmse_m3, rmse_m48)
}
summarize_all(RMSE, mean)
# A tibble: 1 x 5
fold rmse_m1 rmse_m2 rmse_m3 rmse_m48
<dbl> <dbl> <dbl> <dbl> <dbl>
1 16 3.64 3.11 3.12 142697.
Not my favorite method, but it is widely used.
Two types of stepwise procedures: (i) forward selection and (ii) backward elimination.
Choose an overall criterion that balances GoF (small SSR) and parsimony (small \(p\))
\(\Rightarrow\) adjusted-\(R^2\), AIC or BIC
Start with the empty model \(Y \approx \beta_0\), i.e. the model with no predictor. This is our current model
Fit all possible models with one additional predictor.
Compute the criterion, say the AIC, of each of these models.
Identify the model with the smallest AIC. This is our candidate model.
If the AIC of the candidate model is smaller (better) than the AIC of the current model, the candidate model becomes the current model. Go back to step 3.
If the AIC of the candidate model is larger than the AIC of the current model (no new model improves on the current one), the procedure stops, and we select the current model.
05:00
Similar to forward selection, except that
05:00
To obtain a parsimonious model
Overfitting
Model selection through
overall criteria (adjusted-\(R^2\), AIC, BIC)
predictive performance (holdout method, cross-validation)
stepwise procedure (forward, backward)
You are now fully equipped for the prediction project!
🍀 Good luck! 🍀