(Builds on: Models with multiple variables)
library(tidyverse)
library(modelr)
A crucial part of the modeling process is determining the quality of your model. In this reading, we’ll explore how to compare different models to each other and pick the one that best approximates your data.
In order to illustrate the process of model evaluation, we’re going to use simulated data. To simulate data, you pick a function and use that function to generate a series of data points. We can then compare any models we make to the true function used to generate the data.
Note that you almost always won’t actually know the functional representation underlying your data. If you did, you probably wouldn’t need a model in the first place.
For our simulated data, let’s assume there’s some phenomenon that has a functional representation of the form
y = f(x)
In other words, the phenomenon is purely a function of the variable x
,
and so, if you know x
, you can figure out the behavior of the
phenomenon. Here’s the function f()
that we’ll use.
# True function
f <- function(x) x + 50 * sin((pi / 50) * x)
f()
is the true function, but let’s also assume that any measurements
of y
will have errors. To simulate errors, we need a function that
adds random noise to f()
. We’ll call this function g()
.
# Function with measurement error
g <- function(x) f(x) + rnorm(n = length(x), mean = 0, sd = 20)
We can use g()
to generate a random sample of data.
# Function that generates a random sample of data points, using g()
sim_data <- function(from, to, by) {
tibble(
x = seq(from = from, to = to, by = by),
y = g(x)
)
}
# Generate random sample of points to approximate
set.seed(439)
data_1 <- sim_data(0, 100, 0.5)
Here’s a plot of both the simulated data (data_1
), and the true
function. Our job is now to use the simulated data to create a function
that closely approximates the true function.
tibble(x = seq(0, 100, 0.5), y = f(x)) %>%
ggplot(aes(x, y)) +
geom_line() +
geom_point(data = data_1) +
labs(title = "True function and simulated data")
loess()
modelGiven the shape of our data, the stats::loess()
function is a good
place to start. loess()
works by creating a local model at each data
point. Each of these local models only uses data within a given distance
from that point. The span
parameter controls how many points are
included in each of these local models. The smaller the span, the fewer
the number of points included in each local model.
To better understand the effect of the span
parameter, let’s plot
loess()
models with different values of span
.
The function plot_loess()
fits a model using loess()
, then plots the
model predictions and the underlying data.
# Plot loess model for given span
plot_loess <- function(span) {
data_1 %>%
mutate(f = f(x)) %>%
add_predictions(loess(y ~ x, data = data_1, span = span)) %>%
ggplot(aes(x)) +
geom_line(aes(y = f)) +
geom_point(aes(y = y)) +
geom_line(aes(y = pred), color = "#3366FF", size = 1) +
labs(
title = str_glue("loess, span = {span}"),
y = "y"
)
}
Now, we can use purrr to apply plot_loess()
to a vector of different
span values.
c(10, 1, 0.75, 0.1, 0.01) %>%
map(plot_loess) %>%
walk(print)
The span
parameter clearly makes a big difference in the model:
span = 10
smooths too much and washes out the variability in the
model.span = 1
is much better, but still has a bit too much smoothing.span = 0.75
is the default and does a reasonably good job.span = 0.1
has too little smoothing. The added complexity has
started tracking the random variation in our data.span = 0.01
went way too far and started to fit individual random
points.It’s clear that the model with span = 0.75
comes closest to following
the form of the true model. The models with span
equal to 0.1
and
0.01
have overfit the data, and so won’t generalize well to new
data.
Let’s return to the question of how to measure the quality of a model.
Above, we could easily visualize our different models because they each
had only one predictor variable, x
. It will be much harder to
visualize models with many predictor variables, and so we need a more
general way of determining model quality and comparing models to each
other.
One way to measure model quality involves comparing the model’s predicted values to the actual values in data. Ideally, we want to know the difference between predicted and actual values on new data. This difference is called the test error.
Unfortunately, you usually won’t have new data. All you’ll have is the data used to create the model. The difference between the predicted and actual values on the data used to train the model is called the training error.
Since the expected value for the error term of g()
is 0
for each
x
, the best possible model for new data would be f()
itself. The
error on f()
itself is called the Bayes error.
Usually, you won’t know the true function f()
, and you won’t have
access to new test data. In our case, we simulated data, so we know
f()
and can generate new data. The following code creates 50 different
samples, each the same size. We’ll use these 50 different samples to get
an accurate estimate of our test error.
set.seed(886)
data_50 <-
1:50 %>%
map_dfr(~ sim_data(0, 100, 0.5))
Now, we need a way to measure the difference between a vector of predictions and vector of actual values. Two common metrics include:
sqrt(mean((y - pred)^2, na.rm =
TRUE))
mean(abs(y - pred), na.rm = TRUE)
Both measures are supported by modelr. We’ll use RMSE below.
Now, we’ll train a series of models on our original dataset, data_1
,
using a range of span
’s. Then, for each model, we’ll calculate the
training error on data_1
and the test error on the new data in
data_50
.
First, we need to create a vector of different span
values.
spans <- 2^seq(1, -2, -0.2)
Next, we can use purrr to iterate over the span values, fit a model for each one, and calculate the RMSE.
errors <-
tibble(span = spans) %>%
mutate(
train_error =
map_dbl(span, ~ rmse(loess(y ~ x, data = data_1, span = .), data_1)),
test_error =
map_dbl(span, ~ rmse(loess(y ~ x, data = data_1, span = .), data_50))
)
The result is a tibble with a training error and test error for each
span
.
errors
## # A tibble: 16 x 3
## span train_error test_error
## <dbl> <dbl> <dbl>
## 1 2 27.7 27.4
## 2 1.74 26.9 26.7
## 3 1.52 25.9 25.7
## 4 1.32 24.6 24.6
## 5 1.15 22.8 23.0
## 6 1 21.3 21.8
## 7 0.871 19.3 20.4
## 8 0.758 18.7 20.2
## 9 0.660 18.5 20.2
## 10 0.574 18.4 20.3
## 11 0.5 18.4 20.3
## 12 0.435 18.3 20.4
## 13 0.379 18.1 20.4
## 14 0.330 17.8 20.4
## 15 0.287 17.4 20.6
## 16 0.25 17.2 20.8
It would be useful to compare these errors to the Bayes error, the error
on the true function, which you can think of as the “best-case-scenario”
error. The RMSE Bayes error is 20, the standard deviation of the random
errors added by g()
. We can check this with an approximation using
data_50
.
bayes_error <- 20
data_50 %>%
summarize(bayes_error = sqrt(mean((y - f(x))^2, na.rm = TRUE))) %>%
pull(bayes_error)
## [1] 19.9741
Now, let’s plot the training, test, and Bayes errors by span
.
bayes <- tibble(
span = range(errors$span),
type = "bayes_error",
error = bayes_error
)
errors %>%
gather(key = type, value = error, -span) %>%
ggplot(aes(1 / span, error, color = type)) +
geom_line(data = bayes) +
geom_line() +
geom_point() +
labs(
title = "Errors as a function of increasing model complexity",
color = NULL
)
Mapping 1 / span
to the x-axis makes the plot easier to interpret
because smaller values of span
create more complex models. The
leftmost points indicate the simplest models, and the rightmost indicate
the most complex.
The lowest test error is around 20.18, and came from the model with
span
≈ 0.758. Note that the default value of span
in loess()
is
0.75, so the default here would have done pretty well.
You can see that as the model increases in complexity, the training
error (blue line) continually declines, but the test error (green line)
actually starts to increase. The test error increases for values of
span
that make the model too complex overfit the training data.
We used data_50
, our new data, to calculate the actual test error. As
we’ve said, though, you usually won’t have access to new data and so
won’t be able to calculate the test error. You’ll need a way to
estimate the test error using the training data. As our plot indicates,
however, the training error is a poor estimate of the actual test error.
In the next section, we’ll discuss a better way, called
cross-validation, to estimate the test error from the training data.
There are two key ideas of cross-validation. First, in the absence of new data, we can hold out of a random portion of our original data (the test set), train the model on the rest of the data (the training set), and then test the model on test set. Second, we can generate multiple train-test pairs to create a more robust estimate of the true test error.
modelr provides two functions for generating these train-test pairs:
crossv_mc()
and crossv_kfold()
. We’ll introduce these functions in
the next section.
crossv_mc()
and crossv_kfold()
crossv_mc()
and crossv_kfold()
both take your original data as input
and produce a tibble that looks like this:
data_1 %>%
crossv_kfold()
## # A tibble: 5 x 3
## train test .id
## <list> <list> <chr>
## 1 <S3: resample> <S3: resample> 1
## 2 <S3: resample> <S3: resample> 2
## 3 <S3: resample> <S3: resample> 3
## 4 <S3: resample> <S3: resample> 4
## 5 <S3: resample> <S3: resample> 5
Each row represents one train-test pair. Eventually, we’re going to build a model on each train portion, and then test that model on each test portion.
crossv_mc()
and crossv_kfold()
differ in how they generate the
train-test pairs.
crossv_mc()
takes two parameters: n
and test
. n
specifies how
many train-test pairs to generate. test
specifies the portion of the
data to hold out each time. Importantly, crossv_mc()
pulls randomly
from the entire dataset to create each train-test split. This means that
the test sets are not mutually exclusive. A single point can be present
in multiple test sets.
crossv_kfold()
takes one parameter: k
, and splits the data into k
exclusive partitions, or folds. It then uses each of these k
folds
as a test set, with the remaining k
- 1 folds as the training set. The
main difference between crossv_mc()
and crossv_kfold()
is that the
crossv_kfold()
test sets are mutually exclusive.
Now, let’s take a closer look at how these functions works. The
following code uses crossv_kfold()
to create 10 train-test pairs.
set.seed(122)
df <-
data_1 %>%
crossv_kfold(k = 10)
df
## # A tibble: 10 x 3
## train test .id
## <list> <list> <chr>
## 1 <S3: resample> <S3: resample> 01
## 2 <S3: resample> <S3: resample> 02
## 3 <S3: resample> <S3: resample> 03
## 4 <S3: resample> <S3: resample> 04
## 5 <S3: resample> <S3: resample> 05
## 6 <S3: resample> <S3: resample> 06
## 7 <S3: resample> <S3: resample> 07
## 8 <S3: resample> <S3: resample> 08
## 9 <S3: resample> <S3: resample> 09
## 10 <S3: resample> <S3: resample> 10
The result is a tibble with two list-columns: train
and test
. Each
element of train
and test
is a resample object, which you might not
have seen before. Let’s look at the first row, the first train-test
pair.
train_01 <- df$train[[1]]
train_01
## <resample [180 x 2]> 1, 3, 4, 5, 6, 7, 9, 10, 11, 12, ...
test_01 <- df$test[[1]]
test_01
## <resample [21 x 2]> 2, 8, 21, 43, 53, 63, 78, 81, 85, 101, ...
Each resample object is actually a two-element list.
glimpse(train_01)
## List of 2
## $ data:Classes 'tbl_df', 'tbl' and 'data.frame': 201 obs. of 2 variables:
## ..$ x: num [1:201] 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 ...
## ..$ y: num [1:201] 22.95 36.64 -2.32 21.26 18.88 ...
## $ idx : int [1:180] 1 3 4 5 6 7 9 10 11 12 ...
## - attr(*, "class")= chr "resample"
The first element, data
is the original dataset. The second element,
idx
, is a set of indices that indicate the subset of the original data
to use.
We can use a set operation to verify that the train and test sets are mutually exclusive. In other words, there is no point in a given train set that is also in its corresponding test set, and vice versa.
is_empty(intersect(train_01$idx, test_01$idx))
## [1] TRUE
The union of each train and test set is just the complete original dataset. We can verify this by checking that the set of all train and test indices contains every possible index in the original dataset.
setequal(union(train_01$idx, test_01$idx), 1:201)
## [1] TRUE
resample objects are not tibbles. Some model functions, like lm()
, can
take resample objects as input, but loess()
cannot. In order to use
loess()
, we’ll need to turn the resample objects into tibbles.
as_tibble(test_01)
## # A tibble: 21 x 2
## x y
## <dbl> <dbl>
## 1 0.5 36.6
## 2 3.5 18.5
## 3 10 26.8
## 4 21 79.3
## 5 26 68.5
## 6 31 102.
## 7 38.5 76.4
## 8 40 73.6
## 9 42 37.1
## 10 50 40.5
## # … with 11 more rows
Resample objects waste a lot of space, since each one contains the full dataset. A new package rsample is being developed to support tidy modeling in a more space-efficient way.
crossv_kfold()
and crossv_mc()
just generate train-test pairs. In
order to carry out cross-validation, you still need to fit your model on
each train set and then evaluate it on each corresponding test set. In
this section, we’ll walk you through the process.
1 Create a function to calculate your test error
First, you’ll need a function that calculates the test error using some metric. We’ll use RMSE. The following function returns the RMSE for a given span, train set, and test set.
rmse_error <- function(span, train, test) {
rmse(loess(y ~ x, data = as_tibble(train), span = span), as_tibble(test))
}
2 Create a function to calculate CV error
Next, you need a function that iterates over all your train-test pairs, calculates the test error for each pair, and then calculates the mean of the errors and the standard error of the mean.
span_error <- function(span, data_cv) {
errors <-
data_cv %>%
select(-.id) %>%
add_column(span = span) %>%
pmap_dbl(rmse_error)
tibble(
span = span,
error_mean = mean(errors, na.rm = TRUE),
error_se = sd(errors, na.rm = TRUE) / sqrt(length(errors))
)
}
3 Create your train-test pairs
Use either crossv_mc()
or crossv_kfold()
to generate your train-test
pairs. We’ll use crossv_mc()
to generate 10 train-test pairs with test
sets consisting of approximately 20% of the data.
set.seed(430)
data_mc <- crossv_mc(data_1, n = 10, test = 0.2)
4 Calculate the CV error for different values of your tuning parameter
We want to compare different values of span
, so we also need to
iterate over a vector of different span
values. We’ll calculate the CV
error for all span
s using these train-test pairs.
errors_mc <-
spans %>%
map_dfr(span_error, data_cv = data_mc)
errors_mc %>%
knitr::kable()
span | error_mean | error_se |
---|---|---|
2.0000000 | 28.77162 | 0.4598847 |
1.7411011 | 27.90898 | 0.4590082 |
1.5157166 | 26.80809 | 0.4633976 |
1.3195079 | 25.42115 | 0.4763469 |
1.1486984 | 23.47827 | 0.5162414 |
1.0000000 | 21.89512 | 0.5574787 |
0.8705506 | 19.70633 | 0.6247474 |
0.7578583 | 19.00509 | 0.6083815 |
0.6597540 | 18.79978 | 0.5899732 |
0.5743492 | 18.80655 | 0.5736369 |
0.5000000 | 18.82917 | 0.5663217 |
0.4352753 | 18.87305 | 0.5575942 |
0.3789291 | 18.85135 | 0.5313073 |
0.3298770 | 18.78642 | 0.4918808 |
0.2871746 | 18.72327 | 0.5003602 |
0.2500000 | 18.70617 | 0.5152749 |
5 Choose the best model
Now, we need to compare the different models and choose the best value
of span
.
Because we simulated our data, we can compare the CV estimates of the test error with the actual test error.
errors_mc %>%
left_join(errors, by = "span") %>%
ggplot(aes(1 / span)) +
geom_line(aes(y = test_error, color = "test_error")) +
geom_point(aes(y = test_error, color = "test_error")) +
geom_line(aes(y = error_mean, color = "mc_error")) +
geom_pointrange(
aes(
y = error_mean,
ymin = error_mean - error_se,
ymax = error_mean + error_se,
color = "mc_error"
)
) +
labs(
title = "Cross-validation and test errors",
y = "error",
color = NULL
)
From this plot, we can see that CV error estimates from crossv_mc()
underestimate the true test error in the range with 1/span
> 1. The
line ranges reflect one standard error on either side of the mean error.
Usually, you won’t have access to the true test error, so we need a way
to choose the best value of span
using just the CV errors. Here’s the
rule-of-thumb for choosing a tuning parameter when you only know the CV
errors.
span
= 0.25 (on the plot, 1/span
= 4) with
an mean CV error of approximately 18.7.span
=
0.25 is approximately 0.5, so we are seeking the simplest model with
a mean CV error less than 19.2. This occurs for span
≈ 0.758 (on
the plot, 1/span
≈ 1.32).We saw above that if we knew the test error, we would choose span
≈
0.758 as the optimal parameter. The one-standard-error rule of thumb
would lead us to choose the same optimal parameter without having to
know the test error. The test error for this span
is very close to the
Bayes error of 20.
errors %>%
top_n(-1, test_error) %>%
select(span, test_error) %>%
knitr::kable()
span | test_error |
---|---|
0.7578583 | 20.1787 |