Module 5: Explanatory and Predictive Modeling

PSYC 3032 M

Udi Alter

Things Udi Should Remember

Before we start

  • Congrats on finishing the 1st assignment? How was it?
  • Thanks for filling out the survey
  • Groups, babies?
  • Zoom/lecture recording
  • Last class topic

About Module 5

Goals for Today:

  • To explain or to predict? That is the question
  • Introduction to explanatory modeling
  • Causal inference
  • DAGitty DAG DAG
  • A bit on mediation
  • Predictive modeling
  • A bit on cross-validation


I spent most of my time during reading week:


A) Reading/school stuff, duh?!

B) Sleeping

C) Working (at a job)

D) Hanging out with family/friends

E) Binge-watching shows/movies



What’s your spirit animal? (and you can’t say platypus)




Explanatory vs. Predictive Modeling in Psychology

Explanatory vs. Predictive Modeling

  • The primary aim of scientific psychology is to gain a deeper understanding of human behavior

  • Traditionally, this involved both explaining behavior by identifying underlying causes and predicting behavior by forecasting future actions. However, in practice, these objectives are often treated as interchangeable, but, if fact, they aren’t

  • The assumption is that models that best explain behavior should also best predict future behavior. Unfortunately, although explanation and prediction may be philosophically compatible, there are good reasons to think that they are often in statistical and pragmatic tension with one another

  • Statistically speaking, the model that most closely approximates the data-generating process will NOT be the most successful at predicting real-world outcomes

  • Similarly, models that do the best job at predicting outcomes of interest, will NOT identify the underlying causes

  • As a result, researchers must make a conscious choice: to explain or to predict?




Explanatory Modeling in Psychology

Explanatory Modeling and Causal Inference

“In the explanatory approach to science, the ultimate goal is to develop a mechanistic model of the data-generating process that gives rise to the observed data.”
Yarkoni & Westfall, 2017

“In explanatory modeling, the focus is on minimizing bias to obtain the most accurate representation of the underlying theory”
Shmueli, 2010

  • Most of the research in psychology is explanatory; it’s meant to represent a theoretical account for the variation of the outcome variable

  • Typically this account (either explicitly or implicitly) represents the actual causal mechanisms, or a subset of potential causal mechanisms, that produce differences in the outcome

Explanatory Modeling and Causal Inference

  • In both the research design and analysis stages, researchers include the variables they believe explain or cause the variation of the effect or outcome of interest

  • Often, researchers will also include covariates (e.g., age, gender, income) deemed potentially important (either by prior research or their own theoretical interests)

  • Researchers include covariates in their models to reduce the risk of drawing invalid or misleading conclusions about the importance of an explanatory variable, ensuring that any observed effect is not simply due to its association with other related variables

  • But, what drives researchers to select these covariates that are supposed to prevent such spurious effects?

  • Researchers are advised to let theory guide decisions about the inclusion of important covariates. But such advice is usually vague and difficult to adhere to.

  • In practice, researchers haphazardly select covariates using their common sense and make such decision at their own subjective discretion

  • Instead, we should use causal inference principles to make more sensible decisions about which covariates to account for in our explanatory models

Causal Inference

Causal Inference

Causal inference concerns what would happen to an outcome (Y) as a result of a treatment, intervention, or exposure (T), given pretreatment information (X) (Gelman et al., 2021)

  • To define causation formally, we start by recognizing that treatment effects vary from one unit of observation (e.g., one person) to another, implying that there is a separate causal effect \(\tau_i\) for each case, i, in a sample or population

  • Using the counterfactual framework (i.e., Rubin’s causal model; Rubin, 1974), assume that the treatment (T) indicator \(t_i=1\) when case i receives the treatment and \(t_i=0\) when case i does not receive the treatment (i.e., control condition)

  • Then, \(y^1_i\) is the outcome value that would occur if case i received the treatment, and \(y^0_i\) is the outcome value that would occur if case i DID NOT received the treatment; these values are referred to as potential outcomes

  • Thus, the individual causal effect for case i is: \(\tau_i = y^1_i - y^0_i\)

Causal Inference

  • The “fundamental problem of causal inference” is that, for each person i, we can only observe either \(y^1_i\) or \(y^0_i\), but not both!

  • Consider these hypothetical data:

\[ \scriptsize \begin{array}{c c c c c} \hline & & {\textbf{Potential outcomes}} \\ \hline \text{Case } i & t_i & y_i^0, \text{ if } t_i = 0 & y_i^1, \text{ if } t_i = 1 & y_i \\ & \text{(treatment indicator)} & & & \text{(observed outcome)} \\ \hline 1 & 0 & 40 & ? & 40 \\ 2 & 0 & 40 & ? & 40 \\ 3 & 0 & 50 & ? & 50 \\ 4 & 1 & ? & 55 & 55 \\ 5 & 1 & ? & 55 & 55 \\ 6 & 1 & ? & 60 & 60 \\ \hline \end{array} \]

Causal Inference

  • The sample average treatment effect (ATE) is simply the mean of the individual treatment effects

\[\bar{\tau} = \frac{1}{N} \sum_{i=1}^{N} \tau_i = \frac{1}{N} \sum_{i=1}^{N} (y_i^1 - y_i^0) \\ = \frac{1}{N} \sum_{i=1}^{N} y_i^1 - \frac{1}{N} \sum_{i=1}^{N} y_i^0\]

  • Still, we cannot calculate the ATE directly because we observe only either \(y^1_i\) or \(y^0_i\); half of the data are missing!

  • But, if we assume that the potential outcomes that are observed are randomly sampled from the population, then their means, \(\bar{Y}_{t=0}\) and \(\bar{Y}_{t=1}\), will give us unbiased estimates of \(\frac{1}{N} \sum_{i=1}^{N} y_i^1\) and \(\frac{1}{N} \sum_{i=1}^{N} y_i^0\), respectively

Causal Inference

  • Under this strong assumption, we can estimate the ATE by subtracting the mean of the observed scores in the control condition from the mean of the observed scores in the treatment condition:

\[\hat{ATE}=\bar{Y}_{t=1}-\bar{Y}_{t=0}\]

  • The above estimator of the ATE is unbiased with respect to the population ATE only when cases are randomly assigned to the levels of the treatment indicator T. Otherwise, the mean difference could be determined by some pretreatment characteristic other than the treatment itself (e.g., a covariate of sorts)

  • Yet, random assignment only guarantees that the treated and control groups are balanced on all relevant pretreatment on average across an infinite set of hypothetical samples from the same population (i.e., they are balanced in expectation)

Causal Inference

  • With any single sample, however, the treated and control groups may still differ with respect to pretreatment characteristics, especially if the sample is small, leading to sample mean differences that are potentially widely discrepant from the population ATE

  • In other words, random assignment may not effectively cancel-out pretreatment or confounding differences between the treated and control groups in a single sample

  • Thus, it’s advisable to account for pretreatment group differences even with random assignment to (Gelman et al., 2021)

  • That is, to obtain unbiased causal effect, we have to carefully consider how other variables, namely confounders, influence the association between our explanatory variable/s and the outcome and potentially adjust for these in our GLM

DAG, DAGITTY, DAG DAG

  • A set of hypothesized causal relations is often represented using a directed acyclic graph (DAG)

  • In the simple case of a randomized treatment variable, T, having a causal influence on an outcome, Y, the DAG looks like this:

  • In a DAG, variables are known as nodes, whereas an arrow connecting two variables is known as an edge

Confining Confounders

  • DAGs are called acyclic to indicate that causes do not eventually have effects on themselves

  • In formal causal modeling, a confounder has a causal effect on both the focal explanatory variable and the outcome variable (remember this?)

  • In this DAG, we see a fork; age is “pointing” at (causing) both exercise and cholesterol

DAG Building Blocks

  • Every DAG,no matter how big and complicated,is built out of these four relations

Backdoor Paths

  • Using DAG heuristics, to obtain an unbiased estimate of the effect, it is necessary to adjust for a confounder Age (either statistically or with experimental design) because age opens a backdoor path between Exercise and Cholesterol:

\[Exercise \leftarrow Age \rightarrow Cholesterol\]

  • A backdoor path is an indirect path between an explanatory variable (T/Exercise) and an outcome variable (Y/Cholesterol) that does not go through the intended causal direction (i.e., it doesn’t follow the arrows forward); instead, it “enters” the explanatory variable through the backdoor—meaning it connects T and Y through a common cause (i.e., confounder; X/Age)
  • Adjusting for Age (either with research design or statistically) closes the backdoor path

Shutting the Backdoor

  • Had we randomly assigned individuals to different levels of Exercise—i.e., manipulated Exercise, which means that Exercise would be \(\perp\) (orthogonal) to Age—then we would sever the causal effect Age has on Exercise and close the backdoor path

Shutting the Backdoor

  • “In general, then, causal effects can be estimated using regression if the model includes all confounding covariates (predictors that can affect treatment assignment or the outcome) and if the model is correct” (Gelman et al., 2021, p. 385)

  • But, actually, it’s only necessary to include in the model the confounders that block backdoor paths; if two confounders are causally associated with each other, we may only need to control one of them to obtain an unbiased causal effect (Morgan & Winship, 2015)

  • That is, as long as we close all backdoor paths, we can infer causality! Yes, we can say things like exercise causes less cholesterol. Don’t you feel powerful!?

  • Because DAGs provide an overview of the causal relationships among a set of theoretically relevant variables, they allow identification of a set of variables to adjust for in the analysis to remove confounding

STOP 🛑 Example Time

Causal Inference: Example

  • let’s consider an actual dataset: A researcher is interested in estimating the causal effect of the a movie’s Tomatometer score on its profit

  • The researcher collects data for N = 609 movies released between 2008 and 2012, including their net profit (i.e., measured as the difference between their gross income and budget in millions of US dollars)

  • They also record the movies’ numeric Tomatometer score, representing an aggregate of critics’ reviews of the film

  • Simply put, the researcher wants to devise a model for profit to explain how (and why) movies vary on this important outcome

year title gross budget leng score profit
2012 Django Unchained 162.80543 100 165 88 62.805434
2012 Parental Guidance 77.26493 25 105 18 52.264926
2012 The Impossible 19.01988 40 114 81 -20.980118
2012 Jack Reacher 80.07074 60 130 61 20.070736
2012 This is 40 67.54451 35 134 51 32.544505
2012 The Guilt Trip 37.13421 40 95 38 -2.865785

Causal Structure

  • There are a few variables and cofounders we should consider: Let’s start with budget
    • Larger budgets allow for longer movies AND affect reviews
    • Longer movies tend to also have an affect on reviews (sometimes positive, other times negative)
    • Better reviews bring greater profit (what we’re after)
  • How do these variables come together in a DAG?Confirm there are no “cycles” AND identify all backdoor paths connecting review score and profit

Identify Backdoor Paths

  • There are two backdoor paths connecting review score and profit marked in red and orange (which confounders should we adjust for?)
  • Above, there are two backdoor paths: (1) \(Review \leftarrow Length \rightarrow Profit\) and (2) \(Review \leftarrow Budget \rightarrow Length \rightarrow Profit\)

Identify Backdoor Paths

  • We need to block ALL backdoor paths to get an unbiased estimate of the causal effect of review score on profit
  • As it happens, both backdoor paths go through Length, so if we control Length, we can close both! (Nice!)

No DAGitty, No Doubt

  • We can use the dagitty package to confirm everything!

No DAGitty, No Doubt

  • We can use the dagitty package to confirm everything!

dagitty Operations

  • First, let’s confirm this DAG is actually a proper DAG, i.e., acyclic
library(dagitty)
library(igraph)
# Define the DAG
movieDAG <- dagitty ("dag{
Budget -> Length
Budget -> Review_Score
Length -> Profit
Length -> Review_Score
Review_Score -> Profit
}")
# Extract edges
DAG_dat <- as.data.frame(dagitty::edges(movieDAG), stringsAsFactors = FALSE)
colnames(DAG_dat) <- c("from", "to")  # Rename columns
DAG_dat <- DAG_dat %>% select(from, to) # include only from and to columns
# Convert to igraph object
igobj <- graph_from_data_frame(DAG_dat, directed = TRUE)
# Check for cycles using igraph's `is_dag()`
is_dag(igobj)  # TRUE if acyclic, FALSE if cycles exist
[1] TRUE

dagitty Operations

  • Now, let’s see what we need to control for:
adjustmentSets(movieDAG, exposure = "Review_Score", outcome = "Profit")
{ Length }
  • where exposure refers to the focal explanatory variable of interest

  • Therefore, if a linear functional form is correct, we can get this causal effect estimate from an MLR model of Profit with Review Score and Length as predictors; there is no need to include budget in the regression model!

Example Extension

  • Suppose instead we were instead interested in the causal effect of Length on Profit

  • Identify the backdoor path/s. You can follow Pearl’s Backdoor Criterion:

  1. Identify all paths between T (now, Length) and Y (Profit)
  2. Check if any of them start with an incoming arrow into Length (i.e., a backdoor path)
  3. Assess if any of these backdoor paths are open (i.e., directed at Y), creating confounding and introducing bias

Identify Backdoor Paths

  • We have only one backdoor path \(Length \leftarrow Budget \rightarrow Review \ Score \rightarrow Profit\)

  • Thus, we should condition on (control for) Budget to close that path

Why should we not condition on Review Score in addtion/instead?

Mediators

  • Gelman et al. (2021) offer the general advice “do not adjust for posttreatment variables”, but distinguishing posttreatment variables from other variables might be challenging in observational research

  • Carefully considering the (potential) causal roles of variables in a system and creating a DAG can clarify which variables should be controlled

  • Sometimes a focal explanatory variable, T, is hypothesized to exert an indirect effect on an outcome, Y, through a mediator, M

Mediators

  • Following the Gelman et al. advice, a mediator is a type of posttreatment variable, and if we want to estimate the causal effect of T on Y, we should NOT control for M

  • In the movie DAG, the axiom that posttreatment variables should not be controlled is the reason that Review Score is not labeled as a variable to adjust to get the causal effect of Length on Profit; the DAG suggests that Review Score is a mediator between Length and Profit

  • In psychological research, however we often are interested in describing the mediation process itself; that is, we wish to partition the total effect of T on Y into a direct effect and an indirect (i.e., mediated) effect

  • Mediation analysis is popular in psychology because it addresses the question of how a treatment or intervention affects an outcome

    • For example, researchers may hypothesize that a certain type of therapy (treatment, T) improves a child’s behavioural outcome (e.g., externalizing behaviour, Y) by improving the child’s emotion regulation (mediator, M)
  • To be clear: Mediation is a causal process!

Mediators

  • Assuming that linear models are correct, the mediation DAG below indicates that we can get the direct and indirect effects with two regression models:

  • The coefficient \({\color{purple}{\beta_{MT}}}\) corresponds to edge \({\color{purple}{a}}\); \({\color{deeppink}{\beta_{YT}}}\) corresponds to \({\color{deeppink}{c}}\); and \({\color{gold}{\beta_{YM}}}\) corresponds to \({\color{gold}{b}}\)

  • The direct effect of T on Y is thus \({\color{deeppink}{\beta_{YT}}} = {\color{deeppink}{c}}\) , whereas the indirect effect is typically quantified as \({\color{purple}{\beta_{MT}}} \times {\color{gold}{\beta_{YM}}} = {\color{purple}{a}} \times {\color{gold}{b}}\)

  • The total causal effect is the sum of the direct and indirect effects: \({\color{deeppink}{\beta_{YT}}}+{\color{purple}{\beta_{MT}}} \times {\color{gold}{\beta_{YM}}}= {\color{deeppink}{c}}+{\color{purple}{a}} \times {\color{gold}{b}}\)

Mediators

  • The total causal effect is the sum of the direct and indirect effects: \({\color{deeppink}{\beta_{YT}}}+{\color{purple}{\beta_{MT}}} \times {\color{gold}{\beta_{YM}}}= {\color{deeppink}{c}}+{\color{purple}{a}} \times {\color{gold}{b}}\)

  • Returning to Gelman et al.’s point about not adjusting for post-treatment variables (in this case, a mediator), the total effect can also be obtained by simply regressing Y on T while ignoring M!

EXAMPLE, again…

  • Suppose movies with higher Tomatometer scores see an increased level of advertisement (“Critics agree! Avatar is brilliant!”), which in turn leads to great profit

  • Of course, it is also the case that a bigger budget leads to more advertisement, but it doesn’t seem plausible that movie length has a direct effect on advertisement

Mediators, Another Example

  • The DAG shows that Advertisement is a (hypothesized) mediator between Review Score and Profit. Incidentally, Advertisement is also a mediator between Budget and Profit. Which variables need to be conditioned on?

Mediators, Another Example

movieDAG2 <- dagitty ("dag{
Budget -> Length
Budget -> Review_Score
Budget -> Advertisement
Length -> Profit
Length -> Review_Score
Review_Score -> Profit
Review_Score -> Advertisement
Advertisement -> Profit
}")
# Extract edges
DAG_dat2 <- as.data.frame(dagitty::edges(movieDAG2), stringsAsFactors = FALSE)
colnames(DAG_dat2) <- c("from", "to")  # Rename columns
DAG_dat2 <- DAG_dat2 %>% select(from, to) # include only from and to columns
# Convert to igraph object
igobj2 <- graph_from_data_frame(DAG_dat2, directed = TRUE)
# Check for cycles using igraph's `is_dag()`
is_dag(igobj2)  # TRUE if acyclic, FALSE if cycles exist
[1] TRUE
adjustmentSets(movieDAG2, exposure = "Review_Score", outcome = "Profit")
{ Budget, Length }

Mediators, Another Example

  • Now, we see that Advertisement was not listed as a variable to control when estimating the causal effect of Review Score on Profit
    • Consistent with the idea that mediators (or post-treatment variables generally) should not be controlled when estimating the total causal effect of an explanatory variable

Causal Inference: Key Takeaways




Stop! 🛑 Activity Time!

Class Activity

Draw a DAG to represent the following processes: Suppose we want to estimate the causal effect of professor rank on the outcome salary

  • We also have data for two covariates, years since PhD and years of service, which are obviously correlated (in fact, they are collinear; but one does not cause another!)

  • We know that years of service leads to rank because a professor must work at a university for several years before becoming eligible for promotion

  • Years of service also leads to salary because professors typically get a small pay raise every year, regardless of their rank

  • Years since PhD may have a separate influence on rank and salary because when they are first hired by a university, professors who already have post-PhD experience are likely to be offered a higher starting salary and may get early tenure/promotion than professors who have only just completed their PhD

Other than rank (exposure), which predictors must be included in the model to obtain an unbiased causal estimate on salary?

Class Activity

Class Activity

salaryDAG <-dagitty("dag{
  Rank -> Salary
  ServiceYears -> Rank
  ServiceYears -> Salary
  YrsPhD -> Salary
  YrsPhD -> Rank
}")
DAG_salary <- as.data.frame(dagitty::edges(salaryDAG), stringsAsFactors = FALSE)
colnames(DAG_salary) <- c("from", "to")  # Rename columns
DAG_salary <- DAG_salary %>% select(from, to) # include only from and to columns
# Convert to igraph object
igobj_sal <- graph_from_data_frame(DAG_salary, directed = TRUE)
# Check for cycles using igraph's `is_dag()`
is_dag(igobj_sal)  # TRUE if acyclic, FALSE if cycles exist
[1] TRUE
adjustmentSets(salaryDAG, exposure="Rank",outcome="Salary")
{ ServiceYears, YrsPhD }

Class Activity

library(ggdag)
ggdag(salaryDAG)+theme_dag()




Predictive Modeling in Psychology

Predictive Modeling

  • Unlike explanatory modeling, predictive modeling is explicitly focused on developing a model that provides accurate forecasts, or predictions, for critical outcomes that have not yet been observed

  • Recall that identifying the data-generating mechanisms and causal relationships will not necessarily yield the most successful prediction in the real world and vice versa

  • Especially when it comes to incredibly complex topics, such as human behaviour and mental health, reducing a predictive model to a mere two to seven variables (a common number of variables in psychological research) is simply naïve

  • Perhaps the reason for the mainstream use of the explanatory approach in psychology is that, in the not-so-distant past, before powerful computing and the ability to collect “big data,” identifying the most critical factors affecting the variable of interest was our best attempt at a reasonable forecast

  • But, with the recent growth in machine learning and computational muscle, better predictions carried out by heavy-duty methods are right at our fingertips

Explaining and Predicting Are Different

  • The difference between explanatory and predictive modelling manifests itself in multiple ways:

    • Causation vs. association
    • Retrospective vs. prospective
    • Theory vs. data
    • Bias vs. expected prediction error (EPE). What does this mean?!
  • It means that explanatory modeling primarily seeks to minimize bias—that is, to make sure our estimated model is as close as possible to the true data-generating mechanism (model) that gave rise to the observed data—whereas, the goal behind predictive modeling is to minimize the EPE. Makes sense?

Expected Prediction Error

  • As its name suggests, the EPE measures how well our estimated model predicts new, unseen data

  • It represents the average (i.e., expected) squared difference between the true values, \(Y\), and the model’s predictions, \(\hat{f}(X)\) over all possible datasets and inputs

  • Mathematically,

\[EPE= \mathbb{E}[(Y - \hat{f}(X))^2]\]

  • Recall that \(\mathbb{E}\) refers to “expected,” which is just another word for “mean” or “average”

  • \(X\) is simply the input, you can think about it as a dataset from the population of interest with one or more variables

  • As it turns out, EPE is “made up” of a few components, two of which are bias and estiamtion variance

Bias and Variance

  • When we fit a model to predict an outcome, \(Y\), based on some input, \(X\), our model naturally makes errors

  • These errors can be broken down into bias and estimation variance (often referred to simply as variance)

  • Bias measures how much the model’s prediction, on average, differs systematically from the true values; bias is about the difference between the true function, \(f(x)\) and the average of estimated function, \(\mathbb{E}[\hat{f}(x)]\), across different samples of the same population

\[\text{Bias} = \mathbb{E}[\hat{f}(X)] - f(X)\]

  • \(\hat{f}(X)\) = an estimated model from a single dataset/sample (i.e., \(\hat{Y}\))
  • \(\mathbb{E}[\hat{f}(X)]\) = mean of all our model estimations (across all datasets)
  • \(f(X)\) = the true model in the population (the data-generating mechanism), which we usually can’t actually know

Bias and Variance

  • Estimation variance refers to how much our model’s parameter estimates and thus predictions, \(\hat{f}(x)\), fluctuate across different datasets/samples

\[\text{Var}(\hat{f}(X)) = \mathbb{E} \left[ (\hat{f}(X) - \mathbb{E}[\hat{f}(X)])^2 \right]\]

  • Note that the Variance equation is similar to the variance equation you know…it’s the average squared difference between each estimated model and the mean of the estimated models, right?

The Bias-Variance Trade-Off

Bias and Variance

Yarkoni & Westfall, 2017

The Make-Up of EPE

  • Together, bias and estimation variance make up EPE, such that:

\[EPE= \mathbb{E}[(Y - \hat{f}(X))^2] = Bias^2 + Var(\hat{f}(x))+Var(Y)\]

  • where \(Var(Y)\) represents the noise in the outcome variable that no model can eliminate (e.g., naturally random errors, \(\epsilon\))

  • Recall that predictive modeling is “hell-bent” on minimizing EPE? So, to minimize EPE, we must try to minimize both bias and estimation variance

  • There’s only one problem, though, when bias goes down, varinace goes up and vice versa…oopsy! This is often referred to as the bias-variance trade-off

The Bias-Variance Trade-Off

  • The theory behind predictive modelling rests primarily on the trade-off between bias and variance

  • The Bias-Variance Trade-off is a key concept in statistical modeling and machine learning that describes the balance between underfitting and overfitting when making predictions

  • The bias-variance trade-off refers to the notion that more complex models (indicated by the number of parameters) tend to fit the data better, but they also lead to greater variability in predictions accuracy when fitted to other samples (low bias-high variance; overfitting)

  • In contrast, a simpler model may fit the data less accurately, but its predictions vary to a lesser extent (higher bias-lower variance; underfitting)

The Bias-Variance Trade-Off

The Bias-Variance Trade-Off

Minimizing EPE

  • So, if our goal is to minimize EPE, we must find a good balance between bias and variance, the optimal point, AKA the “sweet spot”; that is the goal of predictive modeling!

  • OK, but how?! Well, we have a bunch of options…in this Module, we’re only going to cover one, cross-validation

  • But, there are many others such as, lasso, ridge regression, and other regularization methods

Cross-Validation Methods

  • To minimize overfitting and obtain a model that will be capable of producing more precise predictions for observations beyond our sample (i.e., unobserved, future data), we can (and should) use cross-validation methods

  • The goal of cross-validation is to evaluate the predictive performance of a model by quantifying the out-of-sample prediction error

  • Cross-validation refers to a family of procedures which include “calibrating” a model by fitting it (or, “training” it) only on a portion of the data and “validating” it (or, “testing” it) on the remaining portion of the data

  • Practically speaking, the principles of cross-validation are as follows:

  1. The sample is randomly split into two smaller datasets, referred to as the training data and the test data (AKA a validation sample or hold-out sample)

Cross-Validation Methods

  1. A model is fitted to the training data and the fit of the training model is measured using multiple-\(R^2\) (i.e., coefficient of determination)

  2. The same model, with the same parameters and predictors, is then used to generate predicted values, \(\hat{Y}\), for the test (validation) data

  3. The correlation between \(\hat{Y}\) and \(Y\) of the test data is called the cross-validation correlation (also denoted as \(R_{CV}^2\)) and represents the extent to which the model can predict (or generalize) to out-of-sample data from the same population

    • That is, the cross-validation correlation statistic is a measure of the predictive performance of the model in question
  4. A useful statistic which is commonly used in cross-validation is the out-of-sample mean square error (MSE), which is calculated by

\[\text{MSE} = \frac{1}{n_{test}} \sum_{i=1}^{n_{test}} (y_i - \hat{y}_i)^2\] with lower MSE values indicate greater prediction ability

Cross-Validation Example

  • We’ll use the penguins dataset in the palmerpenguins package which includes a sample of N=344 penguins from Antarctica, describing their different species, flipper length, body mass, bill dimension, and sex
library(palmerpenguins)
peng <- na.omit(penguins)
nrow(peng)
[1] 333

STEP 1: Split the Data into Training and Test Subsets

set.seed(311)
train.data <- peng %>% slice_sample(prop = 0.5) # randomly select 50% of sample
test.data <- anti_join(peng, train.data) # select the remaining rows
nrow(train.data)
[1] 166
nrow(test.data)
[1] 167

Cross-Validation Example

STEP 2: Fit a Hypothesized Model to Training Data

  • Here, I use the an MLR model, where penguins’ body mass (in grams) is regressed on both bill length and depth (in millimeters).
mod <- lm(body_mass_g~bill_length_mm+bill_depth_mm, data=train.data)
summary(mod) # NOTE THE R2 and adjusted-R2

Call:
lm(formula = body_mass_g ~ bill_length_mm + bill_depth_mm, data = train.data)

Residuals:
     Min       1Q   Median       3Q      Max 
-1406.52  -405.82     2.52   347.29  1498.38 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    2967.499    586.152   5.063 1.11e-06 ***
bill_length_mm   83.022      8.354   9.938  < 2e-16 ***
bill_depth_mm  -138.321     22.643  -6.109 7.12e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 572.1 on 163 degrees of freedom
Multiple R-squared:  0.5068,    Adjusted R-squared:  0.5007 
F-statistic: 83.74 on 2 and 163 DF,  p-value: < 2.2e-16

Cross-Validation Example

STEP 3: Get Model Predictions for Test Data

head(Yhat <- predict(mod, test.data)) # calculating the predicted values on the TEST DATA
       1        2        3        4        5        6 
3823.496 3380.839 3734.930 3510.858 3945.242 3239.731 

STEPS 4 and 5: Calculate the Cross-validation Correlation and MSE

cor(Yhat, test.data$body_mass_g) # CV correlation
[1] 0.6564795
resid <- test.data$body_mass_g - Yhat # residuals
(MSE <- mean(resid^2)) # MSE
[1] 373759.3

Cross-Validation Example

  • Calculating the cross-validation correlation on the test data is a proxy for our power of prediction to future observations

  • The inverse of predictive power is the discrepancy between the predicted and observed values when applied to a new sample (from the same population); this discrepancy is referred to as the test error

  • Hypothetically, if we calculate the test error for each sample randomly selected from the population an infinite number of times and average them, we’ll get the EPE

  • The estimation variance introduced earlier, \(Var(\hat{f}(X))\), refers to how test error varies from one sample (\(X\)) to another

\(K\)-fold Cross-Validation

  • If cross-validation correlation and test error are our measures of predictive power (or lack thereof), how much confidence should we have in our model if it was trained on only half our sample?

  • The answer is…well, not a lot…This is where the \(k\)-fold cross-validation technique comes in handy

  • The idea behind \(k\)-fold cross-validation involves splitting the sample in random order across \(k\) unique subsets, referred to as folds, of (about) the same size

  • The first fold is “held out” to be used as the test data whereas the model is fitted to the rest of the \(k-1\) folds

  • Next, the estimated model is applied to the “hold-out” (i.e., test) fold to compute one estimate of the test error, MSE

  • This process is then repeated an additional \(k-1\) times, with each new iteration, a different fold is treated as the “hold-out,” or test, fold, which results in k total estimates of the test error, i.e., {MSE1, MSE2, MSE3,…MSEk}

\(K\)-fold Cross-Validation

  • To get the \(k\)-fold cross-validation test error estimate, the \(k\) MSE statistics are averaged

\[CV_{(k)} = \frac{1}{k}\sum_{i=1}^k MSE_i\]

\(K\)-fold Cross-Validation Example

library(caret)
# Define k-fold cross-validation settings
k <- 10  # Number of folds
train.control <- trainControl(method = "cv", number = k)
# Train the model using 10-fold cross-validation
train.model <- train(body_mass_g ~ bill_length_mm + bill_depth_mm, 
                     data = peng, 
                     method = "lm",
                     trControl = train.control)
# Print results
train.model 
Linear Regression 

333 samples
  2 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 300, 300, 300, 299, 299, 298, ... 
Resampling results:

  RMSE      Rsquared  MAE     
  590.3279  0.472769  484.3707

Tuning parameter 'intercept' was held constant at a value of TRUE
  • Note the average Rsquared and how it compares to the original (=0.51). Also note that \(RMSE=\sqrt{MSE}\)

\(K\)-fold Cross-Validation Example

# Extract CV-MSE (Mean Squared Error from cross-validation)
cv_mse <- train.model$resample$RMSE^2  # Squaring RMSE gives MSE for each fold
mean(cv_mse)  # Average MSE across folds
[1] 350412.9
  • In this example, \(k = 10\), setting \(k = N\) (the entire sample size) leads to a special case of \(k\)-fold cross-validation known as leave-one-out cross validation (LOOCV)

  • With LOOCV, each iteration uses only \(n = 1\) for the test data

  • LOOCV tends to work better than cross-validation with fewer folds, but it is more computationally intensive

Leave-One-Out Cross-Validation (LOOCV)



LOOCV Cross-Validation Example

library(caret)
train.control.LOOCV <- trainControl(method = 'LOOCV')
train.model.LOOCV <- train(body_mass_g~bill_length_mm+bill_depth_mm, 
                           data = peng,
                           method = "lm",
                           trControl = train.control.LOOCV)
train.model.LOOCV 
Linear Regression 

333 samples
  2 predictor

No pre-processing
Resampling: Leave-One-Out Cross-Validation 
Summary of sample sizes: 332, 332, 332, 332, 332, 332, ... 
Resampling results:

  RMSE     Rsquared  MAE     
  591.936  0.458019  484.7524

Tuning parameter 'intercept' was held constant at a value of TRUE
  • Note the average Rsquared and how it compares to the original (=0.51) and to the 10-fold cv (=.47)

Cross-Validation Methods

  • Until now, we used cross-validation to evaluate the predictive performance or generalization capacity of a particular model (e.g., \(\text{body mass}_i= \beta_0+\beta_1\text{bill length}_i + \beta_2 \text{bill depth}_i + \epsilon_i\))

  • But, cross-validation is also a great tool for model evaluation!

  • Example, say you’re trying to decide which variables are better at predicting penguin body mass; you tried already bill depth and length, but what about sex and species? i.e.,

\[\text{MOD1: body mass}_i= \beta_0+\beta_1\text{bill length}_i + \beta_2 \text{bill depth}_i + \epsilon_i \\vs. \\ \text{MOD2: body mass}_i= \beta_0+\beta_1\text{sex}_i + \beta_2 \text{species}_i + \epsilon_i\]

Cross-Validation Methods

train.control.LOOCV <- trainControl(method = 'LOOCV')
# Mod 1:
train.model1.LOOCV <- train(body_mass_g~bill_length_mm+bill_depth_mm, 
                      data = peng, method = "lm", trControl = train.control.LOOCV)
# Mod 2:
train.model2.LOOCV <- train(body_mass_g~sex+species,
                      data = peng, method = "lm", trControl = train.control.LOOCV)

mod_compare <- data.frame(t(train.model.LOOCV$results),t(train.model2.LOOCV$results))[-1,]
colnames(mod_compare) <- c("Mod1: bill_len+dep", "Mod2: sex+spec")
kable(mod_compare)
Mod1: bill_len+dep Mod2: sex+spec
RMSE 591.936030 318.6248079
Rsquared 0.458019 0.8429538
MAE 484.752447 256.4213241
  • Which model has better predictive ability?

  • Of course you can extend this to many more options at once, with different variables, different number of predictors, using different analysis approaches (e.g., WLS, removing outliers) etc.

Cross-Validation Takeaway

  • Cross-validation techniques are a simple but extremely powerful and popular tools in predictive modeling and machine learning.

  • Yes, that’s right, you just got a little taste of machine learning!

  • It gives us a minimally biased way of gauging the true generalization performance of any model

  • Generally speaking, the overfitting observed when using the same data to both train and test a model will largely disappear when cross-validation is applied, and the cross-validated estimate of a model’s generalization performance will (on average) typically be very close to the true out-of-sample performance

  • Another purpose of cross-validation is model selection: you can compare and contrast the same model under each possible analysis approach and/or different models all together and check which brings about the best predictive power!

  • Other relevant and interesting tools in predictive modeling and machine learning include regularization methods like Lasso (L1), Ridge (L2) regression, and more…