Module 5: Explanatory and Predictive Modeling

PSYC 3032 M

Udi Alter

Things Udi Should Remember

Before we start

Congrats on finishing the 1st assignment? How was it?
Thanks for filling out the survey
Groups, babies?
Zoom/lecture recording
Last class topic

About Module 5

Goals for Today:

To explain or to predict? That is the question
Introduction to explanatory modeling
Causal inference
DAGitty DAG DAG
A bit on mediation
Predictive modeling
A bit on cross-validation

I spent most of my time during reading week:

A) Reading/school stuff, duh?!

B) Sleeping

C) Working (at a job)

D) Hanging out with family/friends

E) Binge-watching shows/movies

What’s your spirit animal? (and you can’t say platypus)

Explanatory vs. Predictive Modeling in Psychology

Explanatory vs. Predictive Modeling

The primary aim of scientific psychology is to gain a deeper understanding of human behavior
Traditionally, this involved both explaining behavior by identifying underlying causes and predicting behavior by forecasting future actions. However, in practice, these objectives are often treated as interchangeable, but, if fact, they aren’t
The assumption is that models that best explain behavior should also best predict future behavior. Unfortunately, although explanation and prediction may be philosophically compatible, there are good reasons to think that they are often in statistical and pragmatic tension with one another
Statistically speaking, the model that most closely approximates the data-generating process will NOT be the most successful at predicting real-world outcomes
Similarly, models that do the best job at predicting outcomes of interest, will NOT identify the underlying causes
As a result, researchers must make a conscious choice: to explain or to predict?

Explanatory Modeling in Psychology

Explanatory Modeling and Causal Inference

“In the explanatory approach to science, the ultimate goal is to develop a mechanistic model of the data-generating process that gives rise to the observed data.”
— Yarkoni & Westfall, 2017

“In explanatory modeling, the focus is on minimizing bias to obtain the most accurate representation of the underlying theory”
— Shmueli, 2010

Most of the research in psychology is explanatory; it’s meant to represent a theoretical account for the variation of the outcome variable
Typically this account (either explicitly or implicitly) represents the actual causal mechanisms, or a subset of potential causal mechanisms, that produce differences in the outcome

Explanatory Modeling and Causal Inference

In both the research design and analysis stages, researchers include the variables they believe explain or cause the variation of the effect or outcome of interest
Often, researchers will also include covariates (e.g., age, gender, income) deemed potentially important (either by prior research or their own theoretical interests)
Researchers include covariates in their models to reduce the risk of drawing invalid or misleading conclusions about the importance of an explanatory variable, ensuring that any observed effect is not simply due to its association with other related variables
But, what drives researchers to select these covariates that are supposed to prevent such spurious effects?
Researchers are advised to let theory guide decisions about the inclusion of important covariates. But such advice is usually vague and difficult to adhere to.
In practice, researchers haphazardly select covariates using their common sense and make such decision at their own subjective discretion
Instead, we should use causal inference principles to make more sensible decisions about which covariates to account for in our explanatory models

Causal Inference

Causal Inference

Causal inference concerns what would happen to an outcome (Y) as a result of a treatment, intervention, or exposure (T), given pretreatment information (X) (Gelman et al., 2021)

To define causation formally, we start by recognizing that treatment effects vary from one unit of observation (e.g., one person) to another, implying that there is a separate causal effect \(\tau_i\) for each case, i, in a sample or population
Using the counterfactual framework (i.e., Rubin’s causal model; Rubin, 1974), assume that the treatment (T) indicator \(t_i=1\) when case i receives the treatment and \(t_i=0\) when case i does not receive the treatment (i.e., control condition)
Then, \(y^1_i\) is the outcome value that would occur if case i received the treatment, and \(y^0_i\) is the outcome value that would occur if case i DID NOT received the treatment; these values are referred to as potential outcomes
Thus, the individual causal effect for case i is: \(\tau_i = y^1_i - y^0_i\)

Causal Inference

The “fundamental problem of causal inference” is that, for each person i, we can only observe either \(y^1_i\) or \(y^0_i\), but not both!
Consider these hypothetical data:

\[ \scriptsize \begin{array}{c c c c c} \hline & & {\textbf{Potential outcomes}} \\ \hline \text{Case } i & t_i & y_i^0, \text{ if } t_i = 0 & y_i^1, \text{ if } t_i = 1 & y_i \\ & \text{(treatment indicator)} & & & \text{(observed outcome)} \\ \hline 1 & 0 & 40 & ? & 40 \\ 2 & 0 & 40 & ? & 40 \\ 3 & 0 & 50 & ? & 50 \\ 4 & 1 & ? & 55 & 55 \\ 5 & 1 & ? & 55 & 55 \\ 6 & 1 & ? & 60 & 60 \\ \hline \end{array} \]

Causal Inference

The sample average treatment effect (ATE) is simply the mean of the individual treatment effects

\[\bar{\tau} = \frac{1}{N} \sum_{i=1}^{N} \tau_i = \frac{1}{N} \sum_{i=1}^{N} (y_i^1 - y_i^0) \\ = \frac{1}{N} \sum_{i=1}^{N} y_i^1 - \frac{1}{N} \sum_{i=1}^{N} y_i^0\]

Still, we cannot calculate the ATE directly because we observe only either \(y^1_i\) or \(y^0_i\); half of the data are missing!
But, if we assume that the potential outcomes that are observed are randomly sampled from the population, then their means, \(\bar{Y}_{t=0}\) and \(\bar{Y}_{t=1}\), will give us unbiased estimates of \(\frac{1}{N} \sum_{i=1}^{N} y_i^1\) and \(\frac{1}{N} \sum_{i=1}^{N} y_i^0\), respectively

Causal Inference

Under this strong assumption, we can estimate the ATE by subtracting the mean of the observed scores in the control condition from the mean of the observed scores in the treatment condition:

\[\hat{ATE}=\bar{Y}_{t=1}-\bar{Y}_{t=0}\]

The above estimator of the ATE is unbiased with respect to the population ATE only when cases are randomly assigned to the levels of the treatment indicator T. Otherwise, the mean difference could be determined by some pretreatment characteristic other than the treatment itself (e.g., a covariate of sorts)
Yet, random assignment only guarantees that the treated and control groups are balanced on all relevant pretreatment on average across an infinite set of hypothetical samples from the same population (i.e., they are balanced in expectation)

Causal Inference

With any single sample, however, the treated and control groups may still differ with respect to pretreatment characteristics, especially if the sample is small, leading to sample mean differences that are potentially widely discrepant from the population ATE
In other words, random assignment may not effectively cancel-out pretreatment or confounding differences between the treated and control groups in a single sample
Thus, it’s advisable to account for pretreatment group differences even with random assignment to (Gelman et al., 2021)
That is, to obtain unbiased causal effect, we have to carefully consider how other variables, namely confounders, influence the association between our explanatory variable/s and the outcome and potentially adjust for these in our GLM

DAG, DAGITTY, DAG DAG

A set of hypothesized causal relations is often represented using a directed acyclic graph (DAG)
In the simple case of a randomized treatment variable, T, having a causal influence on an outcome, Y, the DAG looks like this:

In a DAG, variables are known as nodes, whereas an arrow connecting two variables is known as an edge

Confining Confounders

DAGs are called acyclic to indicate that causes do not eventually have effects on themselves
In formal causal modeling, a confounder has a causal effect on both the focal explanatory variable and the outcome variable (remember this?)

In this DAG, we see a fork; age is “pointing” at (causing) both exercise and cholesterol

DAG Building Blocks

Every DAG,no matter how big and complicated,is built out of these four relations

Backdoor Paths

Using DAG heuristics, to obtain an unbiased estimate of the effect, it is necessary to adjust for a confounder Age (either statistically or with experimental design) because age opens a backdoor path between Exercise and Cholesterol:

\[Exercise \leftarrow Age \rightarrow Cholesterol\]

A backdoor path is an indirect path between an explanatory variable (T/Exercise) and an outcome variable (Y/Cholesterol) that does not go through the intended causal direction (i.e., it doesn’t follow the arrows forward); instead, it “enters” the explanatory variable through the backdoor—meaning it connects T and Y through a common cause (i.e., confounder; X/Age)

Adjusting for Age (either with research design or statistically) closes the backdoor path

Shutting the Backdoor

Had we randomly assigned individuals to different levels of Exercise—i.e., manipulated Exercise, which means that Exercise would be \(\perp\) (orthogonal) to Age—then we would sever the causal effect Age has on Exercise and close the backdoor path

Shutting the Backdoor

“In general, then, causal effects can be estimated using regression if the model includes all confounding covariates (predictors that can affect treatment assignment or the outcome) and if the model is correct” (Gelman et al., 2021, p. 385)
But, actually, it’s only necessary to include in the model the confounders that block backdoor paths; if two confounders are causally associated with each other, we may only need to control one of them to obtain an unbiased causal effect (Morgan & Winship, 2015)
That is, as long as we close all backdoor paths, we can infer causality! Yes, we can say things like exercise causes less cholesterol. Don’t you feel powerful!?
Because DAGs provide an overview of the causal relationships among a set of theoretically relevant variables, they allow identification of a set of variables to adjust for in the analysis to remove confounding

STOP 🛑 Example Time

Causal Inference: Example

let’s consider an actual dataset: A researcher is interested in estimating the causal effect of the a movie’s Tomatometer score on its profit
The researcher collects data for N = 609 movies released between 2008 and 2012, including their net profit (i.e., measured as the difference between their gross income and budget in millions of US dollars)
They also record the movies’ numeric Tomatometer score, representing an aggregate of critics’ reviews of the film
Simply put, the researcher wants to devise a model for profit to explain how (and why) movies vary on this important outcome

year	title	gross	budget	leng	score	profit
2012	Django Unchained	162.80543	100	165	88	62.805434
2012	Parental Guidance	77.26493	25	105	18	52.264926
2012	The Impossible	19.01988	40	114	81	-20.980118
2012	Jack Reacher	80.07074	60	130	61	20.070736
2012	This is 40	67.54451	35	134	51	32.544505
2012	The Guilt Trip	37.13421	40	95	38	-2.865785

Causal Structure

There are a few variables and cofounders we should consider: Let’s start with budget
- Larger budgets allow for longer movies AND affect reviews
- Longer movies tend to also have an affect on reviews (sometimes positive, other times negative)
- Better reviews bring greater profit (what we’re after)
How do these variables come together in a DAG?Confirm there are no “cycles” AND identify all backdoor paths connecting review score and profit

Identify Backdoor Paths

There are two backdoor paths connecting review score and profit marked in red and orange (which confounders should we adjust for?)

Above, there are two backdoor paths: (1) \(Review \leftarrow Length \rightarrow Profit\) and (2) \(Review \leftarrow Budget \rightarrow Length \rightarrow Profit\)

Identify Backdoor Paths

We need to block ALL backdoor paths to get an unbiased estimate of the causal effect of review score on profit

As it happens, both backdoor paths go through Length, so if we control Length, we can close both! (Nice!)

No DAGitty, No Doubt

We can use the dagitty package to confirm everything!

No DAGitty, No Doubt

We can use the dagitty package to confirm everything!

`dagitty` Operations

First, let’s confirm this DAG is actually a proper DAG, i.e., acyclic

library(dagitty)
library(igraph)
# Define the DAG
movieDAG <- dagitty ("dag{
Budget -> Length
Budget -> Review_Score
Length -> Profit
Length -> Review_Score
Review_Score -> Profit
}")
# Extract edges
DAG_dat <- as.data.frame(dagitty::edges(movieDAG), stringsAsFactors = FALSE)
colnames(DAG_dat) <- c("from", "to")  # Rename columns
DAG_dat <- DAG_dat %>% select(from, to) # include only from and to columns
# Convert to igraph object
igobj <- graph_from_data_frame(DAG_dat, directed = TRUE)
# Check for cycles using igraph's `is_dag()`
is_dag(igobj)  # TRUE if acyclic, FALSE if cycles exist

[1] TRUE

`dagitty` Operations

Now, let’s see what we need to control for:

adjustmentSets(movieDAG, exposure = "Review_Score", outcome = "Profit")

{ Length }

where exposure refers to the focal explanatory variable of interest
Therefore, if a linear functional form is correct, we can get this causal effect estimate from an MLR model of Profit with Review Score and Length as predictors; there is no need to include budget in the regression model!

Example Extension

Suppose instead we were instead interested in the causal effect of Length on Profit
Identify the backdoor path/s. You can follow Pearl’s Backdoor Criterion:

Identify all paths between T (now, Length) and Y (Profit)
Check if any of them start with an incoming arrow into Length (i.e., a backdoor path)
Assess if any of these backdoor paths are open (i.e., directed at Y), creating confounding and introducing bias

Identify Backdoor Paths

We have only one backdoor path \(Length \leftarrow Budget \rightarrow Review \ Score \rightarrow Profit\)
Thus, we should condition on (control for) Budget to close that path

Why should we not condition on Review Score in addtion/instead?

Mediators

Gelman et al. (2021) offer the general advice “do not adjust for posttreatment variables”, but distinguishing posttreatment variables from other variables might be challenging in observational research
Carefully considering the (potential) causal roles of variables in a system and creating a DAG can clarify which variables should be controlled
Sometimes a focal explanatory variable, T, is hypothesized to exert an indirect effect on an outcome, Y, through a mediator, M

Mediators

Following the Gelman et al. advice, a mediator is a type of posttreatment variable, and if we want to estimate the causal effect of T on Y, we should NOT control for M
In the movie DAG, the axiom that posttreatment variables should not be controlled is the reason that Review Score is not labeled as a variable to adjust to get the causal effect of Length on Profit; the DAG suggests that Review Score is a mediator between Length and Profit
In psychological research, however we often are interested in describing the mediation process itself; that is, we wish to partition the total effect of T on Y into a direct effect and an indirect (i.e., mediated) effect
Mediation analysis is popular in psychology because it addresses the question of how a treatment or intervention affects an outcome
- For example, researchers may hypothesize that a certain type of therapy (treatment, T) improves a child’s behavioural outcome (e.g., externalizing behaviour, Y) by improving the child’s emotion regulation (mediator, M)
To be clear: Mediation is a causal process!

Mediators

Assuming that linear models are correct, the mediation DAG below indicates that we can get the direct and indirect effects with two regression models:

The coefficient \({\color{purple}{\beta_{MT}}}\) corresponds to edge \({\color{purple}{a}}\); \({\color{deeppink}{\beta_{YT}}}\) corresponds to \({\color{deeppink}{c}}\); and \({\color{gold}{\beta_{YM}}}\) corresponds to \({\color{gold}{b}}\)
The direct effect of T on Y is thus \({\color{deeppink}{\beta_{YT}}} = {\color{deeppink}{c}}\) , whereas the indirect effect is typically quantified as \({\color{purple}{\beta_{MT}}} \times {\color{gold}{\beta_{YM}}} = {\color{purple}{a}} \times {\color{gold}{b}}\)
The total causal effect is the sum of the direct and indirect effects: \({\color{deeppink}{\beta_{YT}}}+{\color{purple}{\beta_{MT}}} \times {\color{gold}{\beta_{YM}}}= {\color{deeppink}{c}}+{\color{purple}{a}} \times {\color{gold}{b}}\)

Mediators

The total causal effect is the sum of the direct and indirect effects: \({\color{deeppink}{\beta_{YT}}}+{\color{purple}{\beta_{MT}}} \times {\color{gold}{\beta_{YM}}}= {\color{deeppink}{c}}+{\color{purple}{a}} \times {\color{gold}{b}}\)
Returning to Gelman et al.’s point about not adjusting for post-treatment variables (in this case, a mediator), the total effect can also be obtained by simply regressing Y on T while ignoring M!

EXAMPLE, again…

Suppose movies with higher Tomatometer scores see an increased level of advertisement (“Critics agree! Avatar is brilliant!”), which in turn leads to great profit
Of course, it is also the case that a bigger budget leads to more advertisement, but it doesn’t seem plausible that movie length has a direct effect on advertisement

Mediators, Another Example

The DAG shows that Advertisement is a (hypothesized) mediator between Review Score and Profit. Incidentally, Advertisement is also a mediator between Budget and Profit. Which variables need to be conditioned on?

Mediators, Another Example

movieDAG2 <- dagitty ("dag{
Budget -> Length
Budget -> Review_Score
Budget -> Advertisement
Length -> Profit
Length -> Review_Score
Review_Score -> Profit
Review_Score -> Advertisement
Advertisement -> Profit
}")
# Extract edges
DAG_dat2 <- as.data.frame(dagitty::edges(movieDAG2), stringsAsFactors = FALSE)
colnames(DAG_dat2) <- c("from", "to")  # Rename columns
DAG_dat2 <- DAG_dat2 %>% select(from, to) # include only from and to columns
# Convert to igraph object
igobj2 <- graph_from_data_frame(DAG_dat2, directed = TRUE)
# Check for cycles using igraph's `is_dag()`
is_dag(igobj2)  # TRUE if acyclic, FALSE if cycles exist

[1] TRUE

adjustmentSets(movieDAG2, exposure = "Review_Score", outcome = "Profit")

{ Budget, Length }

Mediators, Another Example

Now, we see that Advertisement was not listed as a variable to control when estimating the causal effect of Review Score on Profit
- Consistent with the idea that mediators (or post-treatment variables generally) should not be controlled when estimating the total causal effect of an explanatory variable

Causal Inference: Key Takeaways

Keep in mind that statistical methods alone cannot distinguish between mediators, forks, confounders, etc.
Instead, we have to rely on logic, common sense, and and well-formed theory to determine how variables should be organized in a DAG, and then use that DAG to guide model creation
This is just the tip of the iceberg when it comes to causal inference, DAGs, and mediation analysis. Here are some good, related reads if you’re curios to learn more:

Stop! 🛑 Activity Time!

Class Activity

Draw a DAG to represent the following processes: Suppose we want to estimate the causal effect of professor rank on the outcome salary

We also have data for two covariates, years since PhD and years of service, which are obviously correlated (in fact, they are collinear; but one does not cause another!)
We know that years of service leads to rank because a professor must work at a university for several years before becoming eligible for promotion
Years of service also leads to salary because professors typically get a small pay raise every year, regardless of their rank
Years since PhD may have a separate influence on rank and salary because when they are first hired by a university, professors who already have post-PhD experience are likely to be offered a higher starting salary and may get early tenure/promotion than professors who have only just completed their PhD

Other than rank (exposure), which predictors must be included in the model to obtain an unbiased causal estimate on salary?

Class Activity

salaryDAG <-dagitty("dag{
  Rank -> Salary
  ServiceYears -> Rank
  ServiceYears -> Salary
  YrsPhD -> Salary
  YrsPhD -> Rank
}")
DAG_salary <- as.data.frame(dagitty::edges(salaryDAG), stringsAsFactors = FALSE)
colnames(DAG_salary) <- c("from", "to")  # Rename columns
DAG_salary <- DAG_salary %>% select(from, to) # include only from and to columns
# Convert to igraph object
igobj_sal <- graph_from_data_frame(DAG_salary, directed = TRUE)
# Check for cycles using igraph's `is_dag()`
is_dag(igobj_sal)  # TRUE if acyclic, FALSE if cycles exist

[1] TRUE

adjustmentSets(salaryDAG, exposure="Rank",outcome="Salary")

{ ServiceYears, YrsPhD }

Class Activity

library(ggdag)
ggdag(salaryDAG)+theme_dag()

Predictive Modeling in Psychology

Predictive Modeling

Unlike explanatory modeling, predictive modeling is explicitly focused on developing a model that provides accurate forecasts, or predictions, for critical outcomes that have not yet been observed
Recall that identifying the data-generating mechanisms and causal relationships will not necessarily yield the most successful prediction in the real world and vice versa
Especially when it comes to incredibly complex topics, such as human behaviour and mental health, reducing a predictive model to a mere two to seven variables (a common number of variables in psychological research) is simply naïve
Perhaps the reason for the mainstream use of the explanatory approach in psychology is that, in the not-so-distant past, before powerful computing and the ability to collect “big data,” identifying the most critical factors affecting the variable of interest was our best attempt at a reasonable forecast
But, with the recent growth in machine learning and computational muscle, better predictions carried out by heavy-duty methods are right at our fingertips

Explaining and Predicting Are Different

The difference between explanatory and predictive modelling manifests itself in multiple ways:
- Causation vs. association
- Retrospective vs. prospective
- Theory vs. data
- Bias vs. expected prediction error (EPE). What does this mean?!
It means that explanatory modeling primarily seeks to minimize bias—that is, to make sure our estimated model is as close as possible to the true data-generating mechanism (model) that gave rise to the observed data—whereas, the goal behind predictive modeling is to minimize the EPE. Makes sense?

Expected Prediction Error

As its name suggests, the EPE measures how well our estimated model predicts new, unseen data
It represents the average (i.e., expected) squared difference between the true values, \(Y\), and the model’s predictions, \(\hat{f}(X)\) over all possible datasets and inputs
Mathematically,

\[EPE= \mathbb{E}[(Y - \hat{f}(X))^2]\]

Recall that \(\mathbb{E}\) refers to “expected,” which is just another word for “mean” or “average”
\(X\) is simply the input, you can think about it as a dataset from the population of interest with one or more variables
As it turns out, EPE is “made up” of a few components, two of which are bias and estiamtion variance

Bias and Variance

When we fit a model to predict an outcome, \(Y\), based on some input, \(X\), our model naturally makes errors
These errors can be broken down into bias and estimation variance (often referred to simply as variance)
Bias measures how much the model’s prediction, on average, differs systematically from the true values; bias is about the difference between the true function, \(f(x)\) and the average of estimated function, \(\mathbb{E}[\hat{f}(x)]\), across different samples of the same population

\[\text{Bias} = \mathbb{E}[\hat{f}(X)] - f(X)\]

\(\hat{f}(X)\) = an estimated model from a single dataset/sample (i.e., \(\hat{Y}\))
\(\mathbb{E}[\hat{f}(X)]\) = mean of all our model estimations (across all datasets)
\(f(X)\) = the true model in the population (the data-generating mechanism), which we usually can’t actually know

Bias and Variance

Estimation variance refers to how much our model’s parameter estimates and thus predictions, \(\hat{f}(x)\), fluctuate across different datasets/samples

\[\text{Var}(\hat{f}(X)) = \mathbb{E} \left[ (\hat{f}(X) - \mathbb{E}[\hat{f}(X)])^2 \right]\]

Note that the Variance equation is similar to the variance equation you know…it’s the average squared difference between each estimated model and the mean of the estimated models, right?

The Bias-Variance Trade-Off

Bias and Variance

The Make-Up of EPE

Together, bias and estimation variance make up EPE, such that:

\[EPE= \mathbb{E}[(Y - \hat{f}(X))^2] = Bias^2 + Var(\hat{f}(x))+Var(Y)\]

where \(Var(Y)\) represents the noise in the outcome variable that no model can eliminate (e.g., naturally random errors, \(\epsilon\))
Recall that predictive modeling is “hell-bent” on minimizing EPE? So, to minimize EPE, we must try to minimize both bias and estimation variance
There’s only one problem, though, when bias goes down, varinace goes up and vice versa…oopsy! This is often referred to as the bias-variance trade-off

The Bias-Variance Trade-Off

The theory behind predictive modelling rests primarily on the trade-off between bias and variance
The Bias-Variance Trade-off is a key concept in statistical modeling and machine learning that describes the balance between underfitting and overfitting when making predictions
The bias-variance trade-off refers to the notion that more complex models (indicated by the number of parameters) tend to fit the data better, but they also lead to greater variability in predictions accuracy when fitted to other samples (low bias-high variance; overfitting)
In contrast, a simpler model may fit the data less accurately, but its predictions vary to a lesser extent (higher bias-lower variance; underfitting)

The Bias-Variance Trade-Off

The Bias-Variance Trade-Off

Minimizing EPE

So, if our goal is to minimize EPE, we must find a good balance between bias and variance, the optimal point, AKA the “sweet spot”; that is the goal of predictive modeling!
OK, but how?! Well, we have a bunch of options…in this Module, we’re only going to cover one, cross-validation
But, there are many others such as, lasso, ridge regression, and other regularization methods

Cross-Validation Methods

To minimize overfitting and obtain a model that will be capable of producing more precise predictions for observations beyond our sample (i.e., unobserved, future data), we can (and should) use cross-validation methods
The goal of cross-validation is to evaluate the predictive performance of a model by quantifying the out-of-sample prediction error
Cross-validation refers to a family of procedures which include “calibrating” a model by fitting it (or, “training” it) only on a portion of the data and “validating” it (or, “testing” it) on the remaining portion of the data
Practically speaking, the principles of cross-validation are as follows:

The sample is randomly split into two smaller datasets, referred to as the training data and the test data (AKA a validation sample or hold-out sample)

Cross-Validation Methods

A model is fitted to the training data and the fit of the training model is measured using multiple-\(R^2\) (i.e., coefficient of determination)
The same model, with the same parameters and predictors, is then used to generate predicted values, \(\hat{Y}\), for the test (validation) data
The correlation between \(\hat{Y}\) and \(Y\) of the test data is called the cross-validation correlation (also denoted as \(R_{CV}^2\)) and represents the extent to which the model can predict (or generalize) to out-of-sample data from the same population
- That is, the cross-validation correlation statistic is a measure of the predictive performance of the model in question
A useful statistic which is commonly used in cross-validation is the out-of-sample mean square error (MSE), which is calculated by

\[\text{MSE} = \frac{1}{n_{test}} \sum_{i=1}^{n_{test}} (y_i - \hat{y}_i)^2\] with lower MSE values indicate greater prediction ability

Cross-Validation Example

We’ll use the penguins dataset in the palmerpenguins package which includes a sample of N=344 penguins from Antarctica, describing their different species, flipper length, body mass, bill dimension, and sex

library(palmerpenguins)
peng <- na.omit(penguins)
nrow(peng)

[1] 333

STEP 1: Split the Data into Training and Test Subsets

set.seed(311)
train.data <- peng %>% slice_sample(prop = 0.5) # randomly select 50% of sample
test.data <- anti_join(peng, train.data) # select the remaining rows
nrow(train.data)

[1] 166

nrow(test.data)

[1] 167

Cross-Validation Example

STEP 2: Fit a Hypothesized Model to Training Data

Here, I use the an MLR model, where penguins’ body mass (in grams) is regressed on both bill length and depth (in millimeters).

mod <- lm(body_mass_g~bill_length_mm+bill_depth_mm, data=train.data)
summary(mod) # NOTE THE R2 and adjusted-R2


Call:
lm(formula = body_mass_g ~ bill_length_mm + bill_depth_mm, data = train.data)

Residuals:
     Min       1Q   Median       3Q      Max 
-1406.52  -405.82     2.52   347.29  1498.38 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    2967.499    586.152   5.063 1.11e-06 ***
bill_length_mm   83.022      8.354   9.938  < 2e-16 ***
bill_depth_mm  -138.321     22.643  -6.109 7.12e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 572.1 on 163 degrees of freedom
Multiple R-squared:  0.5068,    Adjusted R-squared:  0.5007 
F-statistic: 83.74 on 2 and 163 DF,  p-value: < 2.2e-16

Cross-Validation Example

STEP 3: Get Model Predictions for Test Data

head(Yhat <- predict(mod, test.data)) # calculating the predicted values on the TEST DATA

       1        2        3        4        5        6 
3823.496 3380.839 3734.930 3510.858 3945.242 3239.731

STEPS 4 and 5: Calculate the Cross-validation Correlation and MSE

cor(Yhat, test.data$body_mass_g) # CV correlation

[1] 0.6564795

resid <- test.data$body_mass_g - Yhat # residuals
(MSE <- mean(resid^2)) # MSE

[1] 373759.3

Cross-Validation Example

Calculating the cross-validation correlation on the test data is a proxy for our power of prediction to future observations
The inverse of predictive power is the discrepancy between the predicted and observed values when applied to a new sample (from the same population); this discrepancy is referred to as the test error
Hypothetically, if we calculate the test error for each sample randomly selected from the population an infinite number of times and average them, we’ll get the EPE
The estimation variance introduced earlier, \(Var(\hat{f}(X))\), refers to how test error varies from one sample (\(X\)) to another

\(K\)-fold Cross-Validation

If cross-validation correlation and test error are our measures of predictive power (or lack thereof), how much confidence should we have in our model if it was trained on only half our sample?
The answer is…well, not a lot…This is where the \(k\)-fold cross-validation technique comes in handy
The idea behind \(k\)-fold cross-validation involves splitting the sample in random order across \(k\) unique subsets, referred to as folds, of (about) the same size
The first fold is “held out” to be used as the test data whereas the model is fitted to the rest of the \(k-1\) folds
Next, the estimated model is applied to the “hold-out” (i.e., test) fold to compute one estimate of the test error, MSE
This process is then repeated an additional \(k-1\) times, with each new iteration, a different fold is treated as the “hold-out,” or test, fold, which results in k total estimates of the test error, i.e., {MSE1, MSE2, MSE3,…MSEk}

\(K\)-fold Cross-Validation

To get the \(k\)-fold cross-validation test error estimate, the \(k\) MSE statistics are averaged

\[CV_{(k)} = \frac{1}{k}\sum_{i=1}^k MSE_i\]

\(K\)-fold Cross-Validation Example

library(caret)
# Define k-fold cross-validation settings
k <- 10  # Number of folds
train.control <- trainControl(method = "cv", number = k)
# Train the model using 10-fold cross-validation
train.model <- train(body_mass_g ~ bill_length_mm + bill_depth_mm, 
                     data = peng, 
                     method = "lm",
                     trControl = train.control)
# Print results
train.model

Linear Regression 

333 samples
  2 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 300, 300, 300, 299, 299, 298, ... 
Resampling results:

  RMSE      Rsquared  MAE     
  590.3279  0.472769  484.3707

Tuning parameter 'intercept' was held constant at a value of TRUE

Note the average Rsquared and how it compares to the original (=0.51). Also note that \(RMSE=\sqrt{MSE}\)

\(K\)-fold Cross-Validation Example

# Extract CV-MSE (Mean Squared Error from cross-validation)
cv_mse <- train.model$resample$RMSE^2  # Squaring RMSE gives MSE for each fold
mean(cv_mse)  # Average MSE across folds

[1] 350412.9

In this example, \(k = 10\), setting \(k = N\) (the entire sample size) leads to a special case of \(k\)-fold cross-validation known as leave-one-out cross validation (LOOCV)
With LOOCV, each iteration uses only \(n = 1\) for the test data
LOOCV tends to work better than cross-validation with fewer folds, but it is more computationally intensive

Leave-One-Out Cross-Validation (LOOCV)

LOOCV Cross-Validation Example

library(caret)
train.control.LOOCV <- trainControl(method = 'LOOCV')
train.model.LOOCV <- train(body_mass_g~bill_length_mm+bill_depth_mm, 
                           data = peng,
                           method = "lm",
                           trControl = train.control.LOOCV)
train.model.LOOCV

Linear Regression 

333 samples
  2 predictor

No pre-processing
Resampling: Leave-One-Out Cross-Validation 
Summary of sample sizes: 332, 332, 332, 332, 332, 332, ... 
Resampling results:

  RMSE     Rsquared  MAE     
  591.936  0.458019  484.7524

Tuning parameter 'intercept' was held constant at a value of TRUE

Note the average Rsquared and how it compares to the original (=0.51) and to the 10-fold cv (=.47)

Cross-Validation Methods

Until now, we used cross-validation to evaluate the predictive performance or generalization capacity of a particular model (e.g., \(\text{body mass}_i= \beta_0+\beta_1\text{bill length}_i + \beta_2 \text{bill depth}_i + \epsilon_i\))
But, cross-validation is also a great tool for model evaluation!
Example, say you’re trying to decide which variables are better at predicting penguin body mass; you tried already bill depth and length, but what about sex and species? i.e.,

\[\text{MOD1: body mass}_i= \beta_0+\beta_1\text{bill length}_i + \beta_2 \text{bill depth}_i + \epsilon_i \\vs. \\ \text{MOD2: body mass}_i= \beta_0+\beta_1\text{sex}_i + \beta_2 \text{species}_i + \epsilon_i\]

Cross-Validation Methods

train.control.LOOCV <- trainControl(method = 'LOOCV')
# Mod 1:
train.model1.LOOCV <- train(body_mass_g~bill_length_mm+bill_depth_mm, 
                      data = peng, method = "lm", trControl = train.control.LOOCV)
# Mod 2:
train.model2.LOOCV <- train(body_mass_g~sex+species,
                      data = peng, method = "lm", trControl = train.control.LOOCV)

mod_compare <- data.frame(t(train.model.LOOCV$results),t(train.model2.LOOCV$results))[-1,]
colnames(mod_compare) <- c("Mod1: bill_len+dep", "Mod2: sex+spec")
kable(mod_compare)

	Mod1: bill_len+dep	Mod2: sex+spec
RMSE	591.936030	318.6248079
Rsquared	0.458019	0.8429538
MAE	484.752447	256.4213241

Which model has better predictive ability?
Of course you can extend this to many more options at once, with different variables, different number of predictors, using different analysis approaches (e.g., WLS, removing outliers) etc.

Cross-Validation Takeaway

Cross-validation techniques are a simple but extremely powerful and popular tools in predictive modeling and machine learning.
Yes, that’s right, you just got a little taste of machine learning!
It gives us a minimally biased way of gauging the true generalization performance of any model
Generally speaking, the overfitting observed when using the same data to both train and test a model will largely disappear when cross-validation is applied, and the cross-validated estimate of a model’s generalization performance will (on average) typically be very close to the true out-of-sample performance
Another purpose of cross-validation is model selection: you can compare and contrast the same model under each possible analysis approach and/or different models all together and check which brings about the best predictive power!
Other relevant and interesting tools in predictive modeling and machine learning include regularization methods like Lasso (L1), Ridge (L2) regression, and more…

Module 5: Explanatory and Predictive Modeling

Things Udi Should Remember

Before we start

About Module 5

Goals for Today:

I spent most of my time during reading week:

A) Reading/school stuff, duh?!

B) Sleeping

C) Working (at a job)

D) Hanging out with family/friends

E) Binge-watching shows/movies

What’s your spirit animal? (and you can’t say platypus)

Explanatory vs. Predictive Modeling

Explanatory Modeling and Causal Inference

Explanatory Modeling and Causal Inference

Causal Inference

Causal Inference

Causal Inference

Causal Inference

Causal Inference

DAG, DAGITTY, DAG DAG

Confining Confounders

DAG Building Blocks

Backdoor Paths

Shutting the Backdoor

Shutting the Backdoor

Causal Inference: Example

Causal Structure

Identify Backdoor Paths

Identify Backdoor Paths

No DAGitty, No Doubt

No DAGitty, No Doubt

dagitty Operations

dagitty Operations

Example Extension

Identify Backdoor Paths

Mediators

Mediators

Mediators

Mediators

Mediators, Another Example

Mediators, Another Example

Mediators, Another Example

Causal Inference: Key Takeaways

Class Activity

Class Activity

Class Activity

Class Activity

Predictive Modeling

Explaining and Predicting Are Different

Expected Prediction Error

Bias and Variance

Bias and Variance

The Bias-Variance Trade-Off

Bias and Variance

The Make-Up of EPE

The Bias-Variance Trade-Off

The Bias-Variance Trade-Off

The Bias-Variance Trade-Off

Minimizing EPE

Cross-Validation Methods

Cross-Validation Methods

Cross-Validation Example

Cross-Validation Example

Cross-Validation Example

Cross-Validation Example

\(K\)-fold Cross-Validation

\(K\)-fold Cross-Validation

\(K\)-fold Cross-Validation Example

\(K\)-fold Cross-Validation Example

Leave-One-Out Cross-Validation (LOOCV)

LOOCV Cross-Validation Example

Cross-Validation Methods

Cross-Validation Methods

Cross-Validation Takeaway

`dagitty` Operations

`dagitty` Operations