PSYC 3032 M
Explanatory vs. Predictive Modeling in Psychology
The primary aim of scientific psychology is to gain a deeper understanding of human behavior
Traditionally, this involved both explaining behavior by identifying underlying causes and predicting behavior by forecasting future actions. However, in practice, these objectives are often treated as interchangeable, but, if fact, they aren’t
The assumption is that models that best explain behavior should also best predict future behavior. Unfortunately, although explanation and prediction may be philosophically compatible, there are good reasons to think that they are often in statistical and pragmatic tension with one another
Statistically speaking, the model that most closely approximates the data-generating process will NOT be the most successful at predicting real-world outcomes
Similarly, models that do the best job at predicting outcomes of interest, will NOT identify the underlying causes
As a result, researchers must make a conscious choice: to explain or to predict?
Explanatory Modeling in Psychology
“In the explanatory approach to science, the ultimate goal is to develop a mechanistic model of the data-generating process that gives rise to the observed data.”
— Yarkoni & Westfall, 2017
“In explanatory modeling, the focus is on minimizing bias to obtain the most accurate representation of the underlying theory”
— Shmueli, 2010
Most of the research in psychology is explanatory; it’s meant to represent a theoretical account for the variation of the outcome variable
Typically this account (either explicitly or implicitly) represents the actual causal mechanisms, or a subset of potential causal mechanisms, that produce differences in the outcome
In both the research design and analysis stages, researchers include the variables they believe explain or cause the variation of the effect or outcome of interest
Often, researchers will also include covariates (e.g., age, gender, income) deemed potentially important (either by prior research or their own theoretical interests)
Researchers include covariates in their models to reduce the risk of drawing invalid or misleading conclusions about the importance of an explanatory variable, ensuring that any observed effect is not simply due to its association with other related variables
But, what drives researchers to select these covariates that are supposed to prevent such spurious effects?
Researchers are advised to let theory guide decisions about the inclusion of important covariates. But such advice is usually vague and difficult to adhere to.
In practice, researchers haphazardly select covariates using their common sense and make such decision at their own subjective discretion
Instead, we should use causal inference principles to make more sensible decisions about which covariates to account for in our explanatory models
Causal Inference
Causal inference concerns what would happen to an outcome (Y) as a result of a treatment, intervention, or exposure (T), given pretreatment information (X) (Gelman et al., 2021)
To define causation formally, we start by recognizing that treatment effects vary from one unit of observation (e.g., one person) to another, implying that there is a separate causal effect \(\tau_i\) for each case, i, in a sample or population
Using the counterfactual framework (i.e., Rubin’s causal model; Rubin, 1974), assume that the treatment (T) indicator \(t_i=1\) when case i receives the treatment and \(t_i=0\) when case i does not receive the treatment (i.e., control condition)
Then, \(y^1_i\) is the outcome value that would occur if case i received the treatment, and \(y^0_i\) is the outcome value that would occur if case i DID NOT received the treatment; these values are referred to as potential outcomes
Thus, the individual causal effect for case i is: \(\tau_i = y^1_i - y^0_i\)
The “fundamental problem of causal inference” is that, for each person i, we can only observe either \(y^1_i\) or \(y^0_i\), but not both!
Consider these hypothetical data:
\[ \scriptsize \begin{array}{c c c c c} \hline & & {\textbf{Potential outcomes}} \\ \hline \text{Case } i & t_i & y_i^0, \text{ if } t_i = 0 & y_i^1, \text{ if } t_i = 1 & y_i \\ & \text{(treatment indicator)} & & & \text{(observed outcome)} \\ \hline 1 & 0 & 40 & ? & 40 \\ 2 & 0 & 40 & ? & 40 \\ 3 & 0 & 50 & ? & 50 \\ 4 & 1 & ? & 55 & 55 \\ 5 & 1 & ? & 55 & 55 \\ 6 & 1 & ? & 60 & 60 \\ \hline \end{array} \]
\[\bar{\tau} = \frac{1}{N} \sum_{i=1}^{N} \tau_i = \frac{1}{N} \sum_{i=1}^{N} (y_i^1 - y_i^0) \\ = \frac{1}{N} \sum_{i=1}^{N} y_i^1 - \frac{1}{N} \sum_{i=1}^{N} y_i^0\]
Still, we cannot calculate the ATE directly because we observe only either \(y^1_i\) or \(y^0_i\); half of the data are missing!
But, if we assume that the potential outcomes that are observed are randomly sampled from the population, then their means, \(\bar{Y}_{t=0}\) and \(\bar{Y}_{t=1}\), will give us unbiased estimates of \(\frac{1}{N} \sum_{i=1}^{N} y_i^1\) and \(\frac{1}{N} \sum_{i=1}^{N} y_i^0\), respectively
\[\hat{ATE}=\bar{Y}_{t=1}-\bar{Y}_{t=0}\]
The above estimator of the ATE is unbiased with respect to the population ATE only when cases are randomly assigned to the levels of the treatment indicator T. Otherwise, the mean difference could be determined by some pretreatment characteristic other than the treatment itself (e.g., a covariate of sorts)
Yet, random assignment only guarantees that the treated and control groups are balanced on all relevant pretreatment on average across an infinite set of hypothetical samples from the same population (i.e., they are balanced in expectation)
With any single sample, however, the treated and control groups may still differ with respect to pretreatment characteristics, especially if the sample is small, leading to sample mean differences that are potentially widely discrepant from the population ATE
In other words, random assignment may not effectively cancel-out pretreatment or confounding differences between the treated and control groups in a single sample
Thus, it’s advisable to account for pretreatment group differences even with random assignment to (Gelman et al., 2021)
That is, to obtain unbiased causal effect, we have to carefully consider how other variables, namely confounders, influence the association between our explanatory variable/s and the outcome and potentially adjust for these in our GLM
A set of hypothesized causal relations is often represented using a directed acyclic graph (DAG)
In the simple case of a randomized treatment variable, T, having a causal influence on an outcome, Y, the DAG looks like this:
DAGs are called acyclic to indicate that causes do not eventually have effects on themselves
In formal causal modeling, a confounder has a causal effect on both the focal explanatory variable and the outcome variable (remember this?)
\[Exercise \leftarrow Age \rightarrow Cholesterol\]
“In general, then, causal effects can be estimated using regression if the model includes all confounding covariates (predictors that can affect treatment assignment or the outcome) and if the model is correct” (Gelman et al., 2021, p. 385)
But, actually, it’s only necessary to include in the model the confounders that block backdoor paths; if two confounders are causally associated with each other, we may only need to control one of them to obtain an unbiased causal effect (Morgan & Winship, 2015)
That is, as long as we close all backdoor paths, we can infer causality! Yes, we can say things like exercise causes less cholesterol. Don’t you feel powerful!?
Because DAGs provide an overview of the causal relationships among a set of theoretically relevant variables, they allow identification of a set of variables to adjust for in the analysis to remove confounding
STOP 🛑 Example Time
let’s consider an actual dataset: A researcher is interested in estimating the causal effect of the a movie’s Tomatometer score on its profit
The researcher collects data for N = 609 movies released between 2008 and 2012, including their net profit (i.e., measured as the difference between their gross income and budget in millions of US dollars)
They also record the movies’ numeric Tomatometer score, representing an aggregate of critics’ reviews of the film
Simply put, the researcher wants to devise a model for profit to explain how (and why) movies vary on this important outcome
| year | title | gross | budget | leng | score | profit |
|---|---|---|---|---|---|---|
| 2012 | Django Unchained | 162.80543 | 100 | 165 | 88 | 62.805434 |
| 2012 | Parental Guidance | 77.26493 | 25 | 105 | 18 | 52.264926 |
| 2012 | The Impossible | 19.01988 | 40 | 114 | 81 | -20.980118 |
| 2012 | Jack Reacher | 80.07074 | 60 | 130 | 61 | 20.070736 |
| 2012 | This is 40 | 67.54451 | 35 | 134 | 51 | 32.544505 |
| 2012 | The Guilt Trip | 37.13421 | 40 | 95 | 38 | -2.865785 |
dagitty Operationslibrary(dagitty)
library(igraph)
# Define the DAG
movieDAG <- dagitty ("dag{
Budget -> Length
Budget -> Review_Score
Length -> Profit
Length -> Review_Score
Review_Score -> Profit
}")
# Extract edges
DAG_dat <- as.data.frame(dagitty::edges(movieDAG), stringsAsFactors = FALSE)
colnames(DAG_dat) <- c("from", "to") # Rename columns
DAG_dat <- DAG_dat %>% select(from, to) # include only from and to columns
# Convert to igraph object
igobj <- graph_from_data_frame(DAG_dat, directed = TRUE)
# Check for cycles using igraph's `is_dag()`
is_dag(igobj) # TRUE if acyclic, FALSE if cycles exist[1] TRUE
dagitty Operationswhere exposure refers to the focal explanatory variable of interest
Therefore, if a linear functional form is correct, we can get this causal effect estimate from an MLR model of Profit with Review Score and Length as predictors; there is no need to include budget in the regression model!
Suppose instead we were instead interested in the causal effect of Length on Profit
Identify the backdoor path/s. You can follow Pearl’s Backdoor Criterion:
We have only one backdoor path \(Length \leftarrow Budget \rightarrow Review \ Score \rightarrow Profit\)
Thus, we should condition on (control for) Budget to close that path
Why should we not condition on Review Score in addtion/instead?
Gelman et al. (2021) offer the general advice “do not adjust for posttreatment variables”, but distinguishing posttreatment variables from other variables might be challenging in observational research
Carefully considering the (potential) causal roles of variables in a system and creating a DAG can clarify which variables should be controlled
Sometimes a focal explanatory variable, T, is hypothesized to exert an indirect effect on an outcome, Y, through a mediator, M
Following the Gelman et al. advice, a mediator is a type of posttreatment variable, and if we want to estimate the causal effect of T on Y, we should NOT control for M
In the movie DAG, the axiom that posttreatment variables should not be controlled is the reason that Review Score is not labeled as a variable to adjust to get the causal effect of Length on Profit; the DAG suggests that Review Score is a mediator between Length and Profit
In psychological research, however we often are interested in describing the mediation process itself; that is, we wish to partition the total effect of T on Y into a direct effect and an indirect (i.e., mediated) effect
Mediation analysis is popular in psychology because it addresses the question of how a treatment or intervention affects an outcome
To be clear: Mediation is a causal process!
The coefficient \({\color{purple}{\beta_{MT}}}\) corresponds to edge \({\color{purple}{a}}\); \({\color{deeppink}{\beta_{YT}}}\) corresponds to \({\color{deeppink}{c}}\); and \({\color{gold}{\beta_{YM}}}\) corresponds to \({\color{gold}{b}}\)
The direct effect of T on Y is thus \({\color{deeppink}{\beta_{YT}}} = {\color{deeppink}{c}}\) , whereas the indirect effect is typically quantified as \({\color{purple}{\beta_{MT}}} \times {\color{gold}{\beta_{YM}}} = {\color{purple}{a}} \times {\color{gold}{b}}\)
The total causal effect is the sum of the direct and indirect effects: \({\color{deeppink}{\beta_{YT}}}+{\color{purple}{\beta_{MT}}} \times {\color{gold}{\beta_{YM}}}= {\color{deeppink}{c}}+{\color{purple}{a}} \times {\color{gold}{b}}\)
The total causal effect is the sum of the direct and indirect effects: \({\color{deeppink}{\beta_{YT}}}+{\color{purple}{\beta_{MT}}} \times {\color{gold}{\beta_{YM}}}= {\color{deeppink}{c}}+{\color{purple}{a}} \times {\color{gold}{b}}\)
Returning to Gelman et al.’s point about not adjusting for post-treatment variables (in this case, a mediator), the total effect can also be obtained by simply regressing Y on T while ignoring M!
EXAMPLE, again…
Suppose movies with higher Tomatometer scores see an increased level of advertisement (“Critics agree! Avatar is brilliant!”), which in turn leads to great profit
Of course, it is also the case that a bigger budget leads to more advertisement, but it doesn’t seem plausible that movie length has a direct effect on advertisement
movieDAG2 <- dagitty ("dag{
Budget -> Length
Budget -> Review_Score
Budget -> Advertisement
Length -> Profit
Length -> Review_Score
Review_Score -> Profit
Review_Score -> Advertisement
Advertisement -> Profit
}")
# Extract edges
DAG_dat2 <- as.data.frame(dagitty::edges(movieDAG2), stringsAsFactors = FALSE)
colnames(DAG_dat2) <- c("from", "to") # Rename columns
DAG_dat2 <- DAG_dat2 %>% select(from, to) # include only from and to columns
# Convert to igraph object
igobj2 <- graph_from_data_frame(DAG_dat2, directed = TRUE)
# Check for cycles using igraph's `is_dag()`
is_dag(igobj2) # TRUE if acyclic, FALSE if cycles exist[1] TRUE
{ Budget, Length }
Keep in mind that statistical methods alone cannot distinguish between mediators, forks, confounders, etc.
Instead, we have to rely on logic, common sense, and and well-formed theory to determine how variables should be organized in a DAG, and then use that DAG to guide model creation
This is just the tip of the iceberg when it comes to causal inference, DAGs, and mediation analysis. Here are some good, related reads if you’re curios to learn more:
Stop! 🛑 Activity Time!
Draw a DAG to represent the following processes: Suppose we want to estimate the causal effect of professor rank on the outcome salary
We also have data for two covariates, years since PhD and years of service, which are obviously correlated (in fact, they are collinear; but one does not cause another!)
We know that years of service leads to rank because a professor must work at a university for several years before becoming eligible for promotion
Years of service also leads to salary because professors typically get a small pay raise every year, regardless of their rank
Years since PhD may have a separate influence on rank and salary because when they are first hired by a university, professors who already have post-PhD experience are likely to be offered a higher starting salary and may get early tenure/promotion than professors who have only just completed their PhD
Other than rank (exposure), which predictors must be included in the model to obtain an unbiased causal estimate on salary?
salaryDAG <-dagitty("dag{
Rank -> Salary
ServiceYears -> Rank
ServiceYears -> Salary
YrsPhD -> Salary
YrsPhD -> Rank
}")
DAG_salary <- as.data.frame(dagitty::edges(salaryDAG), stringsAsFactors = FALSE)
colnames(DAG_salary) <- c("from", "to") # Rename columns
DAG_salary <- DAG_salary %>% select(from, to) # include only from and to columns
# Convert to igraph object
igobj_sal <- graph_from_data_frame(DAG_salary, directed = TRUE)
# Check for cycles using igraph's `is_dag()`
is_dag(igobj_sal) # TRUE if acyclic, FALSE if cycles exist[1] TRUE
{ ServiceYears, YrsPhD }
Predictive Modeling in Psychology
Unlike explanatory modeling, predictive modeling is explicitly focused on developing a model that provides accurate forecasts, or predictions, for critical outcomes that have not yet been observed
Recall that identifying the data-generating mechanisms and causal relationships will not necessarily yield the most successful prediction in the real world and vice versa
Especially when it comes to incredibly complex topics, such as human behaviour and mental health, reducing a predictive model to a mere two to seven variables (a common number of variables in psychological research) is simply naïve
Perhaps the reason for the mainstream use of the explanatory approach in psychology is that, in the not-so-distant past, before powerful computing and the ability to collect “big data,” identifying the most critical factors affecting the variable of interest was our best attempt at a reasonable forecast
But, with the recent growth in machine learning and computational muscle, better predictions carried out by heavy-duty methods are right at our fingertips
The difference between explanatory and predictive modelling manifests itself in multiple ways:
It means that explanatory modeling primarily seeks to minimize bias—that is, to make sure our estimated model is as close as possible to the true data-generating mechanism (model) that gave rise to the observed data—whereas, the goal behind predictive modeling is to minimize the EPE. Makes sense?
As its name suggests, the EPE measures how well our estimated model predicts new, unseen data
It represents the average (i.e., expected) squared difference between the true values, \(Y\), and the model’s predictions, \(\hat{f}(X)\) over all possible datasets and inputs
Mathematically,
\[EPE= \mathbb{E}[(Y - \hat{f}(X))^2]\]
Recall that \(\mathbb{E}\) refers to “expected,” which is just another word for “mean” or “average”
\(X\) is simply the input, you can think about it as a dataset from the population of interest with one or more variables
As it turns out, EPE is “made up” of a few components, two of which are bias and estiamtion variance
When we fit a model to predict an outcome, \(Y\), based on some input, \(X\), our model naturally makes errors
These errors can be broken down into bias and estimation variance (often referred to simply as variance)
Bias measures how much the model’s prediction, on average, differs systematically from the true values; bias is about the difference between the true function, \(f(x)\) and the average of estimated function, \(\mathbb{E}[\hat{f}(x)]\), across different samples of the same population
\[\text{Bias} = \mathbb{E}[\hat{f}(X)] - f(X)\]
\[\text{Var}(\hat{f}(X)) = \mathbb{E} \left[ (\hat{f}(X) - \mathbb{E}[\hat{f}(X)])^2 \right]\]
\[EPE= \mathbb{E}[(Y - \hat{f}(X))^2] = Bias^2 + Var(\hat{f}(x))+Var(Y)\]
where \(Var(Y)\) represents the noise in the outcome variable that no model can eliminate (e.g., naturally random errors, \(\epsilon\))
Recall that predictive modeling is “hell-bent” on minimizing EPE? So, to minimize EPE, we must try to minimize both bias and estimation variance
There’s only one problem, though, when bias goes down, varinace goes up and vice versa…oopsy! This is often referred to as the bias-variance trade-off
The theory behind predictive modelling rests primarily on the trade-off between bias and variance
The Bias-Variance Trade-off is a key concept in statistical modeling and machine learning that describes the balance between underfitting and overfitting when making predictions
The bias-variance trade-off refers to the notion that more complex models (indicated by the number of parameters) tend to fit the data better, but they also lead to greater variability in predictions accuracy when fitted to other samples (low bias-high variance; overfitting)
In contrast, a simpler model may fit the data less accurately, but its predictions vary to a lesser extent (higher bias-lower variance; underfitting)
So, if our goal is to minimize EPE, we must find a good balance between bias and variance, the optimal point, AKA the “sweet spot”; that is the goal of predictive modeling!
OK, but how?! Well, we have a bunch of options…in this Module, we’re only going to cover one, cross-validation
But, there are many others such as, lasso, ridge regression, and other regularization methods
To minimize overfitting and obtain a model that will be capable of producing more precise predictions for observations beyond our sample (i.e., unobserved, future data), we can (and should) use cross-validation methods
The goal of cross-validation is to evaluate the predictive performance of a model by quantifying the out-of-sample prediction error
Cross-validation refers to a family of procedures which include “calibrating” a model by fitting it (or, “training” it) only on a portion of the data and “validating” it (or, “testing” it) on the remaining portion of the data
Practically speaking, the principles of cross-validation are as follows:
A model is fitted to the training data and the fit of the training model is measured using multiple-\(R^2\) (i.e., coefficient of determination)
The same model, with the same parameters and predictors, is then used to generate predicted values, \(\hat{Y}\), for the test (validation) data
The correlation between \(\hat{Y}\) and \(Y\) of the test data is called the cross-validation correlation (also denoted as \(R_{CV}^2\)) and represents the extent to which the model can predict (or generalize) to out-of-sample data from the same population
A useful statistic which is commonly used in cross-validation is the out-of-sample mean square error (MSE), which is calculated by
\[\text{MSE} = \frac{1}{n_{test}} \sum_{i=1}^{n_{test}} (y_i - \hat{y}_i)^2\] with lower MSE values indicate greater prediction ability
penguins dataset in the palmerpenguins package which includes a sample of N=344 penguins from Antarctica, describing their different species, flipper length, body mass, bill dimension, and sexSTEP 1: Split the Data into Training and Test Subsets
STEP 2: Fit a Hypothesized Model to Training Data
mod <- lm(body_mass_g~bill_length_mm+bill_depth_mm, data=train.data)
summary(mod) # NOTE THE R2 and adjusted-R2
Call:
lm(formula = body_mass_g ~ bill_length_mm + bill_depth_mm, data = train.data)
Residuals:
Min 1Q Median 3Q Max
-1406.52 -405.82 2.52 347.29 1498.38
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2967.499 586.152 5.063 1.11e-06 ***
bill_length_mm 83.022 8.354 9.938 < 2e-16 ***
bill_depth_mm -138.321 22.643 -6.109 7.12e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 572.1 on 163 degrees of freedom
Multiple R-squared: 0.5068, Adjusted R-squared: 0.5007
F-statistic: 83.74 on 2 and 163 DF, p-value: < 2.2e-16
STEP 3: Get Model Predictions for Test Data
1 2 3 4 5 6
3823.496 3380.839 3734.930 3510.858 3945.242 3239.731
STEPS 4 and 5: Calculate the Cross-validation Correlation and MSE
Calculating the cross-validation correlation on the test data is a proxy for our power of prediction to future observations
The inverse of predictive power is the discrepancy between the predicted and observed values when applied to a new sample (from the same population); this discrepancy is referred to as the test error
Hypothetically, if we calculate the test error for each sample randomly selected from the population an infinite number of times and average them, we’ll get the EPE
The estimation variance introduced earlier, \(Var(\hat{f}(X))\), refers to how test error varies from one sample (\(X\)) to another
If cross-validation correlation and test error are our measures of predictive power (or lack thereof), how much confidence should we have in our model if it was trained on only half our sample?
The answer is…well, not a lot…This is where the \(k\)-fold cross-validation technique comes in handy
The idea behind \(k\)-fold cross-validation involves splitting the sample in random order across \(k\) unique subsets, referred to as folds, of (about) the same size
The first fold is “held out” to be used as the test data whereas the model is fitted to the rest of the \(k-1\) folds
Next, the estimated model is applied to the “hold-out” (i.e., test) fold to compute one estimate of the test error, MSE
This process is then repeated an additional \(k-1\) times, with each new iteration, a different fold is treated as the “hold-out,” or test, fold, which results in k total estimates of the test error, i.e., {MSE1, MSE2, MSE3,…MSEk}
\[CV_{(k)} = \frac{1}{k}\sum_{i=1}^k MSE_i\]
library(caret)
# Define k-fold cross-validation settings
k <- 10 # Number of folds
train.control <- trainControl(method = "cv", number = k)
# Train the model using 10-fold cross-validation
train.model <- train(body_mass_g ~ bill_length_mm + bill_depth_mm,
data = peng,
method = "lm",
trControl = train.control)
# Print results
train.model Linear Regression
333 samples
2 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 300, 300, 300, 299, 299, 298, ...
Resampling results:
RMSE Rsquared MAE
590.3279 0.472769 484.3707
Tuning parameter 'intercept' was held constant at a value of TRUE
# Extract CV-MSE (Mean Squared Error from cross-validation)
cv_mse <- train.model$resample$RMSE^2 # Squaring RMSE gives MSE for each fold
mean(cv_mse) # Average MSE across folds[1] 350412.9
In this example, \(k = 10\), setting \(k = N\) (the entire sample size) leads to a special case of \(k\)-fold cross-validation known as leave-one-out cross validation (LOOCV)
With LOOCV, each iteration uses only \(n = 1\) for the test data
LOOCV tends to work better than cross-validation with fewer folds, but it is more computationally intensive
library(caret)
train.control.LOOCV <- trainControl(method = 'LOOCV')
train.model.LOOCV <- train(body_mass_g~bill_length_mm+bill_depth_mm,
data = peng,
method = "lm",
trControl = train.control.LOOCV)
train.model.LOOCV Linear Regression
333 samples
2 predictor
No pre-processing
Resampling: Leave-One-Out Cross-Validation
Summary of sample sizes: 332, 332, 332, 332, 332, 332, ...
Resampling results:
RMSE Rsquared MAE
591.936 0.458019 484.7524
Tuning parameter 'intercept' was held constant at a value of TRUE
Until now, we used cross-validation to evaluate the predictive performance or generalization capacity of a particular model (e.g., \(\text{body mass}_i= \beta_0+\beta_1\text{bill length}_i + \beta_2 \text{bill depth}_i + \epsilon_i\))
But, cross-validation is also a great tool for model evaluation!
Example, say you’re trying to decide which variables are better at predicting penguin body mass; you tried already bill depth and length, but what about sex and species? i.e.,
\[\text{MOD1: body mass}_i= \beta_0+\beta_1\text{bill length}_i + \beta_2 \text{bill depth}_i + \epsilon_i \\vs. \\ \text{MOD2: body mass}_i= \beta_0+\beta_1\text{sex}_i + \beta_2 \text{species}_i + \epsilon_i\]
train.control.LOOCV <- trainControl(method = 'LOOCV')
# Mod 1:
train.model1.LOOCV <- train(body_mass_g~bill_length_mm+bill_depth_mm,
data = peng, method = "lm", trControl = train.control.LOOCV)
# Mod 2:
train.model2.LOOCV <- train(body_mass_g~sex+species,
data = peng, method = "lm", trControl = train.control.LOOCV)
mod_compare <- data.frame(t(train.model.LOOCV$results),t(train.model2.LOOCV$results))[-1,]
colnames(mod_compare) <- c("Mod1: bill_len+dep", "Mod2: sex+spec")
kable(mod_compare)| Mod1: bill_len+dep | Mod2: sex+spec | |
|---|---|---|
| RMSE | 591.936030 | 318.6248079 |
| Rsquared | 0.458019 | 0.8429538 |
| MAE | 484.752447 | 256.4213241 |
Which model has better predictive ability?
Of course you can extend this to many more options at once, with different variables, different number of predictors, using different analysis approaches (e.g., WLS, removing outliers) etc.
Cross-validation techniques are a simple but extremely powerful and popular tools in predictive modeling and machine learning.
Yes, that’s right, you just got a little taste of machine learning!
It gives us a minimally biased way of gauging the true generalization performance of any model
Generally speaking, the overfitting observed when using the same data to both train and test a model will largely disappear when cross-validation is applied, and the cross-validated estimate of a model’s generalization performance will (on average) typically be very close to the true out-of-sample performance
Another purpose of cross-validation is model selection: you can compare and contrast the same model under each possible analysis approach and/or different models all together and check which brings about the best predictive power!
Other relevant and interesting tools in predictive modeling and machine learning include regularization methods like Lasso (L1), Ridge (L2) regression, and more…
Module 5