Module 3: Multiple Linear Regression

PSYC 3032 M

Udi Alter

Things Udi Should Remember

Before we start

  • Any questions so far (about anything)?
  • How was Lab 1?
  • Lab 1 solution will be available today, after class
  • Lab 2 due next week
  • Groups, baby!
  • Start thinking about:
    • Topic for group presentation
    • Topics for last class

About Module 3

Goals for Today:

  • Multiple linear regression (MLR; and some very cool visualizations!)
  • Two levels of a model
    • Micro/Predictor Level info (e.g., regression slopes interpretation, semipartial-\(r^2\))
    • Macro/Model Level info (e.g., \(R^2\), \(R_{adjusted}^2\))
  • Simultaneous regression
  • Hierarchical regression
  • Nested model comparison
  • Model selection & information criteria
  • Binary predictors in regression

Now, we’re really cooking!

iClicker



What is the strongest predictor in determining how much you tip at a restaurant?



  1. How good the food is
  2. How nice are the waiting staff
  3. How quick the food came out
  4. How clean/sanitary the place is
  5. Whether the kitchen/waiting staff followed special instructions
  6. The value of the meal/amount owed



There are NO right or wrong answers

iClicker



What is the strongest predictor in deciding where to eat?



  1. Online reviews (e.g., Yelp, Google, UberEats)
  2. Recommendations from friends/family
  3. Social media posts/pictures
  4. Menu options and pricing
  5. Location and convenience
  6. The restaurant’s atmosphere (e.g., decor, vibe)



There are NO right or wrong answers

iClicker



What’s your favourite dinning style?



  1. Ordering takeout or delivery and eating at home
  2. Going out to a fancy restaurant for a special occasion
  3. Dining at a casual, local spot with a cozy vibe
  4. Grabbing food from a food truck or street vendor
  5. Cooking and eating at home with family or friends
  6. Something else



There are NO right or wrong answers




Multiple Linear Regression (MLR)

Intro to Multiple Linear Regression (MLR)

  • Although simple linear regression (SLR) provided a pretty solid foundation for modelling, realistically, researchers are usually interested in more complicated relationships, ones that involve more than two variables simultaneously

    • i.e., It is uncommon that researchers will theorize that a single variable (predictor/explanatory variable) is able to help explain or account for a large degree of variability in another variable (outcome)
  • Thus, the statistical model for that outcome variable should be flexible enough to incorporate multiple predictors—enter multiple linear regression (MLR)

  • MLR is the natural extension to SLR, whereby a researcher models the impact of multiple predictor variables on a single outcome variable

  • As in turns out, MLR serves is the building block for the majority of statistical techniques commonly used to analyze psychological data (e.g., SEM, MLM)

Defining Multiple Linear Regression (MLR)

  • MLR is useful when we have two or more variables that jointly (i.e., together according to a linear model) can be used to predict/explain an outcome variable

  • Combining multiple sources of information will typically improve the accuracy of our predictions (even if that improvement is extremely small). This is a good thing!

  • The idea in MLR, however, is that although each of the predictor variables may independently relate to the outcome variable, they may contain redundant information (i.e., their variability overlap with one another AND the outcome)

    • E.g., predicting weight by height AND pants size. Obviously, both can predict weight independently, but height and pants size are clearly related as well, and have redundant information

Defining Multiple Linear Regression (MLR)

  • So, we can extend our SLR from one predictor…

\[y_i = \beta_0 + \beta_1 x1_{i} + \epsilon_i\]

  • to two predictors…

\[y_i = \beta_0 + \beta_1 x1_{i} + \beta_2 x2_{i} + \epsilon_i\]

  • to three predictors…

\[y_i = \beta_0 + \beta_1 x1_{i} + \beta_2 x2_{i} + \beta_3 x3_{i} + \epsilon_i\]

  • and so on, to include as many predictors as we’d like (theoretically, until \(\infty\))

\[y_i = \beta_0 + \beta_1 x1_{i} + \beta_2 x2_{i} + \dots + \beta_P xP_{i} + \epsilon_i \\ =\beta_0 + \sum_{p=1}^{P} \beta_p X_{p_i} + \epsilon_i\]

Defining Multiple Linear Regression (MLR)

  • But, let’s start simple; let’s examine the “smallest” MLR model, which has one outcome variable and two predictors:

\[y_i = \beta_0 + \beta_1 x1_{i} + \beta_2 x2_{i} + \epsilon_i\]

  • You can also write it as:

\[\hat{y}_i = E(y_i|x1_i, x2_i)= \beta_0 + \beta_1 x1_{i} + \beta_2 x2_{i}\]

  • The parameters of interest are the same as in SLR (with an added slope):
    • \(\beta_0\) = intercept
    • \(\beta_1\) = partial regression coefficient (slope) for variable x1
    • \(\beta_2\) = partial regression coefficient (slope) for variable x2

Visualizing MLR (as a diagram)

  • We’re moving from this:
  • to this…
  • Direct arrows (\(\rightarrow\)) indicated the expected change information (i.e., \(\beta\) weights) while bi-directional dashed arrows (<- - - - >) reflects the co-relationship between the IV; that is, that the predictor variables covary with one another

Visualizing MLR (as a scatterplot)

  • While the relationship between x and y in a simple regression model can be visually displayed as a regression line, with two predictor variables, the line becomes a regression plane (like a sheet of paper). Why?

  • Because the relationship between three continuous variables cannot be adequately presented in a two-dimensional space.

  • The true data space would need to be three dimensions to arrive at a unique solution for the predicted value of y.

What about 3 predictors?!

Visualizing MLR (as a scatterplot)

  • We’re moving from this:

\[ \hat{y}_i = {\color{red} {\beta_0}} + {\color{purple} {\beta_1}} X1_i \]

\[\hat{y}_i = {\color{red} {-0.09}} + {\color{purple} {0.28}} X1_i \]

Visualizing MLR (as a scatterplot)

  • To this:

\[ \hat{y}_i = {\color{red} {\beta_0}} + {\color{purple} {\beta_1}} X1_i + {\color{orange} {\beta_2}} X2_i \]

\[\hat{y}_i = {\color{red} {2.72}} + {\color{purple} {0.24}} X1_i - {\color{orange}{0.51}} X2_i\]

Code to create a 3D plot (totally optional)

library(haven)
library(plotly)
agrsn <- read_sav("aggression.sav")
# Fit a multiple linear regression model
model <- lm(BPAQ ~ AISS + BIS, data = agrsn)
# Generate data for the regression plane
grid <- expand.grid(
  AISS = seq(min(agrsn$AISS), max(agrsn$AISS), length.out = 50),
  BIS = seq(min(agrsn$BIS), max(agrsn$BIS), length.out = 50))
grid$BPAQ <- predict(model, newdata = grid)  # Predicted Y values for the plane
# Create the 3D plot
plot_ly() %>%
  # Add data points
  add_markers(
    x = agrsn$AISS, y = agrsn$BIS, z = agrsn$BPAQ,
    marker = list(color = 'deepskyblue3', size = 5),
    name = "Data Points",
    hovertemplate = "AISS: %{x}<br>BIS: %{y}<br>BPAQ: %{z}<extra></extra>") %>% # Custom hover text 
  add_surface(  # Add regression plane
    x = matrix(grid$AISS, nrow = 50), 
    y = matrix(grid$BIS, nrow = 50), 
    z = matrix(grid$BPAQ, nrow = 50), 
    opacity = 0.6,
    showscale = FALSE,
    hovertemplate = "AISS: %{x}<br>BIS: %{y}<br>BPAQ: %{z}<extra></extra>", # Custom hover text for the plane
    name = "Regression Plane") %>%
  layout(  # Layout settings
    scene = list(
      xaxis = list(title = "AISS"),
      yaxis = list(title = "BIS"),
      zaxis = list(title = "BPAQ")))
  • you can create a similar plot more easily using the car and rgl packages, click here to see how.

Visualizing MLR (as a scatterplot)

  • Here is another way to graphically demonstrate how regression slopes look (and change!!!) if we compare the “raw” relationship between x and y (i.e., no other variables), to the partial relationship between x and y (i.e., adding an additional variable)

What’s wrong with the plot below?

\(Cholesterol_i = \beta_0 + \beta_1 Exercise_i + \epsilon_i\)

Visualizing MLR (as a scatterplot)

  • Here is another way to graphically demonstrate how regression slopes look (and change!!!) if we compare the “raw” relationship between x and y (i.e., no other variables), to the partial relationship between x and y (i.e., adding an additional variable)

BEFORE: \(Cholesterol_i = \beta_0 + \beta_1 Exercise_i + \epsilon_i\)

AFTER: \(Cholesterol_i = \beta_0 + \beta_1 Exercise_i + \beta_2 Age_i+ \epsilon_i\)

Quick Detour: Simpson’s Paradox

  • Simpson’s paradox, AKA the reversal paradox or Yule-Simpson effect, occurs when a trend that appears in different groups of data reverses or disappears when the groups are combined

  • In other words, the aggregated data may lead to a conclusion that contradicts the conclusions drawn from analyzing the data within each subgroup

  • The paradox often arises due to the presence of a confounding variable that affects both the grouping and the outcome

Quick Detour: Confounding Effects

  • This example illustrates why conditioning on another variable is often extremely important to understand the causal structure (in this case, effect of exercise is confounded by age)—to “eliminate” the confounding effect, we conditioned on, or “controlled for” age!

Quick Detour: Confounding Effects

  • Causal relationships can dictate whether a SLR or MLR is more appropriate (e.g., understanding effect of age on exercise you would NOT condition on cholesterol!)

  • Relationships like these are everywhere, which is why you often hear “correlation does not imply causation”

  • We can, however, get around confounding effects through randomized experimentation (e.g., randomly assigning individuals to an exercise level would ensure that age is evenly distributed, “blocking” the unwanted causal effect on exercise/cholesterol)

The Two Levels of a Model

  • When models have more than one predictor, there are multiple levels of information that a researcher would be interested in examining (some you’re already saw last week)

  • Let’s separate the different levels of information (and tests) into categories of inference that we’ll call Micro/Predictor Level and Macro/Model Level inferences




Micro/Predictor Level

Micro/Predictor Level: MLR Parameters

\[\hat{y}_i = E(y_i|x1_i, x2_i)= \beta_0 + \beta_1 x1_{i} + \beta_2 x2_{i}\]

  • The parameters of interest are the same as in SLR, but now the coefficients’ interpretation is slightly different once we incorporate more predictors into the model:

    • \(\beta_0\) = the predicted value of y when both x1 and x2 scores are 0
    • \(\beta_1\) = the predicted change in y for a 1-unit change in x1, holding x2 constant
    • \(\beta_2\) = the predicted change in y for a 1-unit change in x2, holding x1 constant

Micro/Predictor Level: Partial Slopes

  • These partial slopes’ interpretations sound pretty “scripted,” can make things more practical? Yes, we can!

Example: Predicting Resting HR from Physical Activity and Sleep

  • With multiple variables now, we should be checking the relationship among all variables. This is where a scatterplot matrix becomes useful to inspect (at the very least) all bivariate relationship
summary(health_data)
 PhysAct_hrs_wk  Sleep_hrs_day    Resting_HR_bpm 
 Min.   : 0.00   Min.   : 3.000   Min.   :47.00  
 1st Qu.: 6.85   1st Qu.: 8.200   1st Qu.:54.00  
 Median :11.40   Median : 9.750   Median :59.00  
 Mean   :11.16   Mean   : 9.662   Mean   :60.29  
 3rd Qu.:15.93   3rd Qu.:11.425   3rd Qu.:67.00  
 Max.   :20.90   Max.   :13.000   Max.   :78.00  

Partial Slopes’ Interpretation: Example

library(GGally)
ggpairs(health_data)

Partial Slopes’ Interpretation: Example

Partial Slopes’ Interpretation: Example

  • Let’s fit the model:

\[\hat{RHR}_i = E(RHR| sleep, \ activity)=\beta_0 + \beta_1 sleep_i + \beta_2 activity_i\]

mlr.mod <- lm(Resting_HR_bpm ~ Sleep_hrs_day + PhysAct_hrs_wk, data = health_data)
mlr.mod$coefficients
   (Intercept)  Sleep_hrs_day PhysAct_hrs_wk 
    78.7227720     -0.7409841     -1.0098311 
  • (Intercept) = \(\hat{\beta}_0=78.72\): For a hypothetical person who sleeps 0 hours a night AND does 0 hours of physical activity a week, the expected resting HR is 78.72 bpm

  • Sleep_hrs_day = \(\hat{\beta}_1=-0.74\): For every additional hour of sleep per day, resting HR is expected to decrease by 0.74 bpm, holding physical activity constant

  • PhysAct_hrs_wk = \(\hat{\beta}_2=-1.01\): For every additional hour of physical activity per week, resting HR is expected to decrease by 1.01 bpm, holding sleep constant

Micro/Predictor Level: Partial Slopes

What does “holding X constant” even means!?

  • The “holding all else constant” part of the partial slope interpretation is a little ambiguous, so we’ll take a closer look with something more hands-on!

  • To illustrate what these partial slopes/coefficients mean, we’ll take a hypothetical person, Terry Jeffords, and say they sleep 7 hours per night and does 15 hours of physical activity per week, let’s predict their resting HR:

predict(mlr.mod, data.frame(Sleep_hrs_day=7, PhysAct_hrs_wk=15))
       1 
58.38842 
  • Above, is the expected resting HR for Terry (or anyone with 7 hours of sleep AND 15 hours of physical activity per week)

Micro/Predictor Level: Partial Slopes

  • Now, let’s take another person, Rosa Diaz, who exactly like Terry, does 15 hours of physical activity per week (i.e., holding x2, physical activity, constant), but sleeps one hour more than Terry, i.e., a 1-unit change in x1, sleep
predict(mlr.mod, data.frame(Sleep_hrs_day=7+1, PhysAct_hrs_wk=15))
       1 
57.64743 
  • If we look at the difference on expected resting HR between Rosa and Terry, we will notice a familiar value:
predict(mlr.mod, data.frame(Sleep_hrs_day=7+1, PhysAct_hrs_wk=15))-predict(mlr.mod, data.frame(Sleep_hrs_day=7, PhysAct_hrs_wk=15))
         1 
-0.7409841 
  • It’s \(\hat{\beta}_1\)!!!

Micro/Predictor Level: Partial Slopes

  • We can do the same thing for \(\beta_2\): If we take yet another person, Doug Judy, who exactly like Terry, sleeps 7 hours per night (i.e., holding x1, sleep, constant), but does an additional hour of physical activity per week, i.e., a 1-unit change in x2, physical activity
predict(mlr.mod, data.frame(Sleep_hrs_day=7, PhysAct_hrs_wk=15+1))
       1 
57.37859 
  • If we look at the difference on expected resting HR between Doug and Terry, we will notice another familiar value:
predict(mlr.mod, data.frame(Sleep_hrs_day=7, PhysAct_hrs_wk=15+1))-predict(mlr.mod, data.frame(Sleep_hrs_day=7, PhysAct_hrs_wk=15))
        1 
-1.009831 
  • It’s \(\hat{\beta}_2\)!!!

Micro/Predictor Level: Partial Slopes

  • So, we can also interpret the partial slopes such as:

\(\hat{\beta}_1=-0.74\): If two random individuals with the same level of physical activity (i.e., holding physical activity constant) have a one-hour difference in sleep per night, we would expect the person who sleeps longer to have a lower resting hear rate by 0.74 bpm


\(\hat{\beta}_2=-1.01\): If two random individuals who are otherwise the same on all the variables, but have a one-hour difference on physical activity per week, we would expect the person who exercise one more hour a week to have a lower resting HR by 1.01 bpm

Micro/Predictor Level: Visualizing Partial Slopes

library(car)
avPlots(mlr.mod) 

Micro/Predictor Level: Semipartial-\(r^2\)

  • Another way to assess the contribution of predictors is to report their semipartial-\(r^2\) (\(spr^2\)) with the outcome (remember from the correlation Module?)

  • The \(spr^2\) for a predictor gives the proportion of variance in the outcome that is uniquely shared with that particular predictor

  • \(spr^2\) is another type of (standardized) effect size that allows a researcher to compare the relative importance/contribution of predictor variables to explaining/predicting the outcome

Micro/Predictor Level: Semipartial-\(r^2\)

Example in R:

library(DescTools)
EtaSq(mlr.mod) # eta.sq = semipartial-r^2; eta.sq.part = partial-r^2
                   eta.sq eta.sq.part
Sleep_hrs_day  0.02471666  0.07026386
PhysAct_hrs_wk 0.31121489  0.48759331
  • In relation to \(\hat{\beta}_1\), the number of daily sleep hours uniquely accounts for—or, explains—about 2.5% of the variability in resting HR

  • In relation to \(\hat{\beta}_2\), the number of hours a week an individuals engage in physical activity uniquely accounts for—or, explains—about 31% of the variability in resting HR

Model Estimation

  • Just like in SLR, estimates for the model parameters are obtained through OLS (remember the 3D plot that looks like a folded sheet from last week?)

  • The formula to calculate each of the slopes is not too difficult with only two predictors but it becomes increasingly complicated the more predictor variables that are added to the model.

\[ \hat{\beta}_1 = \frac{r_{y x_1} - r_{y x_2} r_{x_1 x_2}}{1 - r_{x_1 x_2}^2} \left( \frac{SD_y}{SD_{x_1}} \right) \quad \quad \hat{\beta}_2 = \frac{r_{y x_2} - r_{y x_1} r_{x_1 x_2}}{1 - r_{x_1 x_2}^2} \left( \frac{SD_y}{SD_{x_2}} \right) \]

  • \(r_{y x_1}\), \(r_{y x_2}\), \(r_{x1 x_2}\) are the correlations between y & x1, y & x2, and x1 & x2, respectively
  • \(SD_y\), \(SD_{x1}\), \(SD_{x2}\) are the standard deviation for y, x1, and x2, respectively
  • Lastly the parameter estimate for the intercept is calculated as:

\[\hat{\beta}_0=\bar{y}-\hat{\beta}_1\bar{x}_1-\hat{\beta}_2\bar{x}_2\]

Micro/Predictor Level: Hypothesis Testing

  • Same as in SLR: evaluating the slope parameter. Only now, we have more than 1 (note that each slope has its own estimate and SE, but the same df, \(\alpha\), and critical value):

\[H_0:\beta_1=0 \quad \quad \quad H_0:\beta_2=0 \\ H_1:\beta_1\neq0 \quad \quad \quad H_1:\beta_2\neq0\]


\[t_1(df)=\frac{\hat{\beta}_1}{SE_{\hat{\beta_1}}}\quad \quad \quad t_2(df)=\frac{\hat{\beta}_2}{SE_{\hat{\beta_2}}}\]

\[CI_{(1-\alpha)100\%}=\hat{\beta}_1 \pm t_{(1 − \alpha /2, \ df)} \times SE_{\hat{\beta_1}} \quad \quad \quad CI_{(1-\alpha)100\%}=\hat{\beta}_2 \pm t_{(1 − \alpha /2, \ df)} \times SE_{\hat{\beta_2}}\]




Macro/Model Level

From \(r^2\) to \(R^2\) and beyond

  • In MLR, the macro level inferences pertain to the proportion of variance in the outcome explained by the linear combination of predictors, otherwise known as \(R^2\), the squared multiple correlation or the coefficient of multiple determination

  • \(R^2\), the squared multiple correlation is analogous to \(r^2\), the coefficient of determination, but instead of just two variables, it is the proportion of the variability in the outcome shared with multiple variables simultaneously

  • Remember this from the first week?

\[Variable \ of \ interest = Model + Error\]

  • We can think of \(R^2\) as the “Model” piece of this equation

  • The variability in the outcome is completely explained by the model variability (i.e., linear combination of predictors) plus the variability that is not explained by the model (error or residuals)

From \(r^2\) to \(R^2\) and beyond

In R:

summary(mlr.mod)$r.squared
[1] 0.6729475

i.e., the proportion of variance in the resting HR explained by the linear combination of sleep (h/d) and physical activity (h/w) is estimated at 67.29%; or, sleep (h/d) and physical activity (h/w), together as a set, account for about 67.29% of the variability in resting HR!

From \(r^2\) to \(R^2\) and beyond

  • Remember the partition of sums of squares (SS) in ANOVA into \(SS_{between}\) and \(SS_{within}\)?

ANOVA: \(SS_{between}+SS_{within}=SS_{total}\)

  • Similarly, in MLR, we have \(SS_{model}\) and \(SS_{error}\)

Regression: \(SS_{model}+SS_{error}=SS_{total}\)

  • Where \(SS_{model}\) (Model Sum of Squares, also called Regression Sum of Squares) represents the variability explained by the regression model (i.e., how well the predictor variables explain the dependent variable); \(SS_{model}=\sum (\hat{y}_i-\bar{y}_i)^2\)

  • And, \(SS_{error}\) (Error Sum of Squares, also called Residual Sum of Squares) represents the unexplained variability or error in the model (i.e., the deviation of observed values from the predicted values); \(SS_{error}=\sum (y_i-\hat{y}_i)^2\)

  • Finally, \(SS_{total}\) (Total Sum of Squares) represents the total variability in the dependent variable (the outcome variable) around its mean; \(SS_{total}=\sum (y_i-\bar{y}_i)^2\)

From \(r^2\) to \(R^2\) and beyond

  • Therefore:

\[R^2=\frac{SS_{model}}{SS_{total}} = 1- \frac{SS_{error}}{SS_{total}}\]

  • We still can interpret \(R^2\) as a type of squared-bivariate correlation, however we have to rethink what what exactly is being correlated

  • If we conceptualize the bivariate correlation as existing between the observed variable, y, and its predictions, \(\hat{y}\), then we can see another nice definition of what \(R^2\) means:

\[R^2=COR(y, \hat{y})^2\]

cor(health_data$Resting_HR_bpm, predict(mlr.mod))^2 # predcit yields the predicted values from the model
[1] 0.6729475

See?

Hypothesis Testing for \(R^2\)

  • Much like other parameters, we can test whether \(R^2\) is statistically different from 0 using the F statistic

\[H_0: R^2 = 0 \quad \quad H_1: R^2 \neq 0\]

\[ \begin{array}{|l|l|l|l|l|} \hline \text{Source of Variation} & SS & df & MS & F \\ \hline \text{Regression} & SSR = \sum (\hat{Y}_i - \bar{Y})^2 & P \ (no. \ predictors) & MSR = \frac{SSR}{df_R} & \frac{MSR}{MSE} \\ \text{Error} & SSE = \sum (Y_i - \hat{Y}_i)^2 & N - P - 1 & MSE = \frac{SSE}{df_E} & \\ \text{Total} & SST = \sum (Y_i - \bar{Y})^2 & N - 1 & & \\ \hline \end{array} \]

  • As you know, the F distribution (Fisher) is a generalization of the t distribution (Student), but for jointly evaluating more than one parameter

  • Hence, its usefulness arises in modeling conditions with smaller sample sizes, and when we have multiple slopes to worry about

  • The F distribution has two \(df\) terms which control its shape and how thick the tail is

Hypothesis Testing for \(R^2\)

In R (see Multiple R-squared)

summary(mlr.mod)

Call:
lm(formula = Resting_HR_bpm ~ Sleep_hrs_day + PhysAct_hrs_wk, 
    data = health_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.5182  -3.0236  -0.1513   3.4134  18.2300 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    78.72277    1.56088  50.435  < 2e-16 ***
Sleep_hrs_day  -0.74098    0.19204  -3.859 0.000155 ***
PhysAct_hrs_wk -1.00983    0.07376 -13.692  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.824 on 197 degrees of freedom
Multiple R-squared:  0.6729,    Adjusted R-squared:  0.6696 
F-statistic: 202.7 on 2 and 197 DF,  p-value: < 2.2e-16
  • Note: When writing up your results, don’t say “our model was statistically significant,” instead please use “our model accounted for a statistically significant proportion of variance in [OUTCOME VARIABLE NAME]” (followed by reporting the test statistics, F(df1, df2), p, etc.)

Adjusted \(R^2\)

  • There is, unfortunately, one notably problematic property of \(R^2\) . . .

  • The \(R^2\) estimate tends to be upwardly biased (i.e., higher than it should be)

    • As more predictors are added to the model, \(R^2\) will only increase regardless of whether these are good predictors of y

    • This is because of our form \(R^2= 1- \frac{SS_{error}}{SS_{total}}\)

    • The Denominator is constant across models, but \(SS_{error}\) will either stay the same (extremely unlikely) or decrease whenever we add new variables

  • The implications is that larger, more complicated models will always “appear” to explain more variance than simpler models (which is true, but isn’t necessarily optimal because it capitalizes on chance relationships that are not real in the population)

Adjusted \(R^2\)

  • To counteract this upward bias property we can use the adjusted \(R^2\) statistic, which adjusts the original \(R^2\) by the associated degrees of freedom:

\[R_{\text{adj}}^2 = 1 - \frac{\hat{\sigma}_e^2}{\hat{\sigma}_t^2} = 1 - \frac{SS_e / (N - P - 1)}{SS_t / (N - 1)} = 1 - \frac{(1 - R^2)(N - 1)}{N - P - 1}\]

  • \(R^2_{adj}\) is always smaller than (or, in rare cases, equal to) \(R^2\), \(R^2_{adj} \le R^2\)
    • Some authors recommend reporting \(R^2_{adj}\) instead of \(R^2\) for this reason, though in practice the two estimates are often very close
    • That said, note that the test statistics (F and p values) reported in the output of the summary() function refer only to \(R^2\)

Simultaneous Regression

  • In simultaneous regression, all predictor variables of interest are entered into the model at the same time.

  • This allows you to model the relationship between each predictor variable and the outcome while simultaneously “controlling” for all other predictors in the model

  • This is basically what we have been doing so far, and in the example to follow



Research Example Revisited: Model and Descriptives

  • With the foundations of a two-predictor-MLR model, we can return to the research example from your first Lab, with aggression (BPAQ; y) regressed on impulsivity (BIS; x1) and age (x2)

\[BPAQ_i = \beta_0 + \beta_1BIS_i + \beta_2age_i+\epsilon_i\]

library(haven)
library(tidyverse)
agrsn <- read_sav("aggression.sav")
misty::descript(agrsn %>% select(BPAQ, BIS, age))
 Descriptive Statistics

  Variable   n nNA   pNA     M   SD   Min   Max Skew  Kurt
   BPAQ    275   0 0.00%  2.61 0.52  1.34  4.03 0.01 -0.38
   BIS     275   0 0.00%  2.28 0.35  1.42  3.15 0.36 -0.19
   age     275   0 0.00% 20.21 4.96 17.00 50.00 3.74 15.87

Research Example Revisited: Descriptives

my_fn <- function(data, mapping, method1="lm", method2= "loess",...){
    ggplot(data = data, mapping = mapping) + geom_point(color="deepskyblue3") + geom_smooth(method=method1, size=2, colour="black")+ geom_smooth(method=method2, size=2, span = 0.7, colour="purple", alpha=0.2, linetype="dashed")+theme_minimal()}
# Using the ggpairs function from the GGally package
GGally::ggpairs(agrsn %>% select(BPAQ, BIS, age), aes(alpha = 0.5),
        columnLabels = c("Aggression", "Impulsivity", "age"), lower = list(continuous = my_fn), upper = list(continuous = 'cor'))

Research Example Revisited: Modeling

agrsn.mlr <- lm(BPAQ ~ BIS + age, data = agrsn)
summary(agrsn.mlr)

Call:
lm(formula = BPAQ ~ BIS + age, data = agrsn)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.17933 -0.32561 -0.00454  0.35594  1.33877 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.954392   0.244922   7.980 4.12e-14 ***
BIS          0.443607   0.085060   5.215 3.64e-07 ***
age         -0.017556   0.006032  -2.910  0.00391 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4906 on 272 degrees of freedom
Multiple R-squared:  0.1299,    Adjusted R-squared:  0.1235 
F-statistic: 20.31 on 2 and 272 DF,  p-value: 6.016e-09

Research Example Revisited: Visualizing Results

Research Example Revisited: Visualizing Results

car::avPlots(model) 

Research Example Revisited: Reporting Results

library(apaTables)
apa.reg.table(agrsn.mlr) # you can use the argument `filename=`, e.g., apa.reg.table(agrsn.mlr, filename="regtable.doc") and look at your working directory for a file under that name


Regression results using BPAQ as the criterion
 

   Predictor       b       b_95%_CI  beta    beta_95%_CI sr2  sr2_95%_CI      r
 (Intercept)  1.95**   [1.47, 2.44]                                            
         BIS  0.44**   [0.28, 0.61]  0.30   [0.19, 0.41] .09  [.02, .15]  .32**
         age -0.02** [-0.03, -0.01] -0.17 [-0.28, -0.05] .03 [-.01, .06] -.21**
                                                                               
                                                                               
                                                                               
             Fit
                
                
                
     R2 = .130**
 95% CI[.06,.20]
                

Note. A significant b-weight indicates the beta-weight and semi-partial correlation are also significant.
b represents unstandardized regression weights. beta indicates the standardized regression weights. 
sr2 represents the semi-partial correlation squared. r represents the zero-order correlation.
Square brackets are used to enclose the lower and upper limits of a confidence interval.
* indicates p < .05. ** indicates p < .01.
 
  • b = \(\hat{\beta}\)
  • sr2 = semipartial-\(r^2\)
  • beta = standardized \(\hat{\beta}\)

Nested Models

  • The research example above (BPAQ scores regressed on BIS and age) is an example of a simultaneous regression model because all variables of interest were included in “one go”

  • To set up a contrast with hierarchical regression, it is useful to consider another model with a larger number of predictors

  • Consequently let’s add NEOo (openness) and NEOc (contentiousness) as predictors of BPAQ

  • Two models are said to be nested if both contain the same parameters, but one model is a special case of the other (e.g., \(\beta_3\) and \(\beta_4\) = 0); if one model is a subset of another model, these two models would be called nested models

Example

NESTED: \({\color{deeppink} {\hat{BPAQ}_i = \beta_0 + \beta_1BIS_i + \beta_2age_i}}\)

FULL: \({\color{deeppink} {\hat{BPAQ}_i = \beta_0 + \beta_1BIS_i + \beta_2age_i}}+\beta_3NEOo_i +\beta_4NEOc_i\)

Nested Models

  • Let’s also estimate the “full,” larger model
agrsn.full <- lm(BPAQ ~ BIS + age + NEOo + NEOc, data = agrsn)
summary(agrsn.full)

Call:
lm(formula = BPAQ ~ BIS + age + NEOo + NEOc, data = agrsn)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.20896 -0.32513 -0.00971  0.37144  1.28585 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.415691   0.494262   4.887 1.75e-06 ***
BIS          0.369988   0.106467   3.475 0.000595 ***
age         -0.016666   0.006175  -2.699 0.007394 ** 
NEOo        -0.016360   0.059002  -0.277 0.781774    
NEOc        -0.072225   0.062888  -1.148 0.251790    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4912 on 270 degrees of freedom
Multiple R-squared:  0.1342,    Adjusted R-squared:  0.1214 
F-statistic: 10.46 on 4 and 270 DF,  p-value: 6.788e-08

Hierarchical Regression

  • Hierarchical (also known as Sequential) regression allows for (nested) model comparisons

  • Specifically, an initial model with a smaller number of predictors is compared to a subsequent model with the same predictors plus one or more new predictors

  • You can think of this approach as testing whether the new variables in the larger model (as a set) account for a statistically non-zero amount of variance in the outcome variable over and above what has already been explained by the smaller model

  • Stated differently, does the new set of variables uniquely contribute to the total \(R^2\) beyond the first model’s \(R^2\)?

  • Going back to our aggression data example, the researcher might ask something like:

  • “Controlling for BIS and age, do NEOo and NEOc, taken together, predict aggression (BPAQ)?” OR

  • Do NEOo and NEOc, as a set, explain the variability in aggression (BPAQ), over and above age and BIS?

Hierarchical Regression

  • Hierarchical regression is all about \(R^2\) and therefore \(\sigma_{\epsilon}^2\), such that

\[H_0: \rho_{nested}^2 = \rho_{full}^2 \quad {\color{red} {OR}} \quad H_0: \sigma_{\epsilon_{nested}}^2 = \sigma_{\epsilon_{full}}^2 \quad {\color{red} {OR}} \quad H_0: \beta_3 = \beta_4 = 0\]

  • In English, the population proportion of variance explained is equal across the two models

  • OR, the population proportion of unexplained variance is equal across the two models

  • OR, the population partial regression slopes for the added predictors to the full model both equal 0

  • If \(R^2\) is significantly greater in the full model than in the nested model, we can conclude that the set of predictors unique to the full model (e.g., NEOo and NEOc) significantly predicts/explain the outcome (BPAQ), holding the rest of the predictors (i.e., BIS and age) constant

Hierarchical Regression: Model Comparison

\(R^2\) Model 1 (nested model) \(\hat{BPAQ}_i = \beta_0 + \beta_1BIS_i + \beta_2age_i\)

[1] 0.1299306

\(R^2\) Model 2 (full model) \(\hat{BPAQ}_i = \beta_0 + \beta_1BIS_i + \beta_2age_i+\beta_3NEOo_i +\beta_4NEOc_i\)

[1] 0.1342165

\(\Delta R^2\): The difference between the two models’ estimated proportion of explained variance in BPAQ:

summary(agrsn.full)$r.squared-summary(agrsn.mlr)$r.squared
[1] 0.004285952
  • Obviously, this is a very small difference, but for completeness, we can test its significance

Hierarchical Regression: Model Comparison

\[F(\Delta df, df_{\text{full}}) = \frac{\left( R_{\text{full}}^2 - R_{\text{nested}}^2 \right) / \Delta df}{\left( 1 - R_{\text{full}}^2 \right) / df_{\text{full}}}\]

where \(P=\) number of predictors in either model, \(df_{full}= N-P_{full}-1\), \(\Delta df=P_{full}-P_{nested}\)

  • In our example, we have

\[F(2, 270) = \frac{\left( 0.134^2 - 0.129^2 \right) / (4 - 2)}{\left( 1 - 0.134^2 \right) / (275 - 4 - 1)}\approx 0.6683\]

Hierarchical Regression: Model Comparison

In R:

anova(agrsn.full, agrsn.mlr)
Analysis of Variance Table

Model 1: BPAQ ~ BIS + age + NEOo + NEOc
Model 2: BPAQ ~ BIS + age
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1    270 65.138                           
2    272 65.461 -2  -0.32246 0.6683 0.5134
  • Based on these data, the set of variables NEOo and NEOc, does not account for a significant proportion of variance in BPAQ, over and above that already accounted for by BIS and age

  • But, more importantly than statistical significance is practical significance: \(\Delta R^2=0.0043\); that is, a 0.43% difference between the two models in explained variance in aggression, that is very small and probably negligible in practice

Hierarchical Regression: Reporting Results

apa.reg.table(agrsn.mlr, agrsn.full) # you can use the argument `filename=`, e.g., apa.reg.table(agrsn.mlr, filename="regtable.doc") and look at your working directory for a file under that name


Regression results using BPAQ as the criterion
 

   Predictor       b       b_95%_CI  beta    beta_95%_CI sr2  sr2_95%_CI      r
 (Intercept)  1.95**   [1.47, 2.44]                                            
         BIS  0.44**   [0.28, 0.61]  0.30   [0.19, 0.41] .09  [.02, .15]  .32**
         age -0.02** [-0.03, -0.01] -0.17 [-0.28, -0.05] .03 [-.01, .06] -.21**
                                                                               
                                                                               
                                                                               
 (Intercept)  2.42**   [1.44, 3.39]                                            
         BIS  0.37**   [0.16, 0.58]  0.25   [0.11, 0.39] .04 [-.00, .08]  .32**
         age -0.02** [-0.03, -0.00] -0.16 [-0.27, -0.04] .02 [-.01, .06] -.21**
        NEOo   -0.02  [-0.13, 0.10] -0.02  [-0.13, 0.10] .00 [-.00, .00]   -.09
        NEOc   -0.07  [-0.20, 0.05] -0.08  [-0.22, 0.06] .00 [-.01, .02] -.25**
                                                                               
                                                                               
                                                                               
             Fit        Difference
                                  
                                  
                                  
     R2 = .130**                  
 95% CI[.06,.20]                  
                                  
                                  
                                  
                                  
                                  
                                  
     R2 = .134**   Delta R2 = .004
 95% CI[.06,.20] 95% CI[-.01, .02]
                                  

Note. A significant b-weight indicates the beta-weight and semi-partial correlation are also significant.
b represents unstandardized regression weights. beta indicates the standardized regression weights. 
sr2 represents the semi-partial correlation squared. r represents the zero-order correlation.
Square brackets are used to enclose the lower and upper limits of a confidence interval.
* indicates p < .05. ** indicates p < .01.
 

Hierarchical Regression: Reporting Results

\[ \begin{array}{lccccc} \hline \text{Variable} & \beta & se & p & \text{95 % CI} & sr^2 \\ \hline \underline{\textbf{Model 1}} & & & & & \\ \text{BIS} & 0.44 & 0.09 & < .001 & (0.28, 0.61) & 0.087 \\ \text{Age} & -0.02 & 0.01 & .004 & (-0.03, -0.01) & 0.027 \\ \\ \underline{\textbf{Model 2}} & & & & & \\ \text{BIS} & 0.37 & 0.11 & < .001 & (0.16, 0.58) & 0.039 \\ \text{Age} & -0.02 & 0.01 & .007 & (-0.03, -0.00) & 0.023 \\ \text{NEO}_{\text{o}} & -0.02 & 0.06 & .782 & (-0.13, 0.10) & < .001 \\ \text{NEO}_{\text{c}} & -0.07 & 0.06 & .252 & (-0.20, 0.05) & 0.004 \\ \hline \end{array} \]

\(\small{ \textit{Note.} \ Model \ 1 \ R^2 = .129 , \ F(2, 272) = 20.31, \ p < .001, \\ Model \ 2 \ R^2 = .134, F(4, 270) = 10.46, \ p < .001 , \\ \Delta R^2 = .004, F(2, 270) = 0.67, \ p = .51.}\)

Model Selection: Information Criteria

  • Model comparisons with the anova() function which uses the F test (or Likelihood ratio test; LRT) is a hypothesis test

  • As such, they are designed to detect statistical significance, not practical significance

  • When we’re doing explanatory research (as opposed to predictive or exploratory), we should avoid using p values as a model selection criterion; they can lead to misleading conclusions

  • Alternative approaches include information criteria, which assess model fit and complexity

Model Selection: Information Criteria

  • Information criteria summarize how well a model fits the data while accounting for model complexity (i.e., number of parameters)

  • They balance model fit (residual size) against complexity to avoid overfitting

  • Can be used to compare non-nested models!!!



Common Information Criteria Measures

  • Akaike’s Information Criterion (AIC)
  • Bayesian Information Criterion (BIC)

Model Selection: Information Criteria

Choosing the Best Model

  • The model with the smallest AIC or BIC value is considered the best in an absolute sense
  • Both criteria penalize model complexity, meaning:
  • A more complex model (more predictors) will have a higher AIC and BIC unless it explains significantly more variability
  • The model must provide substantial improvement in fit to justify added complexity

Key Differences and Applications: AIC vs. BIC

  • AIC tends to favour more complex models, focusing on predictive accuracy
  • BIC penalizes complexity more strongly, often leading to simpler models

Model Selection: Information Criteria

AIC(agrsn.full, agrsn.mlr)
           df      AIC
agrsn.full  6 396.3449
agrsn.mlr   4 393.7029
BIC(agrsn.full, agrsn.mlr)
           df      BIC
agrsn.full  6 418.0455
agrsn.mlr   4 408.1700

Model Selection: Common Computer Algorithms

  • Researchers often aim to build an optimal model by selecting predictors that maximize \(R^2\) while keeping the model as simple as possible (model parsimony)

Common computer algorithms for model selection include:

  • Forward Selection

  • Backward Selection

  • Stepwise Selection

Final Thoughts on Data-Driven Model Selection

Limitations of data-driven methods

  • Forward, backward, and stepwise selection ignore the theoretical importance of individual predictors
  • These methods capitalize on chance relationships in the sample, which:
    • May not replicate in another sample
    • Are unlikely to represent true relationships in the population

Cohen & Cohen (1983) on Automatic Regression Methods

  • Automatic methods, like stepwise regression, can be tempting for researchers with little theoretical guidance
  • These methods relieve researchers of the responsibility of selecting variables based on:
    • Logical relevance
    • Causal priority
  • However, they make interpretation more challenging

Model Selection: Key Takeaways

  • Avoid over-reliance on automated selection techniques for explanatory research
  • For predictive and/or exploratory research, however, these methods might be very useful!
  • For explanatory research: emphasize theory-driven variable selection to ensure meaningful and replicable findings
  • Use data-driven methods with caution, supplementing them with substantive knowledge




Binary Predictor Variables in Regression

Binary Predictors in Regression

  • The predictor variable in a regression model need not be continuous

  • While all of the examples in the previous lectures included continuous predictor variables, there is no reason that you cannot include categorical predictor variables or a mix of continuous and categorical

  • Binary or dichotomous variables are categorical variables with only two categories (i.e., treatment vs. control, homeowner vs. renter, etc.)

  • For example, we can use regression to look at the relation between a binary gender variable and aggression scores (BPAQ) which is dummy-coded

  • Dummy coding is a method of converting categorical variables into numerical values (typically 0 and 1), representing group membership

  • In the aggression.sav dataset, the variable gender is a binary, dummy-coded where

attr(agrsn$gender, "labels")
  male female 
     0      1 

Binary Predictors in Regression

  • We can use regression to look at the relationship between gender and BPAQ
ggplot(agrsn, aes(x=gender, y=BPAQ))+geom_point(size=3, colour="lightblue", alpha=0.75)+theme_minimal()

Binary Predictors in Regression

  • What happens if we fit an OLS regression line to that scatterplot?
ggplot(agrsn, aes(x=gender, y=BPAQ))+geom_point(size=3, colour="lightblue", alpha=0.75)+
  geom_smooth(method = "lm", size=2, colour="black", se=F)+theme_minimal()

Binary Predictors in Regression

  • In the previous plot, it turns out that the OLS regression slope is the line that connects the mean of BPAQ values within each of two genders

  • Now, let’s estimate this model:

tt.mod <- lm(BPAQ~gender, data = agrsn) # can you guess why I called it tt.mod?
summary(tt.mod)

Call:
lm(formula = BPAQ ~ gender, data = agrsn)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.25482 -0.38936  0.02104  0.40035  1.43483 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.66183    0.06945  38.326   <2e-16 ***
gender      -0.06217    0.07801  -0.797    0.426    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5244 on 273 degrees of freedom
Multiple R-squared:  0.002322,  Adjusted R-squared:  -0.001333 
F-statistic: 0.6353 on 1 and 273 DF,  p-value: 0.4261

Binary Predictors in Regression

  • The estimated OLS regression line describing the regression of BPAQ on the categorical gender variable is:

\[\hat{BPAQ}_i=2.66 -0.062 \times gender \]

  • Because \(\hat{\beta}_0\), the estimated intercept, is the predicted value of Y when X = 0, then 2.66 is the predicted value of BPAQ among males (because gender is coded 0 = male, 1 = female)

  • This value, \(\hat{\beta}_0=2.66\) is, in fact, the mean BPAQ score of males, see?!

agrsn %>% group_by(gender) %>% summarise(mean(BPAQ))
# A tibble: 2 × 2
  gender     `mean(BPAQ)`
  <dbl+lbl>         <dbl>
1 0 [male]           2.66
2 1 [female]         2.60

Binary Predictors in Regression

  • Now, \(\hat{\beta}_1=-0.062\), the estimated slope, is the predicted difference in Y when X differs by one unit (from 0 to 1), then -0.062 is the predicted difference in BPAQ between males and females; that is “moving” from 0 (males) to 1 (females), we expect a decrease of 0.062 points on BPAQ

  • Thus, the estimated slop value is, in fact, the mean difference in BPAQ between males and females, does this sound familiar?!

  • When the predictor variable in SLR is dichotomous, the test of whether the slope significantly differs from zero is equivalent to the independent-groups t test comparing the dependent variable means of the two groups formed by the dichotomous predictor!

\[t(df) = \frac{\hat{\beta}_1}{SE_{\beta_1}} = \frac{\bar{Y}_2 -\bar{Y}_1 }{SD_{\bar{Y}_2 -\bar{Y}_1}}\]

  • Yup!

Binary Predictors in Regression

  • Don’t believe me? Here’s the proof:
t.test(agrsn$BPAQ[agrsn$gender==0], agrsn$BPAQ[agrsn$gender==1], var.equal = TRUE)

    Two Sample t-test

data:  agrsn$BPAQ[agrsn$gender == 0] and agrsn$BPAQ[agrsn$gender == 1]
t = 0.79705, df = 273, p-value = 0.4261
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.09139479  0.21574473
sample estimates:
mean of x mean of y 
 2.661827  2.599652 
  • You can compare the t and p valus, as well as the df with the regression results (secret: they’re the same!)

  • This illustrates that the model underlying the independent groups t test is a special case of OLS regression!

Binary Predictors in Regression

  • But, what about MLR? Well, now, it will not be exactly like the t test, but it won’t be too complicated either

  • The interpretation of the slope associated with the binary predictor will change slightly, but you’re already familiar with this change!

  • This will not be different than the difference between any other predictor in SLR when you move to MLR

  • For example, say we estimate a model where aggression is regressed on gender and impulsivity, the estimate for the slope of gender will be the difference between males and females on BPAQ, holding BIS constant, or after adjusting for the effects of impulsivity!

  • Cool, right? Very!

Class Activity/Check-in

Don’t forget to fill out this survey (also on eClass, under Module 3): Click Here