PSYC 3032 M
Now, we’re really cooking!
There are NO right or wrong answers
There are NO right or wrong answers
There are NO right or wrong answers
Multiple Linear Regression (MLR)
Although simple linear regression (SLR) provided a pretty solid foundation for modelling, realistically, researchers are usually interested in more complicated relationships, ones that involve more than two variables simultaneously
Thus, the statistical model for that outcome variable should be flexible enough to incorporate multiple predictors—enter multiple linear regression (MLR)
MLR is the natural extension to SLR, whereby a researcher models the impact of multiple predictor variables on a single outcome variable
As in turns out, MLR serves is the building block for the majority of statistical techniques commonly used to analyze psychological data (e.g., SEM, MLM)
MLR is useful when we have two or more variables that jointly (i.e., together according to a linear model) can be used to predict/explain an outcome variable
Combining multiple sources of information will typically improve the accuracy of our predictions (even if that improvement is extremely small). This is a good thing!
The idea in MLR, however, is that although each of the predictor variables may independently relate to the outcome variable, they may contain redundant information (i.e., their variability overlap with one another AND the outcome)
\[y_i = \beta_0 + \beta_1 x1_{i} + \epsilon_i\]
\[y_i = \beta_0 + \beta_1 x1_{i} + \beta_2 x2_{i} + \epsilon_i\]
\[y_i = \beta_0 + \beta_1 x1_{i} + \beta_2 x2_{i} + \beta_3 x3_{i} + \epsilon_i\]
\[y_i = \beta_0 + \beta_1 x1_{i} + \beta_2 x2_{i} + \dots + \beta_P xP_{i} + \epsilon_i \\ =\beta_0 + \sum_{p=1}^{P} \beta_p X_{p_i} + \epsilon_i\]
\[y_i = \beta_0 + \beta_1 x1_{i} + \beta_2 x2_{i} + \epsilon_i\]
\[\hat{y}_i = E(y_i|x1_i, x2_i)= \beta_0 + \beta_1 x1_{i} + \beta_2 x2_{i}\]
While the relationship between x and y in a simple regression model can be visually displayed as a regression line, with two predictor variables, the line becomes a regression plane (like a sheet of paper). Why?
Because the relationship between three continuous variables cannot be adequately presented in a two-dimensional space.
The true data space would need to be three dimensions to arrive at a unique solution for the predicted value of y.
What about 3 predictors?!
\[ \hat{y}_i = {\color{red} {\beta_0}} + {\color{purple} {\beta_1}} X1_i \]
\[\hat{y}_i = {\color{red} {-0.09}} + {\color{purple} {0.28}} X1_i \]
\[ \hat{y}_i = {\color{red} {\beta_0}} + {\color{purple} {\beta_1}} X1_i + {\color{orange} {\beta_2}} X2_i \]
\[\hat{y}_i = {\color{red} {2.72}} + {\color{purple} {0.24}} X1_i - {\color{orange}{0.51}} X2_i\]
library(haven)
library(plotly)
agrsn <- read_sav("aggression.sav")
# Fit a multiple linear regression model
model <- lm(BPAQ ~ AISS + BIS, data = agrsn)
# Generate data for the regression plane
grid <- expand.grid(
AISS = seq(min(agrsn$AISS), max(agrsn$AISS), length.out = 50),
BIS = seq(min(agrsn$BIS), max(agrsn$BIS), length.out = 50))
grid$BPAQ <- predict(model, newdata = grid) # Predicted Y values for the plane
# Create the 3D plot
plot_ly() %>%
# Add data points
add_markers(
x = agrsn$AISS, y = agrsn$BIS, z = agrsn$BPAQ,
marker = list(color = 'deepskyblue3', size = 5),
name = "Data Points",
hovertemplate = "AISS: %{x}<br>BIS: %{y}<br>BPAQ: %{z}<extra></extra>") %>% # Custom hover text
add_surface( # Add regression plane
x = matrix(grid$AISS, nrow = 50),
y = matrix(grid$BIS, nrow = 50),
z = matrix(grid$BPAQ, nrow = 50),
opacity = 0.6,
showscale = FALSE,
hovertemplate = "AISS: %{x}<br>BIS: %{y}<br>BPAQ: %{z}<extra></extra>", # Custom hover text for the plane
name = "Regression Plane") %>%
layout( # Layout settings
scene = list(
xaxis = list(title = "AISS"),
yaxis = list(title = "BIS"),
zaxis = list(title = "BPAQ")))car and rgl packages, click here to see how.\(Cholesterol_i = \beta_0 + \beta_1 Exercise_i + \epsilon_i\)
BEFORE: \(Cholesterol_i = \beta_0 + \beta_1 Exercise_i + \epsilon_i\)
AFTER: \(Cholesterol_i = \beta_0 + \beta_1 Exercise_i + \beta_2 Age_i+ \epsilon_i\)
Simpson’s paradox, AKA the reversal paradox or Yule-Simpson effect, occurs when a trend that appears in different groups of data reverses or disappears when the groups are combined
In other words, the aggregated data may lead to a conclusion that contradicts the conclusions drawn from analyzing the data within each subgroup
The paradox often arises due to the presence of a confounding variable that affects both the grouping and the outcome
Causal relationships can dictate whether a SLR or MLR is more appropriate (e.g., understanding effect of age on exercise you would NOT condition on cholesterol!)
Relationships like these are everywhere, which is why you often hear “correlation does not imply causation”
We can, however, get around confounding effects through randomized experimentation (e.g., randomly assigning individuals to an exercise level would ensure that age is evenly distributed, “blocking” the unwanted causal effect on exercise/cholesterol)
When models have more than one predictor, there are multiple levels of information that a researcher would be interested in examining (some you’re already saw last week)
Let’s separate the different levels of information (and tests) into categories of inference that we’ll call Micro/Predictor Level and Macro/Model Level inferences
Micro/Predictor Level
\[\hat{y}_i = E(y_i|x1_i, x2_i)= \beta_0 + \beta_1 x1_{i} + \beta_2 x2_{i}\]
The parameters of interest are the same as in SLR, but now the coefficients’ interpretation is slightly different once we incorporate more predictors into the model:
PhysAct_hrs_wk Sleep_hrs_day Resting_HR_bpm
Min. : 0.00 Min. : 3.000 Min. :47.00
1st Qu.: 6.85 1st Qu.: 8.200 1st Qu.:54.00
Median :11.40 Median : 9.750 Median :59.00
Mean :11.16 Mean : 9.662 Mean :60.29
3rd Qu.:15.93 3rd Qu.:11.425 3rd Qu.:67.00
Max. :20.90 Max. :13.000 Max. :78.00
\[\hat{RHR}_i = E(RHR| sleep, \ activity)=\beta_0 + \beta_1 sleep_i + \beta_2 activity_i\]
mlr.mod <- lm(Resting_HR_bpm ~ Sleep_hrs_day + PhysAct_hrs_wk, data = health_data)
mlr.mod$coefficients (Intercept) Sleep_hrs_day PhysAct_hrs_wk
78.7227720 -0.7409841 -1.0098311
(Intercept) = \(\hat{\beta}_0=78.72\): For a hypothetical person who sleeps 0 hours a night AND does 0 hours of physical activity a week, the expected resting HR is 78.72 bpm
Sleep_hrs_day = \(\hat{\beta}_1=-0.74\): For every additional hour of sleep per day, resting HR is expected to decrease by 0.74 bpm, holding physical activity constant
PhysAct_hrs_wk = \(\hat{\beta}_2=-1.01\): For every additional hour of physical activity per week, resting HR is expected to decrease by 1.01 bpm, holding sleep constant
What does “holding X constant” even means!?
The “holding all else constant” part of the partial slope interpretation is a little ambiguous, so we’ll take a closer look with something more hands-on!
To illustrate what these partial slopes/coefficients mean, we’ll take a hypothetical person, Terry Jeffords, and say they sleep 7 hours per night and does 15 hours of physical activity per week, let’s predict their resting HR:
predict(mlr.mod, data.frame(Sleep_hrs_day=7+1, PhysAct_hrs_wk=15))-predict(mlr.mod, data.frame(Sleep_hrs_day=7, PhysAct_hrs_wk=15)) 1
-0.7409841
predict(mlr.mod, data.frame(Sleep_hrs_day=7, PhysAct_hrs_wk=15+1))-predict(mlr.mod, data.frame(Sleep_hrs_day=7, PhysAct_hrs_wk=15)) 1
-1.009831
\(\hat{\beta}_1=-0.74\): If two random individuals with the same level of physical activity (i.e., holding physical activity constant) have a one-hour difference in sleep per night, we would expect the person who sleeps longer to have a lower resting hear rate by 0.74 bpm
\(\hat{\beta}_2=-1.01\): If two random individuals who are otherwise the same on all the variables, but have a one-hour difference on physical activity per week, we would expect the person who exercise one more hour a week to have a lower resting HR by 1.01 bpm
Another way to assess the contribution of predictors is to report their semipartial-\(r^2\) (\(spr^2\)) with the outcome (remember from the correlation Module?)
The \(spr^2\) for a predictor gives the proportion of variance in the outcome that is uniquely shared with that particular predictor
\(spr^2\) is another type of (standardized) effect size that allows a researcher to compare the relative importance/contribution of predictor variables to explaining/predicting the outcome
eta.sq eta.sq.part
Sleep_hrs_day 0.02471666 0.07026386
PhysAct_hrs_wk 0.31121489 0.48759331
In relation to \(\hat{\beta}_1\), the number of daily sleep hours uniquely accounts for—or, explains—about 2.5% of the variability in resting HR
In relation to \(\hat{\beta}_2\), the number of hours a week an individuals engage in physical activity uniquely accounts for—or, explains—about 31% of the variability in resting HR
Just like in SLR, estimates for the model parameters are obtained through OLS (remember the 3D plot that looks like a folded sheet from last week?)
The formula to calculate each of the slopes is not too difficult with only two predictors but it becomes increasingly complicated the more predictor variables that are added to the model.
\[ \hat{\beta}_1 = \frac{r_{y x_1} - r_{y x_2} r_{x_1 x_2}}{1 - r_{x_1 x_2}^2} \left( \frac{SD_y}{SD_{x_1}} \right) \quad \quad \hat{\beta}_2 = \frac{r_{y x_2} - r_{y x_1} r_{x_1 x_2}}{1 - r_{x_1 x_2}^2} \left( \frac{SD_y}{SD_{x_2}} \right) \]
\[\hat{\beta}_0=\bar{y}-\hat{\beta}_1\bar{x}_1-\hat{\beta}_2\bar{x}_2\]
\[H_0:\beta_1=0 \quad \quad \quad H_0:\beta_2=0 \\ H_1:\beta_1\neq0 \quad \quad \quad H_1:\beta_2\neq0\]
\[t_1(df)=\frac{\hat{\beta}_1}{SE_{\hat{\beta_1}}}\quad \quad \quad t_2(df)=\frac{\hat{\beta}_2}{SE_{\hat{\beta_2}}}\]
\[CI_{(1-\alpha)100\%}=\hat{\beta}_1 \pm t_{(1 − \alpha /2, \ df)} \times SE_{\hat{\beta_1}} \quad \quad \quad CI_{(1-\alpha)100\%}=\hat{\beta}_2 \pm t_{(1 − \alpha /2, \ df)} \times SE_{\hat{\beta_2}}\]
Macro/Model Level
In MLR, the macro level inferences pertain to the proportion of variance in the outcome explained by the linear combination of predictors, otherwise known as \(R^2\), the squared multiple correlation or the coefficient of multiple determination
\(R^2\), the squared multiple correlation is analogous to \(r^2\), the coefficient of determination, but instead of just two variables, it is the proportion of the variability in the outcome shared with multiple variables simultaneously
Remember this from the first week?
\[Variable \ of \ interest = Model + Error\]
We can think of \(R^2\) as the “Model” piece of this equation
The variability in the outcome is completely explained by the model variability (i.e., linear combination of predictors) plus the variability that is not explained by the model (error or residuals)
In R:
i.e., the proportion of variance in the resting HR explained by the linear combination of sleep (h/d) and physical activity (h/w) is estimated at 67.29%; or, sleep (h/d) and physical activity (h/w), together as a set, account for about 67.29% of the variability in resting HR!
ANOVA: \(SS_{between}+SS_{within}=SS_{total}\)
Regression: \(SS_{model}+SS_{error}=SS_{total}\)
Where \(SS_{model}\) (Model Sum of Squares, also called Regression Sum of Squares) represents the variability explained by the regression model (i.e., how well the predictor variables explain the dependent variable); \(SS_{model}=\sum (\hat{y}_i-\bar{y}_i)^2\)
And, \(SS_{error}\) (Error Sum of Squares, also called Residual Sum of Squares) represents the unexplained variability or error in the model (i.e., the deviation of observed values from the predicted values); \(SS_{error}=\sum (y_i-\hat{y}_i)^2\)
Finally, \(SS_{total}\) (Total Sum of Squares) represents the total variability in the dependent variable (the outcome variable) around its mean; \(SS_{total}=\sum (y_i-\bar{y}_i)^2\)
\[R^2=\frac{SS_{model}}{SS_{total}} = 1- \frac{SS_{error}}{SS_{total}}\]
We still can interpret \(R^2\) as a type of squared-bivariate correlation, however we have to rethink what what exactly is being correlated
If we conceptualize the bivariate correlation as existing between the observed variable, y, and its predictions, \(\hat{y}\), then we can see another nice definition of what \(R^2\) means:
\[R^2=COR(y, \hat{y})^2\]
cor(health_data$Resting_HR_bpm, predict(mlr.mod))^2 # predcit yields the predicted values from the model[1] 0.6729475
See?
\[H_0: R^2 = 0 \quad \quad H_1: R^2 \neq 0\]
\[ \begin{array}{|l|l|l|l|l|} \hline \text{Source of Variation} & SS & df & MS & F \\ \hline \text{Regression} & SSR = \sum (\hat{Y}_i - \bar{Y})^2 & P \ (no. \ predictors) & MSR = \frac{SSR}{df_R} & \frac{MSR}{MSE} \\ \text{Error} & SSE = \sum (Y_i - \hat{Y}_i)^2 & N - P - 1 & MSE = \frac{SSE}{df_E} & \\ \text{Total} & SST = \sum (Y_i - \bar{Y})^2 & N - 1 & & \\ \hline \end{array} \]
As you know, the F distribution (Fisher) is a generalization of the t distribution (Student), but for jointly evaluating more than one parameter
Hence, its usefulness arises in modeling conditions with smaller sample sizes, and when we have multiple slopes to worry about
The F distribution has two \(df\) terms which control its shape and how thick the tail is
In R (see Multiple R-squared)
Call:
lm(formula = Resting_HR_bpm ~ Sleep_hrs_day + PhysAct_hrs_wk,
data = health_data)
Residuals:
Min 1Q Median 3Q Max
-13.5182 -3.0236 -0.1513 3.4134 18.2300
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 78.72277 1.56088 50.435 < 2e-16 ***
Sleep_hrs_day -0.74098 0.19204 -3.859 0.000155 ***
PhysAct_hrs_wk -1.00983 0.07376 -13.692 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.824 on 197 degrees of freedom
Multiple R-squared: 0.6729, Adjusted R-squared: 0.6696
F-statistic: 202.7 on 2 and 197 DF, p-value: < 2.2e-16
There is, unfortunately, one notably problematic property of \(R^2\) . . .
The \(R^2\) estimate tends to be upwardly biased (i.e., higher than it should be)
As more predictors are added to the model, \(R^2\) will only increase regardless of whether these are good predictors of y
This is because of our form \(R^2= 1- \frac{SS_{error}}{SS_{total}}\)
The Denominator is constant across models, but \(SS_{error}\) will either stay the same (extremely unlikely) or decrease whenever we add new variables
The implications is that larger, more complicated models will always “appear” to explain more variance than simpler models (which is true, but isn’t necessarily optimal because it capitalizes on chance relationships that are not real in the population)
\[R_{\text{adj}}^2 = 1 - \frac{\hat{\sigma}_e^2}{\hat{\sigma}_t^2} = 1 - \frac{SS_e / (N - P - 1)}{SS_t / (N - 1)} = 1 - \frac{(1 - R^2)(N - 1)}{N - P - 1}\]
summary() function refer only to \(R^2\)In simultaneous regression, all predictor variables of interest are entered into the model at the same time.
This allows you to model the relationship between each predictor variable and the outcome while simultaneously “controlling” for all other predictors in the model
This is basically what we have been doing so far, and in the example to follow
\[BPAQ_i = \beta_0 + \beta_1BIS_i + \beta_2age_i+\epsilon_i\]
library(haven)
library(tidyverse)
agrsn <- read_sav("aggression.sav")
misty::descript(agrsn %>% select(BPAQ, BIS, age)) Descriptive Statistics
Variable n nNA pNA M SD Min Max Skew Kurt
BPAQ 275 0 0.00% 2.61 0.52 1.34 4.03 0.01 -0.38
BIS 275 0 0.00% 2.28 0.35 1.42 3.15 0.36 -0.19
age 275 0 0.00% 20.21 4.96 17.00 50.00 3.74 15.87
my_fn <- function(data, mapping, method1="lm", method2= "loess",...){
ggplot(data = data, mapping = mapping) + geom_point(color="deepskyblue3") + geom_smooth(method=method1, size=2, colour="black")+ geom_smooth(method=method2, size=2, span = 0.7, colour="purple", alpha=0.2, linetype="dashed")+theme_minimal()}
# Using the ggpairs function from the GGally package
GGally::ggpairs(agrsn %>% select(BPAQ, BIS, age), aes(alpha = 0.5),
columnLabels = c("Aggression", "Impulsivity", "age"), lower = list(continuous = my_fn), upper = list(continuous = 'cor'))
Call:
lm(formula = BPAQ ~ BIS + age, data = agrsn)
Residuals:
Min 1Q Median 3Q Max
-1.17933 -0.32561 -0.00454 0.35594 1.33877
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.954392 0.244922 7.980 4.12e-14 ***
BIS 0.443607 0.085060 5.215 3.64e-07 ***
age -0.017556 0.006032 -2.910 0.00391 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4906 on 272 degrees of freedom
Multiple R-squared: 0.1299, Adjusted R-squared: 0.1235
F-statistic: 20.31 on 2 and 272 DF, p-value: 6.016e-09
library(apaTables)
apa.reg.table(agrsn.mlr) # you can use the argument `filename=`, e.g., apa.reg.table(agrsn.mlr, filename="regtable.doc") and look at your working directory for a file under that name
Regression results using BPAQ as the criterion
Predictor b b_95%_CI beta beta_95%_CI sr2 sr2_95%_CI r
(Intercept) 1.95** [1.47, 2.44]
BIS 0.44** [0.28, 0.61] 0.30 [0.19, 0.41] .09 [.02, .15] .32**
age -0.02** [-0.03, -0.01] -0.17 [-0.28, -0.05] .03 [-.01, .06] -.21**
Fit
R2 = .130**
95% CI[.06,.20]
Note. A significant b-weight indicates the beta-weight and semi-partial correlation are also significant.
b represents unstandardized regression weights. beta indicates the standardized regression weights.
sr2 represents the semi-partial correlation squared. r represents the zero-order correlation.
Square brackets are used to enclose the lower and upper limits of a confidence interval.
* indicates p < .05. ** indicates p < .01.
b = \(\hat{\beta}\)sr2 = semipartial-\(r^2\)beta = standardized \(\hat{\beta}\)The research example above (BPAQ scores regressed on BIS and age) is an example of a simultaneous regression model because all variables of interest were included in “one go”
To set up a contrast with hierarchical regression, it is useful to consider another model with a larger number of predictors
Consequently let’s add NEOo (openness) and NEOc (contentiousness) as predictors of BPAQ
Two models are said to be nested if both contain the same parameters, but one model is a special case of the other (e.g., \(\beta_3\) and \(\beta_4\) = 0); if one model is a subset of another model, these two models would be called nested models
NESTED: \({\color{deeppink} {\hat{BPAQ}_i = \beta_0 + \beta_1BIS_i + \beta_2age_i}}\)
FULL: \({\color{deeppink} {\hat{BPAQ}_i = \beta_0 + \beta_1BIS_i + \beta_2age_i}}+\beta_3NEOo_i +\beta_4NEOc_i\)
Call:
lm(formula = BPAQ ~ BIS + age + NEOo + NEOc, data = agrsn)
Residuals:
Min 1Q Median 3Q Max
-1.20896 -0.32513 -0.00971 0.37144 1.28585
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.415691 0.494262 4.887 1.75e-06 ***
BIS 0.369988 0.106467 3.475 0.000595 ***
age -0.016666 0.006175 -2.699 0.007394 **
NEOo -0.016360 0.059002 -0.277 0.781774
NEOc -0.072225 0.062888 -1.148 0.251790
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4912 on 270 degrees of freedom
Multiple R-squared: 0.1342, Adjusted R-squared: 0.1214
F-statistic: 10.46 on 4 and 270 DF, p-value: 6.788e-08
Hierarchical (also known as Sequential) regression allows for (nested) model comparisons
Specifically, an initial model with a smaller number of predictors is compared to a subsequent model with the same predictors plus one or more new predictors
You can think of this approach as testing whether the new variables in the larger model (as a set) account for a statistically non-zero amount of variance in the outcome variable over and above what has already been explained by the smaller model
Stated differently, does the new set of variables uniquely contribute to the total \(R^2\) beyond the first model’s \(R^2\)?
Going back to our aggression data example, the researcher might ask something like:
“Controlling for BIS and age, do NEOo and NEOc, taken together, predict aggression (BPAQ)?” OR
Do NEOo and NEOc, as a set, explain the variability in aggression (BPAQ), over and above age and BIS?
\[H_0: \rho_{nested}^2 = \rho_{full}^2 \quad {\color{red} {OR}} \quad H_0: \sigma_{\epsilon_{nested}}^2 = \sigma_{\epsilon_{full}}^2 \quad {\color{red} {OR}} \quad H_0: \beta_3 = \beta_4 = 0\]
In English, the population proportion of variance explained is equal across the two models
OR, the population proportion of unexplained variance is equal across the two models
OR, the population partial regression slopes for the added predictors to the full model both equal 0
If \(R^2\) is significantly greater in the full model than in the nested model, we can conclude that the set of predictors unique to the full model (e.g., NEOo and NEOc) significantly predicts/explain the outcome (BPAQ), holding the rest of the predictors (i.e., BIS and age) constant
[1] 0.1299306
[1] 0.1342165
\(\Delta R^2\): The difference between the two models’ estimated proportion of explained variance in BPAQ:
\[F(\Delta df, df_{\text{full}}) = \frac{\left( R_{\text{full}}^2 - R_{\text{nested}}^2 \right) / \Delta df}{\left( 1 - R_{\text{full}}^2 \right) / df_{\text{full}}}\]
where \(P=\) number of predictors in either model, \(df_{full}= N-P_{full}-1\), \(\Delta df=P_{full}-P_{nested}\)
\[F(2, 270) = \frac{\left( 0.134^2 - 0.129^2 \right) / (4 - 2)}{\left( 1 - 0.134^2 \right) / (275 - 4 - 1)}\approx 0.6683\]
In R:
Analysis of Variance Table
Model 1: BPAQ ~ BIS + age + NEOo + NEOc
Model 2: BPAQ ~ BIS + age
Res.Df RSS Df Sum of Sq F Pr(>F)
1 270 65.138
2 272 65.461 -2 -0.32246 0.6683 0.5134
Based on these data, the set of variables NEOo and NEOc, does not account for a significant proportion of variance in BPAQ, over and above that already accounted for by BIS and age
But, more importantly than statistical significance is practical significance: \(\Delta R^2=0.0043\); that is, a 0.43% difference between the two models in explained variance in aggression, that is very small and probably negligible in practice
apa.reg.table(agrsn.mlr, agrsn.full) # you can use the argument `filename=`, e.g., apa.reg.table(agrsn.mlr, filename="regtable.doc") and look at your working directory for a file under that name
Regression results using BPAQ as the criterion
Predictor b b_95%_CI beta beta_95%_CI sr2 sr2_95%_CI r
(Intercept) 1.95** [1.47, 2.44]
BIS 0.44** [0.28, 0.61] 0.30 [0.19, 0.41] .09 [.02, .15] .32**
age -0.02** [-0.03, -0.01] -0.17 [-0.28, -0.05] .03 [-.01, .06] -.21**
(Intercept) 2.42** [1.44, 3.39]
BIS 0.37** [0.16, 0.58] 0.25 [0.11, 0.39] .04 [-.00, .08] .32**
age -0.02** [-0.03, -0.00] -0.16 [-0.27, -0.04] .02 [-.01, .06] -.21**
NEOo -0.02 [-0.13, 0.10] -0.02 [-0.13, 0.10] .00 [-.00, .00] -.09
NEOc -0.07 [-0.20, 0.05] -0.08 [-0.22, 0.06] .00 [-.01, .02] -.25**
Fit Difference
R2 = .130**
95% CI[.06,.20]
R2 = .134** Delta R2 = .004
95% CI[.06,.20] 95% CI[-.01, .02]
Note. A significant b-weight indicates the beta-weight and semi-partial correlation are also significant.
b represents unstandardized regression weights. beta indicates the standardized regression weights.
sr2 represents the semi-partial correlation squared. r represents the zero-order correlation.
Square brackets are used to enclose the lower and upper limits of a confidence interval.
* indicates p < .05. ** indicates p < .01.
\[ \begin{array}{lccccc} \hline \text{Variable} & \beta & se & p & \text{95 % CI} & sr^2 \\ \hline \underline{\textbf{Model 1}} & & & & & \\ \text{BIS} & 0.44 & 0.09 & < .001 & (0.28, 0.61) & 0.087 \\ \text{Age} & -0.02 & 0.01 & .004 & (-0.03, -0.01) & 0.027 \\ \\ \underline{\textbf{Model 2}} & & & & & \\ \text{BIS} & 0.37 & 0.11 & < .001 & (0.16, 0.58) & 0.039 \\ \text{Age} & -0.02 & 0.01 & .007 & (-0.03, -0.00) & 0.023 \\ \text{NEO}_{\text{o}} & -0.02 & 0.06 & .782 & (-0.13, 0.10) & < .001 \\ \text{NEO}_{\text{c}} & -0.07 & 0.06 & .252 & (-0.20, 0.05) & 0.004 \\ \hline \end{array} \]
\(\small{ \textit{Note.} \ Model \ 1 \ R^2 = .129 , \ F(2, 272) = 20.31, \ p < .001, \\ Model \ 2 \ R^2 = .134, F(4, 270) = 10.46, \ p < .001 , \\ \Delta R^2 = .004, F(2, 270) = 0.67, \ p = .51.}\)
Model comparisons with the anova() function which uses the F test (or Likelihood ratio test; LRT) is a hypothesis test
As such, they are designed to detect statistical significance, not practical significance
When we’re doing explanatory research (as opposed to predictive or exploratory), we should avoid using p values as a model selection criterion; they can lead to misleading conclusions
Alternative approaches include information criteria, which assess model fit and complexity
Information criteria summarize how well a model fits the data while accounting for model complexity (i.e., number of parameters)
They balance model fit (residual size) against complexity to avoid overfitting
Can be used to compare non-nested models!!!
Common Information Criteria Measures
Choosing the Best Model
Key Differences and Applications: AIC vs. BIC
Common computer algorithms for model selection include:
Forward Selection
Backward Selection
Stepwise Selection
Limitations of data-driven methods
Cohen & Cohen (1983) on Automatic Regression Methods
Binary Predictor Variables in Regression
The predictor variable in a regression model need not be continuous
While all of the examples in the previous lectures included continuous predictor variables, there is no reason that you cannot include categorical predictor variables or a mix of continuous and categorical
Binary or dichotomous variables are categorical variables with only two categories (i.e., treatment vs. control, homeowner vs. renter, etc.)
For example, we can use regression to look at the relation between a binary gender variable and aggression scores (BPAQ) which is dummy-coded
Dummy coding is a method of converting categorical variables into numerical values (typically 0 and 1), representing group membership
In the aggression.sav dataset, the variable gender is a binary, dummy-coded where
In the previous plot, it turns out that the OLS regression slope is the line that connects the mean of BPAQ values within each of two genders
Now, let’s estimate this model:
Call:
lm(formula = BPAQ ~ gender, data = agrsn)
Residuals:
Min 1Q Median 3Q Max
-1.25482 -0.38936 0.02104 0.40035 1.43483
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.66183 0.06945 38.326 <2e-16 ***
gender -0.06217 0.07801 -0.797 0.426
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5244 on 273 degrees of freedom
Multiple R-squared: 0.002322, Adjusted R-squared: -0.001333
F-statistic: 0.6353 on 1 and 273 DF, p-value: 0.4261
\[\hat{BPAQ}_i=2.66 -0.062 \times gender \]
Because \(\hat{\beta}_0\), the estimated intercept, is the predicted value of Y when X = 0, then 2.66 is the predicted value of BPAQ among males (because gender is coded 0 = male, 1 = female)
This value, \(\hat{\beta}_0=2.66\) is, in fact, the mean BPAQ score of males, see?!
Now, \(\hat{\beta}_1=-0.062\), the estimated slope, is the predicted difference in Y when X differs by one unit (from 0 to 1), then -0.062 is the predicted difference in BPAQ between males and females; that is “moving” from 0 (males) to 1 (females), we expect a decrease of 0.062 points on BPAQ
Thus, the estimated slop value is, in fact, the mean difference in BPAQ between males and females, does this sound familiar?!
When the predictor variable in SLR is dichotomous, the test of whether the slope significantly differs from zero is equivalent to the independent-groups t test comparing the dependent variable means of the two groups formed by the dichotomous predictor!
\[t(df) = \frac{\hat{\beta}_1}{SE_{\beta_1}} = \frac{\bar{Y}_2 -\bar{Y}_1 }{SD_{\bar{Y}_2 -\bar{Y}_1}}\]
Two Sample t-test
data: agrsn$BPAQ[agrsn$gender == 0] and agrsn$BPAQ[agrsn$gender == 1]
t = 0.79705, df = 273, p-value = 0.4261
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.09139479 0.21574473
sample estimates:
mean of x mean of y
2.661827 2.599652
You can compare the t and p valus, as well as the df with the regression results (secret: they’re the same!)
This illustrates that the model underlying the independent groups t test is a special case of OLS regression!
But, what about MLR? Well, now, it will not be exactly like the t test, but it won’t be too complicated either
The interpretation of the slope associated with the binary predictor will change slightly, but you’re already familiar with this change!
This will not be different than the difference between any other predictor in SLR when you move to MLR
For example, say we estimate a model where aggression is regressed on gender and impulsivity, the estimate for the slope of gender will be the difference between males and females on BPAQ, holding BIS constant, or after adjusting for the effects of impulsivity!
Cool, right? Very!
Don’t forget to fill out this survey (also on eClass, under Module 3): Click Here
Module 3