PSYC 3032 M
Categorical Predictor Variables
In this course we’ve been focusing almost exclusively on the relationship between continuous (i.e., quantitative) variables
What’s exactly happening when we include categorical (a.k.a., qualitative) predictors in our regression model?
What happens when there are more than two categories!? How can we use and interpret multi-category, nominal, or ordinal predictors in our regression model?
In Module 3, we saw that it is straightforward to include a dichotomous/binary (i.e., categories=2) variable in a simple regression model
Because \(X\) has only two possible values, the relation between \(X\) and \(Y\) can only be linear!
When a dependent variable \(Y\) is regressed on a single, dichotomous predictor \(X\), the estimated regression line connects the mean of \(Y\) at one level of \(X\) to the mean of \(Y\) at the other level of \(X\)
The slope, \(\hat{\beta}_1\), of this line equals the difference between the two means, remember?
But what if a discrete predictor has more than two categories (i.e., C > 2)?
The “trick” is to break down the categories into a set of dichotomous variables, and then include these sets as predictors in the model
Breaking categories into sets of dichotomous variables means creating pairwise comparisons of categories (e.g., New Democratic vs. Progressive Conservative, New Democratic vs. Green party)
There are two approaches to doing this (i.e., breaking categories into sets):
To illustrate MLR with categorical variables, we will use a new dataset (finally!) taken from Baumann et al., (1992)
Grade 4 kids (N = 66) were randomly assigned to one of three groups in an intervention designed to improve reading comprehension
The groups were evenly distributed such that each group had 22 participants
After each group received its respective intervention, participants completed a test designed to evaluate their reading comprehension
In this example, the experimental manipulation has a control group and two treatment groups (i.e., three categories; C = 3)
Test Yourself!
Considering this research design, which statistical approach would be suitable for this example?
Regrssion with a Categorical Predictor is a One-Way ANOVA
The one-way ANOVA is simply an omnibus test that tells you whether the variability between groups is significantly more than the variability within groups
We know that we get macro- and micro-level information from our regression model
At the macro level, we obtain an F test for the model fit, \(R^2\), which is exactly the same as we would have obtained using a one-way ANOVA and eta-squared, \(\eta^2\)
The analysis of regression variance table is exactly the same as the one-way ANOVA table substituting \(SS_{between}\) for \(SS_{model}\) and \(SS_{within}\) for \(SS_{error}\)
Further, \(R^2=\eta^2\)!!! (No way, what!?!?!?)
We would like to develop a statistical model to represent the outcome variable (i.e., reading comprehension) as a function of the predictor variable (i.e., treatment type)
But, before modeling, we should plot the data and examine the descriptive statistics:
library(haven);library(e1071); library(kableExtra)
read <- read_sav("reading.sav")
descriptive_stats <- read %>% group_by(Group=as_factor(group)) %>% # Converts to factor using haven labels
summarise(Count = n(),
Mean = mean(posttest1, na.rm = TRUE), SD = sd(posttest1, na.rm = TRUE),
Min = min(posttest1, na.rm = TRUE), Max = max(posttest1, na.rm = TRUE),
Skewness = skewness(posttest1, na.rm = TRUE), Kurtosis = kurtosis(posttest1, na.rm = TRUE),
.groups = 'drop') # Drops the grouping structure after summarizing
kable(descriptive_stats)| Group | Count | Mean | SD | Min | Max | Skewness | Kurtosis |
|---|---|---|---|---|---|---|---|
| control | 22 | 6.681818 | 2.766920 | 2 | 12 | 0.3158039 | -0.9243863 |
| DRTA | 22 | 9.772727 | 2.724349 | 5 | 14 | -0.1089612 | -1.1941274 |
| TA | 22 | 7.772727 | 3.927095 | 1 | 15 | 0.2252122 | -1.2032715 |
ggplot(read, aes(x = as_factor(group), y = posttest1, fill = as_factor(group))) +
geom_violin(trim = FALSE, alpha = 0.4) +
geom_jitter(width = 0.15, color = "black", size = 1.5, alpha = 0.7) +
geom_boxplot(width = 0.1, alpha = 0.5, fill = "grey") +
labs(x = "Treatment Group", y = "Reading Comprehension", fill = "Treatment Group") +
theme_classic() + scale_fill_brewer(palette = "Dark2")Membership in one of the three treatment groups is a discrete variable with a nominal scale of measurement
How can we include it as a predictor in a regression model? One approach is to use dummy-coded variables denoting group comparisons with 0s and 1s
We already saw the use of dummy codes when we included a gender variable in a regression model by setting female = 0 and male = 1
Because gender was a nominal variable scored with two levels (i.e., two categories), we needed one dummy variable to distinguish male from female
However, when a discrete independent variable has more than two levels, more than one dummy code is needed to compare the groups
In general, if a categorical variable has \(K\) levels, then \(K-1\) codes are needed to distinguish among them
In the our reading example, the treatment variable has three levels (control, TA, and DRTA), so we will need two dummy codes to distinguish among them
In the creation of dummy codes, it is necessary to choose one group (i.e., one level of the independent variable) to represent a “reference” or “baseline” group, which is the group coded 0 on all dummy-coded variables
With the gender predictor, male was the level dummy-coded 0, and thus served as the reference group
In a simple regression model with dichotomous, dummy-coded \(X\), the intercept parameter represents the predicted mean of \(Y\) for the reference group
The slope parameter then represents the differences in means between the other group and the reference/baseline group
\[\hat{y}_i= {\color{deeppink} {\beta_0}} + {\color{darkcyan} {\beta_1}}x_i = {\color{deeppink} {\mu_{ref}}} + ({\color{darkcyan} {\mu_{other \ group}-\mu_{ref}}})x_i\]
In the new reading comprehension example, there is a control group
It makes sense and will be intuitive for interpretation to set the control group as the reference (although that choice is mostly arbitrary)
Thus, in this three-group situation, we need to create two new dummy-coded variables
\[ \begin{array}{lcc} \hline &{\textbf{Dummy-code variable values}} \\ \hline \textbf{Level of IV} & \textbf{D1} & \textbf{D2} \\ \hline \text{Control} & 0 & 0 \\ \text{DRTA} & 1 & 0 \\ \text{TA} & 0 & 1 \\ \hline \end{array} \]
\[ \begin{array}{cccc} \hline i & \text{Reading score} & \text{Group} & D1 & D2 \\ \hline 1 & 5 & \text{control} & 0 & 0 \\ 2 & 9 & \text{control} & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 23 & 7 & \text{DRTA} & 1 & 0 \\ 24 & 5 & \text{DRTA} & 1 & 0 \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 45 & 11 & \text{TA} & 0 & 1 \\ 46 & 4 & \text{TA} & 0 & 1 \\ \hline \end{array} \]
group):\[\hat{Reading \ score}_i = \beta_0 + \beta_1D1_i + \beta_2D2_i\]
\[\hat{y}_i = \beta_0 + \beta_1D1_i + \beta_2D2_i + \ ... \ + \beta_{k-1}D(k-1)_i\]
dbl+lbl [1:66] 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
@ format.spss: chr "F8.2"
@ labels : Named num [1:3] 0 1 2
..- attr(*, "names")= chr [1:3] "control" "DRTA" "TA"
group is a charachter vector for now, let’s convert it into a factor
If we want to dummy-code D1 and D2 ourselves, we can do:
Call:
lm(formula = posttest1 ~ D1 + D2, data = read)
Residuals:
Min 1Q Median 3Q Max
-6.7727 -2.6818 0.2273 2.2955 7.2273
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.6818 0.6798 9.829 2.44e-14 ***
D1 3.0909 0.9614 3.215 0.00206 **
D2 1.0909 0.9614 1.135 0.26078
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.189 on 63 degrees of freedom
Multiple R-squared: 0.1444, Adjusted R-squared: 0.1173
F-statistic: 5.317 on 2 and 63 DF, p-value: 0.007347
\[H_0: \mu_{control} = \mu_{DRTA} = \mu_{TA}\]
Test Yourself (before you wreck yourself)!
What if we have more predictors in the model, how could we get the overall group effect then?
*X-files theme song playing*
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.681818 0.6797950 9.829167 2.443011e-14
D1 3.090909 0.9613753 3.215091 2.058348e-03
D2 1.090909 0.9613753 1.134738 2.607841e-01
2.5 % 97.5 %
(Intercept) 5.3233563 8.040280
D1 1.1697539 5.012064
D2 -0.8302461 3.012064
\(\hat{\beta}_0 = 6.68\), 95% CI [5.32, 8.04], which is statistically significant, t(63) = 9.83, p < .001
\(\hat{\beta}_1 = 3.09\), 95% CI [1.17, 5.01], which is significant, t(63) = 3.22, p = .002,
\(\hat{\beta}_2 = 1.09\), 95% CI [-.83, 3.01], which is not significant, t(63) = 1.14, p = .26.
| Group | Mean |
|---|---|
| control | 6.68 |
| DRTA | 9.77 |
| TA | 7.77 |
Want to see something cool?
As long as your grouping variable is defined as a factor, R will dummy-code for you (but, look at the predictor names now!)
read$group <- as_factor(read$group) # converting chr to factor
# Before
dummod <- lm(posttest1 ~ D1 + D2, data=read)
# After (it's the same!)
readmod <- lm(posttest1 ~ group, data=read)
summary(readmod)$coefficients # NOTE: the results are identical, the predictors represent the same things as D1 and D2, only the their names are different Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.681818 0.6797950 9.829167 2.443011e-14
groupDRTA 3.090909 0.9613753 3.215091 2.058348e-03
groupTA 1.090909 0.9613753 1.134738 2.607841e-01
Test Yourself (before you wreck yourself)!
How might we compare the DRTA group mean to the TA group mean?
Of course, we can redefine D1 and D2 manually
But, it’s easier to re-level the factored grouping variable
# Creating a new variable with a difference reference group
read$group_relev <- relevel(read$group, ref = "TA")
# Refitting the model with the new variable
readmod_relev <- lm(posttest1 ~ group_relev, data=read)
summary(readmod_relev)$coefficients # NOTE: Now, group_relevDRTA is DRTA vs. TA Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.772727 0.6797950 11.433929 5.187793e-17
group_relevcontrol -1.090909 0.9613753 -1.134738 2.607841e-01
group_relevDRTA 2.000000 0.9613753 2.080353 4.156606e-02
pairs() function from the emmeans packagelibrary(emmeans)
# Estimate the groups' marginal means
emm <- emmeans(readmod, ~ group)
# Pairwise comparisons for every unique pair
pairs(emm, adjust="none") # adjust= refers to multiplicity control (e.g., Tukey), but I don't know if I believe in MC contrast estimate SE df t.ratio p.value
control - DRTA -3.09 0.961 63 -3.215 0.0021
control - TA -1.09 0.961 63 -1.135 0.2608
DRTA - TA 2.00 0.961 63 2.080 0.0416
As we’ve been discussing, ANOVA is part of the GLM, but many researchers operate solely within an ANOVA framework for many different reasons
It’s still surprisingly common for researchers to take a continuous variable (e.g., age) and artificially categorize it to run an ANOVA instead of a regression model
Module 6