Module 6: Categorical Predictor Variables

PSYC 3032 M

Udi Alter

Things Udi Should Remember

Before we start

Group presentation is coming up (March 25)
- Post videos on eClass
Zoom/lecture recording
Assignment 1 is being graded!
Lab 4 is due next week

About Module 6

Goals for Today:

Finishing up predictive modeling
- Expected prediction error (EPE)
- The Bias-Variance Trade-off
- Machine learning methods: Cross-validation techniques
Regression with categorical predictors
- Categorical variables with more than 2 categories
- Dummy-coded variables
- Examplesssss
- (Don’t) categorize continuous variables

How did your post-reading-week assignments/mid-terms for other classes go?

A) Excellent!

B) Pretty good

C) OK, not bad

D) Meh…

E) “Jesus, take the wheel!”

What type of movie would your life be right now?

A) Romance

B) Comedy

C) Horror/sad

D) Thriller

E) Drama

If you were a type of coffee, which one would you be?

A) Espresso – punchy and to the point

B) Latte – smooth and fancy

C) Americano – sophisticated and honest

D) Decaf – calm and collected

E) Frappuccino – sweet and cool

Categorical Predictor Variables

Categorical Predictor Variables

In this course we’ve been focusing almost exclusively on the relationship between continuous (i.e., quantitative) variables
- But, we did learn about binary predictors (i.e., categories=2)
- Not only that, but you’ve also experimented with that topic in Assignment 1
What’s exactly happening when we include categorical (a.k.a., qualitative) predictors in our regression model?
What happens when there are more than two categories!? How can we use and interpret multi-category, nominal, or ordinal predictors in our regression model?

Categorical Predictor Variables

In Module 3, we saw that it is straightforward to include a dichotomous/binary (i.e., categories=2) variable in a simple regression model
Because \(X\) has only two possible values, the relation between \(X\) and \(Y\) can only be linear!
When a dependent variable \(Y\) is regressed on a single, dichotomous predictor \(X\), the estimated regression line connects the mean of \(Y\) at one level of \(X\) to the mean of \(Y\) at the other level of \(X\)
The slope, \(\hat{\beta}_1\), of this line equals the difference between the two means, remember?

More Than 2 Categories

But what if a discrete predictor has more than two categories (i.e., C > 2)?
- For example, an experimental manipulation could have a control group and two or more treatment groups
- Or, given the recent elections in Ontario, a vote (or political party membership) can be a multi-category predictor variable
The “trick” is to break down the categories into a set of dichotomous variables, and then include these sets as predictors in the model
Breaking categories into sets of dichotomous variables means creating pairwise comparisons of categories (e.g., New Democratic vs. Progressive Conservative, New Democratic vs. Green party)
There are two approaches to doing this (i.e., breaking categories into sets):
- Dummy Coding
- Contrast Coding

Working Example

To illustrate MLR with categorical variables, we will use a new dataset (finally!) taken from Baumann et al., (1992)
Grade 4 kids (N = 66) were randomly assigned to one of three groups in an intervention designed to improve reading comprehension
- The 1st group is a control condition,
- The 2nd group is a directed-reading thinking activity (DRTA), and
- The 3rd group is a “think aloud” (TA) approach
The groups were evenly distributed such that each group had 22 participants
After each group received its respective intervention, participants completed a test designed to evaluate their reading comprehension
In this example, the experimental manipulation has a control group and two treatment groups (i.e., three categories; C = 3)

Test Yourself!

Considering this research design, which statistical approach would be suitable for this example?

Small Detour

Regrssion with a Categorical Predictor is a One-Way ANOVA

The one-way ANOVA is simply an omnibus test that tells you whether the variability between groups is significantly more than the variability within groups
We know that we get macro- and micro-level information from our regression model
At the macro level, we obtain an F test for the model fit, \(R^2\), which is exactly the same as we would have obtained using a one-way ANOVA and eta-squared, \(\eta^2\)
The analysis of regression variance table is exactly the same as the one-way ANOVA table substituting \(SS_{between}\) for \(SS_{model}\) and \(SS_{within}\) for \(SS_{error}\)
Further, \(R^2=\eta^2\)!!! (No way, what!?!?!?)
- Yes, both represent the proportion of variability in the outcome explained by the model; stated differently, are we able to reliably distinguish scores on the outcome by knowing which group a person belongs to?

Working Example

We would like to develop a statistical model to represent the outcome variable (i.e., reading comprehension) as a function of the predictor variable (i.e., treatment type)
But, before modeling, we should plot the data and examine the descriptive statistics:

library(haven);library(e1071); library(kableExtra)
read <- read_sav("reading.sav")
descriptive_stats <- read %>% group_by(Group=as_factor(group)) %>%  # Converts to factor using haven labels
                              summarise(Count = n(), 
                              Mean = mean(posttest1, na.rm = TRUE), SD = sd(posttest1, na.rm = TRUE), 
                              Min = min(posttest1, na.rm = TRUE), Max = max(posttest1, na.rm = TRUE),
                              Skewness = skewness(posttest1, na.rm = TRUE), Kurtosis = kurtosis(posttest1, na.rm = TRUE),
                             .groups = 'drop') # Drops the grouping structure after summarizing
kable(descriptive_stats)

Group	Count	Mean	SD	Min	Max	Skewness	Kurtosis
control	22	6.681818	2.766920	2	12	0.3158039	-0.9243863
DRTA	22	9.772727	2.724349	5	14	-0.1089612	-1.1941274
TA	22	7.772727	3.927095	1	15	0.2252122	-1.2032715

Working Example

ggplot(read, aes(x = as_factor(group), y = posttest1, fill = as_factor(group))) +
  geom_violin(trim = FALSE, alpha = 0.4) +
  geom_jitter(width = 0.15, color = "black", size = 1.5, alpha = 0.7) +
  geom_boxplot(width = 0.1, alpha = 0.5, fill = "grey") +
  labs(x = "Treatment Group", y = "Reading Comprehension", fill = "Treatment Group") +
  theme_classic() + scale_fill_brewer(palette = "Dark2")

Dummy Coding

Membership in one of the three treatment groups is a discrete variable with a nominal scale of measurement
How can we include it as a predictor in a regression model? One approach is to use dummy-coded variables denoting group comparisons with 0s and 1s
We already saw the use of dummy codes when we included a gender variable in a regression model by setting female = 0 and male = 1
Because gender was a nominal variable scored with two levels (i.e., two categories), we needed one dummy variable to distinguish male from female
However, when a discrete independent variable has more than two levels, more than one dummy code is needed to compare the groups
In general, if a categorical variable has \(K\) levels, then \(K-1\) codes are needed to distinguish among them

How many dummy-coded variables do we need for our 3-treatment groups example?

Dummy Coding

In the our reading example, the treatment variable has three levels (control, TA, and DRTA), so we will need two dummy codes to distinguish among them
In the creation of dummy codes, it is necessary to choose one group (i.e., one level of the independent variable) to represent a “reference” or “baseline” group, which is the group coded 0 on all dummy-coded variables
With the gender predictor, male was the level dummy-coded 0, and thus served as the reference group
In a simple regression model with dichotomous, dummy-coded \(X\), the intercept parameter represents the predicted mean of \(Y\) for the reference group
The slope parameter then represents the differences in means between the other group and the reference/baseline group

\[\hat{y}_i= {\color{deeppink} {\beta_0}} + {\color{darkcyan} {\beta_1}}x_i = {\color{deeppink} {\mu_{ref}}} + ({\color{darkcyan} {\mu_{other \ group}-\mu_{ref}}})x_i\]

Back to Example

In the new reading comprehension example, there is a control group
It makes sense and will be intuitive for interpretation to set the control group as the reference (although that choice is mostly arbitrary)
Thus, in this three-group situation, we need to create two new dummy-coded variables
- One variables represents control vs. DRTA (variable D1)
- Another variable represents control vs. TA (variable D2)
- Consequently, the control group is the reference; it’s coded 0 on both D1 and D2

\[ \begin{array}{lcc} \hline &{\textbf{Dummy-code variable values}} \\ \hline \textbf{Level of IV} & \textbf{D1} & \textbf{D2} \\ \hline \text{Control} & 0 & 0 \\ \text{DRTA} & 1 & 0 \\ \text{TA} & 0 & 1 \\ \hline \end{array} \]

Reading Example

Once the dummy variables are created, the dataset will look something like this (here, showing data only for two participants in each group):

\[ \begin{array}{cccc} \hline i & \text{Reading score} & \text{Group} & D1 & D2 \\ \hline 1 & 5 & \text{control} & 0 & 0 \\ 2 & 9 & \text{control} & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 23 & 7 & \text{DRTA} & 1 & 0 \\ 24 & 5 & \text{DRTA} & 1 & 0 \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 45 & 11 & \text{TA} & 0 & 1 \\ 46 & 4 & \text{TA} & 0 & 1 \\ \hline \end{array} \]

Reading Example

D1 and D2 then become the predictors in a multiple regression model (instead of the original grouping variable, e.g., group):

\[\hat{Reading \ score}_i = \beta_0 + \beta_1D1_i + \beta_2D2_i\]

We can generalize to any hypothetical discrete independent variable with more than three levels:

\[\hat{y}_i = \beta_0 + \beta_1D1_i + \beta_2D2_i + \ ... \ + \beta_{k-1}D(k-1)_i\]

str(read$group)

 dbl+lbl [1:66] 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
 @ format.spss: chr "F8.2"
 @ labels     : Named num [1:3] 0 1 2
  ..- attr(*, "names")= chr [1:3] "control" "DRTA" "TA"

group is a charachter vector for now, let’s convert it into a factor
If we want to dummy-code D1 and D2 ourselves, we can do:

read$group <- as_factor(read$group) # converting chr to factor

# Option 1
read$D1 <- ifelse(read$group == "DRTA", 1, 0)
read$D2 <- ifelse(read$group == "TA", 1, 0)
# Option 2
read <- read %>%
  mutate(D1 = as.integer(group == "DRTA"),
         D2 = as.integer(group == "TA"))

We can estimate the model with:

dummod <- lm(posttest1 ~ D1 + D2, data=read)
summary(dummod)


Call:
lm(formula = posttest1 ~ D1 + D2, data = read)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.7727 -2.6818  0.2273  2.2955  7.2273 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   6.6818     0.6798   9.829 2.44e-14 ***
D1            3.0909     0.9614   3.215  0.00206 ** 
D2            1.0909     0.9614   1.135  0.26078    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.189 on 63 degrees of freedom
Multiple R-squared:  0.1444,    Adjusted R-squared:  0.1173 
F-statistic: 5.317 on 2 and 63 DF,  p-value: 0.007347

Overall Group Effect

We might first like to measure the overall group effect, in other words, how strong is the overall (omnibus) effect of treatment group on reading scores?

summary(dummod)$r.squared

[1] 0.1444271

Multiple \(R^2\) summarizes the effect of the two dummy variables taken together
- Because the two dummy variables comprehensively represent the original three-level independent variable, \(R^2\) summarizes the extent to which the dependent variable varies based on intervention group membership
With the dummy-variable MLR model fitted to the reading comprehension data using OLS estimation, \(R^2\) = .14, a small, but likely meaningful/interesting effect, which is significant with F (2, 63) = 5.32, p = .007

Overall Group Effect

This F test is exactly the same as we would have obtained using a one-way ANOVA to test the omnibus null hypothesis that the population mean test scores of the three groups are equal:

\[H_0: \mu_{control} = \mu_{DRTA} = \mu_{TA}\]

Test Yourself (before you wreck yourself)!

What if we have more predictors in the model, how could we get the overall group effect then?

*X-files theme song playing*

Parameter Estimate Interpretation

summary(dummod)$coefficients

            Estimate Std. Error  t value     Pr(>|t|)
(Intercept) 6.681818  0.6797950 9.829167 2.443011e-14
D1          3.090909  0.9613753 3.215091 2.058348e-03
D2          1.090909  0.9613753 1.134738 2.607841e-01

confint(dummod)

                 2.5 %   97.5 %
(Intercept)  5.3233563 8.040280
D1           1.1697539 5.012064
D2          -0.8302461 3.012064

\(\hat{\beta}_0 = 6.68\), 95% CI [5.32, 8.04], which is statistically significant, t(63) = 9.83, p < .001
\(\hat{\beta}_1 = 3.09\), 95% CI [1.17, 5.01], which is significant, t(63) = 3.22, p = .002,
\(\hat{\beta}_2 = 1.09\), 95% CI [-.83, 3.01], which is not significant, t(63) = 1.14, p = .26.

Parameter Estimate Interpretation

\(\hat{\beta_0}=6.68\) gives the expected (i.e., mean) reading comprehension score for the control group
- Which makes sense because the intercept represents the predicted value for a hypothetical participant whose values for all predictors equal 0
- Here, a participant with a value of 0 for both predictors, D1 and D2, is a member of the control group, remember?
The t test for this intercept parameter tells us that the control group’s reading comprehension score is likely to be different from zero (not whether it is different from another group)

Parameter Estimate Interpretation

\(\hat{\beta_1}=3.09\) gives the expected difference in \(y\) (i.e., reading comprehension) when D1 increases by one unit, and therefore reflects the mean difference between the control and DRTA groups on reading comprehension
- That is, children in the DRTA treatment scored on average 3.09 points higher on the reading comprehension scale than the children in the control group post-intervention (see the violin plots)
- This mean difference is about 3 points on a scale from 1 to 15 which is about \(\frac{3}{15}=.2\) or 20% of the entire scale range; I’d say that’s somewhere between a small- and a medium-sized effect!
- Additionally, because \(\hat{\beta_1}\) is statistically significant, it’s unlikely that the difference between the groups on the reading comprehension is zero

Parameter Estimate Interpretation

Similarly, \(\hat{\beta_2}=1.09\) gives the expected difference in \(y\) when D2 increases by one unit, which reflects the mean difference between the control and TA groups on reading comprehension
- That is, children in the TA treatment scored on average 1.09 points higher on the reading comprehension scale than the children in the control group post-intervention (see table below)
- The mean difference is ~1 point which is about \(\frac{1}{15}=.066\) or 6.6% of the entire scale range; a rather small effect
- Additionally, because \(\hat{\beta_2}\) is not statistically significant, we have insufficient evidence to infer that the two populations (i.e., no intervention and TA) differ on reading comprehension

Group	Mean
control	6.68
DRTA	9.77
TA	7.77

Let be the Dummy (Coder)

Want to see something cool?
As long as your grouping variable is defined as a factor, R will dummy-code for you (but, look at the predictor names now!)

read$group <- as_factor(read$group) # converting chr to factor
# Before
dummod <- lm(posttest1 ~ D1 + D2, data=read)
# After (it's the same!)
readmod <- lm(posttest1 ~ group, data=read)
summary(readmod)$coefficients # NOTE: the results are identical, the predictors represent the same things as D1 and D2, only the their names are different

            Estimate Std. Error  t value     Pr(>|t|)
(Intercept) 6.681818  0.6797950 9.829167 2.443011e-14
groupDRTA   3.090909  0.9613753 3.215091 2.058348e-03
groupTA     1.090909  0.9613753 1.134738 2.607841e-01

Test Yourself (before you wreck yourself)!

How might we compare the DRTA group mean to the TA group mean?

Let be the Dummy (Coder)

Of course, we can redefine D1 and D2 manually
But, it’s easier to re-level the factored grouping variable

# Creating a new variable with a difference reference group
read$group_relev <- relevel(read$group, ref = "TA") 
# Refitting the model with the new variable
readmod_relev <- lm(posttest1 ~ group_relev, data=read)
summary(readmod_relev)$coefficients # NOTE: Now, group_relevDRTA is DRTA vs. TA

                    Estimate Std. Error   t value     Pr(>|t|)
(Intercept)         7.772727  0.6797950 11.433929 5.187793e-17
group_relevcontrol -1.090909  0.9613753 -1.134738 2.607841e-01
group_relevDRTA     2.000000  0.9613753  2.080353 4.156606e-02

Let be the Dummy (Coder)

And perhaps even easier is to use the pairs() function from the emmeans package

library(emmeans)
# Estimate the groups' marginal means
emm <- emmeans(readmod, ~ group) 
# Pairwise comparisons for every unique pair
pairs(emm, adjust="none") # adjust= refers to multiplicity control (e.g., Tukey), but I don't know if I believe in MC

 contrast       estimate    SE df t.ratio p.value
 control - DRTA    -3.09 0.961 63  -3.215  0.0021
 control - TA      -1.09 0.961 63  -1.135  0.2608
 DRTA - TA          2.00 0.961 63   2.080  0.0416

Categorizing Continuous Variables (Just Don’t!)

As we’ve been discussing, ANOVA is part of the GLM, but many researchers operate solely within an ANOVA framework for many different reasons
It’s still surprisingly common for researchers to take a continuous variable (e.g., age) and artificially categorize it to run an ANOVA instead of a regression model
- E.g., if you have ages between 13 to 80 years, “cut” it into categories of teens (13-19), young adults (20-35), adults (36-60), and older adults (> 61) and use the age-categories variable as predictor instead of age

Unless there is a good theoretical and justification, YOU SHOULD NOT DO THIS!!!

Artificially categorizing variables (especially using techniques like median splits) creates measurement error, reduces variability and power, and can result in misleading results/conclusions!