Module 6: Categorical Predictor Variables

PSYC 3032 M

Udi Alter

Things Udi Should Remember



Before we start

  • Group presentation is coming up (March 25)
    • Post videos on eClass
  • Zoom/lecture recording
  • Assignment 1 is being graded!
  • Lab 4 is due next week

About Module 6

Goals for Today:

  • Finishing up predictive modeling
    • Expected prediction error (EPE)
    • The Bias-Variance Trade-off
    • Machine learning methods: Cross-validation techniques
  • Regression with categorical predictors
    • Categorical variables with more than 2 categories
    • Dummy-coded variables
    • Examplesssss
    • (Don’t) categorize continuous variables


How did your post-reading-week assignments/mid-terms for other classes go?


A) Excellent!

B) Pretty good

C) OK, not bad

D) Meh…

E) “Jesus, take the wheel!”


What type of movie would your life be right now?


A) Romance

B) Comedy

C) Horror/sad

D) Thriller

E) Drama


If you were a type of coffee, which one would you be?


A) Espresso – punchy and to the point

B) Latte – smooth and fancy

C) Americano – sophisticated and honest

D) Decaf – calm and collected

E) Frappuccino – sweet and cool




Categorical Predictor Variables

Categorical Predictor Variables

  • In this course we’ve been focusing almost exclusively on the relationship between continuous (i.e., quantitative) variables

    • But, we did learn about binary predictors (i.e., categories=2)
    • Not only that, but you’ve also experimented with that topic in Assignment 1
  • What’s exactly happening when we include categorical (a.k.a., qualitative) predictors in our regression model?

  • What happens when there are more than two categories!? How can we use and interpret multi-category, nominal, or ordinal predictors in our regression model?

Categorical Predictor Variables

  • In Module 3, we saw that it is straightforward to include a dichotomous/binary (i.e., categories=2) variable in a simple regression model

  • Because \(X\) has only two possible values, the relation between \(X\) and \(Y\) can only be linear!

  • When a dependent variable \(Y\) is regressed on a single, dichotomous predictor \(X\), the estimated regression line connects the mean of \(Y\) at one level of \(X\) to the mean of \(Y\) at the other level of \(X\)

  • The slope, \(\hat{\beta}_1\), of this line equals the difference between the two means, remember?

More Than 2 Categories

  • But what if a discrete predictor has more than two categories (i.e., C > 2)?

    • For example, an experimental manipulation could have a control group and two or more treatment groups
    • Or, given the recent elections in Ontario, a vote (or political party membership) can be a multi-category predictor variable
  • The “trick” is to break down the categories into a set of dichotomous variables, and then include these sets as predictors in the model

  • Breaking categories into sets of dichotomous variables means creating pairwise comparisons of categories (e.g., New Democratic vs. Progressive Conservative, New Democratic vs. Green party)

  • There are two approaches to doing this (i.e., breaking categories into sets):

    • Dummy Coding
    • Contrast Coding

Working Example

  • To illustrate MLR with categorical variables, we will use a new dataset (finally!) taken from Baumann et al., (1992)

  • Grade 4 kids (N = 66) were randomly assigned to one of three groups in an intervention designed to improve reading comprehension

    • The 1st group is a control condition,
    • The 2nd group is a directed-reading thinking activity (DRTA), and
    • The 3rd group is a “think aloud” (TA) approach
  • The groups were evenly distributed such that each group had 22 participants

  • After each group received its respective intervention, participants completed a test designed to evaluate their reading comprehension

  • In this example, the experimental manipulation has a control group and two treatment groups (i.e., three categories; C = 3)

Test Yourself!

Considering this research design, which statistical approach would be suitable for this example?

Small Detour

Regrssion with a Categorical Predictor is a One-Way ANOVA

  • The one-way ANOVA is simply an omnibus test that tells you whether the variability between groups is significantly more than the variability within groups

  • We know that we get macro- and micro-level information from our regression model

  • At the macro level, we obtain an F test for the model fit, \(R^2\), which is exactly the same as we would have obtained using a one-way ANOVA and eta-squared, \(\eta^2\)

  • The analysis of regression variance table is exactly the same as the one-way ANOVA table substituting \(SS_{between}\) for \(SS_{model}\) and \(SS_{within}\) for \(SS_{error}\)

  • Further, \(R^2=\eta^2\)!!! (No way, what!?!?!?)

    • Yes, both represent the proportion of variability in the outcome explained by the model; stated differently, are we able to reliably distinguish scores on the outcome by knowing which group a person belongs to?

Working Example

  • We would like to develop a statistical model to represent the outcome variable (i.e., reading comprehension) as a function of the predictor variable (i.e., treatment type)

  • But, before modeling, we should plot the data and examine the descriptive statistics:

library(haven);library(e1071); library(kableExtra)
read <- read_sav("reading.sav")
descriptive_stats <- read %>% group_by(Group=as_factor(group)) %>%  # Converts to factor using haven labels
                              summarise(Count = n(), 
                              Mean = mean(posttest1, na.rm = TRUE), SD = sd(posttest1, na.rm = TRUE), 
                              Min = min(posttest1, na.rm = TRUE), Max = max(posttest1, na.rm = TRUE),
                              Skewness = skewness(posttest1, na.rm = TRUE), Kurtosis = kurtosis(posttest1, na.rm = TRUE),
                             .groups = 'drop') # Drops the grouping structure after summarizing
kable(descriptive_stats)
Group Count Mean SD Min Max Skewness Kurtosis
control 22 6.681818 2.766920 2 12 0.3158039 -0.9243863
DRTA 22 9.772727 2.724349 5 14 -0.1089612 -1.1941274
TA 22 7.772727 3.927095 1 15 0.2252122 -1.2032715

Working Example

ggplot(read, aes(x = as_factor(group), y = posttest1, fill = as_factor(group))) +
  geom_violin(trim = FALSE, alpha = 0.4) +
  geom_jitter(width = 0.15, color = "black", size = 1.5, alpha = 0.7) +
  geom_boxplot(width = 0.1, alpha = 0.5, fill = "grey") +
  labs(x = "Treatment Group", y = "Reading Comprehension", fill = "Treatment Group") +
  theme_classic() + scale_fill_brewer(palette = "Dark2")

Dummy Coding

  • Membership in one of the three treatment groups is a discrete variable with a nominal scale of measurement

  • How can we include it as a predictor in a regression model? One approach is to use dummy-coded variables denoting group comparisons with 0s and 1s

  • We already saw the use of dummy codes when we included a gender variable in a regression model by setting female = 0 and male = 1

  • Because gender was a nominal variable scored with two levels (i.e., two categories), we needed one dummy variable to distinguish male from female

  • However, when a discrete independent variable has more than two levels, more than one dummy code is needed to compare the groups

  • In general, if a categorical variable has \(K\) levels, then \(K-1\) codes are needed to distinguish among them

How many dummy-coded variables do we need for our 3-treatment groups example?

Dummy Coding

  • In the our reading example, the treatment variable has three levels (control, TA, and DRTA), so we will need two dummy codes to distinguish among them

  • In the creation of dummy codes, it is necessary to choose one group (i.e., one level of the independent variable) to represent a “reference” or “baseline” group, which is the group coded 0 on all dummy-coded variables

  • With the gender predictor, male was the level dummy-coded 0, and thus served as the reference group

  • In a simple regression model with dichotomous, dummy-coded \(X\), the intercept parameter represents the predicted mean of \(Y\) for the reference group

  • The slope parameter then represents the differences in means between the other group and the reference/baseline group

\[\hat{y}_i= {\color{deeppink} {\beta_0}} + {\color{darkcyan} {\beta_1}}x_i = {\color{deeppink} {\mu_{ref}}} + ({\color{darkcyan} {\mu_{other \ group}-\mu_{ref}}})x_i\]

Back to Example

  • In the new reading comprehension example, there is a control group

  • It makes sense and will be intuitive for interpretation to set the control group as the reference (although that choice is mostly arbitrary)

  • Thus, in this three-group situation, we need to create two new dummy-coded variables

    • One variables represents control vs. DRTA (variable D1)
    • Another variable represents control vs. TA (variable D2)
    • Consequently, the control group is the reference; it’s coded 0 on both D1 and D2

\[ \begin{array}{lcc} \hline &{\textbf{Dummy-code variable values}} \\ \hline \textbf{Level of IV} & \textbf{D1} & \textbf{D2} \\ \hline \text{Control} & 0 & 0 \\ \text{DRTA} & 1 & 0 \\ \text{TA} & 0 & 1 \\ \hline \end{array} \]

Reading Example

  • Once the dummy variables are created, the dataset will look something like this (here, showing data only for two participants in each group):

\[ \begin{array}{cccc} \hline i & \text{Reading score} & \text{Group} & D1 & D2 \\ \hline 1 & 5 & \text{control} & 0 & 0 \\ 2 & 9 & \text{control} & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 23 & 7 & \text{DRTA} & 1 & 0 \\ 24 & 5 & \text{DRTA} & 1 & 0 \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 45 & 11 & \text{TA} & 0 & 1 \\ 46 & 4 & \text{TA} & 0 & 1 \\ \hline \end{array} \]

Reading Example

  • D1 and D2 then become the predictors in a multiple regression model (instead of the original grouping variable, e.g., group):

\[\hat{Reading \ score}_i = \beta_0 + \beta_1D1_i + \beta_2D2_i\]

  • We can generalize to any hypothetical discrete independent variable with more than three levels:

\[\hat{y}_i = \beta_0 + \beta_1D1_i + \beta_2D2_i + \ ... \ + \beta_{k-1}D(k-1)_i\]

str(read$group)
 dbl+lbl [1:66] 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
 @ format.spss: chr "F8.2"
 @ labels     : Named num [1:3] 0 1 2
  ..- attr(*, "names")= chr [1:3] "control" "DRTA" "TA"
  • group is a charachter vector for now, let’s convert it into a factor

  • If we want to dummy-code D1 and D2 ourselves, we can do:

read$group <- as_factor(read$group) # converting chr to factor

# Option 1
read$D1 <- ifelse(read$group == "DRTA", 1, 0)
read$D2 <- ifelse(read$group == "TA", 1, 0)
# Option 2
read <- read %>%
  mutate(D1 = as.integer(group == "DRTA"),
         D2 = as.integer(group == "TA"))

  • We can estimate the model with:
dummod <- lm(posttest1 ~ D1 + D2, data=read)
summary(dummod)

Call:
lm(formula = posttest1 ~ D1 + D2, data = read)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.7727 -2.6818  0.2273  2.2955  7.2273 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   6.6818     0.6798   9.829 2.44e-14 ***
D1            3.0909     0.9614   3.215  0.00206 ** 
D2            1.0909     0.9614   1.135  0.26078    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.189 on 63 degrees of freedom
Multiple R-squared:  0.1444,    Adjusted R-squared:  0.1173 
F-statistic: 5.317 on 2 and 63 DF,  p-value: 0.007347

Overall Group Effect

  • We might first like to measure the overall group effect, in other words, how strong is the overall (omnibus) effect of treatment group on reading scores?
summary(dummod)$r.squared
[1] 0.1444271
  • Multiple \(R^2\) summarizes the effect of the two dummy variables taken together
    • Because the two dummy variables comprehensively represent the original three-level independent variable, \(R^2\) summarizes the extent to which the dependent variable varies based on intervention group membership
  • With the dummy-variable MLR model fitted to the reading comprehension data using OLS estimation, \(R^2\) = .14, a small, but likely meaningful/interesting effect, which is significant with F (2, 63) = 5.32, p = .007

Overall Group Effect

  • This F test is exactly the same as we would have obtained using a one-way ANOVA to test the omnibus null hypothesis that the population mean test scores of the three groups are equal:

\[H_0: \mu_{control} = \mu_{DRTA} = \mu_{TA}\]



Test Yourself (before you wreck yourself)!

What if we have more predictors in the model, how could we get the overall group effect then?

*X-files theme song playing*

Parameter Estimate Interpretation

summary(dummod)$coefficients
            Estimate Std. Error  t value     Pr(>|t|)
(Intercept) 6.681818  0.6797950 9.829167 2.443011e-14
D1          3.090909  0.9613753 3.215091 2.058348e-03
D2          1.090909  0.9613753 1.134738 2.607841e-01
confint(dummod)
                 2.5 %   97.5 %
(Intercept)  5.3233563 8.040280
D1           1.1697539 5.012064
D2          -0.8302461 3.012064
  • \(\hat{\beta}_0 = 6.68\), 95% CI [5.32, 8.04], which is statistically significant, t(63) = 9.83, p < .001

  • \(\hat{\beta}_1 = 3.09\), 95% CI [1.17, 5.01], which is significant, t(63) = 3.22, p = .002,

  • \(\hat{\beta}_2 = 1.09\), 95% CI [-.83, 3.01], which is not significant, t(63) = 1.14, p = .26.

Parameter Estimate Interpretation

  • \(\hat{\beta_0}=6.68\) gives the expected (i.e., mean) reading comprehension score for the control group
    • Which makes sense because the intercept represents the predicted value for a hypothetical participant whose values for all predictors equal 0
    • Here, a participant with a value of 0 for both predictors, D1 and D2, is a member of the control group, remember?
  • The t test for this intercept parameter tells us that the control group’s reading comprehension score is likely to be different from zero (not whether it is different from another group)

Parameter Estimate Interpretation

  • \(\hat{\beta_1}=3.09\) gives the expected difference in \(y\) (i.e., reading comprehension) when D1 increases by one unit, and therefore reflects the mean difference between the control and DRTA groups on reading comprehension
    • That is, children in the DRTA treatment scored on average 3.09 points higher on the reading comprehension scale than the children in the control group post-intervention (see the violin plots)
    • This mean difference is about 3 points on a scale from 1 to 15 which is about \(\frac{3}{15}=.2\) or 20% of the entire scale range; I’d say that’s somewhere between a small- and a medium-sized effect!
    • Additionally, because \(\hat{\beta_1}\) is statistically significant, it’s unlikely that the difference between the groups on the reading comprehension is zero

Parameter Estimate Interpretation

  • Similarly, \(\hat{\beta_2}=1.09\) gives the expected difference in \(y\) when D2 increases by one unit, which reflects the mean difference between the control and TA groups on reading comprehension
    • That is, children in the TA treatment scored on average 1.09 points higher on the reading comprehension scale than the children in the control group post-intervention (see table below)
    • The mean difference is ~1 point which is about \(\frac{1}{15}=.066\) or 6.6% of the entire scale range; a rather small effect
    • Additionally, because \(\hat{\beta_2}\) is not statistically significant, we have insufficient evidence to infer that the two populations (i.e., no intervention and TA) differ on reading comprehension
Group Mean
control 6.68
DRTA 9.77
TA 7.77

Let be the Dummy (Coder)

  • Want to see something cool?

  • As long as your grouping variable is defined as a factor, R will dummy-code for you (but, look at the predictor names now!)

read$group <- as_factor(read$group) # converting chr to factor
# Before
dummod <- lm(posttest1 ~ D1 + D2, data=read)
# After (it's the same!)
readmod <- lm(posttest1 ~ group, data=read)
summary(readmod)$coefficients # NOTE: the results are identical, the predictors represent the same things as D1 and D2, only the their names are different
            Estimate Std. Error  t value     Pr(>|t|)
(Intercept) 6.681818  0.6797950 9.829167 2.443011e-14
groupDRTA   3.090909  0.9613753 3.215091 2.058348e-03
groupTA     1.090909  0.9613753 1.134738 2.607841e-01

Test Yourself (before you wreck yourself)!

How might we compare the DRTA group mean to the TA group mean?

Let be the Dummy (Coder)

  • Of course, we can redefine D1 and D2 manually

  • But, it’s easier to re-level the factored grouping variable

# Creating a new variable with a difference reference group
read$group_relev <- relevel(read$group, ref = "TA") 
# Refitting the model with the new variable
readmod_relev <- lm(posttest1 ~ group_relev, data=read)
summary(readmod_relev)$coefficients # NOTE: Now, group_relevDRTA is DRTA vs. TA
                    Estimate Std. Error   t value     Pr(>|t|)
(Intercept)         7.772727  0.6797950 11.433929 5.187793e-17
group_relevcontrol -1.090909  0.9613753 -1.134738 2.607841e-01
group_relevDRTA     2.000000  0.9613753  2.080353 4.156606e-02

Let be the Dummy (Coder)

  • And perhaps even easier is to use the pairs() function from the emmeans package
library(emmeans)
# Estimate the groups' marginal means
emm <- emmeans(readmod, ~ group) 
# Pairwise comparisons for every unique pair
pairs(emm, adjust="none") # adjust= refers to multiplicity control (e.g., Tukey), but I don't know if I believe in MC
 contrast       estimate    SE df t.ratio p.value
 control - DRTA    -3.09 0.961 63  -3.215  0.0021
 control - TA      -1.09 0.961 63  -1.135  0.2608
 DRTA - TA          2.00 0.961 63   2.080  0.0416

Categorizing Continuous Variables (Just Don’t!)

  • As we’ve been discussing, ANOVA is part of the GLM, but many researchers operate solely within an ANOVA framework for many different reasons

  • It’s still surprisingly common for researchers to take a continuous variable (e.g., age) and artificially categorize it to run an ANOVA instead of a regression model

    • E.g., if you have ages between 13 to 80 years, “cut” it into categories of teens (13-19), young adults (20-35), adults (36-60), and older adults (> 61) and use the age-categories variable as predictor instead of age

Unless there is a good theoretical and justification, YOU SHOULD NOT DO THIS!!!

  • Artificially categorizing variables (especially using techniques like median splits) creates measurement error, reduces variability and power, and can result in misleading results/conclusions!