Simpson’s Paradox (in 3D!)

Module 3

Udi Alter

2025-Jan-28

…Continuing from Module 3, slides 17-20

Here is another way to graphically demonstrate how regression slopes look (and change!!!) if we compare the “raw” relationship between x and y (i.e., no other variables), to the partial relationship between x and y (i.e., adding an additional variable)

\[Cholesterol_i = \beta_0 + \beta_1 Exercise_i + \epsilon_i\]

\(Cholesterol_i = \beta_0 + \beta_1 Exercise_i + \beta_2 Age_i+ \epsilon_i\)

Simpson’s Paradox (in 3D!)

Simpson’s paradox, AKA the reversal paradox or Yule-Simpson effect, occurs when a trend that appears in different groups of data reverses or disappears when the groups are combined. In other words, the aggregated data may lead to a conclusion that contradicts the conclusions drawn from analyzing the data within each subgroup. The paradox often arises due to the presence of a confounding variable that affects both the grouping and the outcome

The 3D plot below illustrates how exercise, cholesterol, and age interact (not in the statistical modeling sense, i.e., not moderation). At first glance, it might seem like there’s a positive relationship between exercise and cholesterol when age is ignored. However, when age is factored in, the partial relationship between exercise and cholesterol turns negative. This happens because age is a confounder, influencing both exercise and cholesterol levels.

The different age groups form distinct “data clouds,” represented by different colors, and are spread out along the Z-axis (age). This makes it clear that cholesterol levels tend to increase with age, even when accounting for exercise. The black regression lines within each age group reveal the relationship between exercise and cholesterol specific to that age group. Meanwhile, the semi-transparent regression plane ties everything together, showing the overall predicted cholesterol levels based on both exercise and age.