Module 4: Assumptions & Diagnostics

PSYC 3032 M

Udi Alter

Things Udi Should Remember

Before we start

Any questions so far (about anything)?
How was Lab 2?
Lab 2 solution will be available today, after class
Lab 3 due next week
PLEASE LOOK CAREFULLY AT LAB SOLUTIONS
ASSIGNMENT 1 DUE ON FEB 25th (3 weeks away)
Groups, baby?

About Module 4

Goals for Today:

Assumptions
Some more assumptions…
Evaluating potential assumption violations through visualizations
Outliers and influential cases
Evaluating potential outliers and highly influential cases through visualizations

If I had to eat only one type of food for the rest of my life, it will be…

Which of the following would you choose as your superpower?

A) Immortality

B) Mind-reading

C) Super strength

D) Super intelligence

E) Time travel

Regression Diagnostics

Regression Diagnostics

Why do we model? To answer research questions about relationships, effects, or phenomena by representing them mathematically
How do we do that? We define these relationships/effects/phenomena as parameters in a model and then estimate them using data
But, how do we know these estimates are any good? Well… isn’t that the million-dollar question?
Parameter estimates and inferential statistics from a model will be reasonably accurate only when all of the statistical assumptions are met
Violating different types of assumptions will have different effects on your results, though; some will result in biased (i.e., inaccurate) effect size estimates, whereas others will result in in/deflated standard errors (which affect the calculation of CIs, p values, etc.)

Regression Diagnostics

Assumptions are important because if they are not met, OLS parameter estimates will be inaccurate (they will be biased and inconsistent) and inefficient (meaning that SEs and CIs will inflate)
These assumptions can be more difficult to evaluate with small samples. Ironically, the smaller the sample, the more important it is that assumptions are met
In practice, however, researchers underestimate the importance of statistical assumptions and rarely report or describe statistical assumptions testing in published work (e.g., see Counsell & Harlow, 2017)
Thus, there is no doubt that results of many published studies using regression are misleading, if not completely wrong

Regression Diagnostics

Regression diagnostics are graphical and statistical methods for evaluating the extent to which a regression model fitted to data is an adequate representation of that data and for testing the trustworthiness of inferential conclusions about the model’s parameter estimates
In particular, regression diagnostics are used to check the extent to which the model’s assumptions have been violated and to examine how any unusual or outlying observations may impact results
The GLM assumptions are a LINE:
- Linearity (the partial relations between each predictor and outcome is linear)
- Independence (of model residuals and therefore observations)
- Normality (of model residuals)
- Equal residual variance (across all levels of the predictor/s)
Although not technically assumptions, you should also assess for the presence of:
- Multicollinearity
- Outliers & influential cases

Regression Diagnostics

Did you notice that many of the assumptions relate to residuals?! Yeah, why?!
Residuals represent what the model fails to explain. If the residuals exhibit systematic patterns, it suggests that the model is misrepresenting the underlying data structure, which can lead to biased estimates, incorrect inferences, and bad conclusions
Recall that the residual, \(e_i\), is just the estimate of the true error, \(\epsilon_i\)

\[e_i = y_i-\hat{y}_i=y_i-[\hat{\beta}_0 + \sum_{p=1}^{P} \hat{\beta}_p x_{p_i}]\]

In checking our assumptions, we will often evaluate/plot the relationship between the residuals (\(e\)) and predicted values or predictors
The structure of residuals in relation to fitted/predictor values can often reveal systematic issues in the model

Aggression Example Model (again!?)

Let’s go back to the the model we love so much…

library(haven)
agrsn <- read_sav("aggression.sav")
agrsn.mod <- lm(BPAQ~BIS+age, data = agrsn)

Assumption 1: the partial relations between each predictor and outcome is linear

Assumption 1: Linearity

This assumption states that the association between each predictor and the outcome, after adjusting for other predictors, can be adequately described by a linear relationship
Note that it is possible for the raw relation between y and x1 to be non-linear, but when x2 is added, the (partial) relationship between y and x1 is reasonably approximated by a straight line (or vice versa)
As such, examining bivariate scatterplots of each x and y will not provide sufficient evidence of the linearity assumption; thus, although it is informative to plot y against each predictor in a given model, plotting the model’s residuals against each predictor is a more direct way of assessing whether the partial relations are linear
If residuals plotted against a predictor show a systematic pattern (e.g., a curve), it means the model should have included a non-linear term for that predictor
Actually, it’s better to plot the Standardized, Pearson, or Studentized residuals because they remove non-constant variance effects (Assumption 4), making non-linearity easier to see

Assumption 1: Linearity

library(car)
residualPlots(agrsn.mod)

           Test stat Pr(>|Test stat|)
BIS          -0.4727           0.6368
age           0.1571           0.8753
Tukey test   -0.7619           0.4461

What do we want?! No clear pattern (e.g., curvature) and a flat, horizontal blue LOESS line! What are we looking for?! The opposite! See next slides

Assumption 1: Linearity

Here’s an example of what we’re “looking for” (i.e., non-linearity; but, we hope not to see it)

Assumption 1: Linearity

Here’s another example:

Assumption 1: Linearity

You can also try this code, which is using Studentized, as opposed to Pearson residuals

stud_resid <- rstudent(agrsn.mod) # Studentizing the model residuals
scatterplot(stud_resid ~ agrsn$BIS, boxplot=T)
scatterplot(stud_resid ~ agrsn$age, boxplot=T)

What do you think about age?!

Assumption 1: Linearity

Why should I care about linearity?

Non-linear trends in residuals typically indicate model misspecification, usually under-fitting and overly simplistic model
- Could be missing an interaction effect or a polynomial term (e.g., \(x_1^2\))
Which means that the model does not fully capture the relationship between predictors and the outcome!
Which means that our results likely include biased estimates and incorrect predictions, leading to bad inferences and conclusions, YIKES!

What can I do?! (We’ll talk more about remedies later, though)

Assess influential cases
Maybe add a polynomial or interaction term
Maybe use piece-wise regression (fit multiple slopes at different values of x)
Maybe transform the predictor (e.g., take the natural log to “linearize” the relationship)

Assumption 2: Independent residuals

Assumption 2: Independence

Independence is a characteristic that comes into play at the research design rather than statistical phase
Independence simply means that the data for any observation is uncorrelated with other observation
Examples where independence is violated include samples of clustered data (like individuals who see the same therapist), having participants from the same family or couples, etc.
Accordingly, it is NOT something that you formally test with the other model diagnostics, but is nonetheless an important assumption

Assumption 3: The errors are normally distributed

Assumption 3: Normality

The normality of errors assumption relates to ALL conditional \(E(y_i|x_i)\) values
That is, the errors are assumed to be normally distributed at every x value, for each predictor!
This exact assumption is very hard to test because it’s often difficult to observe several \(y_i\) values associated with the exact same \(x_i\)
- For example, of those who are exactly 2.1 on BIS, what is the distribution of BPAQ? You may only have 1-2 people with this exact BIS score, and therefore testing normality for this conditional distribution is hard
That said, we know from probability theory that if all the conditional distributions are normal then the marginal distribution will be normal as well

Assumption 3: Normality

Assumption 3: Normality

Unlike in univariate methods, the normality assumption in regression actually refers to normally distributed errors and not the distributions of particular variables in your model.
Of course, we don’t know the errors, but we have their estimates, the residuals!
As such, it’s inappropriate to test the normality assumption by examining univariate histograms of your independent or dependent variables
Instead, we should extract the residuals from a regression model, and use a histogram, QQplot, or boxplot to examine their distribution
For the most accurate method to test residual normality, it is preferable to plot the Studentized residuals as opposed to the raw residuals (again, because they remove non-constant variance effects)
When the residuals evidence a non-normal distribution, the estimates of regression coefficients remain unbiased, but SE will be incorrect, which in turn impacts hypothesis tests and CIs

Assumption 3: Normality

hist(stud_resid, breaks = 30)
plot(density(stud_resid))

Assumption 3: Normality

This Quantile-Quantile (Q-Q) plot compares the distribution of Studentized residuals to a theoretical t distribution along with a pointwise 95% confidence envelope
Normality = Points fall along the 45° reference line; confidence band contain most of the points; no major deviations in the tails

car::qqPlot(agrsn.mod)

Assumption 3: Normality

Here is an example of non-normally distributed residuals

Assumption 3: Normality

And another…

Assumption 3: Normality

Another way you can gauge normality is when you use the summary() function on the model; at the top you’ll see descriptive statistics about the residuals (sans the Mean, which is always necessarily 0)


Call:
lm(formula = BPAQ ~ BIS + age, data = agrsn)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.17933 -0.32561 -0.00454  0.35594  1.33877 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.954392   0.244922   7.980 4.12e-14 ***
BIS          0.443607   0.085060   5.215 3.64e-07 ***
age         -0.017556   0.006032  -2.910  0.00391 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4906 on 272 degrees of freedom
Multiple R-squared:  0.1299,    Adjusted R-squared:  0.1235 
F-statistic: 20.31 on 2 and 272 DF,  p-value: 6.016e-09

Assumption 4: Equal residual variance across all levels of the predictor/s

Assumption 4: Homoscedasticity

Equal/constant residual variance across all levels of the predictor/s means that at any given point on a particular predictor, x, the spread of the residuals is the same
This is also sometimes can be phrased such that errors are homoscedastic, meaning they show homogeneity of variance
A scatter plot of (Studentized or Pearson) residuals against a predictor can reveal unmodeled non-linearity, as shown earlier, but they are also useful for judging homogeneity of variance!
Since the predicted values reflect the influence of all predictors together, we can make things easier by checking the homoscedasticity of residuals across the range of the fitted (predicted) values
This assumption means that the model does an equally good job, regardless of the predicted dependent score is
When the residuals evidence a non-constant variance—or there is heteroscedasticity—the estimates of regression coefficients remain unbiased, but SE will be incorrect, which in turn impacts hypothesis tests and CIs

Assumption 4: Homoscedasticity

Assumption 4: Homoscedasticity

scatterplot(rstudent(agrsn.mod) ~ fitted(agrsn.mod), boxplot=T)

What do we want?! A constant, equal-width confidence band across the range of fitted values
What are we looking for?! a horizontal “funnel” or “hourglass” shapes

Assumption 4: Homoscedasticity

plot(agrsn.mod, which = 3)

What do we want?! A horizontal, straight, and flat red line from left to right (and constant spread of circles around it)
What are we looking for?! “wiggles,” curves, or diagnoal red line (see next slides)

Assumption 4: Homoscedasticity

Here’s an example of heteroscedasticity

Assumption 4: Homoscedasticity

Here’s another

Playing Favourites

Are all assumptions equally important? All assumptions matter, but not all have the same impact
As we usually do, let’s again make the distinction between estimates and significance testing
The LINEARITY and INDEPENDENCE assumptions are critical for producing UNBIASED and CONSISTANT parameter estimates and thus are highly important
- UNBIASED: Estimates accurately reflect the true population parameters
- CONSISTANT Estimates become increasingly accurate as the sample size grows
On the other hand, NORMALITY and HOMOSCEDASTICITY violations primarily affect SEs, CIs, and p values
- However, with larger sample sizes, CLT “kicks in” and the sampling distribution of the coefficients becomes approximately normal (even if residuals aren’t)
- Also, there are easier “fixes” for such violations so that we can get appropriate precision (SEs and CIs) and inference
  - Examples include Weighted Least Squares (WLS), robust standard errors (sandwich estimators), bootstrapping, and robust regression methods.

Multicollinearity

Multicollinearity

As we have learned, in MLR, the partial relationship between a predictor and the outcome is adjusted by the amount that the predictor is associated with the other predictors in the model
When one or more relationships among the predictors becomes large, there is a considerable amount of redundancy among predictors and thus any one predictor contributes very little unique information to the model; this situation is known as collinearity or multicollinearity
Including collinear predictors in a model causes SEs of slope estimates to inflate, leading to very wide CIs around the estimates and low power for the t tests of the estimates
Although it is always a good idea to look at the simple correlations among your predictors, often this is not enough to detect multicollinearity
There are statistics that can help determine whether multicollinearity is likely to be a problem for your analysis
The variance inflation factor (VIF) describes the amount that the (squared) SE of each regression coefficient is increased relative to a (hypothetical) situation in which all of the predictor variables are completely uncorrelated

Multicollinearity

Multicollinearity

There will be a VIF statistic for each predictor, p, in a model:

\[VIF_p = \frac{1}{1-R_p^2}\]

where \(R_p^2\) refers to the Coefficient of determination (Multiple R-squared) from regressing the predictor, p, on all other predictors in the model
If the pth predictor is perfectly uncorrelated with all other predictors (i.e., correlations = 0), then \(VIF_p = 1\), indicating that the SE of \(\beta_p\) is not affected by multicollinearity of variable p with the other predictors
Another commonly reported statistic is tolerance (TOL). TOL is merely 1/VIF, so values of TOL near zero are indicative of problematic multicollinearity
I’ve seen recommended cut-offs in the literature of 3, 5, or 10; 10 is actually way too large, so I would opt to be a bit more conservative, and examine the effects of treating multicollinearity with VIFs closer to 3

Multicollinearity

With only two predictors the VIF scores are the same—but the values for each predictor will be different with three or more predictors in the model. So, here’s an example with three predictors

vif(lm(BPAQ~age+BIS+AISS, data=agrsn))

     age      BIS     AISS 
1.021521 1.101823 1.088405

Because none of these values is much larger than 1, multicollinearity does not have much of an effect on the SEs for the regression slopes of any of these individual predictors. But, take a look at what happens if we do this:

agrsn$BIS_squared <- agrsn$BIS^2 # creating another variable which is the squared of BIS
vif(lm(BPAQ~age+BIS+BIS_squared, data=agrsn))

        age         BIS BIS_squared 
   1.019559  108.273947  108.298655

YOWZA!

Additionally Assumed Assumptions

There are two “bonus” assumptions that are virtually never met in practice, but a researcher can dream…

The model is correctly specified; in other words, all of the correct variables that should be in the model have been included (i.e., you are not missing important variables)
The variables in your model are measured without error (i.e., perfectly reliable; yeah, right…)

In psychology, we often don’t have a true observed measurement of a construct of interest.
- For example, you cannot directly observe “depression” and must use a scale or instrument to get a measurement of depression
- So, if you use the Beck Depression Inventory (BDI) scores in a model, you assume that BDI is a perfect representation of the unobservable construct depression, but it’s clearly not…
With lower reliability in measured variables, we introduce extra noise to our variables of interest, which would bias our parameter estimates

Outliers and Influential Cases

Outliers and Influential Cases

There are multiple terms (and definitions) that are sometimes used interchangeably—such as outliers and influential cases—but, in fact, are not exactly the same thing!
Outliers are extreme observations compared to other observations in the data
- But, context matters, are these observations aberrant in relation to what? The mean of the variable? To the regression slope (i.e., large residual)?
Influence refers to the amount of “pull” each observations has on the regression slope (and thus parameter estimates); or, you can think about influence as the extent to which individual observations cause the slope (and estimates) to change
- So, all observations are “influential,” but some have minimal effect on the estimates (i.e., if they were removed, the estimates would barely change, if at all), whereas others have a large impact on the coefficient estimates (AKA the “influencers”)
- Some cases “pull” the regression estimates upward, some downward, and some have no impact at all

Outliers and Influential Cases

There are three characteristics researchers use to determine potential outliers/highly-influential cases:

Leverage
Discrepancy
Influence

Leverage

Leverage (sometimes called hat value) refers to the extent to which an observation differs from the other observations in terms of the predictor variables’ scores only; that is, how “far away” an observation is from the joint mean of all predictor values

In essence, a multivariate distribution is created (with P dimensions) and the leverage score (\(h_i\)) measures the distance of an observation from the center of this multivariate distribution (the “centroid”)
Example: A person with an extremely high income in a dataset predicting happiness will have high leverage in a group of grad students, even if their happiness aligns perfectly with the trend.

Leverage

To illustrate large distance from the centroid:

Leverage

To illustrate a high leverage point in SLR:

Leverage

To illustrate a high leverage point in MLR:

Leverage

One (conservative) rule of thumb to determine extreme leverage:

\[h_i > \frac{3(P+1)}{N}\]

where P is the number of predictors and N is the number of observations (sample size); in our example:

\[h_i > \frac{3(2+1)}{275}=\frac{9}{275}=0.0327\]

That said, this rule of thumb is too conservative; so use judgment when “flagging” any observations using this rule
We can plot the leverage values to see if any of them stand out. Here are the leverage scores for our two-predictor model in the research example:

Leverage

leverage<-hatvalues(agrsn.mod) 
plot(leverage)

There are some scores toward the end of the dataset (higher row number) that have higher leverage, but there is nothing that stands out too much; using the rule of thumb above, there are about 10 scores that fall outside of that cut-off.

Discrepancy

Discrepancy refers to the size of the observed residual. It is the extent to which an observation is extreme relative to its predicted score, \(\hat{y}_i\); discrepancy is just the magnitude of the residual for case \(i\), i.e., \(y_i-\hat{y}_i\)

Example: A student with average study hours (x1) but an extremely high quiz score (y) has high discrepancy because their outcome is unusual compared to what the model predicts

Discrepancy

In SLR:

Discrepancy

In MLR:

Discrepancy

We usually assess discrepancy using externally Studentized residuals
The externally Studentized residual (or the “Studentized deleted residual”) for case \(i\) is based on the SD of the residuals from a regression slope estimated using all observations except for case i (\(SD_{e_{-i}}\)):

\[e_i^*=\frac{e_i}{SD_{e_{-i}}\sqrt{1-h_i}}\]

Because \(e^*_i\) is distributed as a t statistic, we expect most values to fall between approximately -2 and +2, with values well outside the range of -2.5 to +2.5 considered extreme

Discrepancy

plot(rstudent(agrsn.mod), ylab="Studentized Residuals")
abline(2,0,lty=3)
abline(-2,0, lty=3)

Influence

Influence combines both leverage and discrepancy to measure the extent to which a case impacts a model’s parameter estimates

Influence

One common measures of influence is Cook’s distance (or Cook’s D; Cook, 1977), defined as:

\[D_i = \frac{e_i^*}{P + 1} \times \frac{h_i}{1 - h_i},\]

Thus, Cook’s distance is a function of both leverage (i.e., \(h_i\)) and discrepancy (\(e_i^*\)), where larger values of \(D_i\) are indicative of greater influence
Cut-offs for Cook’s D vary; some have argued in favour of using a cut-off of 1, but this is a way too liberal (see Camilleri, Alter, & Cribbie, 2024), whereas a conservative cut-off is:

\[D_i < \frac{4}{N-P-1}\]

Influence

Using the suggested cut-off from the previous, the Cook’s D cut-off for our model and data, \(\frac{4}{275-2-1}=0.0147\)
But, I typically start to worry about values Cook’s D values, \(D_i>0.15\) (somewhere between the conservative and too liberal, but much closer to the conservative cut-off…)
Other equally useful measures of influence include the Standardized Difference of Fit (DFFITS) and Standardized Difference of Beta (DFBETAS) (Patrick and Greg talk about these in the Quantitude episode assigned this week)
We won’t get into these in this course due to time limitation, but I do want you to know about them
You can check out Camilleri, Alter, & Cribbie, 2024 for how Cook’s D, DFFITS, and DFBETAS and various cut-off values for each influence statistic performs

Influence

plot(agrsn.mod, which=4)

Influence

Another way we could visually examine extreme scores is through an influence plot
This plot simultaneously provides information about the leverage, discrepancy and Cook’s distance, and is available from the influencePlot() function in the car package
This plot helps visually identify potential outliers and highly influential cases based on the size of the “bubbles”
The numbers on the graph correspond to the row number for the observation (row #)
The dotted lines help give you a sense of what is in the “normal” range
You would be most concerned about larger bubbles (which is why their row number are identified), but especially so if they are close to the top and bottom most right corners (i.e., the observation has both high leverage and high discrepancy).
The influencePlot() function also provides a table for the most extreme scores listed in the plot. The table includes statistics for each of the outlier metrics, discrepancy, leverage, and influence

Influence

influencePlot(agrsn.mod)

      StudRes         Hat       CookD
47   2.437728 0.004400059 0.008598094
261 -1.453120 0.110608718 0.087177965
262  1.061081 0.135907121 0.059000607
263  1.069090 0.139881372 0.061927065
264  2.768212 0.004344552 0.010879350

Outliers/influential cases: Decisions, Decisions…

What to do with outlying and/or highly-influential cases?

Decisions depend on their context and empirical meaning:
Are they due to coding errors or data entry mistakes?
Do they represent real individuals from a different/subset population?
Are they legitimate but unusual cases that reveal something new?

Possible Actions

Remove cases if they’re clear errors
Down-weight their impact using robust regression techniques
Model them explicitly by adding predictors that explain their uniqueness

Outliers/influential cases: Decisions, Decisions…

Key Takeaway

Context matters—we need to “drill down” on these cases and try to understand each of them
Also, we must consider how much these cases affect results and how “surprising” they are
The best we can do is ensure transparent reporting, be honest about your data and decisions, avoid confirmation bias, and do some good-old sensitivity analyses to test the robustness of your findings