Module 4: Remedies for Assumption Violations and Outliers

PSYC 3032 M

Udi Alter

Things Udi Should Remember

Before we start

Any questions so far (about anything)?
Group presentation topics
Lab 3 solution will be available today, after class
No lab 4 for a bit now! YAY! 😊
PLEASE LOOK CAREFULLY AT LAB SOLUTIONS
ASSIGNMENT 1 DUE ON FEB 25th (2 weeks away)

About Module 4 Part 2

Goals for Today:

Some more assumptions…
Outliers and influential cases
Potential rememdies for assumption violations & highly influential cases
Discussing Assignment 1

Linearity Violation

Non-Linearity

Recall that linearity is one of the most critical assumptions!

Non-linear trends in residuals typically indicate model misspecification, which means that the model does not fully capture the relationship between predictors and the outcome!

What can I do?!

First, are there any obvious highly-influential or outlying cases that lead to the deviation from linearity? If so, skip to the section of dealing with outliers/highly-influential cases
If you don’t suspect this is the “work” of an outlier, identify which predictor is the “culprit.”
- When assessing each predictor’s partial association with the residuals, determine which predictor demonstrates a non-linear trend
Now, you have a few options…

Non-Linearity: Adding Polynomial Term

Option 1: Model the curvature using a polynomial or higher-order regression model

A polynomial term in a GLM is a predictor variable raised to a power greater than one (e.g., \(x_1^2\), \(x_1^3\)) to model nonlinear relationships between the predictor and the response variable. So, if our linear model looked like this (the OLS slope in blue)

\[\hat{y}_i=\beta_0 + \beta_1x_{1_i}\]

Non-Linearity: Adding Polynomial Term

Our, “new,” polynomial model will look like:

\[\hat{y}_i=\beta_0 + \beta_1x_{1_i} + \beta_2x_{1_i}^2\]

Non-Linearity: Adding Polynomial Term

Specifying a model with a quadratic (to the power of 2) term then allows us to proceed with OLS estimation, treating \(x^2\) as just another variable in the model
If \(\beta_2\) (the coefficient of the quadratic term) is large, then there is substantial curvature in the relationship between X and Y
With quadratic a term (e.g., \(x^2\)) there is a single “bend” in the relationship; with a cubic term (e.g., \(x^3\)) there are two bends, etc.
In R, you can use lm(BOAQ~BIS+I(BIS^2)) for example, or lm(BOAQ~BIS+poly(BIS,2)) for numerical stability and avoid extreme multicollinearity; though, it would be wise to also centre the predictor for more useful interpretation (e.g., mean-centre; but, don’t worry about this)
When we add a polynomial term, the interpretation of our parameter estimates change, too. But, this beyond the score of this course

Non-Linearity: Piecewise Regression

Option 2: Use a piecewise regression model

As an alternative to the polynomial model, we can use piecewise linear regression model (also known as spline regression models), namely an interrupted regression model
A piecewise regression model effectively partitions a regressor (explanatory or predictor variable) X into a set of two or more distinct intervals or pieces, and then allows a separate regression association (linear or polynomial) between X and Y within each piece
The value of X where the regression lines meet is known as a “knot” or breakpoint (see next slide)
There are two different ways of creating piecewise linear regressions: Use dummy coding based on the known breakpoint(s) and fit the model as usual. Or, use an iterative technique to identify the breakpoint(s) and fit the model
Consequently, your model will include regression coefficients that allows you to calculate a slope for each of the two “pieces.” Note that there is an R package called segmented that will facilitate piecewise regression

Non-Linearity: Piecewise Regression

Non-Linearity: Transformation of the Outcome

Option 3: Transform your oucome variable

Transformations modify variables to “linearize” relationships and improve model fit in regression analysis
There are multiple forms of transformations such as: log(Y) when we see exponential relationships, or sqrt(Y) when extremely skewed Y

BUT, remember that log of 0 is undefined, as is the square root of a negative value!!!

So, for example, if some observed values of Y equal 0, you can create a new variable Y’ = ln(Y +1). Or, for negative values, you can use Y’ = sqrt(Y+5000). Then you can use Y’ as the outcome variable in a regression model, and proceeds with OLS estimation.

Another BUT: interpretations of coefficients from a model with transformed variable can be cumbersome and unintuitive. Thus, I rarely use or recommend to transform your data; this is the absolute last resort for me

Normality of Residuals Violation

Non-Normality

Recall that when the residuals evidence a non-normal distribution, the estimates of regression coefficients remain unbiased, but SE will be incorrect, which in turn impacts hypothesis tests and CIs
You should also recall that deviation from non-normally distributed residuals is more of a problem when your sample size is small; the larger your sample size, the less worried you should be about potential violations of this assumption
That said, residuals are often (but not always) non-normal when the outcome variable is non-normal

What can we do?

But, first, is it possible that the outcome variable is not normal because it is discrete (e.g., binary, categorical, Likert-type scales) or count data?! If so, you might be using an inappropriate regression model!
Solution: Use Generalized linear models instead (e.g., logistic regression for binary outcome, multinomial logistic regression for categorical outcome, and Poisson or negative binomial regression for count data)

Non-Normality: Solutions

Option 1: Apply a mathematical transformation to your DV

Transforming a non-normal outcome often improves the distribution of residuals, and can also improve heteroscedasticity and reduce the influence of extreme observations
- Excess positive skewness can be reduced using a log-based transformation; excess negative or positive skewness can be reduced by squaring the variable (among other possibilities)

Option 2: Remove outliers and/or re-specify your regression model

If you have a couple of “bad” outliers, removing them can improve the distribution of your residuals.
There also might be other predictor variables that need to be included and these variables could be higher order versions of your variables (e.g., quadratic terms; see section on nonlinearity), interaction terms, etc.

Non-Normality: Solutions

Option 3: Do nothing!

As long as you have a reasonably large sample size and other assumptions are not terribly violated, your SEs will not be biased enough for you to worry!
If you have a small sample size and heteroscedasticity is also present, use robust SEs (e.g., sandwich estimator, described soon in section on heteroscedasticity)

Heteroscedasticity

Heteroscedasticity

Remember that violating homoscedasticity results in biased SEs (and not parameter estimates)

What can we do?

Option 1: Re-estimate the model using weighted least squares (WLS)

One method for handling heteroscedasticity is to estimate the regression model using WLS instead of OLS
WLS can correct heteroscedasticity by assigning weights to each observation based on the inverse of the residual variance;
- That is, observations with smaller residual variances will have greater influence (more weight), and observations with larger residual variances will have less influence (less weight)
With model heteroscedasticity, WLS estimates are more efficient than OLS estimates, meaning that there is greater power for significance tests and confidence intervals are narrower compared to OLS

Heteroscedasticity: WLS

# Step 1: Fit OLS model
ols_model <- lm(BPAQ ~ BIS + age, data = agrsn)
# Step 2: Compute weights (Example: using absolute residuals)
wts <- 1 / lm(abs(ols_model$residuals) ~ ols_model$fitted.values)$fitted.values^2
# Step 3: Fit WLS model
wls_model <- lm(BPAQ ~ BIS + age, data = agrsn,
                weights = wts) # Weights
summary(wls_model)


Call:
lm(formula = BPAQ ~ BIS + age, data = agrsn, weights = wts)

Weighted Residuals:
    Min      1Q  Median      3Q     Max 
-2.9578 -0.8154 -0.0152  0.9058  3.3457 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.950256   0.242595   8.039 2.79e-14 ***
BIS          0.446007   0.085455   5.219 3.57e-07 ***
age         -0.017622   0.005812  -3.032  0.00266 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.235 on 272 degrees of freedom
Multiple R-squared:  0.1326,    Adjusted R-squared:  0.1262 
F-statistic:  20.8 on 2 and 272 DF,  p-value: 3.945e-09

Heteroscedasticity: Sandwich Estimator

Option 2: Use the Sandwich Estimator (RECOMMENDED!)

We can apply a correction to the SEs in our model using what is called a sandwich estimator (Huber-White Robust Standard Errors)
The Sandwich Estimator is a method for calculating SEs in regression models when the assumption of homoscedasticity (constant variance of errors) is violated
It provides heteroscedasticity-consistent SEs, making inferences (e.g., CIs and hypothesis tests) more reliable in the presence of heteroscedasticity; the adjustment will make the SEs smaller or larger depending on the pattern of the heteroscedasticity

Heteroscedasticity: Sandwich Estimator

Heteroscedasticity: Sandwich Estimator

library(lmtest)
library(car)
agrsn.mod <- lm(BPAQ ~ BIS + age, data = agrsn)
sandy <- hccm(agrsn.mod)
coeftest(agrsn.mod, vcov=sandy)


t test of coefficients:

              Estimate Std. Error t value  Pr(>|t|)    
(Intercept)  1.9543924  0.2556512  7.6448 3.612e-13 ***
BIS          0.4436069  0.0872800  5.0826 6.934e-07 ***
age         -0.0175563  0.0065245 -2.6908  0.007569 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

coefci(agrsn.mod, vcov=sandy)

                  2.5 %       97.5 %
(Intercept)  1.45108582  2.457698917
BIS          0.27177666  0.615437229
age         -0.03040116 -0.004711372

Multicollinearity

Multicollinearity

Identify the variables where multicollinearity is the problem
If the multicollinearity is between variables that are true covariates (i.e., control variables), then you do not need to address the issue because you’re not focusing on precision or significance tests and interpreting the coefficients
When your main substantive predictor variables are collinear with one another, there are a few different strategies to help combat this issue:
- With overly redundant predictors, remove one (or more) of them and estimate the new model
- Create a combination variable from collinear variables (e.g., sum scores together, take the mean of the variables, etc.)
- Use alternative regression strategies that allow for shared variability of sets of variables to be modeled as a factor or latent variable (e.g., factor analysis, structural equation modeling)

Outliers and Influential Cases

Outliers and Influential Cases

There is no correct answer to what you should do when you have identified multivariate outliers
The most common approaches are to leave them in or take them out…
BUT (and it’s a big one), I would be incredibly wary about removing any individual solely on the basis of being “outlying” without a reasonable justification!
Outliers often go hand in hand with assumption violation so it’s important to consider the issues holistically
Regardless, you should investigate these cases and perform a sensitivity analysis…

Outliers and Influential Cases

Ask yourself (and the data, I guess) the following questions:

Is this a data entry error? (e.g., height is entered in feet instead of cm)
- Check for typos, incorrect coding, or misplaced decimal points
Does this case truly belong to the population being studied? (e.g., older ppt in a group of youngsters)
- Check for this case’s covariates and other variables’ values
  - E.g., is there a big difference in age/experience/status? Get to know this person (on paper, not IRL)
- If not, consider whether it should be excluded.
Relatedly, is this an unusual but valid observation?
- Some influential cases may be real and meaningful
- Removing them without justification can distort results!

Sensitivity Analysis

Conduct sensitivity analysis

Sensitivity analysis is a method used to assess how robust a model’s results are to changes in assumptions, data, or model specifications; it helps determine whether key findings hold up under different conditions
So, instead of immediately removing influential cases, consider:
- Running models with and without the influential cases to assess their impact
  - How much are the estimates and \(R^2\) change? A lot? A little? If little, I would keep the outliers in…
- Using robust regression or bootstrapping to check the stability of results
- Applying WLS where the weights are inversely proportional to Cook’s D or residual size
Report and discuss the findings of your sensitivity analysis in your report (perhaps include in Supplemental Materials) and justify your decisions

Sensitivity Analysis: Example

influencePlot(agrsn.mod) # 261

Sensitivity Analysis: Example

# Rerun model with "cleaned" data (without 261)
no.outly.mod <- lm(BPAQ ~ BIS + age, data = agrsn[-261,])
library(stargazer)
# Compare side by side (note that "Constant" means Intercept)
stargazer(agrsn.mod, no.outly.mod, type = "text", column.labels = c("Original", "Without ppt 261"))


===================================================================
                                  Dependent variable:              
                    -----------------------------------------------
                                         BPAQ                      
                           Original             Without ppt 261    
                              (1)                     (2)          
-------------------------------------------------------------------
BIS                        0.444***                0.447***        
                            (0.085)                 (0.085)        
                                                                   
age                        -0.018***               -0.015**        
                            (0.006)                 (0.006)        
                                                                   
Constant                   1.954***                1.889***        
                            (0.245)                 (0.249)        
                                                                   
-------------------------------------------------------------------
Observations                  275                     274          
R2                           0.130                   0.120         
Adjusted R2                  0.124                   0.113         
Residual Std. Error    0.491 (df = 272)        0.490 (df = 271)    
F Statistic         20.309*** (df = 2; 272) 18.440*** (df = 2; 271)
===================================================================
Note:                                   *p<0.1; **p<0.05; ***p<0.01

Sensitivity Analysis: Example

library(MASS)
# Fit robust regression model
robust_model <- rlm(BPAQ ~ BIS + age, data = agrsn)
# Compare results
stargazer(agrsn.mod, robust_model, type = "text")


================================================================
                                      Dependent variable:       
                               ---------------------------------
                                             BPAQ               
                                         OLS            robust  
                                                        linear  
                                         (1)              (2)   
----------------------------------------------------------------
BIS                                   0.444***         0.455*** 
                                       (0.085)          (0.090) 
                                                                
age                                   -0.018***        -0.018***
                                       (0.006)          (0.006) 
                                                                
Constant                              1.954***         1.937*** 
                                       (0.245)          (0.258) 
                                                                
----------------------------------------------------------------
Observations                             275              275   
R2                                      0.130                   
Adjusted R2                             0.124                   
Residual Std. Error (df = 272)          0.491            0.504  
F Statistic                    20.309*** (df = 2; 272)          
================================================================
Note:                                *p<0.1; **p<0.05; ***p<0.01

Sensitivity Analysis: Example

# Compute absolute residuals
abs_resid <- abs(residuals(agrsn.mod))
# Compute inverse residual weights 
weights <- 1 / abs_resid
# Fit Weighted Least Squares (WLS) model
wls_model <- lm(BPAQ ~ BIS + age, data = agrsn, weights = weights)
# Compare results
stargazer(agrsn.mod, no.outly.mod, robust_model, wls_model,type = "text", column.labels = c("Original", "Without ppt 261", "RLM", "WLS"))


=============================================================================================================
                                                       Dependent variable:                                   
                    -----------------------------------------------------------------------------------------
                                                              BPAQ                                           
                                          OLS                            robust                OLS           
                                                                         linear                              
                           Original             Without ppt 261           RLM                  WLS           
                              (1)                     (2)                 (3)                  (4)           
-------------------------------------------------------------------------------------------------------------
BIS                        0.444***                0.447***             0.455***             0.442***        
                            (0.085)                 (0.085)             (0.090)              (0.031)         
                                                                                                             
age                        -0.018***               -0.015**            -0.018***            -0.019***        
                            (0.006)                 (0.006)             (0.006)              (0.004)         
                                                                                                             
Constant                   1.954***                1.889***             1.937***             1.988***        
                            (0.245)                 (0.249)             (0.258)              (0.124)         
                                                                                                             
-------------------------------------------------------------------------------------------------------------
Observations                  275                     274                 275                  275           
R2                           0.130                   0.120                                    0.519          
Adjusted R2                  0.124                   0.113                                    0.515          
Residual Std. Error    0.491 (df = 272)        0.490 (df = 271)     0.504 (df = 272)     0.633 (df = 272)    
F Statistic         20.309*** (df = 2; 272) 18.440*** (df = 2; 271)                  146.677*** (df = 2; 272)
=============================================================================================================
Note:                                                                             *p<0.1; **p<0.05; ***p<0.01

Sensitivity Analysis: Guidelines

Sensitivity analyses can quickly balloon into an overwhelming number of possibilities, especially when multiple decisions interact. Here are some strategies to keep your sensitivity analysis meaningful and manageable:
Prioritize Conceptually Defensible Approaches. Rather than testing every possible combination, narrow the choices by focusing on methods that align with your research question, data characteristics, statistical assumptions, and common sense
Define a Small Set of Core Sensitivity Analyses. Select 2–4 key approaches that capture distinct, but reasonable scenarios
Use a Hierarchical Approach. Instead of treating all options as equally valid, consider a hierarchical approach (e.g. diagnose issue, try the most obvious solution, then diagnose again, etc.)
Compare Options Systematically. Focus on effect sizes (reg. coefficients) and CIs, model fit (AIC/BIC/\(R^2\)), and plots (assumptions, partial-slopes, etc.)
If sensitivity analyses yield different conclusions, acknowledge the uncertainty rather than cherry-picking the “best” model. If your main findings hold across reasonable approaches, that strengthens your conclusion

Have a GREAT reading week! 😎