PSYC 3032 M
Linearity Violation
Recall that linearity is one of the most critical assumptions!
What can I do?!
First, are there any obvious highly-influential or outlying cases that lead to the deviation from linearity? If so, skip to the section of dealing with outliers/highly-influential cases
If you don’t suspect this is the “work” of an outlier, identify which predictor is the “culprit.”
Now, you have a few options…
Option 1: Model the curvature using a polynomial or higher-order regression model
\[\hat{y}_i=\beta_0 + \beta_1x_{1_i}\]
\[\hat{y}_i=\beta_0 + \beta_1x_{1_i} + \beta_2x_{1_i}^2\]
Specifying a model with a quadratic (to the power of 2) term then allows us to proceed with OLS estimation, treating \(x^2\) as just another variable in the model
If \(\beta_2\) (the coefficient of the quadratic term) is large, then there is substantial curvature in the relationship between X and Y
With quadratic a term (e.g., \(x^2\)) there is a single “bend” in the relationship; with a cubic term (e.g., \(x^3\)) there are two bends, etc.
In R, you can use lm(BOAQ~BIS+I(BIS^2)) for example, or lm(BOAQ~BIS+poly(BIS,2)) for numerical stability and avoid extreme multicollinearity; though, it would be wise to also centre the predictor for more useful interpretation (e.g., mean-centre; but, don’t worry about this)
When we add a polynomial term, the interpretation of our parameter estimates change, too. But, this beyond the score of this course
Option 2: Use a piecewise regression model
As an alternative to the polynomial model, we can use piecewise linear regression model (also known as spline regression models), namely an interrupted regression model
A piecewise regression model effectively partitions a regressor (explanatory or predictor variable) X into a set of two or more distinct intervals or pieces, and then allows a separate regression association (linear or polynomial) between X and Y within each piece
The value of X where the regression lines meet is known as a “knot” or breakpoint (see next slide)
There are two different ways of creating piecewise linear regressions: Use dummy coding based on the known breakpoint(s) and fit the model as usual. Or, use an iterative technique to identify the breakpoint(s) and fit the model
Consequently, your model will include regression coefficients that allows you to calculate a slope for each of the two “pieces.” Note that there is an R package called segmented that will facilitate piecewise regression
Option 3: Transform your oucome variable
Transformations modify variables to “linearize” relationships and improve model fit in regression analysis
There are multiple forms of transformations such as: log(Y) when we see exponential relationships, or sqrt(Y) when extremely skewed Y
BUT, remember that log of 0 is undefined, as is the square root of a negative value!!!
So, for example, if some observed values of Y equal 0, you can create a new variable Y’ = ln(Y +1). Or, for negative values, you can use Y’ = sqrt(Y+5000). Then you can use Y’ as the outcome variable in a regression model, and proceeds with OLS estimation.
Normality of Residuals Violation
Recall that when the residuals evidence a non-normal distribution, the estimates of regression coefficients remain unbiased, but SE will be incorrect, which in turn impacts hypothesis tests and CIs
You should also recall that deviation from non-normally distributed residuals is more of a problem when your sample size is small; the larger your sample size, the less worried you should be about potential violations of this assumption
That said, residuals are often (but not always) non-normal when the outcome variable is non-normal
What can we do?
But, first, is it possible that the outcome variable is not normal because it is discrete (e.g., binary, categorical, Likert-type scales) or count data?! If so, you might be using an inappropriate regression model!
Solution: Use Generalized linear models instead (e.g., logistic regression for binary outcome, multinomial logistic regression for categorical outcome, and Poisson or negative binomial regression for count data)
Option 1: Apply a mathematical transformation to your DV
Option 2: Remove outliers and/or re-specify your regression model
If you have a couple of “bad” outliers, removing them can improve the distribution of your residuals.
There also might be other predictor variables that need to be included and these variables could be higher order versions of your variables (e.g., quadratic terms; see section on nonlinearity), interaction terms, etc.
Option 3: Do nothing!
As long as you have a reasonably large sample size and other assumptions are not terribly violated, your SEs will not be biased enough for you to worry!
If you have a small sample size and heteroscedasticity is also present, use robust SEs (e.g., sandwich estimator, described soon in section on heteroscedasticity)
Heteroscedasticity
What can we do?
Option 1: Re-estimate the model using weighted least squares (WLS)
One method for handling heteroscedasticity is to estimate the regression model using WLS instead of OLS
WLS can correct heteroscedasticity by assigning weights to each observation based on the inverse of the residual variance;
With model heteroscedasticity, WLS estimates are more efficient than OLS estimates, meaning that there is greater power for significance tests and confidence intervals are narrower compared to OLS
# Step 1: Fit OLS model
ols_model <- lm(BPAQ ~ BIS + age, data = agrsn)
# Step 2: Compute weights (Example: using absolute residuals)
wts <- 1 / lm(abs(ols_model$residuals) ~ ols_model$fitted.values)$fitted.values^2
# Step 3: Fit WLS model
wls_model <- lm(BPAQ ~ BIS + age, data = agrsn,
weights = wts) # Weights
summary(wls_model)
Call:
lm(formula = BPAQ ~ BIS + age, data = agrsn, weights = wts)
Weighted Residuals:
Min 1Q Median 3Q Max
-2.9578 -0.8154 -0.0152 0.9058 3.3457
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.950256 0.242595 8.039 2.79e-14 ***
BIS 0.446007 0.085455 5.219 3.57e-07 ***
age -0.017622 0.005812 -3.032 0.00266 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.235 on 272 degrees of freedom
Multiple R-squared: 0.1326, Adjusted R-squared: 0.1262
F-statistic: 20.8 on 2 and 272 DF, p-value: 3.945e-09
Option 2: Use the Sandwich Estimator (RECOMMENDED!)
We can apply a correction to the SEs in our model using what is called a sandwich estimator (Huber-White Robust Standard Errors)
The Sandwich Estimator is a method for calculating SEs in regression models when the assumption of homoscedasticity (constant variance of errors) is violated
It provides heteroscedasticity-consistent SEs, making inferences (e.g., CIs and hypothesis tests) more reliable in the presence of heteroscedasticity; the adjustment will make the SEs smaller or larger depending on the pattern of the heteroscedasticity
library(lmtest)
library(car)
agrsn.mod <- lm(BPAQ ~ BIS + age, data = agrsn)
sandy <- hccm(agrsn.mod)
coeftest(agrsn.mod, vcov=sandy)
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.9543924 0.2556512 7.6448 3.612e-13 ***
BIS 0.4436069 0.0872800 5.0826 6.934e-07 ***
age -0.0175563 0.0065245 -2.6908 0.007569 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
2.5 % 97.5 %
(Intercept) 1.45108582 2.457698917
BIS 0.27177666 0.615437229
age -0.03040116 -0.004711372
Multicollinearity
Identify the variables where multicollinearity is the problem
If the multicollinearity is between variables that are true covariates (i.e., control variables), then you do not need to address the issue because you’re not focusing on precision or significance tests and interpreting the coefficients
When your main substantive predictor variables are collinear with one another, there are a few different strategies to help combat this issue:
Outliers and Influential Cases
There is no correct answer to what you should do when you have identified multivariate outliers
The most common approaches are to leave them in or take them out…
BUT (and it’s a big one), I would be incredibly wary about removing any individual solely on the basis of being “outlying” without a reasonable justification!
Outliers often go hand in hand with assumption violation so it’s important to consider the issues holistically
Regardless, you should investigate these cases and perform a sensitivity analysis…
Ask yourself (and the data, I guess) the following questions:
Conduct sensitivity analysis
Sensitivity analysis is a method used to assess how robust a model’s results are to changes in assumptions, data, or model specifications; it helps determine whether key findings hold up under different conditions
So, instead of immediately removing influential cases, consider:
Running models with and without the influential cases to assess their impact
Using robust regression or bootstrapping to check the stability of results
Applying WLS where the weights are inversely proportional to Cook’s D or residual size
Report and discuss the findings of your sensitivity analysis in your report (perhaps include in Supplemental Materials) and justify your decisions
# Rerun model with "cleaned" data (without 261)
no.outly.mod <- lm(BPAQ ~ BIS + age, data = agrsn[-261,])
library(stargazer)
# Compare side by side (note that "Constant" means Intercept)
stargazer(agrsn.mod, no.outly.mod, type = "text", column.labels = c("Original", "Without ppt 261"))
===================================================================
Dependent variable:
-----------------------------------------------
BPAQ
Original Without ppt 261
(1) (2)
-------------------------------------------------------------------
BIS 0.444*** 0.447***
(0.085) (0.085)
age -0.018*** -0.015**
(0.006) (0.006)
Constant 1.954*** 1.889***
(0.245) (0.249)
-------------------------------------------------------------------
Observations 275 274
R2 0.130 0.120
Adjusted R2 0.124 0.113
Residual Std. Error 0.491 (df = 272) 0.490 (df = 271)
F Statistic 20.309*** (df = 2; 272) 18.440*** (df = 2; 271)
===================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
library(MASS)
# Fit robust regression model
robust_model <- rlm(BPAQ ~ BIS + age, data = agrsn)
# Compare results
stargazer(agrsn.mod, robust_model, type = "text")
================================================================
Dependent variable:
---------------------------------
BPAQ
OLS robust
linear
(1) (2)
----------------------------------------------------------------
BIS 0.444*** 0.455***
(0.085) (0.090)
age -0.018*** -0.018***
(0.006) (0.006)
Constant 1.954*** 1.937***
(0.245) (0.258)
----------------------------------------------------------------
Observations 275 275
R2 0.130
Adjusted R2 0.124
Residual Std. Error (df = 272) 0.491 0.504
F Statistic 20.309*** (df = 2; 272)
================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
# Compute absolute residuals
abs_resid <- abs(residuals(agrsn.mod))
# Compute inverse residual weights
weights <- 1 / abs_resid
# Fit Weighted Least Squares (WLS) model
wls_model <- lm(BPAQ ~ BIS + age, data = agrsn, weights = weights)
# Compare results
stargazer(agrsn.mod, no.outly.mod, robust_model, wls_model,type = "text", column.labels = c("Original", "Without ppt 261", "RLM", "WLS"))
=============================================================================================================
Dependent variable:
-----------------------------------------------------------------------------------------
BPAQ
OLS robust OLS
linear
Original Without ppt 261 RLM WLS
(1) (2) (3) (4)
-------------------------------------------------------------------------------------------------------------
BIS 0.444*** 0.447*** 0.455*** 0.442***
(0.085) (0.085) (0.090) (0.031)
age -0.018*** -0.015** -0.018*** -0.019***
(0.006) (0.006) (0.006) (0.004)
Constant 1.954*** 1.889*** 1.937*** 1.988***
(0.245) (0.249) (0.258) (0.124)
-------------------------------------------------------------------------------------------------------------
Observations 275 274 275 275
R2 0.130 0.120 0.519
Adjusted R2 0.124 0.113 0.515
Residual Std. Error 0.491 (df = 272) 0.490 (df = 271) 0.504 (df = 272) 0.633 (df = 272)
F Statistic 20.309*** (df = 2; 272) 18.440*** (df = 2; 271) 146.677*** (df = 2; 272)
=============================================================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
Sensitivity analyses can quickly balloon into an overwhelming number of possibilities, especially when multiple decisions interact. Here are some strategies to keep your sensitivity analysis meaningful and manageable:
Prioritize Conceptually Defensible Approaches. Rather than testing every possible combination, narrow the choices by focusing on methods that align with your research question, data characteristics, statistical assumptions, and common sense
Define a Small Set of Core Sensitivity Analyses. Select 2–4 key approaches that capture distinct, but reasonable scenarios
Use a Hierarchical Approach. Instead of treating all options as equally valid, consider a hierarchical approach (e.g. diagnose issue, try the most obvious solution, then diagnose again, etc.)
Compare Options Systematically. Focus on effect sizes (reg. coefficients) and CIs, model fit (AIC/BIC/\(R^2\)), and plots (assumptions, partial-slopes, etc.)
If sensitivity analyses yield different conclusions, acknowledge the uncertainty rather than cherry-picking the “best” model. If your main findings hold across reasonable approaches, that strengthens your conclusion
Module 4 (Part 2)