PSYC 3032 M
Regression Diagnostics
Why do we model? To answer research questions about relationships, effects, or phenomena by representing them mathematically
How do we do that? We define these relationships/effects/phenomena as parameters in a model and then estimate them using data
But, how do we know these estimates are any good? Well… isn’t that the million-dollar question?
Parameter estimates and inferential statistics from a model will be reasonably accurate only when all of the statistical assumptions are met
Violating different types of assumptions will have different effects on your results, though; some will result in biased (i.e., inaccurate) effect size estimates, whereas others will result in in/deflated standard errors (which affect the calculation of CIs, p values, etc.)
Assumptions are important because if they are not met, OLS parameter estimates will be inaccurate (they will be biased and inconsistent) and inefficient (meaning that SEs and CIs will inflate)
These assumptions can be more difficult to evaluate with small samples. Ironically, the smaller the sample, the more important it is that assumptions are met
In practice, however, researchers underestimate the importance of statistical assumptions and rarely report or describe statistical assumptions testing in published work (e.g., see Counsell & Harlow, 2017)
Thus, there is no doubt that results of many published studies using regression are misleading, if not completely wrong
Regression diagnostics are graphical and statistical methods for evaluating the extent to which a regression model fitted to data is an adequate representation of that data and for testing the trustworthiness of inferential conclusions about the model’s parameter estimates
In particular, regression diagnostics are used to check the extent to which the model’s assumptions have been violated and to examine how any unusual or outlying observations may impact results
The GLM assumptions are a LINE:
Although not technically assumptions, you should also assess for the presence of:
Did you notice that many of the assumptions relate to residuals?! Yeah, why?!
Residuals represent what the model fails to explain. If the residuals exhibit systematic patterns, it suggests that the model is misrepresenting the underlying data structure, which can lead to biased estimates, incorrect inferences, and bad conclusions
Recall that the residual, \(e_i\), is just the estimate of the true error, \(\epsilon_i\)
\[e_i = y_i-\hat{y}_i=y_i-[\hat{\beta}_0 + \sum_{p=1}^{P} \hat{\beta}_p x_{p_i}]\]
In checking our assumptions, we will often evaluate/plot the relationship between the residuals (\(e\)) and predicted values or predictors
The structure of residuals in relation to fitted/predictor values can often reveal systematic issues in the model
Assumption 1: the partial relations between each predictor and outcome is linear
This assumption states that the association between each predictor and the outcome, after adjusting for other predictors, can be adequately described by a linear relationship
Note that it is possible for the raw relation between y and x1 to be non-linear, but when x2 is added, the (partial) relationship between y and x1 is reasonably approximated by a straight line (or vice versa)
As such, examining bivariate scatterplots of each x and y will not provide sufficient evidence of the linearity assumption; thus, although it is informative to plot y against each predictor in a given model, plotting the model’s residuals against each predictor is a more direct way of assessing whether the partial relations are linear
If residuals plotted against a predictor show a systematic pattern (e.g., a curve), it means the model should have included a non-linear term for that predictor
Actually, it’s better to plot the Standardized, Pearson, or Studentized residuals because they remove non-constant variance effects (Assumption 4), making non-linearity easier to see
Test stat Pr(>|Test stat|)
BIS -0.4727 0.6368
age 0.1571 0.8753
Tukey test -0.7619 0.4461
What do we want?! No clear pattern (e.g., curvature) and a flat, horizontal blue LOESS line! What are we looking for?! The opposite! See next slides
What do you think about age?!
Why should I care about linearity?
Non-linear trends in residuals typically indicate model misspecification, usually under-fitting and overly simplistic model
Which means that the model does not fully capture the relationship between predictors and the outcome!
Which means that our results likely include biased estimates and incorrect predictions, leading to bad inferences and conclusions, YIKES!
What can I do?! (We’ll talk more about remedies later, though)
Assumption 2: Independent residuals
Independence is a characteristic that comes into play at the research design rather than statistical phase
Independence simply means that the data for any observation is uncorrelated with other observation
Examples where independence is violated include samples of clustered data (like individuals who see the same therapist), having participants from the same family or couples, etc.
Accordingly, it is NOT something that you formally test with the other model diagnostics, but is nonetheless an important assumption
Assumption 3: The errors are normally distributed
The normality of errors assumption relates to ALL conditional \(E(y_i|x_i)\) values
That is, the errors are assumed to be normally distributed at every x value, for each predictor!
This exact assumption is very hard to test because it’s often difficult to observe several \(y_i\) values associated with the exact same \(x_i\)
That said, we know from probability theory that if all the conditional distributions are normal then the marginal distribution will be normal as well
Unlike in univariate methods, the normality assumption in regression actually refers to normally distributed errors and not the distributions of particular variables in your model.
Of course, we don’t know the errors, but we have their estimates, the residuals!
As such, it’s inappropriate to test the normality assumption by examining univariate histograms of your independent or dependent variables
Instead, we should extract the residuals from a regression model, and use a histogram, QQplot, or boxplot to examine their distribution
For the most accurate method to test residual normality, it is preferable to plot the Studentized residuals as opposed to the raw residuals (again, because they remove non-constant variance effects)
When the residuals evidence a non-normal distribution, the estimates of regression coefficients remain unbiased, but SE will be incorrect, which in turn impacts hypothesis tests and CIs
This Quantile-Quantile (Q-Q) plot compares the distribution of Studentized residuals to a theoretical t distribution along with a pointwise 95% confidence envelope
Normality = Points fall along the 45° reference line; confidence band contain most of the points; no major deviations in the tails
summary() function on the model; at the top you’ll see descriptive statistics about the residuals (sans the Mean, which is always necessarily 0)
Call:
lm(formula = BPAQ ~ BIS + age, data = agrsn)
Residuals:
Min 1Q Median 3Q Max
-1.17933 -0.32561 -0.00454 0.35594 1.33877
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.954392 0.244922 7.980 4.12e-14 ***
BIS 0.443607 0.085060 5.215 3.64e-07 ***
age -0.017556 0.006032 -2.910 0.00391 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4906 on 272 degrees of freedom
Multiple R-squared: 0.1299, Adjusted R-squared: 0.1235
F-statistic: 20.31 on 2 and 272 DF, p-value: 6.016e-09
Assumption 4: Equal residual variance across all levels of the predictor/s
Equal/constant residual variance across all levels of the predictor/s means that at any given point on a particular predictor, x, the spread of the residuals is the same
This is also sometimes can be phrased such that errors are homoscedastic, meaning they show homogeneity of variance
A scatter plot of (Studentized or Pearson) residuals against a predictor can reveal unmodeled non-linearity, as shown earlier, but they are also useful for judging homogeneity of variance!
Since the predicted values reflect the influence of all predictors together, we can make things easier by checking the homoscedasticity of residuals across the range of the fitted (predicted) values
This assumption means that the model does an equally good job, regardless of the predicted dependent score is
When the residuals evidence a non-constant variance—or there is heteroscedasticity—the estimates of regression coefficients remain unbiased, but SE will be incorrect, which in turn impacts hypothesis tests and CIs
What do we want?! A constant, equal-width confidence band across the range of fitted values
What are we looking for?! a horizontal “funnel” or “hourglass” shapes
Are all assumptions equally important? All assumptions matter, but not all have the same impact
As we usually do, let’s again make the distinction between estimates and significance testing
The LINEARITY and INDEPENDENCE assumptions are critical for producing UNBIASED and CONSISTANT parameter estimates and thus are highly important
On the other hand, NORMALITY and HOMOSCEDASTICITY violations primarily affect SEs, CIs, and p values
Multicollinearity
As we have learned, in MLR, the partial relationship between a predictor and the outcome is adjusted by the amount that the predictor is associated with the other predictors in the model
When one or more relationships among the predictors becomes large, there is a considerable amount of redundancy among predictors and thus any one predictor contributes very little unique information to the model; this situation is known as collinearity or multicollinearity
Including collinear predictors in a model causes SEs of slope estimates to inflate, leading to very wide CIs around the estimates and low power for the t tests of the estimates
Although it is always a good idea to look at the simple correlations among your predictors, often this is not enough to detect multicollinearity
There are statistics that can help determine whether multicollinearity is likely to be a problem for your analysis
The variance inflation factor (VIF) describes the amount that the (squared) SE of each regression coefficient is increased relative to a (hypothetical) situation in which all of the predictor variables are completely uncorrelated
\[VIF_p = \frac{1}{1-R_p^2}\]
where \(R_p^2\) refers to the Coefficient of determination (Multiple R-squared) from regressing the predictor, p, on all other predictors in the model
If the pth predictor is perfectly uncorrelated with all other predictors (i.e., correlations = 0), then \(VIF_p = 1\), indicating that the SE of \(\beta_p\) is not affected by multicollinearity of variable p with the other predictors
Another commonly reported statistic is tolerance (TOL). TOL is merely 1/VIF, so values of TOL near zero are indicative of problematic multicollinearity
I’ve seen recommended cut-offs in the literature of 3, 5, or 10; 10 is actually way too large, so I would opt to be a bit more conservative, and examine the effects of treating multicollinearity with VIFs closer to 3
agrsn$BIS_squared <- agrsn$BIS^2 # creating another variable which is the squared of BIS
vif(lm(BPAQ~age+BIS+BIS_squared, data=agrsn)) age BIS BIS_squared
1.019559 108.273947 108.298655
The model is correctly specified; in other words, all of the correct variables that should be in the model have been included (i.e., you are not missing important variables)
The variables in your model are measured without error (i.e., perfectly reliable; yeah, right…)
Outliers and Influential Cases
There are multiple terms (and definitions) that are sometimes used interchangeably—such as outliers and influential cases—but, in fact, are not exactly the same thing!
Outliers are extreme observations compared to other observations in the data
Influence refers to the amount of “pull” each observations has on the regression slope (and thus parameter estimates); or, you can think about influence as the extent to which individual observations cause the slope (and estimates) to change
There are three characteristics researchers use to determine potential outliers/highly-influential cases:
Leverage (sometimes called hat value) refers to the extent to which an observation differs from the other observations in terms of the predictor variables’ scores only; that is, how “far away” an observation is from the joint mean of all predictor values
In essence, a multivariate distribution is created (with P dimensions) and the leverage score (\(h_i\)) measures the distance of an observation from the center of this multivariate distribution (the “centroid”)
Example: A person with an extremely high income in a dataset predicting happiness will have high leverage in a group of grad students, even if their happiness aligns perfectly with the trend.
\[h_i > \frac{3(P+1)}{N}\]
\[h_i > \frac{3(2+1)}{275}=\frac{9}{275}=0.0327\]
That said, this rule of thumb is too conservative; so use judgment when “flagging” any observations using this rule
We can plot the leverage values to see if any of them stand out. Here are the leverage scores for our two-predictor model in the research example:
Discrepancy refers to the size of the observed residual. It is the extent to which an observation is extreme relative to its predicted score, \(\hat{y}_i\); discrepancy is just the magnitude of the residual for case \(i\), i.e., \(y_i-\hat{y}_i\)
\[e_i^*=\frac{e_i}{SD_{e_{-i}}\sqrt{1-h_i}}\]
Influence combines both leverage and discrepancy to measure the extent to which a case impacts a model’s parameter estimates
\[D_i = \frac{e_i^*}{P + 1} \times \frac{h_i}{1 - h_i},\]
Thus, Cook’s distance is a function of both leverage (i.e., \(h_i\)) and discrepancy (\(e_i^*\)), where larger values of \(D_i\) are indicative of greater influence
Cut-offs for Cook’s D vary; some have argued in favour of using a cut-off of 1, but this is a way too liberal (see Camilleri, Alter, & Cribbie, 2024), whereas a conservative cut-off is:
\[D_i < \frac{4}{N-P-1}\]
Using the suggested cut-off from the previous, the Cook’s D cut-off for our model and data, \(\frac{4}{275-2-1}=0.0147\)
But, I typically start to worry about values Cook’s D values, \(D_i>0.15\) (somewhere between the conservative and too liberal, but much closer to the conservative cut-off…)
Other equally useful measures of influence include the Standardized Difference of Fit (DFFITS) and Standardized Difference of Beta (DFBETAS) (Patrick and Greg talk about these in the Quantitude episode assigned this week)
We won’t get into these in this course due to time limitation, but I do want you to know about them
You can check out Camilleri, Alter, & Cribbie, 2024 for how Cook’s D, DFFITS, and DFBETAS and various cut-off values for each influence statistic performs
Another way we could visually examine extreme scores is through an influence plot
This plot simultaneously provides information about the leverage, discrepancy and Cook’s distance, and is available from the influencePlot() function in the car package
This plot helps visually identify potential outliers and highly influential cases based on the size of the “bubbles”
The numbers on the graph correspond to the row number for the observation (row #)
The dotted lines help give you a sense of what is in the “normal” range
You would be most concerned about larger bubbles (which is why their row number are identified), but especially so if they are close to the top and bottom most right corners (i.e., the observation has both high leverage and high discrepancy).
The influencePlot() function also provides a table for the most extreme scores listed in the plot. The table includes statistics for each of the outlier metrics, discrepancy, leverage, and influence
What to do with outlying and/or highly-influential cases?
Possible Actions
Key Takeaway
Context matters—we need to “drill down” on these cases and try to understand each of them
Also, we must consider how much these cases affect results and how “surprising” they are
The best we can do is ensure transparent reporting, be honest about your data and decisions, avoid confirmation bias, and do some good-old sensitivity analyses to test the robustness of your findings
Module 4 (Part 1)