PSYC 3032 M
Module 2’s topics relate to modelling (linear) relationships between variables to help address interesting questions like…
How does education affect earnings?
To what extent listening to ska music relates to dressing in black and white checkered clothing articles?
How strong is the relationship between taking PSYC3032 and being a billionaire?
What is the association between OCD and depression?
Does exercise influence psychological brain states, such as depression or anxiety?
Correlation and Simple Linear Regression
Before we dive into each topic separately, it’s useful to put both in the right context.
Correlation and simple regression are often used interchangeably, but there are key conceptual differences (but not mathematical) between them.
Correlation describes the strength of the (primarily linear) relationship or association between two variables
Correlation is used mainly as a descriptive statistic, to quantify an association, but NOT saying anything about causation
Though, as it turns out, the math required to obtain correlation estimates requires the same information used in…..you guessed it, simple linear regression!
Note that correlation is cause-blind (association\(\neq\)causation), we often graph the relationship with double-sided arrows (i.e., we don’t know/care about why they relate, we just know they vary together)
Regression models, on the other hand, are necessarily directional (one-sided arrow), meaning, we make a statement/assumption about what causes/affects what (e.g., X2 leads to Y2)
Covariance and Correlation
Before we define correlation, which is, in fact, a standardized effect size measure, we should first talk about covariance, the unstandardized sibling of correlation
Covariance is the smallest building block of all GLMs and other advanced statistical techniques (e.g., SEM, MLM); you can think of it as an atom—the fundamental unit from which statistical models are built
By now you should know that each variable has its own variance—which describes the spread of the individual observations on that particular variable
\[VAR(X)=\frac{\sum (x_i-\bar{x})^2}{N-1}= \frac{\sum (x_i-\bar{x})(x_i-\bar{x})}{N-1} \]
where \(x_i\) is a particular observation’s score on X, \(\bar{x}\) is the mean of X, and \(N\) is the sample size.
The numerator represents the sum of squares (i.e., the sum of squared deviation scores from the mean)
\[COV(X, \ Y)= \frac{\sum (x_i-\bar{x})(y_i-\bar{y})}{N-1}\]
This measure describes how much the variables co-vary together; the covariance gives us a measure of how these two variables \(X\) and \(Y\) are associated
If Y tends (i.e., on average) to be above its mean when X is above its respective mean then \(COV(X, Y) = +\); if Y tends to be above its mean when X is below its respective mean then \(COV(X, Y) = -\); when \(COV(X, Y ) = 0\), we say that X and Y are independent or orthogonal of one another
Covariance is an important statistic, but because it’s in some hybrid metric (the product of the units of X with the unit of Y), it’s hard to to gauge its magnitude
With this definition of covariance we can now define Pearson’s correlation parameter
\[\rho = \frac{COV(X,Y)}{SD(X) \cdot SD(Y)}\]
where \(SD(X)\) and \(SD(Y)\) are the standard deviation of X and Y, respectively.
Dividing by the standard deviations of X and Y removes both the metrics, thereby standardizing the covariance and making it in a comprehensible metric
Correlation is, thus, the standardized version of covariance
A correlation coefficient is a single numeric value representing the degree to which two variables are associated with one another
Because correlation is a standardized effect size measure, correlation coefficients are bounded by –1 and +1
The sign indicates the direction of the association, while the magnitude of the measure indicates the strength of the association
\(|1|\) = perfect relationship; 0 = no relationship
The correlation formula above is for the Pearson Product-Moment Correlation Coefficient between two continuous variables, but there are others (which we don’t discuss)
\(\rho\) or its estimate \(r\) do not provide a complete description of the two variables; you should always provide means and standard deviations.
Correlation measures the strength of the linear relationship between x and y only; it’s inappropriate to use a correlation to describe nonlinear relationships
Pearson’s correlation assumes that both the variables are normally distributed and that the spread of the scores on one variable is constant across the other
We can check all of that, and it’s ALWAYS a good idea to visualize the association (recall the first step in Modeling Steps 👣?)
How would you describe the relationship between age and height? Can you guess the correlation coef?
We could do this ourselves (first block), or let R do this for us!
[1] 0.7824843
which is the same as…
A correlation coefficient is an effect size
It describes the magnitude and direction of the effect (association between two variables)
According to conventional benchmarks (based on Cohen’s rules of thumb):
Another way to express the magnitude of the effect is to square correlation to get the coefficient of determination, \(r^2\)
The coefficient of determination provides the proportion of variance in one variable that is shared or accounted for by the other
Used when interested in the relationship between two variables, controlling for the effects of a third variable
Partial-\(r^2\) Captures the relationship between X and Y, after controlling for the variance that Z shares with both X and Y
Semi-partial-\(r^2\) Captures the unique contribution of X to Y, controlling for the effect of Z on X only, while retaining the total variance of Y (remember for multiple regression!)
Returning to our previous example of age and height
When adding a third variable, weight:
height age weight
height 1.00000000 0.1764998 0.01696309
age 0.14121401 1.0000000 0.09108445
weight 0.01915631 0.1285635 1.00000000
Semi-partial-\(r^2=0.1764998\) means that when “controlling” for weight (or holding weight constant), age and height share about 18% of their variability
Or, you can say that about 18% of the total variability in height is shared uniquely with age
Say we’re interested in testing whether there is no correlation between two random variables X and Y, we test the null
\[H_0: \rho = 0\]
using the t ratio:
\[t(df)=\frac{r-\rho_0}{SE_r}=\frac{r-0}{\sqrt{\frac{1-r^2}{N-2}}}=\frac{r\sqrt{N-2}}{\sqrt{1-r^2}}\]
with \(df = N − 2\) degrees of freedom, where r is the observed correlation, \(\rho_0\) is the specified correlation value under the null (i.e., 0), and N is the sample size
Rejecting this null would indicate that the r you observed was “surprising” given the position of ignorance, \(\rho = 0\)
Might not be very impressive (e.g., large sample size), so probably want to think in terms of CIs (what is \(\rho\) likely to be?)
A researcher collected data from 275 undergraduate participants and hypothesized a relationship between aggression and impulsivity. She measured aggression using the Buss Perry Aggression Questionnaire (BPAQ) and impulsivity using the Barratt Impulsivity Scale (BIS)
Load the data in R
Examine descriptive information about your variables, e.g., means, SDs, histograms/boxplots and scatterplots
library(patchwork)
## Create plots
hist_BPAQ <- ggplot(agrsn, aes(x = BPAQ)) +
geom_histogram(binwidth = 0.2, fill = "skyblue", color = "black") +
labs(x = "BPAQ", y = "Frequency") +
theme_minimal()
hist_BIS <- ggplot(agrsn, aes(x = BIS)) +
geom_histogram(binwidth = 0.15, fill = "orange", color = "black") +
labs( x = "BIS", y = "Frequency") +
theme_minimal()
# Combine plots
(hist_BPAQ | hist_BIS)
Pearson's product-moment correlation
data: agrsn$BPAQ and agrsn$BIS
t = 5.5939, df = 273, p-value = 5.391e-08
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.2103747 0.4229210
sample estimates:
cor
0.3206789
[1] 0.102835
“In our sample of \(N = 275\) undergraduate students, there was a statistically significant relationship between scores on the BPAQ and BIS (\(r = 0.32\), 95% CI \([0.21, 0.42]\), \(t(273) = 5.59\), \(p < 0.001\)). These results suggest a moderate association between aggression and impulsivity. The narrow confidence interval (\([0.21, 0.42]\)) indicates a relatively precise estimate. Furthermore, aggression and impulsivity share about 10% of their variance (\(r^2 = 0.10\)), which is a modest yet meaningful amount.”
Module 2 (Part 1)