Module 2: Correlation & Simple Linear Regression

PSYC 3032 M

Udi Alter

About Module 2

Module 2’s topics relate to modelling (linear) relationships between variables to help address interesting questions like…

  • How does education affect earnings?

  • To what extent listening to ska music relates to dressing in black and white checkered clothing articles?

  • How strong is the relationship between taking PSYC3032 and being a billionaire?

    • (nobody said the relationship is positive…)
  • What is the association between OCD and depression?

  • Does exercise influence psychological brain states, such as depression or anxiety?




Correlation and Simple Linear Regression

Prologue: Correlation vs. Regression

Before we dive into each topic separately, it’s useful to put both in the right context.

Correlation and simple regression are often used interchangeably, but there are key conceptual differences (but not mathematical) between them.

  • Correlation describes the strength of the (primarily linear) relationship or association between two variables

    • As one variable changes, what happens to another variable? Does it go up as well (positive correlation)? Down (negative correlation)? Seem anaffected (no correlation)?
  • Correlation is used mainly as a descriptive statistic, to quantify an association, but NOT saying anything about causation

  • Though, as it turns out, the math required to obtain correlation estimates requires the same information used in…..you guessed it, simple linear regression!

Prologue: Correlation vs. Regression

  • Simple regression, on the other hand, is about predicting or explaining an outcome/dependent variable using an independent/explanatory variable
    • Has more of a causal (or at minimum, suggestive) nature
  • Useful for experimental, quasi-experimental, and otherwise predictive empirical questions
  • In psychology, we often deal with observational data, meaning that we don’t manipulate the IVs directly (i.e., we don’t manipulate one’s education or earnings directly; we only passively record it)
    • Experimental studies do however try to manipulate the IVs explicitly, and we’ll talk much more about this later

Prologue: Correlation vs. Regression

Note that correlation is cause-blind (association\(\neq\)causation), we often graph the relationship with double-sided arrows (i.e., we don’t know/care about why they relate, we just know they vary together)

Regression models, on the other hand, are necessarily directional (one-sided arrow), meaning, we make a statement/assumption about what causes/affects what (e.g., X2 leads to Y2)




Covariance and Correlation

(Co)variance

  • Before we define correlation, which is, in fact, a standardized effect size measure, we should first talk about covariance, the unstandardized sibling of correlation

  • Covariance is the smallest building block of all GLMs and other advanced statistical techniques (e.g., SEM, MLM); you can think of it as an atom—the fundamental unit from which statistical models are built

  • By now you should know that each variable has its own variance—which describes the spread of the individual observations on that particular variable

\[VAR(X)=\frac{\sum (x_i-\bar{x})^2}{N-1}= \frac{\sum (x_i-\bar{x})(x_i-\bar{x})}{N-1} \]

  • where \(x_i\) is a particular observation’s score on X, \(\bar{x}\) is the mean of X, and \(N\) is the sample size.

  • The numerator represents the sum of squares (i.e., the sum of squared deviation scores from the mean)

Covariance

  • If variance is how one variable varies (alone, with itself), then covariance is one variable varies with another…Thus, the formula of covariance between two variables is similar to that of variance (but, instead of X twice, we have X and Y):

\[COV(X, \ Y)= \frac{\sum (x_i-\bar{x})(y_i-\bar{y})}{N-1}\]

  • This measure describes how much the variables co-vary together; the covariance gives us a measure of how these two variables \(X\) and \(Y\) are associated

  • If Y tends (i.e., on average) to be above its mean when X is above its respective mean then \(COV(X, Y) = +\); if Y tends to be above its mean when X is below its respective mean then \(COV(X, Y) = -\); when \(COV(X, Y ) = 0\), we say that X and Y are independent or orthogonal of one another

  • Covariance is an important statistic, but because it’s in some hybrid metric (the product of the units of X with the unit of Y), it’s hard to to gauge its magnitude

    • E.g., covariance of age and height might be in the ballpark of \(40 \ cm \times year\) (what does it even mean? Is it large, small, meh?)

Correlation

With this definition of covariance we can now define Pearson’s correlation parameter

\[\rho = \frac{COV(X,Y)}{SD(X) \cdot SD(Y)}\]

  • where \(SD(X)\) and \(SD(Y)\) are the standard deviation of X and Y, respectively.

  • Dividing by the standard deviations of X and Y removes both the metrics, thereby standardizing the covariance and making it in a comprehensible metric

  • Correlation is, thus, the standardized version of covariance

  • A correlation coefficient is a single numeric value representing the degree to which two variables are associated with one another

  • Because correlation is a standardized effect size measure, correlation coefficients are bounded by –1 and +1

  • The sign indicates the direction of the association, while the magnitude of the measure indicates the strength of the association

  • \(|1|\) = perfect relationship; 0 = no relationship

Correlations Considerations

  • The correlation formula above is for the Pearson Product-Moment Correlation Coefficient between two continuous variables, but there are others (which we don’t discuss)

    • Rank-ordered variables (1st, 2nd, 3rd) = Spearman’s rank correlation
    • Dichotomous (e.g., [0,1]) measures = \(\phi\) (Phi) correlation
    • One dichotomous, one continuous = point-biserial correlation
    • etc.
  • \(\rho\) or its estimate \(r\) do not provide a complete description of the two variables; you should always provide means and standard deviations.

  • Correlation measures the strength of the linear relationship between x and y only; it’s inappropriate to use a correlation to describe nonlinear relationships

  • Pearson’s correlation assumes that both the variables are normally distributed and that the spread of the scores on one variable is constant across the other

  • We can check all of that, and it’s ALWAYS a good idea to visualize the association (recall the first step in Modeling Steps 👣?)

Visualizing Relationships with Scatterplots

library(ggplot2) # load the package
ggplot(data, aes(x = age, y = height)) +
  geom_point(size=3, color="deepskyblue3") +
  labs(title = "Scatter Plot of Age vs. Height", x = "Age (years)", y = "Height (cm)") + theme_minimal()

Visualizing Relationships with Scatterplots

How would you describe the relationship between age and height? Can you guess the correlation coef?

Interpreting Scatterplots

  • Identify the general pattern of the observations
    • The pattern can be described by the form (non-/linear?), direction (goes up/down), and strength of the relationship (are the dots tightly clustered around a discernible trend line/curve?)
  • Identify obvious deviations from the pattern
    • Observations that clearly deviate from the overall pattern may be outliers

In R

We could do this ourselves (first block), or let R do this for us!

cov_ah <- cov(age, height)
sd_age <- sd(age) ; sd_height <- sd(height)
cov_ah/(sd_age*sd_height)
[1] 0.7824843

which is the same as…

cor(age, height)
[1] 0.7824843

Guess the Correlation

Answer

Guess the Correlation

Answer

Guess the Correlation

Answer

Guess the Correlation

Answer

The Importance of Visualizing Data

Effect Sizes

\(r\)

  • A correlation coefficient is an effect size

  • It describes the magnitude and direction of the effect (association between two variables)

  • According to conventional benchmarks (based on Cohen’s rules of thumb):

    • \(r \approx 0.1\) or smaller is a small effect
    • \(r \approx 0.3\) is a medium effect
    • \(r \approx 0.5\) or larger is a large effect

Effect Sizes

\(r^2\)

  • Another way to express the magnitude of the effect is to square correlation to get the coefficient of determination, \(r^2\)

  • The coefficient of determination provides the proportion of variance in one variable that is shared or accounted for by the other

Effect Sizes

Partial-\(r^2\) & Semi-partial-\(r^2\)
  • Used when interested in the relationship between two variables, controlling for the effects of a third variable

    • Partial-\(r^2\) Captures the relationship between X and Y, after controlling for the variance that Z shares with both X and Y

    • Semi-partial-\(r^2\) Captures the unique contribution of X to Y, controlling for the effect of Z on X only, while retaining the total variance of Y (remember for multiple regression!)

Effect Sizes in R

Returning to our previous example of age and height

cor(age, height) # correlation coefficient, r
[1] 0.7824843
  • \(r=0.78\), a rather large effect size—a strong, positive association



cor(age, height)^2 # coefficient of determination, r^2
[1] 0.6122817
  • \(r^2=0.61\) means that age and height share about 61% of the variance, that’s a lot! (but, also expected, right?)

Effect Sizes in R

When adding a third variable, weight:

library(ppcor)
spcor(data)$estimate^2 # semi-partial-r^2
           height       age     weight
height 1.00000000 0.1764998 0.01696309
age    0.14121401 1.0000000 0.09108445
weight 0.01915631 0.1285635 1.00000000


  • Semi-partial-\(r^2=0.1764998\) means that when “controlling” for weight (or holding weight constant), age and height share about 18% of their variability

  • Or, you can say that about 18% of the total variability in height is shared uniquely with age

Statistical Inference and Hypothesis Testing

Say we’re interested in testing whether there is no correlation between two random variables X and Y, we test the null

\[H_0: \rho = 0\]

using the t ratio:

\[t(df)=\frac{r-\rho_0}{SE_r}=\frac{r-0}{\sqrt{\frac{1-r^2}{N-2}}}=\frac{r\sqrt{N-2}}{\sqrt{1-r^2}}\]

with \(df = N − 2\) degrees of freedom, where r is the observed correlation, \(\rho_0\) is the specified correlation value under the null (i.e., 0), and N is the sample size

  • Rejecting this null would indicate that the r you observed was “surprising” given the position of ignorance, \(\rho = 0\)

  • Might not be very impressive (e.g., large sample size), so probably want to think in terms of CIs (what is \(\rho\) likely to be?)

Applied Research Example

A researcher collected data from 275 undergraduate participants and hypothesized a relationship between aggression and impulsivity. She measured aggression using the Buss Perry Aggression Questionnaire (BPAQ) and impulsivity using the Barratt Impulsivity Scale (BIS)


Load the data in R

library(haven)
agrsn <- read_sav("aggression.sav")

Descriptive Statistics and Graphs

Examine descriptive information about your variables, e.g., means, SDs, histograms/boxplots and scatterplots


library(misty)
library(tidyverse)
agrsn %>% select(BPAQ, BIS) %>% descript() # using the pipe operator, %>%, which reads "and then..." 
 Descriptive Statistics

  Variable   n nNA   pNA    M   SD  Min  Max Skew  Kurt
   BPAQ    275   0 0.00% 2.61 0.52 1.34 4.03 0.01 -0.38
   BIS     275   0 0.00% 2.28 0.35 1.42 3.15 0.36 -0.19

Descriptive Statistics and Graphs

library(patchwork)
## Create plots
hist_BPAQ <- ggplot(agrsn, aes(x = BPAQ)) +
  geom_histogram(binwidth = 0.2, fill = "skyblue", color = "black") +
  labs(x = "BPAQ", y = "Frequency") +
  theme_minimal()
hist_BIS <- ggplot(agrsn, aes(x = BIS)) +
  geom_histogram(binwidth = 0.15, fill = "orange", color = "black") +
  labs( x = "BIS", y = "Frequency") +
  theme_minimal()
# Combine plots
(hist_BPAQ | hist_BIS)

Descriptive Statistics and Graphs

Descriptive Statistics and Graphs

ggplot(agrsn, aes(x = BPAQ, y = BIS)) +
  geom_point(size=3, color="deepskyblue3") +
  geom_smooth(method = "lm", colour= "black", linewidth=2)+ # Corr/Reg line
  geom_smooth(colour= "purple", linewidth=2, linetype="dashed", se=F)+ # LOESS LINE
  theme_minimal()

Understanding the Plot

  • The dashed, purple line is the LOESS (Locally Estimated Scatterplot Smoothing) curve, a non-parametric method that fits a smooth curve to data, capturing non-linear, local trends in the data without assuming a predefined model
    • Why is it important? It can tell us if, where, and how there’s a deviation from linearity!
  • The solid, black line is the regression line, the slope of which represent the direction and strength of the correlation
    • The gray band corresponds to the uncertainty, the 95% confidence band around the regression line which shows the range within which we expect the true regression line to lie (95% of the time, had we conducted this analyses repeatedly, each time with a different sample of the same size)

Hypothesis Testing in R

cor.test(agrsn$BPAQ, agrsn$BIS)

    Pearson's product-moment correlation

data:  agrsn$BPAQ and agrsn$BIS
t = 5.5939, df = 273, p-value = 5.391e-08
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.2103747 0.4229210
sample estimates:
      cor 
0.3206789 
cor(agrsn$BPAQ, agrsn$BIS)^2 # coefficient of determination, r^2
[1] 0.102835

Interpretation and Reporting

When reporting, we should include:

  • Sample size
  • Effect size (\(r = .32\) & \(r^2 = .1\))
  • Uncertainty (SE and/or 95% CI)
  • Interpretation of effect size and uncertainty
  • Analyses results (t and p values with \(df\))
  • If known, bring back to research question and connect to larger lit. body.

Example:

“In our sample of \(N = 275\) undergraduate students, there was a statistically significant relationship between scores on the BPAQ and BIS (\(r = 0.32\), 95% CI \([0.21, 0.42]\), \(t(273) = 5.59\), \(p < 0.001\)). These results suggest a moderate association between aggression and impulsivity. The narrow confidence interval (\([0.21, 0.42]\)) indicates a relatively precise estimate. Furthermore, aggression and impulsivity share about 10% of their variance (\(r^2 = 0.10\)), which is a modest yet meaningful amount.”