Module 2: Correlation & Simple Linear Regression

PSYC 3032 M

Udi Alter

Simple Linear Regression

Remember this slide from Module 1? 📈

Mathematically, the GLM can be expressed like this:

\[y_i = \beta_0 + \beta_1 x1_{i} + \beta_2 x2_{i} + \dots + \beta_p xp_{i} + \epsilon_i\]

where

  • \(Y_i\): Outcome variable (sometime called criterion, response, or DV) for participant i
  • \(\beta_0\): Intercept
  • \(\beta_1, \dots, \beta_p\): Coefficients (sometimes called regression coefficients, slopes or partial slopes, effects or fixed effects, estimates/parameters, or betas)
  • \(x1_i, x2_i, \dots, xp_i\): Predictor variables for ppt i (sometimes called explanatory, regressors, covariates or IV)
  • \(\epsilon_i\): Error term

Simple Linear Regression (SLR)

To start, we’re going to focus on simple linear regression, meaning that we have one IV (predictor) and one DV (outcome)

Or, less angrily, the SLR looks like this:

\[y_i = \beta_0 + \beta_1 x1_{i} + \epsilon_i\]

A regression model is a formal model for expressing the tendency of the outcome variable, Y, to vary conditionally on the predictor variable, X.


This SLR model has 3 parameters and 3 variables, can you identify two of each?

Simple Linear Regression (SLR)

Population:

\[y_i = \beta_0 + \beta_1 x1_{i} + \epsilon_i\]

Sample: \[y_i = \hat{\beta}_0 + \hat{\beta}_1 x1_{i} + e_i\]

  • The variables in this model are:
    • The outcome variable, \(y_i\)
    • The predictor variable \(x1_i\)
    • The residual, \(e_i\)
  • The parameters in this model are the intercept (\(\beta_0\)), slope (\(\beta_1\)), and \(\sigma_{\epsilon}\), which is the SD of the errors, and tells us how closely, on average, the model predicts each \(y_i\) observation
    • \(\sigma_{\epsilon}\) is estimated after we obtain the \(\beta\)s via OLS; it’s added to the model so we can get SEs, CIs, and p values for our estimates
    • But, with SLR—much like in correlation—we focus primarily on the relationship between the variables (i.e., the slope parameter) and \(r^2\)

SLR’s Parameters & Variables 🎨

\[y_i = \hat{\beta}_0 + \hat{\beta}_1 x1_{i} + e_i\]

Regression(s) Consideration(s)

  • The linear regression model is designed to work with a continuous outcome variable

  • The residuals, \(e\), represent the inaccuracy of the model’s ability to reproduce (i.e., predict/explain) the value of \(y_i\) for a given person

  • \(\epsilon\) (and thus \(e\)) is assumed (for proper SE estimates, CIs, p values, etc.) to be normally distributed (across all levels of X) with a mean of 0 (the SD/variance is estimated)

    • which is written as \(e_i \sim N(0, \sigma^2_{\epsilon})\) or, alternatively, \(y_i \sim N(\hat{y}_i, \sigma^2_{\epsilon})\)
  • \(Y\) and \(\epsilon\) (and thus \(e\)) are assumed to be uncorrelated random variables, and X is assumed to be an error free (yeah right…) component that we are using to predict values in Y

  • Parameters without the i subscript are constants—they do not vary across observations—\(\beta_0\) and \(\beta_1\) have one value for all observations/individuals

  • Regression models are all about conditional expectations (i.e., conditional means):

\[E(y_i|x1_i) = \beta_0 + \beta_1 x1_i\]

\(\beta\)s Interpretation

  • The intercept, \(\beta_0\), is the expected value on Y for a hypothetical observation with \(x1_i=0\)

  • The slope measures the strength of the linear relationship between X and Y and indicates that: a one unit increase on X results in an expected/predicted change of \(\beta_1\) in Y


Example: say that you were studying the relationship between a final grade in PSYC 3032 (in %) and the combined score on the assignments (out of a total of 50):

  • You discover the relationship is

\[E( final| assignments) = \beta_0 + \beta_1 \times assignments\] \[E( final| assignments) = 42 + 1.16\times assignments\]

What this means: \(\beta_0 = 42\) indicates that a person who scored 0 on the assignments is expected/predicted to finish the course with 42%, while \(\beta_1 = 1.16\) means that every 1-point increase on the assignments is expected to increase the final grade by 1.16%

Model Predictions

What about for specific individual; for example, someone who obtained 40/50? Notice below that the value of 40 is put into the equation, because the \(\beta_1\) is with respect to the raw assignments score.

\[E( final| assignments) = \hat{y}_i= 42 + 1.16\times 40 = 88.4\]

In other words, that individual is expected to obtain 88.4% as their final grade, given their assignment scores (how close this average guess is depends on how good the model is at prediction!).

\(\hat{y_i}\) is called the predicted value for the ith observation; so, for that individual above the predicted/expected value, given their score on the assignment is, \(\hat{y_i}=88.4\).

  • But, let’s say that individual actually got 91% as their final grade; we can calculate the residual (error term) for that individual, \(e_i = y_i - \hat{y_i}= 91-88.4=2.6\) (the difference between what was actually observed versus what was expected under the model)

…Let’s look again at the SLR plot

Model Estimation

The blue line, which is specified by our best \(\beta_0\) and \(\beta_1\) estimates, is called the “Line of Best Fit,” and it “cuts” right through all the observation. But how can find what it is!?

Ordinary Least Squares (OLS)

OLS is a mathematical solution for finding out the “best” parameter estimates of a linear regression model.


But, what does “best” even mean?

“Best” parameter estimates means that the model (i.e., the regression line) gives us the most accurate prediction/explanation power! (in our sample)


As its name suggests, OLS gives us the values for the intercept and slope(s) that yield the minimum (least) squares (squared residuals) (i.e., “best!”).


We can square the residuals and find their sum. But, wait! We would need to know the estimate for \(\beta_0\) and \(\beta_1\) to get the residuals, right? Recall, \[e_i = y_i-\hat{y}_i=y_i-(\beta_0+\beta_1 x1_i),\]

So what do we do!?

Ordinary Least Squares (OLS)

We can “build” a function from the calculation of all \(e\)s in the sample, but leave two unknowns (intercept and slope):

\[\sum_{i=1}^N{e^2}=\sum_{i=1}^N{(y_1-\hat{y}_i)^2}=\sum_{i=1}^N{(y_i-[\beta_0+\beta_1 x1_i])^2}\]

# Or, in R:
ols <- function(beta_0, beta_1, x, y) {
  residuals <- y - (beta_0 + beta_1 * x) # calculate resids
  RSS <- sum(residuals^2)  # square resids and sum;RSS=Residual Sum of Squares
  return(RSS)}

Now, with a little bit of calculus and linear algebra, we can find the solution (the minimum of the function):

  • Differentiate \(\sum{e^2}\) with respect to \(\beta_0\) and \(\beta_1\)
  • Set these partial derivatives equal to 0, and solve each equation to find each parameter

Check in question: Why do we square the residuals?

The OLS Function (don’t worry about this!)

Population parameters: \(\beta_0=5\) and \(\beta_1=2.5\)

Parameter estimates: \(\hat{\beta}_0=4.62\) and \(\hat{\beta}_1=2.56\)

Ordinary Least Squares (OLS)

Don’t worry too much about the math, all you need to know is that there is an “easy” solution for SLR:

\[\hat{\beta}_1=\frac{COV(X,Y)}{VAR(X)} =\frac{\sum_{i=1}^N (x1_i - \bar{x})(y_i-\bar{y})}{\sum_{i=1}^N (x1_i - \bar{x})^2}=r \times \frac{SD(Y)}{SD(X)}\]

With \(\hat{\beta}_1\), we can solve for the intercept:

\[\hat{\beta}_0=\bar{y}-\hat{\beta}_1 \bar{x}\]

And, \(\hat{\sigma}_{\epsilon}\), which is the estimate of \(\sigma_{\epsilon}\) and called the residual SE, can be calculated as:

\[\hat{\sigma}_{\epsilon} = \sqrt{\frac{\sum (y_i-\hat{y}_i)^2}{N-2}}= \sqrt{\frac{\sum e^2}{df}}\]

Check in Question

If both X and Y are standardized, what does \(\hat{\beta}_1\) equal?

Statistical Inference and Hypothesis Testing

Often researchers seek to test the statistical significance of the slope parameter and of the proportion of variability in the outcome it shares/explains.

Specifically, we test the following null and alternative hypotheses:

\[H_0:\beta_1=0\] \[H_1:\beta_1\neq0\]

Rejecting this null, \(H_0:\beta_1=0\), would indicate that the \(\hat{\beta}_1\) you found was “surprising” given the position of ignorance, \(\hat{\beta}_1 = 0\); that the population relationship between X and Y is unlikely to be zero…

Statistical Inference and Hypothesis Testing

…These hypotheses are tested using a type of t test. Specifically, each parameter has its own standard error such that we can calculate a ratio of the effect (estimated parameter) over noise (the standard error) and get a p value associated with the t statistic.

\[t(df)=\frac{\hat{\beta}_1-B}{SE_{\hat{\beta_1}}}\]

where B is the value of \(\beta\) under the null; some constant to be testing against (often, B = 0, which is the default in most software)

It is also easy to obtain CIs for the individual parameters using the following equation:

\[CI_{(1-\alpha)100\%}=\hat{\beta}_1 \pm t_{(1 − \alpha /2, \ df)} \times SE_{\hat{\beta_1}}\]

where \(\alpha\) is the nominal Type I error rate, and \(t_{(1 − \alpha /2, \ df)}\) is the critical value for the t dist. with \(df\) degrees of freedom and \(\alpha\) level of significance

Effect Sizes in Simple Regression

The effect sizes of interest in SLR are the regression analogues of the effect sizes of interest in correlation.

  • Regression slope/coefficient: unstandardized regression coefficients are easy to understand because they have a clear metric.
    • Because they represent the predicted change in the outcome variable for each 1-unit change in the predictor variable, they provide information about the strength of their association
    • Standardized slope estimates are a different parameterization which also could be used as an effect size. In simple regression, a standardized slope is just the correlation.
  • Shared variance, \(r^2\): the proportion of variability in Y explained/predicted by X

Applied Research Example

We will revisit the research example from last week examining the relationship between impulsivity and aggression: 275 undergraduates completed a questionnaire assessing scores on the BPAQ and BIS scales among others.

Ultimately, the researcher is interested in predicting aggression from impulsivity

At a very simple level, the researcher wants to devise a model for aggression (operationalized with BPAQ scores) to explain or predict how and why people vary on this variable, given their BIS score.

Let’s revisit the example from last week to highlight more of a regression lens as opposed to correlation.


Load the data in R

library(haven)
agrsn <- read_sav("aggression.sav")

Descriptive Statistics and Graphs

Examine descriptive information about your variables, e.g., means, SDs, histograms/boxplots and scatterplots


library(misty)
library(tidyverse)
agrsn %>% select(BPAQ, BIS) %>% descript() # using the pipe operator, %>%, which reads "and then..." 
 Descriptive Statistics

  Variable   n nNA   pNA    M   SD  Min  Max Skew  Kurt
   BPAQ    275   0 0.00% 2.61 0.52 1.34 4.03 0.01 -0.38
   BIS     275   0 0.00% 2.28 0.35 1.42 3.15 0.36 -0.19

Descriptive Statistics and Graphs

The output is the next slide…

my_fn <- function(data, mapping, method1="lm", method2= "loess",...){
  p <- ggplot(data = data, mapping = mapping) + 
    geom_point(color="deepskyblue3") + 
    geom_smooth(method=method1, size=2, colour="black")+
    geom_smooth(method=method2, size=2, span = 0.7, colour="purple", alpha=0.2, linetype="dashed")+theme_minimal()
  p
}

# Using the ggpairs function from the GGally package
GGally::ggpairs(agrsn %>% select(BPAQ, BIS), aes(alpha = 0.5),
        columnLabels = c("Aggression",
                         "Impulsivity"),
        lower = list(continuous = my_fn),
        upper = list(continuous = 'cor'))

Descriptive Statistics and Graphs

The code is the previous slide…

Descriptive Statistics and Graphs

ggplot(agrsn, aes(x = BPAQ, y = BIS)) +
  geom_point(size=3, color="deepskyblue3") +
  geom_smooth(method = "lm", colour= "black", linewidth=2)+ # Corr/Reg line
  geom_smooth(colour= "purple", linewidth=2, linetype="dashed", se=F, span = 0.7)+ # LOESS LINE
  theme_minimal()

Hypothesis Testing in R

SLR.mod <- lm(formula= BPAQ ~ BIS, data=agrsn)
summary(SLR.mod)

Call:
lm(formula = BPAQ ~ BIS, data = agrsn)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.14134 -0.30470  0.00845  0.35500  1.35527 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.5217     0.1973   7.713 2.31e-13 ***
BIS           0.4777     0.0854   5.594 5.39e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4972 on 273 degrees of freedom
Multiple R-squared:  0.1028,    Adjusted R-squared:  0.09955 
F-statistic: 31.29 on 1 and 273 DF,  p-value: 5.391e-08
confint(SLR.mod)
                2.5 %    97.5 %
(Intercept) 1.1333249 1.9101360
BIS         0.3095758 0.6458088

Interpretation and Reporting

When reporting, we should include:

  • Sample size
  • Effect size (\(\hat{\beta}_1 = 0.4777\) & \(multiple-r^2 = 0.1028\))
  • Uncertainty (SE and/or 95% CI)
  • Interpretation of effect size and uncertainty
  • Analyses results (t and p values with \(df\))
  • If known, bring back to research question and connect to larger lit. body.

Example:

“In our sample of \(N = 275\) undergraduate students, impulsivity (BIS) was found to predict aggression (BPAQ). For every 1-point increase on BIS, aggression (BPAQ) is predicted to increase by approximately 0.48 points (\(\hat{\beta}_1 = 0.48\), 95% CI \([0.31, 0.65]\)). This association is statistically significant (\(t(273) = 5.59\), \(p < 0.001\)). The narrow confidence interval suggests a rather precise estimate of the effect size. Furthermore, impulsivity explained about 10% of the variability in aggression (\(R^2 = 0.10\)), indicating a modest yet meaningful proportion.”

In the next episodes of 3032…