PSYC 3032 M
Simple Linear Regression
Mathematically, the GLM can be expressed like this:
\[y_i = \beta_0 + \beta_1 x1_{i} + \beta_2 x2_{i} + \dots + \beta_p xp_{i} + \epsilon_i\]
where
To start, we’re going to focus on simple linear regression, meaning that we have one IV (predictor) and one DV (outcome)
Or, less angrily, the SLR looks like this:
\[y_i = \beta_0 + \beta_1 x1_{i} + \epsilon_i\]
A regression model is a formal model for expressing the tendency of the outcome variable, Y, to vary conditionally on the predictor variable, X.
This SLR model has 3 parameters and 3 variables, can you identify two of each?
Population:
\[y_i = \beta_0 + \beta_1 x1_{i} + \epsilon_i\]
Sample: \[y_i = \hat{\beta}_0 + \hat{\beta}_1 x1_{i} + e_i\]
\[y_i = \hat{\beta}_0 + \hat{\beta}_1 x1_{i} + e_i\]
The linear regression model is designed to work with a continuous outcome variable
The residuals, \(e\), represent the inaccuracy of the model’s ability to reproduce (i.e., predict/explain) the value of \(y_i\) for a given person
\(\epsilon\) (and thus \(e\)) is assumed (for proper SE estimates, CIs, p values, etc.) to be normally distributed (across all levels of X) with a mean of 0 (the SD/variance is estimated)
\(Y\) and \(\epsilon\) (and thus \(e\)) are assumed to be uncorrelated random variables, and X is assumed to be an error free (yeah right…) component that we are using to predict values in Y
Parameters without the i subscript are constants—they do not vary across observations—\(\beta_0\) and \(\beta_1\) have one value for all observations/individuals
Regression models are all about conditional expectations (i.e., conditional means):
\[E(y_i|x1_i) = \beta_0 + \beta_1 x1_i\]
The intercept, \(\beta_0\), is the expected value on Y for a hypothetical observation with \(x1_i=0\)
The slope measures the strength of the linear relationship between X and Y and indicates that: a one unit increase on X results in an expected/predicted change of \(\beta_1\) in Y
Example: say that you were studying the relationship between a final grade in PSYC 3032 (in %) and the combined score on the assignments (out of a total of 50):
\[E( final| assignments) = \beta_0 + \beta_1 \times assignments\] \[E( final| assignments) = 42 + 1.16\times assignments\]
What this means: \(\beta_0 = 42\) indicates that a person who scored 0 on the assignments is expected/predicted to finish the course with 42%, while \(\beta_1 = 1.16\) means that every 1-point increase on the assignments is expected to increase the final grade by 1.16%
What about for specific individual; for example, someone who obtained 40/50? Notice below that the value of 40 is put into the equation, because the \(\beta_1\) is with respect to the raw assignments score.
\[E( final| assignments) = \hat{y}_i= 42 + 1.16\times 40 = 88.4\]
In other words, that individual is expected to obtain 88.4% as their final grade, given their assignment scores (how close this average guess is depends on how good the model is at prediction!).
\(\hat{y_i}\) is called the predicted value for the ith observation; so, for that individual above the predicted/expected value, given their score on the assignment is, \(\hat{y_i}=88.4\).
…Let’s look again at the SLR plot
The blue line, which is specified by our best \(\beta_0\) and \(\beta_1\) estimates, is called the “Line of Best Fit,” and it “cuts” right through all the observation. But how can find what it is!?
OLS is a mathematical solution for finding out the “best” parameter estimates of a linear regression model.
But, what does “best” even mean?
“Best” parameter estimates means that the model (i.e., the regression line) gives us the most accurate prediction/explanation power! (in our sample)
As its name suggests, OLS gives us the values for the intercept and slope(s) that yield the minimum (least) squares (squared residuals) (i.e., “best!”).
We can square the residuals and find their sum. But, wait! We would need to know the estimate for \(\beta_0\) and \(\beta_1\) to get the residuals, right? Recall, \[e_i = y_i-\hat{y}_i=y_i-(\beta_0+\beta_1 x1_i),\]
So what do we do!?
We can “build” a function from the calculation of all \(e\)s in the sample, but leave two unknowns (intercept and slope):
\[\sum_{i=1}^N{e^2}=\sum_{i=1}^N{(y_1-\hat{y}_i)^2}=\sum_{i=1}^N{(y_i-[\beta_0+\beta_1 x1_i])^2}\]
Now, with a little bit of calculus and linear algebra, we can find the solution (the minimum of the function):
Check in question: Why do we square the residuals?
Population parameters: \(\beta_0=5\) and \(\beta_1=2.5\)
Parameter estimates: \(\hat{\beta}_0=4.62\) and \(\hat{\beta}_1=2.56\)
Don’t worry too much about the math, all you need to know is that there is an “easy” solution for SLR:
\[\hat{\beta}_1=\frac{COV(X,Y)}{VAR(X)} =\frac{\sum_{i=1}^N (x1_i - \bar{x})(y_i-\bar{y})}{\sum_{i=1}^N (x1_i - \bar{x})^2}=r \times \frac{SD(Y)}{SD(X)}\]
With \(\hat{\beta}_1\), we can solve for the intercept:
\[\hat{\beta}_0=\bar{y}-\hat{\beta}_1 \bar{x}\]
And, \(\hat{\sigma}_{\epsilon}\), which is the estimate of \(\sigma_{\epsilon}\) and called the residual SE, can be calculated as:
\[\hat{\sigma}_{\epsilon} = \sqrt{\frac{\sum (y_i-\hat{y}_i)^2}{N-2}}= \sqrt{\frac{\sum e^2}{df}}\]
Check in Question
If both X and Y are standardized, what does \(\hat{\beta}_1\) equal?
Often researchers seek to test the statistical significance of the slope parameter and of the proportion of variability in the outcome it shares/explains.
Specifically, we test the following null and alternative hypotheses:
\[H_0:\beta_1=0\] \[H_1:\beta_1\neq0\]
Rejecting this null, \(H_0:\beta_1=0\), would indicate that the \(\hat{\beta}_1\) you found was “surprising” given the position of ignorance, \(\hat{\beta}_1 = 0\); that the population relationship between X and Y is unlikely to be zero…
…These hypotheses are tested using a type of t test. Specifically, each parameter has its own standard error such that we can calculate a ratio of the effect (estimated parameter) over noise (the standard error) and get a p value associated with the t statistic.
\[t(df)=\frac{\hat{\beta}_1-B}{SE_{\hat{\beta_1}}}\]
where B is the value of \(\beta\) under the null; some constant to be testing against (often, B = 0, which is the default in most software)
It is also easy to obtain CIs for the individual parameters using the following equation:
\[CI_{(1-\alpha)100\%}=\hat{\beta}_1 \pm t_{(1 − \alpha /2, \ df)} \times SE_{\hat{\beta_1}}\]
where \(\alpha\) is the nominal Type I error rate, and \(t_{(1 − \alpha /2, \ df)}\) is the critical value for the t dist. with \(df\) degrees of freedom and \(\alpha\) level of significance
The effect sizes of interest in SLR are the regression analogues of the effect sizes of interest in correlation.
We will revisit the research example from last week examining the relationship between impulsivity and aggression: 275 undergraduates completed a questionnaire assessing scores on the BPAQ and BIS scales among others.
Ultimately, the researcher is interested in predicting aggression from impulsivity
At a very simple level, the researcher wants to devise a model for aggression (operationalized with BPAQ scores) to explain or predict how and why people vary on this variable, given their BIS score.
Let’s revisit the example from last week to highlight more of a regression lens as opposed to correlation.
Load the data in R
Examine descriptive information about your variables, e.g., means, SDs, histograms/boxplots and scatterplots
The output is the next slide…
my_fn <- function(data, mapping, method1="lm", method2= "loess",...){
p <- ggplot(data = data, mapping = mapping) +
geom_point(color="deepskyblue3") +
geom_smooth(method=method1, size=2, colour="black")+
geom_smooth(method=method2, size=2, span = 0.7, colour="purple", alpha=0.2, linetype="dashed")+theme_minimal()
p
}
# Using the ggpairs function from the GGally package
GGally::ggpairs(agrsn %>% select(BPAQ, BIS), aes(alpha = 0.5),
columnLabels = c("Aggression",
"Impulsivity"),
lower = list(continuous = my_fn),
upper = list(continuous = 'cor'))The code is the previous slide…
Call:
lm(formula = BPAQ ~ BIS, data = agrsn)
Residuals:
Min 1Q Median 3Q Max
-1.14134 -0.30470 0.00845 0.35500 1.35527
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.5217 0.1973 7.713 2.31e-13 ***
BIS 0.4777 0.0854 5.594 5.39e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4972 on 273 degrees of freedom
Multiple R-squared: 0.1028, Adjusted R-squared: 0.09955
F-statistic: 31.29 on 1 and 273 DF, p-value: 5.391e-08
2.5 % 97.5 %
(Intercept) 1.1333249 1.9101360
BIS 0.3095758 0.6458088
“In our sample of \(N = 275\) undergraduate students, impulsivity (BIS) was found to predict aggression (BPAQ). For every 1-point increase on BIS, aggression (BPAQ) is predicted to increase by approximately 0.48 points (\(\hat{\beta}_1 = 0.48\), 95% CI \([0.31, 0.65]\)). This association is statistically significant (\(t(273) = 5.59\), \(p < 0.001\)). The narrow confidence interval suggests a rather precise estimate of the effect size. Furthermore, impulsivity explained about 10% of the variability in aggression (\(R^2 = 0.10\)), indicating a modest yet meaningful proportion.”
Module 2 (Part 2)