Module 1: Important Statistical Concepts

PSYC 3032 M

Udi Alter

Introduction

“He who loves practice without theory is like the sailor who boards ship without a rudder and compass and never knows where he may be cast.”
— Leonardo da Vinci, 1452-1519

  • Yes, this course is very applied, nonetheless, we’re not going to shy away from statistical theory to help us understand what is going on

  • At least, not unless I deem it unnecessary or generally less fruitful

What Do These Have in Common?

What is a Model?

  • The term model is used a lot in statistics and in the “real-world” (e.g., fashion model, model ship, etc.)

  • It’s important to realize that it means the same thing inside and outside statistics

  • A model is a simplified representation or abstraction of reality used to understand, predict, or explain phenomena in the real world

  • Models allow us to focus on key aspects of a system, process, object or problem while ignoring irrelevant details

    • Think of a LEGO figure—it represents a person’s different body parts and proportions but doesn’t show how a real one moves or functions
  • In essence, models are valuable tools for learning and decision-making

Models in Statistics📊

Statistical models are very useful for understanding what’s going on in the world

  • When we describe the relationship between variables, we generally try to abstract out patterns in the data according to some well thought-out out structure (read: model). For example,

\[Variable \ of \ interest = Model + Error\] or, better yet:

\[DV = IV1 + IV2 +IV3...+ Error\]

Example: If I think social media influence depression in kids, I might use social media to predict or explain depression to some extent:

\[Depression \ score = Hours \ spent \ on \ TikTok + Error\]

Why Use Models in Statistics?

Statistical models are very useful for understanding what’s going on in the world. Here are THREE reasons we might want to use models in statistics:

  1. Explaining why and how phenomena, an effect, or a relationship between variables occurs in the real world (generally from a causal sense)
  • Examples: Cognitive Behavioural Therapy (CBT) helps to reduce anxiety

\[CBT \rightarrow less \ anxiety\] (i.e., CBT causes less anxiety)

  • This usually involves approximating the Data Generating Process (DGP)
    • We can use a statistical model as an attempt to approximate how we think the data comes about in the world:

\[Model_{sample} \approx Model_{population}\]

Why Use Models in Statistics?

  1. Predicting the future. Given information about specific predictors of an outcome, we may be able to guess the outcome fairly accurately
  • Example: University admins predicting success in the program based on high school GPA; using credit score to decide if an applicant should be approved for bank loan

Why Use Models in Statistics?

  1. Simplification. When a phenomenon or relationship between variables is complex, endless number of factors can influence the outcomes. But, not all of the factors influence them to a meaningful degree. We can use models to identify the most impactful variables on the outcome so we can explain, predict, or even manipulate the outcome using the “simplified” model containing only the important variables
  • Example: To study academic performance, a psychologist narrows down key predictors—time spent studying, sleep quality, and motivation—among many potential factors (e.g., personality, immigration status, living arrangement). This streamlined model enables targeted interventions for meaningful improvements.

Why Use Models in Statistics?

  • This is really why we fit statistical models in practice: we want good, succinct descriptions of complex data after ruling out less probable competing models

  • There are more reasons to use models in statistics, but I’m giving you the most imactful ones, i.e., we are simplifying! 😉

  • And, remember…

“All models are wrong, but some are useful.”

— George Box, 1987

What is a Useful Model?👍

  • It’s one thing to make a model, but it’s another to know if it does a good job

  • Some models don’t explain much error (underfitting), while other models explain too much and don’t generalize to new cases (overfitting)

  • In this class we’ll learn to compare models and choose the most useful one

  • Which is the most useful model? 🧒️ + \(3\cdot\)🐻

Modeling Steps 👣

Models in this course will almost always follow the same steps. Given some data:

You can see that statistical modeling is a dynamic process. It is more of a craft than exact science, and because it requires decision-making, it can be subjective and often easy to manipulate. Therefore, we should always use statistics responsibly and ethically, so we are true to the facts, and do not bend them to our needs (e.g., p < .05).

The General Linear Model (GLM) 📈

  • In this class we’ll use a particular type of model, the “umbrella” model called the general linear model (GLM)

  • The GLM is a mathematical framework for representing relationships between variables

  • The GLM is used to explain/predict variation in a particular outcome from 1 or more explanatory/predictor variables

  • Specifically, the GLM models linear relationships between a continuous dependent variable and any combination of categorical or continuous predictor variables

Examples:

  • Predicting stats exam scores from study hours and sleep quality
  • Explaining the effect of different treatments (e.g., CBT, psychoanalysis, medication) on anxiety

The General Linear Model (GLM) 📈

Mathematically, the GLM can be expressed like this:

\[y_i = \beta_0 + \beta_1 x1_{i} + \beta_2 x2_{i} + \dots + \beta_p xp_{i} + \epsilon_i\]

where

  • \(Y_i\): Outcome variable (sometime called criterion, response, or DV) for participant i
  • \(\beta_0\): Intercept
  • \(\beta_1, \dots, \beta_p\): Coefficients (sometimes called regression coefficients, slopes or partial slopes, effects or fixed effects, estimates/parameters, or betas)
  • \(x1_i, x2_i, \dots, xp_i\): Predictor variables for ppt i (sometimes called explanatory, regressors, covariates or IV)
  • \(\epsilon_i\): Error term

Visualizing the GLM 🎨

  • Simple Linear Regression: One predictor, one outcome. \[y_i = \beta_0 + \beta_1 x1_i + \epsilon_i\]

The General Linear Model (GLM) 📈

Did you know?

z test, t test, ANOVA, correlation, and regression are all part of the GLM!!!

What’s Next?

Even this course is a model! And, as with all models, there are assumptions. Here are mine:

  • You are familiar with R and RStudio
  • You have base knowledge of statistics and statistical inference
  • You chose to learn about higher-level concepts that are useful for data analysis and scientific exploration

That said, not everyone had the same path here. So next week, we’ll review (though, not in depth) some old topics, such as

  • Descriptive statistics
  • Parameters
  • Inferential statistics
  • Estimation
  • Hypothesis testing

How Are You Feeling About This Course?

  1. A
  2. B
  1. C
  2. D
  3. D

Homework for Next Week! 📝

  • Install or update R and RStudio!

  • Refresh your R/RStudio skills, see R Stuff section on eClass

  • Review syllabus and take the quiz on eClass

  • Fill out the short pre-class survey

  • Make sure you have iClicker account/app (and the location feature)

  • Try to knit the lab .rmd template as an HTML file

  • Relax, you’re going to be just fine! 😊