Module 1: Important Statistical Concepts

PSYC 3032 M

Udi Alter

Welcome Back! 👋

How was your first week of the term?

  1. A
  2. B
  3. C
  1. D
  2. D

Goals for Today! 🪩

…Still reviewing—or, rather calibrating! Continuing from last week, we will go over some more fundamental concepts:

  • Descriptive statistics
    • (Suuuuuper brief, because, remember? My assumption is that you learned this)
  • Parameters
  • Inferential statistics
  • Estimation
  • Hypothesis testing
  • Starting Module 2 (correlation + simple linear regression)

Descriptive Statistics

  • As humans, simply looking at rows and columns of numbers doesn’t give us much insight
  • Descriptive statistics are summaries of the data that humans can easily comprehend.
  • To describe data, we often rely on
    • measures of central tendency (e.g., mean, median)
    • measures of variability (e.g., standard deviation, variance)
    • visualizations (e.g., histograms, box plots)
  • When we use descriptive statistics, we can actually understand the data, making it easier to spot patterns, detect issues, and uncover potentials
misty::descript(DATASET, print = "all")
 Descriptive Statistics

  Variable    n nNA   pNA     M SE.M     Var    SD   Min   p25   Med   p75     Max   Range  IQR  Skew   Kurt
   OCD     2636   0 0.00%  3.98 0.07   14.69  3.83  0.00  0.00  3.00  6.00   19.00   19.00 6.00  0.95   0.38
   age     2547  89 3.38% 22.49 1.34 4593.52 67.78 16.00 18.00 19.00 21.00 1995.00 1979.00 3.00 28.97 840.37
   MWD     2636   0 0.00%  4.50 0.03    2.07  1.44  1.00  3.50  4.50  5.50    7.00    6.00 2.00 -0.30  -0.43

Measures of Central Tendency

Mean

  • Average of scores \(\bar{x} = \frac{1}{n} \cdot \sum_{i=1}^{n}x_i\)

  • This particular statistic has special meaning in statistical theory in that it relates to the expectation of a variable, \(E(X)\)

nums <- rnorm(n = 10) # Generate 10 random numbers
sum(nums) # and add them up
0.749509174320657 + 0.663234872991503 + -0.529376311508162 + -1.68496884355631 + 0.0276574095133659 + 1.58798454685696 + -0.0760290827912198 + -0.77163594914945 + 1.24539362664122 + 1.70417362819081 =  2.915943 
sum(nums)/length(nums) # Divide the sum by the total number of elements to get the mean---or, simply use the function mean(nums)
x̅ =  0.292

Measures of Central Tendency

Mode

  • The most frequently occurring observation

  • Best to use with categorical or ordinal data (e.g., yes/no responses, Likert-type scales)

set.seed(311) # for replicability
(Letters <- sample(LETTERS[1:5], size = 10, replace = TRUE)) # Generate 10 random letters from A to E
 [1] "C" "B" "E" "E" "D" "E" "B" "D" "D" "A"
table(Letters)
Letters
A B C D E 
1 2 1 3 3 
  • What’s the mode?

Measures of Central Tendency

Median

  • Middle score in a distribution (i.e., 50% of observations above, 50% below)
  • For an even number of scores, take the average of the two middle scores
  • Best to use with data with non-symmetric distribution (e.g., mpg)
set.seed(3032)
(numbers <- sample(1:5, size = 11, replace = TRUE))# Generate random whole numbers from 1 to 5
 [1] 5 1 4 3 4 2 1 1 2 4 3
  • Order the numbers from low to high
sort(numbers) 
 [1] 1 1 1 2 2 3 3 4 4 4 5
  • What’s the median? (Find the {(length(numbers)/2)+1}th element above)

Measures of Variability

Range

  • highest \(x\) minus lowest \(x\), or \(max(x) - min(x)\)

  • Range is a pretty crude measure of variability because it relies exclusively on the most extreme observations and ignores the rest of the data.

  • Can be misleading at times

Interquartile Range (IQR)

  • The range of the middle 50% of the observations
    • i.e., ignoring the most extreme 25% of the observations from each tail


What’s the range & IQR of:

 [1] 1 1 1 2 2 3 3 4 4 4 5

Measures of Variability

Variance (VAR)

  • Noted \(\sigma^2\) for population variance, and \(s^2\) for sample variance

  • Variance is the average of the squared deviations from the mean \(VAR(x)= s^2 = \hat{\sigma^2} = \frac{\sum_{i=1}^{n}(x_i-\bar{x})^2 }{n-1} = \frac{SS}{n-1}\)

  • \(SS\) is referred to as the sums-of-squares.

  • We divide by \(n − 1\) instead of \(n\) to adjust for the fact that need to use an estimate of the mean of the population, \(\mu\), which increases our variability measurement

  • Although theoretically very important in statistical inference, as we’ll encounter many times in this course, the variance generally has some interpretation problems due to the fact that it is in the “squared” metric of X. To fix this, we can use the standard deviation.

Measures of Variability

Standard Deviation (SD)

  • Noted \(\sigma\) for population, \(SD\) and \(s\) for sample SD
  • SD is the average distance from the mean; the square-root of the variance
  • The advantage of the SD over the variance is that it is in the same units as the original variable X
  • SD tells us about the spread or dispersion of data points in relation to the mean

Measures of Shape

  • Things get trickier when we start talking about general shape of a distribution

    • Is a variable symmetric/skewed?
    • Uni/multi-modal?
    • Has thick/thin tails?
  • There are some helpful statistics in this area that are generally in reference to the normal (also called Gaussian) distribution

Measures of Shape

  • Kurtosis estimates have a similar property; 0 depicts as peaked as the normal, negative values more flat with “thick” tails, and positive values more peaked with “narrow” or “thin” tails
  • At the end of the day, it generally makes more sense just to plot the data to get a feel for the overall shape

Measures of Shape

A (fun?) mnemonic to help you remember which is the Platykurtic and which is the Leptokurtic. You can fit a platypus in a platykurtic distribution.

Measures of Shape

  • Skewness estimates start the value of 0, indicating symmetry (like the normal distribution), and can be positive or negative to indicate positive/negative skewness.

Measures of Shape

  • Here’s a nice mnemonic to help you remember which from the above is the positively skewed distribution and which is the negatively skewed distribution
  • Look at the yellow-ish distribution in the last slide (the positively skewed one), the one with its long tail on the right
  • Now, imagine this distribution rotating clockwise 90 degrees, which letter does this remind you of? P! for Positive!

P for positively skewed

Parameters 🌵

Parameters 🌵

  • Parameters are fixed numerical values that describe specific characteristics of an entire population, such as the true mean, variance, or proportion
  • Typically, parameters are what we wish to know or uncover about the population of interest (e.g., all university students, all researchers in North America, etc.)
  • Unlike statistics, which are derived from sample data and serve as estimates, parameters represent the actual values for the population, though they are often unknown and must be inferred using—you guessed it—inferential statistics
  • In research, parameters are the key quantities we aim to estimate, giving us a clearer understanding of the population as a whole

Parameters 🌵

  • A parameter is a single, objective, and fixed value, though typically unknown to us
  • When running an experiment, we calculate a statistic from the sample as a proxy (i.e., estimate) for the unknown population parameter
  • The population parameter is singular because there is only one true value it can take
  • It is objective because it represents an unequivocal truth, even if we cannot know its exact value
  • Finally, the population parameter is fixed, it remains constant at the time of observation (unlike in Bayesian statistics)

Parameters 🌵

  • By convention, these are typically expressed in Greek letters, for example:

    • \(\mu\) = (population) mean
    • \(\sigma\) = standard deviation
    • \(\sigma^2\) = variance
    • \(\beta\) = regression coefficient
  • But, clearly, there aren’t enough letters in the alphabet to describe all possible parameters (e.g., we could talk of a population median)

Inferential Statistics

Inferential Statistics

  • With descriptive statistics, we summarize what we do know, with Inferential Statistics we make inferences about what we do not know from what we do.

  • The core idea of Inferential Statistics is to make inferences about population characteristics by studying information only available in sample data

    • e.g., “Which university’s students use more Apple computers, York or UoT?”
  • We want to be able to draw probabilistic conclusions about populations without actually collecting all the data in the population

    • e.g., “There is a good chance that a greater % of Yorkies have Macs than those at UoT, by about 10% to 15%”
  • Ideally, we would make population level comparisons directly (collect all data from York and UofT to express \(\mu_{York} − \mu_{UoT} = \mu_d\)), but more often than not that is economically and logistically unfeasible

Inferential Statistics

  • We can divide inferential statistics into two “big ideas,” estimation and hypothesis testing

Estimation

  • Estimates are the values we calculate from sample data that serve as stand-ins for the unknown population parameters
  • Because we can’t usually measure the entire population, we rely on estimates 1 and make informed guesses about the true population values
  • These estimates aren’t perfect, but with the right sample and methods, they get us pretty close (or at least closer…)
  • They help bridge the gap between the data we have and the broader, often elusive, population parameters we’re trying to figure out

Estimation

Often, estimates are given a special “hat” to indicate that they are estimates, for example:

  • \(\hat{\mu}\) = mean estimate
  • \(\hat{\sigma}\) = estimated standard deviation
  • \(\hat{\sigma^2}\) = estimated variance
  • \(\hat{\beta}\) = estimated regression coefficient

That said, many researchers and educators prefer to use different notations altogether, for example:

  • \(M\) = mean estimate
  • \(s\) = estimated standard deviation
  • \(s^2\) = estimated variance
  • \(b\) = estimated regression coefficient

Estimation Precision and Uncertainty

  • If sampling behaviour of a statistic can be reasonably assumed (e.g., via the Central Limit Theorem), then it’s possible to make statements about sampling precision (or more appropriately, sampling uncertainty)

  • This is where standard errors and confidence intervals are useful, which are numeric statements about where a population parameter probably is, given what was observed

  • Standard error (SE) is the standard deviation of the sampling distribution; often thought of as the “noise” or uncertainty in the observed statistic due to sampling

  • Confidence interval (CI) interpretation: if we repeat this experiment over and over, each time with a different sample of the same size and calculate the 95% CI, then 95% of these 95% CIs will include the true population parameter. Not so fun, innit?

    • That’s why we often use much more relaxed interpretation (though, not completely justifiable)
    • e.g., \(M_{York} − M_{UoT} = 11.93\), with 95% CI of [8.5, 13.1], says “our best guess about the mean difference is ~12; and, there’s a good chance the population parameter is somewhere between 8.5 and 13.1 (possible to be higher or lower, just less likely!)

Hypothesis Testing

Definition

Null hypothesis significance testing (NHST) is a statistical approach used to assess the strength of evidence against a specific claim (null hypothesis) about a population based on sample data. The ultimate goals is to be able to make decisions and to make explicit inference statements about the world, having only a limiting view of it.

It involves two competing hypotheses: the null hypothesis (\(H_0\)), which typically states that there is no effect or difference, and the alternative hypothesis (\(H_1\)), which suggests the presence of an effect or difference.

The process uses sample data to determine whether to reject the null hypothesis, typically based on a test statistic (e.g., t or F values) and p values.

  • The important thing to recognize that the goal of a hypothesis test is NOT to show that the alternative hypothesis is (probably) true

  • Goal is to show that the null hypothesis is (probably) false

  • …Most people find this counter-intuitive (because it so is!)

Hypothesis Testing

A clever way to think about it is to imagine that a hypothesis test is a criminal trial…the trial of the null hypothesis! (Navarro, 2014)

  • \(H_0\) is assumed true unless proven otherwise
  • Researchers gather evidence to disprove \(H_0\) (i.e., support against the null)
  • We try to maximize the chance that the data will yield a conviction… for the crime of being false!
  • The catch is that the NHST sets the rules of the trial, and those rules are designed to protect the null hypothesis
    • Specifically, to ensure that if \(H_0\) is actually true, the chances of a false conviction (Type I error) are guaranteed to be low
    • After all, \(H_0\) doesn’t have a lawyer. And someone has to protect it because no (or negligible) effects are important, too, right? (yes!)

Real Example (Seli et al., 2017)

Research Question

What is the association between OCD symptoms and deliberate mind wandering (DMW)?



Hypotheses

  • \(H_1\): DMW is associated with OCD symptoms; \(\rho \neq0\)

  • \(H_0\): No association exists between DMW and OCD symptoms; \(\rho = 0\)


What should we do next?

In R

# Using the ggpairs function from the GGally package
GGally::ggpairs(ocd_dat %>% dplyr::select(OCD, MWD), aes(alpha = 0.5), columnLabels = c("OCD", "MWD"))

cor(ocd_dat$OCD, ocd_dat$MWD)
[1] 0.1198966

Real Example (Seli et al., 2017)

The hypothesis expression for this example is mainly about the following conditional probability:

  • \(P(data \ |\ H_0)\) or \(P(r = 0.12 \ | \ \rho=0)\), which reads:

  • “the probability of the data given the null hypothesis,” which actually translates to:

  • Given the (statistical summary of the) data (i.e., \(r = 0.12\)), how likely it was to observe these (or more extreme) results, assuming the null is true?

  • This \(\uparrow\), friends, is the definition of the p value!

    • If \(P(data \ |\ H_0)\) is closer to 0 (smaller p value), the probability of observing these (or more extreme) results is small—making the null less likely

    • If \(P(data \ |\ H_0)\) is closer to 1 (larger p value), the probability of observing these (or more extreme) results is large—making the null more likely

But, this tells us nothing yet about a decision (e.g., to reject or not to reject the null)…

Decision Table

Decisions based on the p value are directly related to our select of \(\alpha\)the value at which we feel it is acceptable making a Type I error


We generally set this reflexively at \(\alpha=.05\) (a 5% chance of obtaining some p < .05 result when the null is true)

  • If p < .05, we decide to reject the null hypothesis

    • We can say that we have sufficient evidence to reject the null or in support of the alternative
  • If p \(\ge .05\), we fail to reject the null

    • We can say that we have insufficient evidence to reject the null—but, we can never accept the null!

Hypothesis Testing

Remember: These always deal with knowing the truth. In reality, we are not privy to that information, and have to make educated guesses (again, leaning on assumptions) based on evidence to discuss the likelihood of each outcome.

Hypothesis Testing

Most hypothesis tests involve some variant of the following:

\[test \ ratio = \frac{sample \ statistic - [ parameter \ value \ under \ null]}{standard \ error}\]

  • Sample statistic: the observed value from the data (e.g., r = 0.12)

  • Parameter value under null: the value expected under the null hypothesis (e.g., \(\rho = 0\))

  • Standard error: the expected variability of the statistic under the null. It accounts for sampling noise or uncertainty in the observed statistic due to sampling

  • Test ratio: this ratio becomes the test statistic (e.g., t value, F value). We compare it to a critical value from a theoretical sampling distribution (e.g., t distribution) that corresponds to our chosen significance level (\(\alpha\))

    • If the test statistic exceeds the critical value, it suggests that the observed effect is more extreme than would be expected by random chance, leading us to potentially reject the null hypothesis

Basically, this formula represents the signal-to-sampling-noise ratio

Back to the Example

  • \(H_1\): DMW is associated with OCD symptoms; \(\rho \neq0\)

  • \(H_0\): No association exists between DMW and OCD symptoms; \(\rho = 0\)

cor.test(ocd_dat$OCD, ocd_dat$MWD)

    Pearson's product-moment correlation

data:  ocd_dat$OCD and ocd_dat$MWD
t = 6.1981, df = 2634, p-value = 6.612e-10
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.08209456 0.15735421
sample estimates:
      cor 
0.1198966 

Inferrential statistics in this course

  • The goal is to use inferential information (e.g., effect size, CIs, SD, SE, p values) to make informed discussions and cope with uncertainty

  • By and large, we’ll make heavy use of interpreting the utility of models by way of the sample estimates and their respective sampling uncertainty

  • Hence, parameter estimates, effect size estimates, confidence intervals, and standard errors are always going to be important and useful!

  • Where applicable, we’ll also make use of hypothesis tests, but we should not and will not put all our faith in null (or dull) hypothesis significance testing!

  • Read:

PAUSE ⏸️

How are you feeling so far about the material?


A. Easy and boring…c’mon, I’ve done this already!

B. Pretty easy, but I’m glad we’re reviewing

C. Some stuff I knew, some I didn’t, but I’m able to follow along

D. I have no idea what’s going on…