Let’s use the ToothGrowth dataset. Make sure you first
load the tidyverse package.
ToothGrowth {datasets}
The Effect of Vitamin C on Tooth Growth in Guinea Pigs
The outcome (dependent variable) is the length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods, orange juice or ascorbic acid (a form of vitamin C and coded as VC).
We’ll first look at the difference in length between the two supplement options, OJ and VC. Shall we start with a little descriptive and assumptions?
tg <- ToothGrowth # renaming to a shorter name for easier typing
dim(tg) # dimensions of the dataset[1] 60 3
summary(tg) len supp dose
Min. : 4.20 OJ:30 Min. :0.500
1st Qu.:13.07 VC:30 1st Qu.:0.500
Median :19.25 Median :1.000
Mean :18.81 Mean :1.167
3rd Qu.:25.27 3rd Qu.:2.000
Max. :33.90 Max. :2.000
describe(tg) vars n mean sd median trimmed mad min max range skew kurtosis se
len 1 60 18.81 7.65 19.25 18.95 9.04 4.2 33.9 29.7 -0.14 -1.04 0.99
supp* 2 60 1.50 0.50 1.50 1.50 0.74 1.0 2.0 1.0 0.00 -2.03 0.07
dose 3 60 1.17 0.63 1.00 1.15 0.74 0.5 2.0 1.5 0.37 -1.55 0.08
tg %>%
group_by(supp) %>%
summarise(sample_size = n(), mean_length = mean(len), SD = sd(len))# A tibble: 2 × 4
supp sample_size mean_length SD
<fct> <int> <dbl> <dbl>
1 OJ 30 20.7 6.61
2 VC 30 17.0 8.27
qplot(supp, len, data = tg, main = "Tooth growth of guinea pigs by supplement type",
xlab = "Supplement type", ylab = "Tooth length") + geom_boxplot(aes(fill = supp))ggplot(tg, aes(x = len, fill = supp, color = supp)) + geom_histogram(bins = 8) +
facet_wrap(~supp)Is there a statistically significant difference in tooth length between the 2 delivery methods orange juice (OJ) or ascorbic acid (VC)?
# Making a vector for only OJ
OJ <- tg %>%
filter(supp == "OJ") %>%
dplyr::select(len) %>%
rename(OJ = len)
# Making a vector for only VC
VC <- tg %>%
filter(supp == "VC") %>%
dplyr::select(len) %>%
rename(VC = len)
# Combining into a simpler dataset
data <- data.frame(OJ, VC)
# NORMALITY
qqnorm(data$OJ)qqnorm(data$VC)hist(data$OJ, breaks = 5)hist(data$VC, breaks = 10)# Relying on Shapiro-Wilk test is not recommended, but can can be used as
# supplement
shapiro.test(data$OJ)
Shapiro-Wilk normality test
data: data$OJ
W = 0.91784, p-value = 0.02359
shapiro.test(data$VC)
Shapiro-Wilk normality test
data: data$VC
W = 0.96567, p-value = 0.4284
# Homogeneity of Variance
ggplot(tg, aes(x = len, colour = supp)) + geom_density()# NHSTs - Looking for statistical NON-significance
leveneTest(data = tg, y = tg$len, group = tg$supp)Levene's Test for Homogeneity of Variance (center = median: tg)
Df F value Pr(>F)
group 1 1.2136 0.2752
58
var.test(len ~ supp, data = tg) # sensitive to non-normal data
F test to compare two variances
data: len by supp
F = 0.6386, num df = 29, denom df = 29, p-value = 0.2331
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.3039488 1.3416857
sample estimates:
ratio of variances
0.6385951
# Note above that the outcome/dependent variable goes to the left of the ~ and
# independent variable goes to the right of the ~
# Fligner-Killeen test
fligner.test(len ~ supp, data = tg)
Fligner-Killeen test of homogeneity of variances
data: len by supp
Fligner-Killeen:med chi-squared = 0.97034, df = 1, p-value = 0.3246
# When assumptions are met
t.test(x = data$OJ, y = data$VC, var.equal = T)
Two Sample t-test
data: data$OJ and data$VC
t = 1.9153, df = 58, p-value = 0.06039
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.1670064 7.5670064
sample estimates:
mean of x mean of y
20.66333 16.96333
# Raw effect size, very important! (often more than standardized effect size)
mean(data$OJ) - mean(data$VC)[1] 3.7
# a 3.7 microns difference in length between OJ and VC...
# Standardized Effect Size: Cohen's d
cohens_d(len ~ supp, data = tg, pooled_sd = T)Cohen's d | 95% CI
-------------------------
0.49 | [-0.02, 1.01]
- Estimated using pooled SD.
# When assumptions of homogeneity of variance is NOT met
t.test(x = data$OJ, y = data$VC, var.equal = F) # unequal variances
Welch Two Sample t-test
data: data$OJ and data$VC
t = 1.9153, df = 55.309, p-value = 0.06063
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.1710156 7.5710156
sample estimates:
mean of x mean of y
20.66333 16.96333
cohens_d(len ~ supp, data = tg, pooled_sd = F) # pooled_sd=FCohen's d | 95% CI
-------------------------
0.49 | [-0.02, 1.01]
- Estimated using un-pooled SD.
# Two sample Wilcoxon test (AKA the Mann-Whitney test) when data is
# non-normally distributed
wilcox.test(x = data$OJ, y = data$VC, exact = F)
Wilcoxon rank sum test with continuity correction
data: data$OJ and data$VC
W = 575.5, p-value = 0.06449
alternative hypothesis: true location shift is not equal to 0
# paired t-test (dependent samples): Doesn't apply here, but for your reference
t.test(x = data$OJ, y = data$VC, var.equal = T, paired = T) # paired=T
Paired t-test
data: data$OJ and data$VC
t = 3.3026, df = 29, p-value = 0.00255
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
1.408659 5.991341
sample estimates:
mean of the differences
3.7
cohens_d(len ~ supp, data = tg, pooled_sd = T, paired = T) # paired=TCohen's d | 95% CI
------------------------
0.60 | [0.21, 1.01]
# let's make the dose variable into a factor:
tg$dose <- as.factor(tg$dose)
# Descriptives
tg %>%
group_by(dose) %>%
summarise(sample_size= n(), mean_length = mean(len),
SD = sd(len))# A tibble: 3 × 4
dose sample_size mean_length SD
<fct> <int> <dbl> <dbl>
1 0.5 20 10.6 4.50
2 1 20 19.7 4.42
3 2 20 26.1 3.77
plotmeans( formula = len ~ dose, # plot length by dose
data = tg, # the data frame
xlab = "Vit C Dose", # x-axis label
ylab = "Tooth Length", # y-axis label
n.label = T ) # display sample size anova.results <- aov(len ~ dose, data = tg) # Note that the outcome (DV) is to the left of the ~ and the IV is to the right# HOV: Equal Variance
leveneTest(anova.results) # HOV - Looking for NON-significant results!Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 2 0.6457 0.5281
57
# Normality
anova.residuals <- residuals(object = anova.results) # extract the residuals
hist(anova.residuals) # plot a histogram qqnorm(anova.residuals)shapiro.test(anova.residuals) # run Shapiro-Wilk test
Shapiro-Wilk normality test
data: anova.residuals
W = 0.96731, p-value = 0.1076
95 - 16 < 0.001[1] FALSE
summary(anova.results) Df Sum Sq Mean Sq F value Pr(>F)
dose 2 2426 1213 67.42 9.53e-16 ***
Residuals 57 1026 18
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Effect Sizes
etaSquared(anova.results) eta.sq eta.sq.part
dose 0.7028642 0.7028642
omega_squared(anova.results)# Effect Size for ANOVA
Parameter | Omega2 | 95% CI
---------------------------------
dose | 0.69 | [0.57, 1.00]
- One-sided CIs: upper bound fixed at (1).
eta_squared(anova.results, ci = 0.95)# Effect Size for ANOVA
Parameter | Eta2 | 95% CI
-------------------------------
dose | 0.70 | [0.59, 1.00]
- One-sided CIs: upper bound fixed at (1).
posthocPairwiseT(anova.results, p.adjust.method = "bonferroni")
Pairwise comparisons using t tests with pooled SD
data: len and dose
0.5 1
1 2.0e-08 -
2 4.4e-16 4.3e-05
P value adjustment method: bonferroni
# same function as:
pairwise.t.test( x = tg$len, # outcome variable
g = tg$dose , # grouping variable
p.adjust.method = "holm") # which correction to use?
Pairwise comparisons using t tests with pooled SD
data: tg$len and tg$dose
0.5 1
1 1.3e-08 -
2 4.4e-16 1.4e-05
P value adjustment method: holm
# Tukey HSD
TukeyHSD(anova.results) Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = len ~ dose, data = tg)
$dose
diff lwr upr p adj
1-0.5 9.130 5.901805 12.358195 0.00e+00
2-0.5 15.495 12.266805 18.723195 0.00e+00
2-1 6.365 3.136805 9.593195 4.25e-05
# If HOV assumption is violated you can run the Welch one-way test
oneway.test(len ~ dose, data = tg)
One-way analysis of means (not assuming equal variances)
data: len and dose
F = 68.401, num df = 2.000, denom df = 37.743, p-value = 2.812e-13
# If normality assumption is violated
kruskal.test(len ~ dose, data = tg)
Kruskal-Wallis rank sum test
data: len by dose
Kruskal-Wallis chi-squared = 40.669, df = 2, p-value = 1.475e-09
# Visualizing the relationship
plot(data)ggplot(data, aes(x = OJ, y = VC)) + geom_point() + geom_smooth(method = lm, se = T,
linetype = "dashed", color = "darkred")# Correlation test NHST
cor.test(data$OJ, data$VC)
Pearson's product-moment correlation
data: data$OJ and data$VC
t = 4.9134, df = 28, p-value = 3.515e-05
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.4241432 0.8358147
sample estimates:
cor
0.6804377
# If more than 2 variables -- Using a subset of the mtcars dataset
library(GGally)
sub.mtcars <- mtcars %>%
dplyr::select(mpg, disp, hp, drat, wt, qsec)
# Investigate pairs
ggpairs(sub.mtcars, diag = list(continuous = "barDiag"), title = "Correlations between variables in mtcars")# Spearman's rank correlation coefficient or Spearman's ρ Spearman Correlation
# for ordinal data
cor.test(mtcars$hp, mtcars$cyl, method = "spearman", exact = F)
Spearman's rank correlation rho
data: mtcars$hp and mtcars$cyl
S = 535.83, p-value = 1.868e-12
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.9017909
Let’s load the aggression dataset
Aggression is measured using the Buss Perry Aggression Questionnaire (BPAQ) and impulsivity is measured by the Barratt Impulsivity Scale (BIS).
BPAQ = Aggression score BIS = Impulsivity score
aggression <- read_sav("aggression.sav")str(aggression)tibble [275 × 8] (S3: tbl_df/tbl/data.frame)
$ age : num [1:275] 18 18 20 17 17 17 17 17 17 17 ...
..- attr(*, "format.spss")= chr "F2.0"
$ BPAQ : num [1:275] 2.62 2.24 2.72 1.93 2.72 ...
..- attr(*, "label")= chr "Aggression total score"
..- attr(*, "format.spss")= chr "F12.10"
..- attr(*, "display_width")= int 14
$ AISS : num [1:275] 2.65 2.85 3.05 2.65 2.95 1.95 2.55 2.3 2 2.15 ...
..- attr(*, "label")= chr "sensation seeking total score"
..- attr(*, "format.spss")= chr "F4.2"
$ alcohol: num [1:275] 28 NA 80 28 10 12 21 3 21 0 ...
..- attr(*, "label")= chr "alcohol consumption (in drinks)"
..- attr(*, "format.spss")= chr "F2.0"
$ BIS : num [1:275] 2.15 3.08 3 1.85 2.08 ...
..- attr(*, "label")= chr "Impulsivity total score"
..- attr(*, "format.spss")= chr "F12.10"
..- attr(*, "display_width")= int 14
$ NEOc : num [1:275] 2.83 2.5 2.75 3.42 3.58 ...
..- attr(*, "label")= chr "Conscientiousness total score"
..- attr(*, "format.spss")= chr "F12.10"
..- attr(*, "display_width")= int 14
$ gender : dbl+lbl [1:275] 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, ...
..@ label : chr "biological sex of participant"
..@ format.spss: chr "F1.0"
..@ labels : Named num [1:2] 0 1
.. ..- attr(*, "names")= chr [1:2] "male" "female"
$ NEOo : num [1:275] 2.92 4.17 3.92 4.17 3.5 ...
..- attr(*, "label")= chr "openness to experience total score"
..- attr(*, "format.spss")= chr "F12.10"
..- attr(*, "display_width")= int 14
describe(aggression[, c(1, 2, 5)]) # 1st, 2nd, 5th columns are the 3 variables of interest vars n mean sd median trimmed mad min max range skew kurtosis
age 1 275 20.21 4.96 18.00 19.01 1.48 17.00 50.00 33.00 3.70 15.43
BPAQ 2 275 2.61 0.52 2.62 2.61 0.56 1.34 4.03 2.69 0.01 -0.41
BIS 3 275 2.28 0.35 2.27 2.27 0.40 1.42 3.15 1.73 0.36 -0.22
se
age 0.30
BPAQ 0.03
BIS 0.02
corr.test(aggression[, c(1, 2, 5)])Call:corr.test(x = aggression[, c(1, 2, 5)])
Correlation matrix
age BPAQ BIS
age 1.00 -0.21 -0.14
BPAQ -0.21 1.00 0.32
BIS -0.14 0.32 1.00
Sample Size
[1] 275
Probability values (Entries above the diagonal are adjusted for multiple tests.)
age BPAQ BIS
age 0.00 0 0.02
BPAQ 0.00 0 0.00
BIS 0.02 0 0.00
To see confidence intervals of the correlations, print with the short=FALSE option
car::scatter3d(BPAQ ~ BIS + age, data = aggression, rgl.printRglwidget = TRUE)
# may not work well in Markdown since interactive - try the command below
rglwidget()library(plotly)
fig <- plot_ly(aggression, x = ~BIS, y = ~BPAQ, z = ~age)
fig <- fig %>%
add_markers(size = 0.02)
fig <- fig %>%
layout(scene = list(xaxis = list(title = "Impulsivity"), yaxis = list(title = "Aggression"),
zaxis = list(title = "Age")))
figagrs.mod <- lm(BPAQ ~ age + BIS, data = aggression)stud_res <- rstudent(agrs.mod) # Studentized residual (t distribution) * recommended
# Normality
hist(resid(agrs.mod))plot(agrs.mod, which = 2)shapiro.test(stud_res) # Looking for nonsignificant results
Shapiro-Wilk normality test
data: stud_res
W = 0.99497, p-value = 0.507
# Linearity
plot(agrs.mod, which = 1) # looking for a flat red lineresidualPlots(agrs.mod) # looking for flat blue lines Test stat Pr(>|Test stat|)
age 0.1571 0.8753
BIS -0.4727 0.6368
Tukey test -0.7619 0.4461
# Homoscedasticity
plot(agrs.mod, which = 3) # looking for a flat red linencvTest(agrs.mod) # looking for nonsignificanceNon-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 0.5141766, Df = 1, p = 0.47334
# Multicollinearity
vif(agrs.mod) # from car package - looking for below 3 (conservative) or 10 (extremely liberal) age BIS
1.019323 1.019323
# Outliers
plot(agrs.mod, which = 4)# looking for below (extremely conservative) cutoff of 4/(n-P-1) [in our case
# 4/(275-2-1) = ~ 0.015] or below (extremely liberal) cutoff of 1.0
plot(cooks.distance(agrs.mod), ylab = "Cook's Distance", ylim = c(0, 0.2))influencePlot(agrs.mod) StudRes Hat CookD
47 2.437728 0.004400059 0.008598094
261 -1.453120 0.110608718 0.087177965
262 1.061081 0.135907121 0.059000607
263 1.069090 0.139881372 0.061927065
264 2.768212 0.004344552 0.010879350
# You would be most concerned about larger bubbles (which is why their row
# number are identified), but especially so if they are close to the top and
# bottom most right corners (i.e., the observation has both high leverage and
# high discrepancy).summary(agrs.mod)
Call:
lm(formula = BPAQ ~ age + BIS, data = aggression)
Residuals:
Min 1Q Median 3Q Max
-1.17933 -0.32561 -0.00454 0.35594 1.33877
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.954392 0.244922 7.980 4.12e-14 ***
age -0.017556 0.006032 -2.910 0.00391 **
BIS 0.443607 0.085060 5.215 3.64e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4906 on 272 degrees of freedom
Multiple R-squared: 0.1299, Adjusted R-squared: 0.1235
F-statistic: 20.31 on 2 and 272 DF, p-value: 6.016e-09
confint(agrs.mod) 2.5 % 97.5 %
(Intercept) 1.47220945 2.43657529
age -0.02943198 -0.00568055
BIS 0.27614693 0.61106696
spcor(aggression[, c(1, 2, 5)]) # from ppcor package$estimate
age BPAQ BIS
age 1.00000000 -0.1721304 -0.07521432
BPAQ -0.16460751 1.0000000 0.29496094
BIS -0.07282232 0.2986320 1.00000000
$p.value
age BPAQ BIS
age 0.000000000 0.0042686065 2.145742e-01
BPAQ 0.006315477 0.0000000000 6.654300e-07
BIS 0.229553815 0.0000004753 0.000000e+00
$statistic
age BPAQ BIS
age 0.000000 -2.881861 -1.243990
BPAQ -2.752321 0.000000 5.091128
BIS -1.204214 5.160655 0.000000
$n
[1] 275
$gp
[1] 1
$method
[1] "pearson"
# to get the semi-partial correlations squared you could use
spcors <- spcor(aggression[, c(1, 2, 5)])$estimate
spcors^2 age BPAQ BIS
age 1.00000000 0.02962887 0.005657193
BPAQ 0.02709563 1.00000000 0.087001957
BIS 0.00530309 0.08918109 1.000000000
# squared semi-partial correlation between X and Y tells you how much unique
# variance in Y that X alone can explain.
# look at the row where the variable that is the DV has a 1.00 (i.e., 2nd row
# in these data)mod1 <- lm(BPAQ ~ BIS + age, data = aggression)
summary(mod1)
Call:
lm(formula = BPAQ ~ BIS + age, data = aggression)
Residuals:
Min 1Q Median 3Q Max
-1.17933 -0.32561 -0.00454 0.35594 1.33877
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.954392 0.244922 7.980 4.12e-14 ***
BIS 0.443607 0.085060 5.215 3.64e-07 ***
age -0.017556 0.006032 -2.910 0.00391 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4906 on 272 degrees of freedom
Multiple R-squared: 0.1299, Adjusted R-squared: 0.1235
F-statistic: 20.31 on 2 and 272 DF, p-value: 6.016e-09
confint(mod1) 2.5 % 97.5 %
(Intercept) 1.47220945 2.43657529
BIS 0.27614693 0.61106696
age -0.02943198 -0.00568055
mod2 <- lm(BPAQ ~ BIS + age + NEOc + NEOo, data = aggression)
# Model 2 information
summary(mod2)
Call:
lm(formula = BPAQ ~ BIS + age + NEOc + NEOo, data = aggression)
Residuals:
Min 1Q Median 3Q Max
-1.20896 -0.32513 -0.00971 0.37144 1.28585
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.415691 0.494262 4.887 1.75e-06 ***
BIS 0.369988 0.106467 3.475 0.000595 ***
age -0.016666 0.006175 -2.699 0.007394 **
NEOc -0.072225 0.062888 -1.148 0.251790
NEOo -0.016360 0.059002 -0.277 0.781774
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4912 on 270 degrees of freedom
Multiple R-squared: 0.1342, Adjusted R-squared: 0.1214
F-statistic: 10.46 on 4 and 270 DF, p-value: 6.788e-08
confint(mod2) 2.5 % 97.5 %
(Intercept) 1.44259357 3.38878829
BIS 0.16037625 0.57959935
age -0.02882404 -0.00450888
NEOc -0.19603852 0.05158817
NEOo -0.13252309 0.09980236
spcor(aggression[, c(1, 2, 5, 6, 8)])$estimate
age BPAQ BIS NEOc NEOo
age 1.000000000 -0.15701239 0.009032315 0.08803045 0.18171659
BPAQ -0.152833352 1.00000000 0.196785971 -0.06503439 -0.01570175
BIS 0.007324224 0.16393530 1.000000000 -0.52385023 -0.14475321
NEOc 0.073454407 -0.05574986 -0.539050578 1.00000000 -0.10266802
NEOo 0.182757037 -0.01622346 -0.179533369 -0.12374560 1.00000000
$p.value
age BPAQ BIS NEOc NEOo
age 0.000000000 0.009495037 8.821204e-01 1.476278e-01 0.002627745
BPAQ 0.011608464 0.000000000 1.104381e-03 2.851747e-01 0.796574189
BIS 0.904293418 0.006736558 0.000000e+00 1.429276e-20 0.016897433
NEOc 0.227242009 0.359704405 6.648811e-22 0.000000e+00 0.091044269
NEOo 0.002480182 0.789970076 2.963634e-03 4.141905e-02 0.000000000
$statistic
age BPAQ BIS NEOc NEOo
age 0.0000000 -2.6123792 0.1484221 1.452125 3.036462
BPAQ -2.5411619 0.0000000 3.2980112 -1.070891 -0.258038
BIS 0.1203525 2.7306749 0.0000000 -10.105227 -2.403856
NEOc 1.2102485 -0.9174907 -10.5161874 0.000000 -1.695970
NEOo 3.0544472 -0.2666137 -2.9987585 -2.049097 0.000000
$n
[1] 275
$gp
[1] 3
$method
[1] "pearson"
# to get the semi-partial correlations squared you could use
spcors2 <- spcor(aggression[, c(1, 2, 5, 6, 8)])$estimate
round(spcors2^2, 2) age BPAQ BIS NEOc NEOo
age 1.00 0.02 0.00 0.01 0.03
BPAQ 0.02 1.00 0.04 0.00 0.00
BIS 0.00 0.03 1.00 0.27 0.02
NEOc 0.01 0.00 0.29 1.00 0.01
NEOo 0.03 0.00 0.03 0.02 1.00
# look at the row where the variable that is the DV has a 1.00 (i.e., 2nd row
# in these data)
# Comparing the 2 models
anova(mod1, mod2)Analysis of Variance Table
Model 1: BPAQ ~ BIS + age
Model 2: BPAQ ~ BIS + age + NEOc + NEOo
Res.Df RSS Df Sum of Sq F Pr(>F)
1 272 65.461
2 270 65.138 2 0.32246 0.6683 0.5134
# R2 of .1342 - .1299 = .0043
# have R do the math:
summary(mod2)$r.squared - summary(mod1)$r.squared[1] 0.004285952
# information criteria
AIC(mod1)[1] 393.7029
BIC(mod1)[1] 408.17
AIC(mod2)[1] 396.3449
BIC(mod2)[1] 418.0455