Week 2: Descriptive Statistics and Data Manipulation
….Just a little bit more on Data Types and Structured
Last week we saw different atomic types like:
- Integer
- Numeric
- Logical
- Character
And various data structures such as:
- vector
- matrix
- list
- data frame
- tibble
Today we will briefly add these:
- Factor
- Missing Data
Factors
In R, factors are used to work with categorical variables, variables that have a fixed and known set of possible values. Factor is a special case of vector (data structure).
More than in other software, in R we need to explicitly think about the type of the variables when we do statistical analysis. The following types of variables can be defined or coerced as factor:
Categorical Variables - These are variables that have distinct, non-ordered categories like marital status (e.g., single, married, divorced, widowed), names (e.g., Jack, Mimi, Amy, Zoidberg), fruit (e.g., apple, orange, watermelon) etc. When a variable that has numeric values as placeholders for levels is imported into R, it is good to transform it into a factor so that R doesn’t use it as continuous variable in statistical analysis. For example, 1= married, 2 = divorced, 3 = single, etc.
Ordinal Variables - These are categorical variables that have an order like letter grades (e.g., A+, A, B+,B), tax/income brackets, shoe sizes (e.g., 7, 7.5, 8, 8,5) and Likert scales (e.g., from ‘strongly disagree’ to ‘strongly agree’). In R, this “order” is called levels.
Missing Values (NA)
Missing values are coded as NA in R. Different functions will handle missing data differently by default. Some will spit out an NA (like mean), others will rely on a default for handling the missing data (e.g., listwise deletion, pairwise deletion, etc.). It is therefore important for you to check out the defaults in functions you’re using because you might be handling missing data differently than you thought.
Load Packages
library(tidyverse)
library(haven)
library(babynames)
library(dslabs)
library(psych)
library(car)Descriptive Statistics
we will be working with the mtcars dataset.
Get the names of columns and view the top few rows of the dataset
data(mtcars)
names(mtcars) [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
[11] "carb"
head(mtcars) mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
You can also look at specific variables using the name of the dataframe object followed by the ‘$’ operator and the name of the specific column you would like to inspect.
Descriptives (Base R)
Look at just the mpg column
mtcars$mpg [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
[16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4
This is particularly useful if you want to inspect the descriptives for only one or a subset of columns. So we could find the mean of a specific column by calling on the mean and star function and specifying the column we are interested in. We can use this
Find the mean and standard deviation of the mpg column
mean(mtcars$mpg)[1] 20.09062
sd(mtcars$mpg)[1] 6.026948
We can find most of the common descriptive statistics in base R, just
by coding the function and specifying the specific column we want to
summarize. Other common descriptives might be range(),
median(), mode(), etc. However, it can be
incredibly cumbersome to analyze multiple variables using a line of code
for each fit stat.
This dataset has 11 variables, so even to find the mean and standard deviation for each would require 22 lines of code without use of a simpler method. This method is most useful for quickly pulling specific stats from specific columns.
If you want to describe your whole dataset, it would be a lot more
efficient to use a single function. The describe function
in the psych package is one of the simplest ways to take
care of this.
Descriptives (psych and car packages)
psych::describe
By default, missing values will be removed. If you would like missing
values to be included, you can add na.rm = FALSE to the
code.
The describe() function will provide all of the most
common descriptive statistics including mean, median, mode, standard
deviation, etc. as well as skew and kurtosis to provide an idea of the
shape of the distribution
describe(mtcars) vars n mean sd median trimmed mad min max range skew
mpg 1 32 20.09 6.03 19.20 19.70 5.41 10.40 33.90 23.50 0.61
cyl 2 32 6.19 1.79 6.00 6.23 2.97 4.00 8.00 4.00 -0.17
disp 3 32 230.72 123.94 196.30 222.52 140.48 71.10 472.00 400.90 0.38
hp 4 32 146.69 68.56 123.00 141.19 77.10 52.00 335.00 283.00 0.73
drat 5 32 3.60 0.53 3.70 3.58 0.70 2.76 4.93 2.17 0.27
wt 6 32 3.22 0.98 3.33 3.15 0.77 1.51 5.42 3.91 0.42
qsec 7 32 17.85 1.79 17.71 17.83 1.42 14.50 22.90 8.40 0.37
kurtosis se
mpg -0.37 1.07
cyl -1.76 0.32
disp -1.21 21.91
hp -0.14 12.12
drat -0.71 0.09
wt -0.02 0.17
qsec 0.34 0.32
[ reached 'max' / getOption("max.print") -- omitted 4 rows ]
Skewness
Negative skewness values indicate that the distribution is skewed to the left, while positive indicates skew to the right.
Kurtosis
Negative kurtosis indicates thin tails (platykurtic distributions), and positive values represent longer tails (leptokurtic distributions).
For a more brief list of descriptives, you can add
fast = T, which will delete columns like skew, kurtosis,
trimmed, etc.
Descriptives by Grouping Variable
If your dataset includes a grouping variable, you can also find descriptives by group level. So in this dataset, the ‘am’ variable is coded as 0 or 1 to indicate the car has an automatic (0) or manual (1) transmission. This ‘dummy coding’ means that it can be used as a grouping variable. We can therefore look at the means of other variables between types of transmission.
describeBy(mtcars, group = 'am', digits = 2)
Descriptive statistics by group
am: 0
vars n mean sd median trimmed mad min max range skew
mpg 1 19 17.15 3.83 17.30 17.12 3.11 10.40 24.40 14.00 0.01
cyl 2 19 6.95 1.54 8.00 7.06 0.00 4.00 8.00 4.00 -0.95
disp 3 19 290.38 110.17 275.80 289.71 124.83 120.10 472.00 351.90 0.05
hp 4 19 160.26 53.91 175.00 161.06 77.10 62.00 245.00 183.00 -0.01
drat 5 19 3.29 0.39 3.15 3.28 0.22 2.76 3.92 1.16 0.50
wt 6 19 3.77 0.78 3.52 3.75 0.45 2.46 5.42 2.96 0.98
qsec 7 19 18.18 1.75 17.82 18.07 1.19 15.41 22.90 7.49 0.85
kurtosis se
mpg -0.80 0.88
cyl -0.74 0.35
disp -1.26 25.28
hp -1.21 12.37
drat -1.30 0.09
wt 0.14 0.18
qsec 0.55 0.40
[ reached 'max' / getOption("max.print") -- omitted 4 rows ]
------------------------------------------------------------
am: 1
vars n mean sd median trimmed mad min max range skew
mpg 1 13 24.39 6.17 22.80 24.38 6.67 15.00 33.90 18.90 0.05
cyl 2 13 5.08 1.55 4.00 4.91 0.00 4.00 8.00 4.00 0.87
disp 3 13 143.53 87.20 120.30 131.25 58.86 71.10 351.00 279.90 1.33
hp 4 13 126.85 84.06 109.00 114.73 63.75 52.00 335.00 283.00 1.36
drat 5 13 4.05 0.36 4.08 4.02 0.27 3.54 4.93 1.39 0.79
wt 6 13 2.41 0.62 2.32 2.39 0.68 1.51 3.57 2.06 0.21
qsec 7 13 17.36 1.79 17.02 17.39 2.34 14.50 19.90 5.40 -0.23
kurtosis se
mpg -1.46 1.71
cyl -0.90 0.43
disp 0.40 24.19
hp 0.56 23.31
drat 0.21 0.10
wt -1.17 0.17
qsec -1.42 0.50
[ reached 'max' / getOption("max.print") -- omitted 4 rows ]
As you can see from the output of the above code, we now have 2 sets of descriptive statistics, one for am: 0 and another for am: 1.
Another common data inspection tool is graphs and distributions. There are numerous kinds of graphs, and some of these will be explored further using GGPlot; however, below, we will go over a few of the more common graph types.
To build off of our investigation of descriptive statistics, we may
wish to generate a frequency distribution such as a histogram
(continuous data) or bar graph (categorical data). This can be done
easily in base R without installing any new packages. In this case, we
will be using the mpg variable as an example of continuous
data, and cyl as an example of categorical/discrete
data.
Basic Histogram for Continuous Variables
Titles are optional, but improve readability
hist(mtcars$mpg, xlab = "Miles Per Gallon", ylab = "Frequency",
main = "MPG Frequency Distribution", breaks=10)Bar Plot for Discrete Variables
counts <- table(mtcars$gear)
barplot(counts, main="Car Distribution",
xlab="Number of Gears", ylab = "Count",
col = c("dark blue", "red", "yellow")) #"col =" will define the colours you wish to useIf you wish to create a stacked or grouped bar plot, you may define the variables you want to graph and call upon the table object in your code.
Stacked barplot
“legend =” will create a legend so that it is clear which variable corresponds to each colour
stackedvars <- table(mtcars$am, mtcars$cyl)
stackedvars
4 6 8
0 3 4 12
1 8 3 2
barplot(stackedvars, xlab = "Number of Cylinders",
ylab = "Frequency", main = "Frequency Distribution of Cylinders by Transmission",
col = c("blue", "green"), legend = c("automatic", "manual"))Grouped Barplot
# Side by Side
# Add "beside = T" to the above code
barplot(stackedvars, xlab = "Number of Cylinders", ylab = "Frequency",
main = "Frequency Distribution of Cylinders by Transmission",
col = c("blue", "green"), legend = c("automatic", "manual"), beside = T)You can look up either the hist or barplot
functions to see more possible additions. You can also adjust the
placement of your figure on the plot using the par()
functionhttps://www.datamentor.io/r-programming/subplot/.
Say we want to visually inspect the relationship between miles per
gallon and weight of a vehicle. Simple scatterplots are also relatively
easily created in base R using the plot function. This can
similarly be customized for aspects like colour and placement, but we
will be focusing here on a basic introduction to plotting functions.
Using the plot function, we will first input the
variable to plot on the x-axis, then the variable we want on the y-axis.
We should also add in titles for interpretability.
Scatterplot
plot(mtcars$wt, mtcars$mpg, xlab = "Weight",
ylab = "Miles per Gallon",
main = "Miles per Gallon by Weight")Scatterplot with line of best fit (linear)
plot(mtcars$wt, mtcars$mpg, xlab = "Weight",
ylab = "Miles per Gallon",
main = "Miles per Gallon by Weight") + # plus sign
abline(lm(mpg ~ wt, data = mtcars), col = "blue") integer(0)
# Add a line of best fit (linear)Scatterplot with lowess line
plot(mtcars$wt, mtcars$mpg, xlab = "Weight",
ylab = "Miles per Gallon",
main = "Miles per Gallon by Weight") + # plus sign
lines(lowess(mtcars$wt, mtcars$mpg),
col = "red")integer(0)
You may also be interested in inspecting quickly the correlations between variables to see if any relationship appears to exist or to quickly eyeball the data for linearity (although this should be examined much more in depth through other testing - this method will only indicate major violations).
Scatterplor Matrix
sub.mtcars <- mtcars %>%
select(mpg, disp, hp, drat, wt, qsec)
scatterplotMatrix(sub.mtcars)# OR in psych
psych::pairs.panels(sub.mtcars,scale = FALSE)It is important to note that in order to create scatterplots, you
must have equal numbers of cases between x and y. If there are missing
values in either categories, you will need to remove missing values.
This can be done using complete.cases or
na.omit. Although unnecessary for this dataset, the code
would look like this:
Removing Missing Values
mtcars.complete <- mtcars[complete.cases(mtcars),]
# OR
mtcars.complete <- na.omit(mtcars)When doing this, I would suggest creating a new dataframe
object, like mtcars.complete so that the original data
frame and its information isn’t lost or deleted.
Data Manipulation Using the Tidyverse
The powerhouse package for doing data transformation and manipulation
is called dplyr (part of tidyverse).
We will be learning how to use the dplyr package to find
different ways to present and transform our data.
There are 6 primary functions in the dplyr toolkit:
Main Functions
filter() … filters rows (creates a subset of individuals
but preserves all variables)
select() … selects columns (creates a subset of
variables but preserves all individuals)
arrange() … sorts the data in ascending or descending
order
mutate() … adds new variables to your dataset
group_by() … creates a grouping variable so that you can
obtain summary statistics separately for each group on a particular
variable (usually used with summarise)
summarise() … calculates summary statistics - this is
typically used in conjuction with some combination of the other 5
functions above to calculate means, SDs, frequencies etc.
Introduction of the ‘pipe’ %>%
There are a couple of different ways to use the dplyr
functions. All of them take as their first argument, the name of the
dataset that you want to manipulate.
Example: Your data is called “dataset” and you have variables x1, x2, x3
to sort the dataset in increasing order for on x1 you would use:
arrange(dataset, x1)See ?arrange for the help file.
But instead of having to start every function with the name of the
dataset, dplyr can also use a ‘pipe’ to list the dataset outside of the
function. The pipe is represented by the symbol: %>%
(percent, greater than, percent).
Here’s the approach using the pipe, that achieves the same result as the code example above.
dataset %>%
arrange(x1)With a single function, there is minimal benefit to this approach, but it is meant to be used in a chain of commands. So if for example, you wanted to sort the dataset using arrange, then filter based on some criteria, then create a new variable using mutate, you can use the pipe in a chain of commands like:
dataset %>%
arrange(...) %>%
filter(... ) %>%
mutate(...)In English, you could read the %>% symbol as the word
“and then”. E.g., take “dataset” and then arrange based
on some variable, and then filter based on some other
variable, and then mutate based on some third
variable.
Dataset
We will be using the “murders” dataset. If you installed and loaded
the dslabs packages, this dataset is aleady imported to
your environemnt.
The murders dataset shows the US gun murders by state in 2010. From the R Documentation: “Gun murder data from FBI reports. Also contains the population of each state.”
# murders dataset
head(murders) state abb region population total
1 Alabama AL South 4779736 135
2 Alaska AK West 710231 19
3 Arizona AZ West 6392017 232
4 Arkansas AR South 2915918 93
5 California CA West 37253956 1257
6 Colorado CO West 5029196 65
describe(murders) # descriptives vars n mean sd median trimmed mad min
state* 1 51 26.00 14.87 26 26.00 19.27 1
abb* 2 51 26.00 14.87 26 26.00 19.27 1
region* 3 51 2.57 1.06 2 2.59 1.48 1
population 4 51 6075769.18 6860669.07 4339367 4692546.63 3725812.35 563626
total 5 51 184.37 236.13 97 138.41 126.02 2
max range skew kurtosis se
state* 51 50 0.00 -1.27 2.08
abb* 51 50 0.00 -1.27 2.08
region* 4 3 0.02 -1.29 0.15
population 37253956 36690330 2.48 7.26 960685.83
total 1257 1255 2.36 6.88 33.06
df <- murders # making a copy of the dataset under an easier name to type, 'df' as in dataframearrange()
Sorting in ascending or descending order is pretty straightforward
arrange(df, total) # sort on age from lowest to highest state abb region population total
1 Vermont VT Northeast 625741 2
2 North Dakota ND North Central 672591 4
3 New Hampshire NH Northeast 1316470 5
4 Wyoming WY West 563626 5
5 Hawaii HI West 1360301 7
6 South Dakota SD North Central 814180 8
7 Maine ME Northeast 1328361 11
8 Idaho ID West 1567582 12
9 Montana MT West 989415 12
10 Rhode Island RI Northeast 1052567 16
11 Alaska AK West 710231 19
12 Iowa IA North Central 3046355 21
13 Utah UT West 2763885 22
14 West Virginia WV South 1852994 27
15 Nebraska NE North Central 1826341 32
16 Oregon OR West 3831074 36
17 Delaware DE South 897934 38
18 Minnesota MN North Central 5303925 53
19 Kansas KS North Central 2853118 63
20 Colorado CO West 5029196 65
[ reached 'max' / getOption("max.print") -- omitted 31 rows ]
arrange(df, desc(total)) # sort on age from highest to lowest state abb region population total
1 California CA West 37253956 1257
2 Texas TX South 25145561 805
3 Florida FL South 19687653 669
4 New York NY Northeast 19378102 517
5 Pennsylvania PA Northeast 12702379 457
6 Michigan MI North Central 9883640 413
7 Georgia GA South 9920000 376
8 Illinois IL North Central 12830632 364
9 Louisiana LA South 4533372 351
10 Missouri MO North Central 5988927 321
11 Ohio OH North Central 11536504 310
12 Maryland MD South 5773552 293
13 North Carolina NC South 9535483 286
14 Virginia VA South 8001024 250
15 New Jersey NJ Northeast 8791894 246
16 Arizona AZ West 6392017 232
17 Tennessee TN South 6346105 219
18 South Carolina SC South 4625364 207
19 Indiana IN North Central 6483802 142
20 Alabama AL South 4779736 135
[ reached 'max' / getOption("max.print") -- omitted 31 rows ]
# here is the same code using the pipe instead
df %>%
arrange(total) state abb region population total
1 Vermont VT Northeast 625741 2
2 North Dakota ND North Central 672591 4
3 New Hampshire NH Northeast 1316470 5
4 Wyoming WY West 563626 5
5 Hawaii HI West 1360301 7
6 South Dakota SD North Central 814180 8
7 Maine ME Northeast 1328361 11
8 Idaho ID West 1567582 12
9 Montana MT West 989415 12
10 Rhode Island RI Northeast 1052567 16
11 Alaska AK West 710231 19
12 Iowa IA North Central 3046355 21
13 Utah UT West 2763885 22
14 West Virginia WV South 1852994 27
15 Nebraska NE North Central 1826341 32
16 Oregon OR West 3831074 36
17 Delaware DE South 897934 38
18 Minnesota MN North Central 5303925 53
19 Kansas KS North Central 2853118 63
20 Colorado CO West 5029196 65
[ reached 'max' / getOption("max.print") -- omitted 31 rows ]
df %>%
arrange(desc(total)) state abb region population total
1 California CA West 37253956 1257
2 Texas TX South 25145561 805
3 Florida FL South 19687653 669
4 New York NY Northeast 19378102 517
5 Pennsylvania PA Northeast 12702379 457
6 Michigan MI North Central 9883640 413
7 Georgia GA South 9920000 376
8 Illinois IL North Central 12830632 364
9 Louisiana LA South 4533372 351
10 Missouri MO North Central 5988927 321
11 Ohio OH North Central 11536504 310
12 Maryland MD South 5773552 293
13 North Carolina NC South 9535483 286
14 Virginia VA South 8001024 250
15 New Jersey NJ Northeast 8791894 246
16 Arizona AZ West 6392017 232
17 Tennessee TN South 6346105 219
18 South Carolina SC South 4625364 207
19 Indiana IN North Central 6483802 142
20 Alabama AL South 4779736 135
[ reached 'max' / getOption("max.print") -- omitted 31 rows ]
Note that while the output was sorted on total in any of these situations above, it does not change the sorting in your original dataset!
Why?
Remember that R is object based, and we did not create a new object that saves the sorting.
To preserve the sorting in our dataset we have to either over-ride our df dataset or create a second version of our df dataset. E.g.,
df2 <- df %>%
arrange(total)With arrange, saving a new dataset is trivial because
there will be no effects on your statistics, but it non-trivial if you
start filtering and transforming variables, and then want to use this
transformed dataset in an analysis.
Note you can also sort across multiple variables. Note that the order matters because it sorts on the first variable first, then the next one, then the next one…etc.
df %>%
arrange(total,population) # sort total from highest to lowest and on population highest to lowest state abb region population total
1 Vermont VT Northeast 625741 2
2 North Dakota ND North Central 672591 4
3 Wyoming WY West 563626 5
4 New Hampshire NH Northeast 1316470 5
5 Hawaii HI West 1360301 7
6 South Dakota SD North Central 814180 8
7 Maine ME Northeast 1328361 11
8 Montana MT West 989415 12
9 Idaho ID West 1567582 12
10 Rhode Island RI Northeast 1052567 16
11 Alaska AK West 710231 19
12 Iowa IA North Central 3046355 21
13 Utah UT West 2763885 22
14 West Virginia WV South 1852994 27
15 Nebraska NE North Central 1826341 32
16 Oregon OR West 3831074 36
17 Delaware DE South 897934 38
18 Minnesota MN North Central 5303925 53
19 Kansas KS North Central 2853118 63
20 Colorado CO West 5029196 65
[ reached 'max' / getOption("max.print") -- omitted 31 rows ]
df %>%
arrange(total,desc(population)) state abb region population total
1 Vermont VT Northeast 625741 2
2 North Dakota ND North Central 672591 4
3 New Hampshire NH Northeast 1316470 5
4 Wyoming WY West 563626 5
5 Hawaii HI West 1360301 7
6 South Dakota SD North Central 814180 8
7 Maine ME Northeast 1328361 11
8 Idaho ID West 1567582 12
9 Montana MT West 989415 12
10 Rhode Island RI Northeast 1052567 16
11 Alaska AK West 710231 19
12 Iowa IA North Central 3046355 21
13 Utah UT West 2763885 22
14 West Virginia WV South 1852994 27
15 Nebraska NE North Central 1826341 32
16 Oregon OR West 3831074 36
17 Delaware DE South 897934 38
18 Minnesota MN North Central 5303925 53
19 Kansas KS North Central 2853118 63
20 Colorado CO West 5029196 65
[ reached 'max' / getOption("max.print") -- omitted 31 rows ]
select()
This function helps you work with a smaller version of your dataset (on the variable side). This can save you time so you don’t have to scroll through to find the exact names of your variables (remember R is case sensitive so it has to be a perfect match).
Our dataset has a small number of variables so it is already manageable but we will still see how to use the select function.
NOTE: There is one really annoying issue with select() that may come up for you. select() is also the name of a function from the MASS package and MASS is loaded by default so R often wants to use that function instead.
If you use ?select you will see it come up for two packages
There are two ways you can fix this:
sub.df <- df %>%
dplyr::select(abb, total) # tells R to use the select function from dplyr in this case only
select<-dplyr::select # tells R that you want to use the select function from dplyr EVERY time within this R session. If you close R, you will need to run this again
sub.df <- df %>%
select(abb, total)
# other ways to use select
sub.df <- df %>%
select(-state) # all variables EXCEPT these ones
sub.df abb region population total
1 AL South 4779736 135
2 AK West 710231 19
3 AZ West 6392017 232
4 AR South 2915918 93
5 CA West 37253956 1257
6 CO West 5029196 65
7 CT Northeast 3574097 97
8 DE South 897934 38
9 DC South 601723 99
10 FL South 19687653 669
11 GA South 9920000 376
12 HI West 1360301 7
13 ID West 1567582 12
14 IL North Central 12830632 364
15 IN North Central 6483802 142
16 IA North Central 3046355 21
17 KS North Central 2853118 63
18 KY South 4339367 116
19 LA South 4533372 351
20 ME Northeast 1328361 11
21 MD South 5773552 293
22 MA Northeast 6547629 118
23 MI North Central 9883640 413
24 MN North Central 5303925 53
25 MS South 2967297 120
[ reached 'max' / getOption("max.print") -- omitted 26 rows ]
df %>%
select(region:total) # select everything starting in between region and total (inclusive) region population total
1 South 4779736 135
2 West 710231 19
3 West 6392017 232
4 South 2915918 93
5 West 37253956 1257
6 West 5029196 65
7 Northeast 3574097 97
8 South 897934 38
9 South 601723 99
10 South 19687653 669
11 South 9920000 376
12 West 1360301 7
13 West 1567582 12
14 North Central 12830632 364
15 North Central 6483802 142
16 North Central 3046355 21
17 North Central 2853118 63
18 South 4339367 116
19 South 4533372 351
20 Northeast 1328361 11
21 South 5773552 293
22 Northeast 6547629 118
23 North Central 9883640 413
24 North Central 5303925 53
25 South 2967297 120
26 North Central 5988927 321
27 West 989415 12
28 North Central 1826341 32
29 West 2700551 84
30 Northeast 1316470 5
31 Northeast 8791894 246
32 West 2059179 67
33 Northeast 19378102 517
[ reached 'max' / getOption("max.print") -- omitted 18 rows ]
# note if you want to use excluding a range you need parentheses around the variables with a minus sign in front of it
df %>%
select(-(region:total)) # select everything excluding variables in between region and total (inclusive) state abb
1 Alabama AL
2 Alaska AK
3 Arizona AZ
4 Arkansas AR
5 California CA
6 Colorado CO
7 Connecticut CT
8 Delaware DE
9 District of Columbia DC
10 Florida FL
11 Georgia GA
12 Hawaii HI
13 Idaho ID
14 Illinois IL
15 Indiana IN
16 Iowa IA
17 Kansas KS
18 Kentucky KY
19 Louisiana LA
20 Maine ME
21 Maryland MD
22 Massachusetts MA
23 Michigan MI
24 Minnesota MN
25 Mississippi MS
26 Missouri MO
27 Montana MT
28 Nebraska NE
29 Nevada NV
30 New Hampshire NH
31 New Jersey NJ
32 New Mexico NM
33 New York NY
34 North Carolina NC
35 North Dakota ND
36 Ohio OH
37 Oklahoma OK
38 Oregon OR
39 Pennsylvania PA
40 Rhode Island RI
41 South Carolina SC
42 South Dakota SD
43 Tennessee TN
44 Texas TX
45 Utah UT
46 Vermont VT
47 Virginia VA
48 Washington WA
49 West Virginia WV
50 Wisconsin WI
[ reached 'max' / getOption("max.print") -- omitted 1 rows ]
filter()
filter() allows you to work with a smaller number of rows in your dataset based on a commonality. You can also think of this as subsetting your data to include participants with particular characteristics.
Here is where you will be using logical data because you need to tell filter() not just the variable you want to filter on, but also the particular conditions or values.
Here are some examples:
df$state <- factor(df$state)
df$abb <- factor(df$abb)
# only west region
df %>%
filter(region == "West") # note that you need two equal signs state abb region population total
1 Alaska AK West 710231 19
2 Arizona AZ West 6392017 232
3 California CA West 37253956 1257
4 Colorado CO West 5029196 65
5 Hawaii HI West 1360301 7
6 Idaho ID West 1567582 12
7 Montana MT West 989415 12
8 Nevada NV West 2700551 84
9 New Mexico NM West 2059179 67
10 Oregon OR West 3831074 36
11 Utah UT West 2763885 22
12 Washington WA West 6724540 93
13 Wyoming WY West 563626 5
# at or more than 50 murders
df %>%
filter(total >= 50) # greater than or equal to state abb region population total
1 Alabama AL South 4779736 135
2 Arizona AZ West 6392017 232
3 Arkansas AR South 2915918 93
4 California CA West 37253956 1257
5 Colorado CO West 5029196 65
6 Connecticut CT Northeast 3574097 97
7 District of Columbia DC South 601723 99
8 Florida FL South 19687653 669
9 Georgia GA South 9920000 376
10 Illinois IL North Central 12830632 364
11 Indiana IN North Central 6483802 142
12 Kansas KS North Central 2853118 63
13 Kentucky KY South 4339367 116
14 Louisiana LA South 4533372 351
15 Maryland MD South 5773552 293
16 Massachusetts MA Northeast 6547629 118
17 Michigan MI North Central 9883640 413
18 Minnesota MN North Central 5303925 53
19 Mississippi MS South 2967297 120
20 Missouri MO North Central 5988927 321
[ reached 'max' / getOption("max.print") -- omitted 14 rows ]
# population smaller than 1M
df %>%
filter(population < 1000000) # less than 5 (doesn't include 5) state abb region population total
1 Alaska AK West 710231 19
2 Delaware DE South 897934 38
3 District of Columbia DC South 601723 99
4 Montana MT West 989415 12
5 North Dakota ND North Central 672591 4
6 South Dakota SD North Central 814180 8
7 Vermont VT Northeast 625741 2
8 Wyoming WY West 563626 5
You can also filter on multiple conditions
# Only Northeast AND have more than 100 murders
df %>%
filter(region == "Northeast" & total > 100) # both of these have to be true state abb region population total
1 Massachusetts MA Northeast 6547629 118
2 New Jersey NJ Northeast 8791894 246
3 New York NY Northeast 19378102 517
4 Pennsylvania PA Northeast 12702379 457
# Northwest or less than 100
df %>%
filter(region == "Northwest" | total < 100) # only one of these needs to be true state abb region population total
1 Alaska AK West 710231 19
2 Arkansas AR South 2915918 93
3 Colorado CO West 5029196 65
4 Connecticut CT Northeast 3574097 97
5 Delaware DE South 897934 38
6 District of Columbia DC South 601723 99
7 Hawaii HI West 1360301 7
8 Idaho ID West 1567582 12
9 Iowa IA North Central 3046355 21
10 Kansas KS North Central 2853118 63
11 Maine ME Northeast 1328361 11
12 Minnesota MN North Central 5303925 53
13 Montana MT West 989415 12
14 Nebraska NE North Central 1826341 32
15 Nevada NV West 2700551 84
16 New Hampshire NH Northeast 1316470 5
17 New Mexico NM West 2059179 67
18 North Dakota ND North Central 672591 4
19 Oregon OR West 3831074 36
20 Rhode Island RI Northeast 1052567 16
[ reached 'max' / getOption("max.print") -- omitted 7 rows ]
# you could also say what you do NOT want
df %>%
filter(region != "East") # != means not equal to state abb region population total
1 Alabama AL South 4779736 135
2 Alaska AK West 710231 19
3 Arizona AZ West 6392017 232
4 Arkansas AR South 2915918 93
5 California CA West 37253956 1257
6 Colorado CO West 5029196 65
7 Connecticut CT Northeast 3574097 97
8 Delaware DE South 897934 38
9 District of Columbia DC South 601723 99
10 Florida FL South 19687653 669
11 Georgia GA South 9920000 376
12 Hawaii HI West 1360301 7
13 Idaho ID West 1567582 12
14 Illinois IL North Central 12830632 364
15 Indiana IN North Central 6483802 142
16 Iowa IA North Central 3046355 21
17 Kansas KS North Central 2853118 63
18 Kentucky KY South 4339367 116
19 Louisiana LA South 4533372 351
20 Maine ME Northeast 1328361 11
[ reached 'max' / getOption("max.print") -- omitted 31 rows ]
Revisiting the dplyr ‘pipeline’
One of the convenient features of dplyr is that the functions and code were designed to be used together for limitless opportunities to analyze different versions of the dataset. You may not need to use some of the more advanced features but you will likely need to string at least a couple of them together at some point.
In SPSS, you simply click through the different buttons and apply each of them separately to get all of your conditions applied.
In R, you set up a “pipeline” using the symbol %>% as the pipe that joins everything together. You can think read this symbol as the word “then”
Here’s an example pipeline in plain English:
Start with the df dataset and then Create a smaller subset of the dataset for only the variables state, abb, region, total and then Filter the dataset to only include West regions and then Sort the dataset by total from highest to lowest
In R this would look like:
df %>%
select(state, abb, region, total) %>%
filter(region == "West") %>%
arrange(desc(total)) state abb region total
1 California CA West 1257
2 Arizona AZ West 232
3 Washington WA West 93
4 Nevada NV West 84
5 New Mexico NM West 67
6 Colorado CO West 65
7 Oregon OR West 36
8 Utah UT West 22
9 Alaska AK West 19
10 Idaho ID West 12
11 Montana MT West 12
12 Hawaii HI West 7
13 Wyoming WY West 5
Now we can easily find the highest (highest two, three, etc.) number of murders for west regions
mutate()
mutate() is used when you want to create a new variable in your dataset.
For example, we may want to create a new murder ratio variable that is the total / population
new_df<-df %>%
mutate(murderatio =total/population)
new_df state abb region population total murderatio
1 Alabama AL South 4779736 135 2.824424e-05
2 Alaska AK West 710231 19 2.675186e-05
3 Arizona AZ West 6392017 232 3.629527e-05
4 Arkansas AR South 2915918 93 3.189390e-05
5 California CA West 37253956 1257 3.374138e-05
6 Colorado CO West 5029196 65 1.292453e-05
7 Connecticut CT Northeast 3574097 97 2.713972e-05
8 Delaware DE South 897934 38 4.231937e-05
9 District of Columbia DC South 601723 99 1.645275e-04
10 Florida FL South 19687653 669 3.398069e-05
11 Georgia GA South 9920000 376 3.790323e-05
12 Hawaii HI West 1360301 7 5.145920e-06
13 Idaho ID West 1567582 12 7.655102e-06
14 Illinois IL North Central 12830632 364 2.836961e-05
15 Indiana IN North Central 6483802 142 2.190073e-05
16 Iowa IA North Central 3046355 21 6.893484e-06
[ reached 'max' / getOption("max.print") -- omitted 35 rows ]
new_df <- new_df %>%
mutate(murderatio_percent = murderatio*100)
new_df state abb region population total murderatio
1 Alabama AL South 4779736 135 2.824424e-05
2 Alaska AK West 710231 19 2.675186e-05
3 Arizona AZ West 6392017 232 3.629527e-05
4 Arkansas AR South 2915918 93 3.189390e-05
5 California CA West 37253956 1257 3.374138e-05
6 Colorado CO West 5029196 65 1.292453e-05
7 Connecticut CT Northeast 3574097 97 2.713972e-05
8 Delaware DE South 897934 38 4.231937e-05
9 District of Columbia DC South 601723 99 1.645275e-04
10 Florida FL South 19687653 669 3.398069e-05
11 Georgia GA South 9920000 376 3.790323e-05
12 Hawaii HI West 1360301 7 5.145920e-06
13 Idaho ID West 1567582 12 7.655102e-06
14 Illinois IL North Central 12830632 364 2.836961e-05
murderatio_percent
1 0.0028244238
2 0.0026751860
3 0.0036295273
4 0.0031893901
5 0.0033741383
6 0.0012924531
7 0.0027139722
8 0.0042319369
9 0.0164527532
10 0.0033980688
11 0.0037903226
12 0.0005145920
13 0.0007655102
14 0.0028369608
[ reached 'max' / getOption("max.print") -- omitted 37 rows ]
You can also apply square root, log, etc. transformations to variables and apply mutate multiple variables in one step.
summarise() and group_by()
summarise() is most often used to get descriptive statistics at the end of the pipeline for particular subgroups, aggregates of groups, frequency counts, etc.
How summarise() differs from output from regular summary statistics from functions like summary(), describe(), etc. is that the output itself is also a tibble that you could use in future analysis.
Generally speaking, where I would find it more useful is if you need to use a series of data manipulation functions in the pipeline to get exploratory statistics for different groups or want to check something quickly.
You can use it for general descriptive statistics if you want very specific pieces of information as well (especially used in conjunction with group_by() )
You can find a list of useful functions to use in summarise here: https://dplyr.tidyverse.org/reference/summarise.html
df %>%
summarise(mean_cases=mean(total, na.rm=TRUE)) mean_cases
1 184.3725
# with group_by
new_df %>%
group_by(state) %>%
summarise(mean_percent=mean(murderatio_percent, na.rm=TRUE))# A tibble: 51 × 2
state mean_percent
<fct> <dbl>
1 Alabama 0.00282
2 Alaska 0.00268
3 Arizona 0.00363
4 Arkansas 0.00319
5 California 0.00337
6 Colorado 0.00129
7 Connecticut 0.00271
8 Delaware 0.00423
9 District of Columbia 0.0165
10 Florida 0.00340
# … with 41 more rows
new_df %>%
group_by(region) %>%
summarise(N=n()) # A tibble: 4 × 2
region N
<fct> <int>
1 Northeast 9
2 South 17
3 North Central 12
4 West 13
relocate()
relocate allows you to change the position of some of your variables (move to front, end, before or after certain other variables, etc.)
Changing variable order doesn’t affect your analyses or statistics in any way but it may make things more convenient or neater for you.
# move a variable to front of data set (default action)
df2<-df %>%
relocate(total)
df2 total state abb region population
1 135 Alabama AL South 4779736
2 19 Alaska AK West 710231
3 232 Arizona AZ West 6392017
4 93 Arkansas AR South 2915918
5 1257 California CA West 37253956
6 65 Colorado CO West 5029196
7 97 Connecticut CT Northeast 3574097
8 38 Delaware DE South 897934
9 99 District of Columbia DC South 601723
10 669 Florida FL South 19687653
11 376 Georgia GA South 9920000
12 7 Hawaii HI West 1360301
13 12 Idaho ID West 1567582
14 364 Illinois IL North Central 12830632
15 142 Indiana IN North Central 6483802
16 21 Iowa IA North Central 3046355
17 63 Kansas KS North Central 2853118
18 116 Kentucky KY South 4339367
19 351 Louisiana LA South 4533372
20 11 Maine ME Northeast 1328361
[ reached 'max' / getOption("max.print") -- omitted 31 rows ]
# move multiple variables to the front of data set
df2<-df %>%
relocate(total, population)
df2 total population state abb region
1 135 4779736 Alabama AL South
2 19 710231 Alaska AK West
3 232 6392017 Arizona AZ West
4 93 2915918 Arkansas AR South
5 1257 37253956 California CA West
6 65 5029196 Colorado CO West
7 97 3574097 Connecticut CT Northeast
8 38 897934 Delaware DE South
9 99 601723 District of Columbia DC South
10 669 19687653 Florida FL South
11 376 9920000 Georgia GA South
12 7 1360301 Hawaii HI West
13 12 1567582 Idaho ID West
14 364 12830632 Illinois IL North Central
15 142 6483802 Indiana IN North Central
16 21 3046355 Iowa IA North Central
17 63 2853118 Kansas KS North Central
18 116 4339367 Kentucky KY South
19 351 4533372 Louisiana LA South
20 11 1328361 Maine ME Northeast
[ reached 'max' / getOption("max.print") -- omitted 31 rows ]
df state abb region population total
1 Alabama AL South 4779736 135
2 Alaska AK West 710231 19
3 Arizona AZ West 6392017 232
4 Arkansas AR South 2915918 93
5 California CA West 37253956 1257
6 Colorado CO West 5029196 65
7 Connecticut CT Northeast 3574097 97
8 Delaware DE South 897934 38
9 District of Columbia DC South 601723 99
10 Florida FL South 19687653 669
11 Georgia GA South 9920000 376
12 Hawaii HI West 1360301 7
13 Idaho ID West 1567582 12
14 Illinois IL North Central 12830632 364
15 Indiana IN North Central 6483802 142
16 Iowa IA North Central 3046355 21
17 Kansas KS North Central 2853118 63
18 Kentucky KY South 4339367 116
19 Louisiana LA South 4533372 351
20 Maine ME Northeast 1328361 11
[ reached 'max' / getOption("max.print") -- omitted 31 rows ]
# move variable to a particular location
# total between abb and region
df3<- df %>%
relocate(total, .before=region)
df3 state abb total region population
1 Alabama AL 135 South 4779736
2 Alaska AK 19 West 710231
3 Arizona AZ 232 West 6392017
4 Arkansas AR 93 South 2915918
5 California CA 1257 West 37253956
6 Colorado CO 65 West 5029196
7 Connecticut CT 97 Northeast 3574097
8 Delaware DE 38 South 897934
9 District of Columbia DC 99 South 601723
10 Florida FL 669 South 19687653
11 Georgia GA 376 South 9920000
12 Hawaii HI 7 West 1360301
13 Idaho ID 12 West 1567582
14 Illinois IL 364 North Central 12830632
15 Indiana IN 142 North Central 6483802
16 Iowa IA 21 North Central 3046355
17 Kansas KS 63 North Central 2853118
18 Kentucky KY 116 South 4339367
19 Louisiana LA 351 South 4533372
20 Maine ME 11 Northeast 1328361
[ reached 'max' / getOption("max.print") -- omitted 31 rows ]