Week 2: Descriptive Statistics and Data Manipulation

….Just a little bit more on Data Types and Structured

Last week we saw different atomic types like:

  • Integer
  • Numeric
  • Logical
  • Character

And various data structures such as:

  • vector
  • matrix
  • list
  • data frame
  • tibble

Today we will briefly add these:

  • Factor
  • Missing Data

Factors

In R, factors are used to work with categorical variables, variables that have a fixed and known set of possible values. Factor is a special case of vector (data structure).

More than in other software, in R we need to explicitly think about the type of the variables when we do statistical analysis. The following types of variables can be defined or coerced as factor:

Categorical Variables - These are variables that have distinct, non-ordered categories like marital status (e.g., single, married, divorced, widowed), names (e.g., Jack, Mimi, Amy, Zoidberg), fruit (e.g., apple, orange, watermelon) etc. When a variable that has numeric values as placeholders for levels is imported into R, it is good to transform it into a factor so that R doesn’t use it as continuous variable in statistical analysis. For example, 1= married, 2 = divorced, 3 = single, etc.

Ordinal Variables - These are categorical variables that have an order like letter grades (e.g., A+, A, B+,B), tax/income brackets, shoe sizes (e.g., 7, 7.5, 8, 8,5) and Likert scales (e.g., from ‘strongly disagree’ to ‘strongly agree’). In R, this “order” is called levels.

Missing Values (NA)

Missing values are coded as NA in R. Different functions will handle missing data differently by default. Some will spit out an NA (like mean), others will rely on a default for handling the missing data (e.g., listwise deletion, pairwise deletion, etc.). It is therefore important for you to check out the defaults in functions you’re using because you might be handling missing data differently than you thought.

Load Packages

library(tidyverse)
library(haven)
library(babynames)
library(dslabs)
library(psych)
library(car)

Descriptive Statistics

we will be working with the mtcars dataset.

Get the names of columns and view the top few rows of the dataset

data(mtcars)
names(mtcars)
 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
[11] "carb"
head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

You can also look at specific variables using the name of the dataframe object followed by the ‘$’ operator and the name of the specific column you would like to inspect.

Descriptives (Base R)

Look at just the mpg column

mtcars$mpg
 [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
[16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4

This is particularly useful if you want to inspect the descriptives for only one or a subset of columns. So we could find the mean of a specific column by calling on the mean and star function and specifying the column we are interested in. We can use this

Find the mean and standard deviation of the mpg column

mean(mtcars$mpg)
[1] 20.09062
sd(mtcars$mpg)
[1] 6.026948

We can find most of the common descriptive statistics in base R, just by coding the function and specifying the specific column we want to summarize. Other common descriptives might be range(), median(), mode(), etc. However, it can be incredibly cumbersome to analyze multiple variables using a line of code for each fit stat.

This dataset has 11 variables, so even to find the mean and standard deviation for each would require 22 lines of code without use of a simpler method. This method is most useful for quickly pulling specific stats from specific columns.

If you want to describe your whole dataset, it would be a lot more efficient to use a single function. The describe function in the psych package is one of the simplest ways to take care of this.

Descriptives (psych and car packages)

psych::describe

By default, missing values will be removed. If you would like missing values to be included, you can add na.rm = FALSE to the code.

The describe() function will provide all of the most common descriptive statistics including mean, median, mode, standard deviation, etc. as well as skew and kurtosis to provide an idea of the shape of the distribution

describe(mtcars)
     vars  n   mean     sd median trimmed    mad   min    max  range  skew
mpg     1 32  20.09   6.03  19.20   19.70   5.41 10.40  33.90  23.50  0.61
cyl     2 32   6.19   1.79   6.00    6.23   2.97  4.00   8.00   4.00 -0.17
disp    3 32 230.72 123.94 196.30  222.52 140.48 71.10 472.00 400.90  0.38
hp      4 32 146.69  68.56 123.00  141.19  77.10 52.00 335.00 283.00  0.73
drat    5 32   3.60   0.53   3.70    3.58   0.70  2.76   4.93   2.17  0.27
wt      6 32   3.22   0.98   3.33    3.15   0.77  1.51   5.42   3.91  0.42
qsec    7 32  17.85   1.79  17.71   17.83   1.42 14.50  22.90   8.40  0.37
     kurtosis    se
mpg     -0.37  1.07
cyl     -1.76  0.32
disp    -1.21 21.91
hp      -0.14 12.12
drat    -0.71  0.09
wt      -0.02  0.17
qsec     0.34  0.32
 [ reached 'max' / getOption("max.print") -- omitted 4 rows ]

Skewness

Negative skewness values indicate that the distribution is skewed to the left, while positive indicates skew to the right.

Kurtosis

Negative kurtosis indicates thin tails (platykurtic distributions), and positive values represent longer tails (leptokurtic distributions).

For a more brief list of descriptives, you can add fast = T, which will delete columns like skew, kurtosis, trimmed, etc.

Descriptives by Grouping Variable

If your dataset includes a grouping variable, you can also find descriptives by group level. So in this dataset, the ‘am’ variable is coded as 0 or 1 to indicate the car has an automatic (0) or manual (1) transmission. This ‘dummy coding’ means that it can be used as a grouping variable. We can therefore look at the means of other variables between types of transmission.

describeBy(mtcars, group = 'am', digits = 2)

 Descriptive statistics by group 
am: 0
     vars  n   mean     sd median trimmed    mad    min    max  range  skew
mpg     1 19  17.15   3.83  17.30   17.12   3.11  10.40  24.40  14.00  0.01
cyl     2 19   6.95   1.54   8.00    7.06   0.00   4.00   8.00   4.00 -0.95
disp    3 19 290.38 110.17 275.80  289.71 124.83 120.10 472.00 351.90  0.05
hp      4 19 160.26  53.91 175.00  161.06  77.10  62.00 245.00 183.00 -0.01
drat    5 19   3.29   0.39   3.15    3.28   0.22   2.76   3.92   1.16  0.50
wt      6 19   3.77   0.78   3.52    3.75   0.45   2.46   5.42   2.96  0.98
qsec    7 19  18.18   1.75  17.82   18.07   1.19  15.41  22.90   7.49  0.85
     kurtosis    se
mpg     -0.80  0.88
cyl     -0.74  0.35
disp    -1.26 25.28
hp      -1.21 12.37
drat    -1.30  0.09
wt       0.14  0.18
qsec     0.55  0.40
 [ reached 'max' / getOption("max.print") -- omitted 4 rows ]
------------------------------------------------------------ 
am: 1
     vars  n   mean    sd median trimmed   mad   min    max  range  skew
mpg     1 13  24.39  6.17  22.80   24.38  6.67 15.00  33.90  18.90  0.05
cyl     2 13   5.08  1.55   4.00    4.91  0.00  4.00   8.00   4.00  0.87
disp    3 13 143.53 87.20 120.30  131.25 58.86 71.10 351.00 279.90  1.33
hp      4 13 126.85 84.06 109.00  114.73 63.75 52.00 335.00 283.00  1.36
drat    5 13   4.05  0.36   4.08    4.02  0.27  3.54   4.93   1.39  0.79
wt      6 13   2.41  0.62   2.32    2.39  0.68  1.51   3.57   2.06  0.21
qsec    7 13  17.36  1.79  17.02   17.39  2.34 14.50  19.90   5.40 -0.23
     kurtosis    se
mpg     -1.46  1.71
cyl     -0.90  0.43
disp     0.40 24.19
hp       0.56 23.31
drat     0.21  0.10
wt      -1.17  0.17
qsec    -1.42  0.50
 [ reached 'max' / getOption("max.print") -- omitted 4 rows ]

As you can see from the output of the above code, we now have 2 sets of descriptive statistics, one for am: 0 and another for am: 1.

Another common data inspection tool is graphs and distributions. There are numerous kinds of graphs, and some of these will be explored further using GGPlot; however, below, we will go over a few of the more common graph types.

To build off of our investigation of descriptive statistics, we may wish to generate a frequency distribution such as a histogram (continuous data) or bar graph (categorical data). This can be done easily in base R without installing any new packages. In this case, we will be using the mpg variable as an example of continuous data, and cyl as an example of categorical/discrete data.

Basic Histogram for Continuous Variables

Titles are optional, but improve readability

hist(mtcars$mpg, xlab = "Miles Per Gallon", ylab = "Frequency", 
     main = "MPG Frequency Distribution", breaks=10)

Bar Plot for Discrete Variables

counts <- table(mtcars$gear)

barplot(counts, main="Car Distribution",
   xlab="Number of Gears", ylab = "Count",
   col = c("dark blue", "red", "yellow")) 

#"col =" will define the colours you wish to use

If you wish to create a stacked or grouped bar plot, you may define the variables you want to graph and call upon the table object in your code.

Stacked barplot

“legend =” will create a legend so that it is clear which variable corresponds to each colour

stackedvars <- table(mtcars$am, mtcars$cyl)
stackedvars
   
     4  6  8
  0  3  4 12
  1  8  3  2
barplot(stackedvars, xlab = "Number of Cylinders", 
        ylab = "Frequency", main = "Frequency Distribution of Cylinders by Transmission", 
        col = c("blue", "green"), legend = c("automatic", "manual"))

Grouped Barplot

# Side by Side
# Add "beside = T" to the above code
barplot(stackedvars, xlab = "Number of Cylinders", ylab = "Frequency", 
        main = "Frequency Distribution of Cylinders by Transmission", 
        col = c("blue", "green"), legend = c("automatic", "manual"), beside = T)

You can look up either the hist or barplot functions to see more possible additions. You can also adjust the placement of your figure on the plot using the par() functionhttps://www.datamentor.io/r-programming/subplot/.

Say we want to visually inspect the relationship between miles per gallon and weight of a vehicle. Simple scatterplots are also relatively easily created in base R using the plot function. This can similarly be customized for aspects like colour and placement, but we will be focusing here on a basic introduction to plotting functions.

Using the plot function, we will first input the variable to plot on the x-axis, then the variable we want on the y-axis. We should also add in titles for interpretability.

Scatterplot

plot(mtcars$wt, mtcars$mpg, xlab = "Weight", 
     ylab = "Miles per Gallon", 
     main = "Miles per Gallon by Weight")

Scatterplot with line of best fit (linear)

plot(mtcars$wt, mtcars$mpg, xlab = "Weight", 
     ylab = "Miles per Gallon", 
     main = "Miles per Gallon by Weight") + # plus sign
abline(lm(mpg ~ wt, data = mtcars), col = "blue") 

integer(0)
# Add a line of best fit (linear)

Scatterplot with lowess line

plot(mtcars$wt, mtcars$mpg, xlab = "Weight", 
     ylab = "Miles per Gallon", 
     main = "Miles per Gallon by Weight") + # plus sign
lines(lowess(mtcars$wt, mtcars$mpg), 
      col = "red")

integer(0)

You may also be interested in inspecting quickly the correlations between variables to see if any relationship appears to exist or to quickly eyeball the data for linearity (although this should be examined much more in depth through other testing - this method will only indicate major violations).

Scatterplor Matrix

sub.mtcars <- mtcars %>%
  select(mpg, disp, hp, drat, wt, qsec)

scatterplotMatrix(sub.mtcars)

# OR in psych

  psych::pairs.panels(sub.mtcars,scale = FALSE)

It is important to note that in order to create scatterplots, you must have equal numbers of cases between x and y. If there are missing values in either categories, you will need to remove missing values. This can be done using complete.cases or na.omit. Although unnecessary for this dataset, the code would look like this:

Removing Missing Values

mtcars.complete <- mtcars[complete.cases(mtcars),]
# OR
mtcars.complete <- na.omit(mtcars)

When doing this, I would suggest creating a new dataframe object, like mtcars.complete so that the original data frame and its information isn’t lost or deleted.

Data Manipulation Using the Tidyverse

The powerhouse package for doing data transformation and manipulation is called dplyr (part of tidyverse).

We will be learning how to use the dplyr package to find different ways to present and transform our data.

There are 6 primary functions in the dplyr toolkit:

Main Functions

filter() … filters rows (creates a subset of individuals but preserves all variables)

select() … selects columns (creates a subset of variables but preserves all individuals)

arrange() … sorts the data in ascending or descending order

mutate() … adds new variables to your dataset

group_by() … creates a grouping variable so that you can obtain summary statistics separately for each group on a particular variable (usually used with summarise)

summarise() … calculates summary statistics - this is typically used in conjuction with some combination of the other 5 functions above to calculate means, SDs, frequencies etc.

Introduction of the ‘pipe’ %>%

There are a couple of different ways to use the dplyr functions. All of them take as their first argument, the name of the dataset that you want to manipulate.

Example: Your data is called “dataset” and you have variables x1, x2, x3

to sort the dataset in increasing order for on x1 you would use:

arrange(dataset, x1)

See ?arrange for the help file.

But instead of having to start every function with the name of the dataset, dplyr can also use a ‘pipe’ to list the dataset outside of the function. The pipe is represented by the symbol: %>% (percent, greater than, percent).

Here’s the approach using the pipe, that achieves the same result as the code example above.

dataset %>%
  arrange(x1)

With a single function, there is minimal benefit to this approach, but it is meant to be used in a chain of commands. So if for example, you wanted to sort the dataset using arrange, then filter based on some criteria, then create a new variable using mutate, you can use the pipe in a chain of commands like:

dataset %>%
  arrange(...)  %>%
  filter(... ) %>%
  mutate(...)

In English, you could read the %>% symbol as the word “and then”. E.g., take “dataset” and then arrange based on some variable, and then filter based on some other variable, and then mutate based on some third variable.

Dataset

We will be using the “murders” dataset. If you installed and loaded the dslabs packages, this dataset is aleady imported to your environemnt.

The murders dataset shows the US gun murders by state in 2010. From the R Documentation: “Gun murder data from FBI reports. Also contains the population of each state.”

# murders dataset
head(murders)
       state abb region population total
1    Alabama  AL  South    4779736   135
2     Alaska  AK   West     710231    19
3    Arizona  AZ   West    6392017   232
4   Arkansas  AR  South    2915918    93
5 California  CA   West   37253956  1257
6   Colorado  CO   West    5029196    65
describe(murders) # descriptives
           vars  n       mean         sd  median    trimmed        mad    min
state*        1 51      26.00      14.87      26      26.00      19.27      1
abb*          2 51      26.00      14.87      26      26.00      19.27      1
region*       3 51       2.57       1.06       2       2.59       1.48      1
population    4 51 6075769.18 6860669.07 4339367 4692546.63 3725812.35 563626
total         5 51     184.37     236.13      97     138.41     126.02      2
                max    range skew kurtosis        se
state*           51       50 0.00    -1.27      2.08
abb*             51       50 0.00    -1.27      2.08
region*           4        3 0.02    -1.29      0.15
population 37253956 36690330 2.48     7.26 960685.83
total          1257     1255 2.36     6.88     33.06
df <- murders # making a copy of the dataset under an easier name to type, 'df' as in dataframe

arrange()

Sorting in ascending or descending order is pretty straightforward

arrange(df, total) # sort on age from lowest to highest
           state abb        region population total
1        Vermont  VT     Northeast     625741     2
2   North Dakota  ND North Central     672591     4
3  New Hampshire  NH     Northeast    1316470     5
4        Wyoming  WY          West     563626     5
5         Hawaii  HI          West    1360301     7
6   South Dakota  SD North Central     814180     8
7          Maine  ME     Northeast    1328361    11
8          Idaho  ID          West    1567582    12
9        Montana  MT          West     989415    12
10  Rhode Island  RI     Northeast    1052567    16
11        Alaska  AK          West     710231    19
12          Iowa  IA North Central    3046355    21
13          Utah  UT          West    2763885    22
14 West Virginia  WV         South    1852994    27
15      Nebraska  NE North Central    1826341    32
16        Oregon  OR          West    3831074    36
17      Delaware  DE         South     897934    38
18     Minnesota  MN North Central    5303925    53
19        Kansas  KS North Central    2853118    63
20      Colorado  CO          West    5029196    65
 [ reached 'max' / getOption("max.print") -- omitted 31 rows ]
arrange(df, desc(total)) # sort on age from highest to lowest
            state abb        region population total
1      California  CA          West   37253956  1257
2           Texas  TX         South   25145561   805
3         Florida  FL         South   19687653   669
4        New York  NY     Northeast   19378102   517
5    Pennsylvania  PA     Northeast   12702379   457
6        Michigan  MI North Central    9883640   413
7         Georgia  GA         South    9920000   376
8        Illinois  IL North Central   12830632   364
9       Louisiana  LA         South    4533372   351
10       Missouri  MO North Central    5988927   321
11           Ohio  OH North Central   11536504   310
12       Maryland  MD         South    5773552   293
13 North Carolina  NC         South    9535483   286
14       Virginia  VA         South    8001024   250
15     New Jersey  NJ     Northeast    8791894   246
16        Arizona  AZ          West    6392017   232
17      Tennessee  TN         South    6346105   219
18 South Carolina  SC         South    4625364   207
19        Indiana  IN North Central    6483802   142
20        Alabama  AL         South    4779736   135
 [ reached 'max' / getOption("max.print") -- omitted 31 rows ]
# here is the same code using the pipe instead

df %>%
    arrange(total)
           state abb        region population total
1        Vermont  VT     Northeast     625741     2
2   North Dakota  ND North Central     672591     4
3  New Hampshire  NH     Northeast    1316470     5
4        Wyoming  WY          West     563626     5
5         Hawaii  HI          West    1360301     7
6   South Dakota  SD North Central     814180     8
7          Maine  ME     Northeast    1328361    11
8          Idaho  ID          West    1567582    12
9        Montana  MT          West     989415    12
10  Rhode Island  RI     Northeast    1052567    16
11        Alaska  AK          West     710231    19
12          Iowa  IA North Central    3046355    21
13          Utah  UT          West    2763885    22
14 West Virginia  WV         South    1852994    27
15      Nebraska  NE North Central    1826341    32
16        Oregon  OR          West    3831074    36
17      Delaware  DE         South     897934    38
18     Minnesota  MN North Central    5303925    53
19        Kansas  KS North Central    2853118    63
20      Colorado  CO          West    5029196    65
 [ reached 'max' / getOption("max.print") -- omitted 31 rows ]
df %>%
    arrange(desc(total))
            state abb        region population total
1      California  CA          West   37253956  1257
2           Texas  TX         South   25145561   805
3         Florida  FL         South   19687653   669
4        New York  NY     Northeast   19378102   517
5    Pennsylvania  PA     Northeast   12702379   457
6        Michigan  MI North Central    9883640   413
7         Georgia  GA         South    9920000   376
8        Illinois  IL North Central   12830632   364
9       Louisiana  LA         South    4533372   351
10       Missouri  MO North Central    5988927   321
11           Ohio  OH North Central   11536504   310
12       Maryland  MD         South    5773552   293
13 North Carolina  NC         South    9535483   286
14       Virginia  VA         South    8001024   250
15     New Jersey  NJ     Northeast    8791894   246
16        Arizona  AZ          West    6392017   232
17      Tennessee  TN         South    6346105   219
18 South Carolina  SC         South    4625364   207
19        Indiana  IN North Central    6483802   142
20        Alabama  AL         South    4779736   135
 [ reached 'max' / getOption("max.print") -- omitted 31 rows ]

Note that while the output was sorted on total in any of these situations above, it does not change the sorting in your original dataset!

Why?

Remember that R is object based, and we did not create a new object that saves the sorting.

To preserve the sorting in our dataset we have to either over-ride our df dataset or create a second version of our df dataset. E.g.,

df2 <- df %>%
    arrange(total)

With arrange, saving a new dataset is trivial because there will be no effects on your statistics, but it non-trivial if you start filtering and transforming variables, and then want to use this transformed dataset in an analysis.

Note you can also sort across multiple variables. Note that the order matters because it sorts on the first variable first, then the next one, then the next one…etc.

df %>% 
    arrange(total,population) # sort total from highest to lowest and on population highest to lowest
           state abb        region population total
1        Vermont  VT     Northeast     625741     2
2   North Dakota  ND North Central     672591     4
3        Wyoming  WY          West     563626     5
4  New Hampshire  NH     Northeast    1316470     5
5         Hawaii  HI          West    1360301     7
6   South Dakota  SD North Central     814180     8
7          Maine  ME     Northeast    1328361    11
8        Montana  MT          West     989415    12
9          Idaho  ID          West    1567582    12
10  Rhode Island  RI     Northeast    1052567    16
11        Alaska  AK          West     710231    19
12          Iowa  IA North Central    3046355    21
13          Utah  UT          West    2763885    22
14 West Virginia  WV         South    1852994    27
15      Nebraska  NE North Central    1826341    32
16        Oregon  OR          West    3831074    36
17      Delaware  DE         South     897934    38
18     Minnesota  MN North Central    5303925    53
19        Kansas  KS North Central    2853118    63
20      Colorado  CO          West    5029196    65
 [ reached 'max' / getOption("max.print") -- omitted 31 rows ]
df %>%
    arrange(total,desc(population)) 
           state abb        region population total
1        Vermont  VT     Northeast     625741     2
2   North Dakota  ND North Central     672591     4
3  New Hampshire  NH     Northeast    1316470     5
4        Wyoming  WY          West     563626     5
5         Hawaii  HI          West    1360301     7
6   South Dakota  SD North Central     814180     8
7          Maine  ME     Northeast    1328361    11
8          Idaho  ID          West    1567582    12
9        Montana  MT          West     989415    12
10  Rhode Island  RI     Northeast    1052567    16
11        Alaska  AK          West     710231    19
12          Iowa  IA North Central    3046355    21
13          Utah  UT          West    2763885    22
14 West Virginia  WV         South    1852994    27
15      Nebraska  NE North Central    1826341    32
16        Oregon  OR          West    3831074    36
17      Delaware  DE         South     897934    38
18     Minnesota  MN North Central    5303925    53
19        Kansas  KS North Central    2853118    63
20      Colorado  CO          West    5029196    65
 [ reached 'max' / getOption("max.print") -- omitted 31 rows ]

select()

This function helps you work with a smaller version of your dataset (on the variable side). This can save you time so you don’t have to scroll through to find the exact names of your variables (remember R is case sensitive so it has to be a perfect match).

Our dataset has a small number of variables so it is already manageable but we will still see how to use the select function.

NOTE: There is one really annoying issue with select() that may come up for you. select() is also the name of a function from the MASS package and MASS is loaded by default so R often wants to use that function instead.

If you use ?select you will see it come up for two packages

There are two ways you can fix this:

sub.df <-  df %>%
    dplyr::select(abb, total) # tells R to use the select function from dplyr in this case only

select<-dplyr::select # tells R that you want to use the select function from dplyr EVERY time within this R session. If you close R, you will need to run this again

sub.df <- df %>%
    select(abb, total)

# other ways to use select

sub.df <- df %>% 
    select(-state) # all variables EXCEPT these ones

sub.df
   abb        region population total
1   AL         South    4779736   135
2   AK          West     710231    19
3   AZ          West    6392017   232
4   AR         South    2915918    93
5   CA          West   37253956  1257
6   CO          West    5029196    65
7   CT     Northeast    3574097    97
8   DE         South     897934    38
9   DC         South     601723    99
10  FL         South   19687653   669
11  GA         South    9920000   376
12  HI          West    1360301     7
13  ID          West    1567582    12
14  IL North Central   12830632   364
15  IN North Central    6483802   142
16  IA North Central    3046355    21
17  KS North Central    2853118    63
18  KY         South    4339367   116
19  LA         South    4533372   351
20  ME     Northeast    1328361    11
21  MD         South    5773552   293
22  MA     Northeast    6547629   118
23  MI North Central    9883640   413
24  MN North Central    5303925    53
25  MS         South    2967297   120
 [ reached 'max' / getOption("max.print") -- omitted 26 rows ]
df %>%
    select(region:total) # select everything starting in between region and total (inclusive)
          region population total
1          South    4779736   135
2           West     710231    19
3           West    6392017   232
4          South    2915918    93
5           West   37253956  1257
6           West    5029196    65
7      Northeast    3574097    97
8          South     897934    38
9          South     601723    99
10         South   19687653   669
11         South    9920000   376
12          West    1360301     7
13          West    1567582    12
14 North Central   12830632   364
15 North Central    6483802   142
16 North Central    3046355    21
17 North Central    2853118    63
18         South    4339367   116
19         South    4533372   351
20     Northeast    1328361    11
21         South    5773552   293
22     Northeast    6547629   118
23 North Central    9883640   413
24 North Central    5303925    53
25         South    2967297   120
26 North Central    5988927   321
27          West     989415    12
28 North Central    1826341    32
29          West    2700551    84
30     Northeast    1316470     5
31     Northeast    8791894   246
32          West    2059179    67
33     Northeast   19378102   517
 [ reached 'max' / getOption("max.print") -- omitted 18 rows ]
# note if you want to use excluding a range you need parentheses around the variables with a minus sign in front of it
df %>%
    select(-(region:total)) # select everything excluding variables in between region and total (inclusive)
                  state abb
1               Alabama  AL
2                Alaska  AK
3               Arizona  AZ
4              Arkansas  AR
5            California  CA
6              Colorado  CO
7           Connecticut  CT
8              Delaware  DE
9  District of Columbia  DC
10              Florida  FL
11              Georgia  GA
12               Hawaii  HI
13                Idaho  ID
14             Illinois  IL
15              Indiana  IN
16                 Iowa  IA
17               Kansas  KS
18             Kentucky  KY
19            Louisiana  LA
20                Maine  ME
21             Maryland  MD
22        Massachusetts  MA
23             Michigan  MI
24            Minnesota  MN
25          Mississippi  MS
26             Missouri  MO
27              Montana  MT
28             Nebraska  NE
29               Nevada  NV
30        New Hampshire  NH
31           New Jersey  NJ
32           New Mexico  NM
33             New York  NY
34       North Carolina  NC
35         North Dakota  ND
36                 Ohio  OH
37             Oklahoma  OK
38               Oregon  OR
39         Pennsylvania  PA
40         Rhode Island  RI
41       South Carolina  SC
42         South Dakota  SD
43            Tennessee  TN
44                Texas  TX
45                 Utah  UT
46              Vermont  VT
47             Virginia  VA
48           Washington  WA
49        West Virginia  WV
50            Wisconsin  WI
 [ reached 'max' / getOption("max.print") -- omitted 1 rows ]

filter()

filter() allows you to work with a smaller number of rows in your dataset based on a commonality. You can also think of this as subsetting your data to include participants with particular characteristics.

Here is where you will be using logical data because you need to tell filter() not just the variable you want to filter on, but also the particular conditions or values.

Here are some examples:

df$state <- factor(df$state)
df$abb <- factor(df$abb)

# only west region

df %>%
    filter(region == "West")  # note that you need two equal signs
        state abb region population total
1      Alaska  AK   West     710231    19
2     Arizona  AZ   West    6392017   232
3  California  CA   West   37253956  1257
4    Colorado  CO   West    5029196    65
5      Hawaii  HI   West    1360301     7
6       Idaho  ID   West    1567582    12
7     Montana  MT   West     989415    12
8      Nevada  NV   West    2700551    84
9  New Mexico  NM   West    2059179    67
10     Oregon  OR   West    3831074    36
11       Utah  UT   West    2763885    22
12 Washington  WA   West    6724540    93
13    Wyoming  WY   West     563626     5
# at or more than 50 murders 

df %>%
    filter(total >= 50) # greater than or equal to
                  state abb        region population total
1               Alabama  AL         South    4779736   135
2               Arizona  AZ          West    6392017   232
3              Arkansas  AR         South    2915918    93
4            California  CA          West   37253956  1257
5              Colorado  CO          West    5029196    65
6           Connecticut  CT     Northeast    3574097    97
7  District of Columbia  DC         South     601723    99
8               Florida  FL         South   19687653   669
9               Georgia  GA         South    9920000   376
10             Illinois  IL North Central   12830632   364
11              Indiana  IN North Central    6483802   142
12               Kansas  KS North Central    2853118    63
13             Kentucky  KY         South    4339367   116
14            Louisiana  LA         South    4533372   351
15             Maryland  MD         South    5773552   293
16        Massachusetts  MA     Northeast    6547629   118
17             Michigan  MI North Central    9883640   413
18            Minnesota  MN North Central    5303925    53
19          Mississippi  MS         South    2967297   120
20             Missouri  MO North Central    5988927   321
 [ reached 'max' / getOption("max.print") -- omitted 14 rows ]
# population smaller than 1M

df %>%
    filter(population < 1000000) # less than 5 (doesn't include 5)
                 state abb        region population total
1               Alaska  AK          West     710231    19
2             Delaware  DE         South     897934    38
3 District of Columbia  DC         South     601723    99
4              Montana  MT          West     989415    12
5         North Dakota  ND North Central     672591     4
6         South Dakota  SD North Central     814180     8
7              Vermont  VT     Northeast     625741     2
8              Wyoming  WY          West     563626     5

You can also filter on multiple conditions

# Only Northeast AND have more than 100 murders
df %>%
    filter(region == "Northeast" & total > 100)  # both of these have to be true
          state abb    region population total
1 Massachusetts  MA Northeast    6547629   118
2    New Jersey  NJ Northeast    8791894   246
3      New York  NY Northeast   19378102   517
4  Pennsylvania  PA Northeast   12702379   457
# Northwest or less than 100

df %>%
    filter(region == "Northwest" | total < 100) # only one of these needs to be true
                  state abb        region population total
1                Alaska  AK          West     710231    19
2              Arkansas  AR         South    2915918    93
3              Colorado  CO          West    5029196    65
4           Connecticut  CT     Northeast    3574097    97
5              Delaware  DE         South     897934    38
6  District of Columbia  DC         South     601723    99
7                Hawaii  HI          West    1360301     7
8                 Idaho  ID          West    1567582    12
9                  Iowa  IA North Central    3046355    21
10               Kansas  KS North Central    2853118    63
11                Maine  ME     Northeast    1328361    11
12            Minnesota  MN North Central    5303925    53
13              Montana  MT          West     989415    12
14             Nebraska  NE North Central    1826341    32
15               Nevada  NV          West    2700551    84
16        New Hampshire  NH     Northeast    1316470     5
17           New Mexico  NM          West    2059179    67
18         North Dakota  ND North Central     672591     4
19               Oregon  OR          West    3831074    36
20         Rhode Island  RI     Northeast    1052567    16
 [ reached 'max' / getOption("max.print") -- omitted 7 rows ]
# you could also say what you do NOT want

df %>%
    filter(region != "East") # != means not equal to
                  state abb        region population total
1               Alabama  AL         South    4779736   135
2                Alaska  AK          West     710231    19
3               Arizona  AZ          West    6392017   232
4              Arkansas  AR         South    2915918    93
5            California  CA          West   37253956  1257
6              Colorado  CO          West    5029196    65
7           Connecticut  CT     Northeast    3574097    97
8              Delaware  DE         South     897934    38
9  District of Columbia  DC         South     601723    99
10              Florida  FL         South   19687653   669
11              Georgia  GA         South    9920000   376
12               Hawaii  HI          West    1360301     7
13                Idaho  ID          West    1567582    12
14             Illinois  IL North Central   12830632   364
15              Indiana  IN North Central    6483802   142
16                 Iowa  IA North Central    3046355    21
17               Kansas  KS North Central    2853118    63
18             Kentucky  KY         South    4339367   116
19            Louisiana  LA         South    4533372   351
20                Maine  ME     Northeast    1328361    11
 [ reached 'max' / getOption("max.print") -- omitted 31 rows ]

Revisiting the dplyr ‘pipeline’

One of the convenient features of dplyr is that the functions and code were designed to be used together for limitless opportunities to analyze different versions of the dataset. You may not need to use some of the more advanced features but you will likely need to string at least a couple of them together at some point.

In SPSS, you simply click through the different buttons and apply each of them separately to get all of your conditions applied.

In R, you set up a “pipeline” using the symbol %>% as the pipe that joins everything together. You can think read this symbol as the word “then”

Here’s an example pipeline in plain English:

Start with the df dataset and then Create a smaller subset of the dataset for only the variables state, abb, region, total and then Filter the dataset to only include West regions and then Sort the dataset by total from highest to lowest

In R this would look like:

df %>%
    select(state, abb, region, total) %>%
    filter(region == "West") %>%
    arrange(desc(total)) 
        state abb region total
1  California  CA   West  1257
2     Arizona  AZ   West   232
3  Washington  WA   West    93
4      Nevada  NV   West    84
5  New Mexico  NM   West    67
6    Colorado  CO   West    65
7      Oregon  OR   West    36
8        Utah  UT   West    22
9      Alaska  AK   West    19
10      Idaho  ID   West    12
11    Montana  MT   West    12
12     Hawaii  HI   West     7
13    Wyoming  WY   West     5

Now we can easily find the highest (highest two, three, etc.) number of murders for west regions

mutate()

mutate() is used when you want to create a new variable in your dataset.

For example, we may want to create a new murder ratio variable that is the total / population

new_df<-df %>%
    mutate(murderatio =total/population)

new_df
                  state abb        region population total   murderatio
1               Alabama  AL         South    4779736   135 2.824424e-05
2                Alaska  AK          West     710231    19 2.675186e-05
3               Arizona  AZ          West    6392017   232 3.629527e-05
4              Arkansas  AR         South    2915918    93 3.189390e-05
5            California  CA          West   37253956  1257 3.374138e-05
6              Colorado  CO          West    5029196    65 1.292453e-05
7           Connecticut  CT     Northeast    3574097    97 2.713972e-05
8              Delaware  DE         South     897934    38 4.231937e-05
9  District of Columbia  DC         South     601723    99 1.645275e-04
10              Florida  FL         South   19687653   669 3.398069e-05
11              Georgia  GA         South    9920000   376 3.790323e-05
12               Hawaii  HI          West    1360301     7 5.145920e-06
13                Idaho  ID          West    1567582    12 7.655102e-06
14             Illinois  IL North Central   12830632   364 2.836961e-05
15              Indiana  IN North Central    6483802   142 2.190073e-05
16                 Iowa  IA North Central    3046355    21 6.893484e-06
 [ reached 'max' / getOption("max.print") -- omitted 35 rows ]
new_df <- new_df %>%
  mutate(murderatio_percent = murderatio*100)

new_df
                  state abb        region population total   murderatio
1               Alabama  AL         South    4779736   135 2.824424e-05
2                Alaska  AK          West     710231    19 2.675186e-05
3               Arizona  AZ          West    6392017   232 3.629527e-05
4              Arkansas  AR         South    2915918    93 3.189390e-05
5            California  CA          West   37253956  1257 3.374138e-05
6              Colorado  CO          West    5029196    65 1.292453e-05
7           Connecticut  CT     Northeast    3574097    97 2.713972e-05
8              Delaware  DE         South     897934    38 4.231937e-05
9  District of Columbia  DC         South     601723    99 1.645275e-04
10              Florida  FL         South   19687653   669 3.398069e-05
11              Georgia  GA         South    9920000   376 3.790323e-05
12               Hawaii  HI          West    1360301     7 5.145920e-06
13                Idaho  ID          West    1567582    12 7.655102e-06
14             Illinois  IL North Central   12830632   364 2.836961e-05
   murderatio_percent
1        0.0028244238
2        0.0026751860
3        0.0036295273
4        0.0031893901
5        0.0033741383
6        0.0012924531
7        0.0027139722
8        0.0042319369
9        0.0164527532
10       0.0033980688
11       0.0037903226
12       0.0005145920
13       0.0007655102
14       0.0028369608
 [ reached 'max' / getOption("max.print") -- omitted 37 rows ]

You can also apply square root, log, etc. transformations to variables and apply mutate multiple variables in one step.

summarise() and group_by()

summarise() is most often used to get descriptive statistics at the end of the pipeline for particular subgroups, aggregates of groups, frequency counts, etc.

How summarise() differs from output from regular summary statistics from functions like summary(), describe(), etc. is that the output itself is also a tibble that you could use in future analysis.

Generally speaking, where I would find it more useful is if you need to use a series of data manipulation functions in the pipeline to get exploratory statistics for different groups or want to check something quickly.

You can use it for general descriptive statistics if you want very specific pieces of information as well (especially used in conjunction with group_by() )

You can find a list of useful functions to use in summarise here: https://dplyr.tidyverse.org/reference/summarise.html

df %>%
    summarise(mean_cases=mean(total, na.rm=TRUE))
  mean_cases
1   184.3725
# with group_by

new_df %>%
    group_by(state) %>%
    summarise(mean_percent=mean(murderatio_percent, na.rm=TRUE))
# A tibble: 51 × 2
   state                mean_percent
   <fct>                       <dbl>
 1 Alabama                   0.00282
 2 Alaska                    0.00268
 3 Arizona                   0.00363
 4 Arkansas                  0.00319
 5 California                0.00337
 6 Colorado                  0.00129
 7 Connecticut               0.00271
 8 Delaware                  0.00423
 9 District of Columbia      0.0165 
10 Florida                   0.00340
# … with 41 more rows
new_df %>%
    group_by(region) %>%
    summarise(N=n()) 
# A tibble: 4 × 2
  region            N
  <fct>         <int>
1 Northeast         9
2 South            17
3 North Central    12
4 West             13

relocate()

relocate allows you to change the position of some of your variables (move to front, end, before or after certain other variables, etc.)

Changing variable order doesn’t affect your analyses or statistics in any way but it may make things more convenient or neater for you.

# move a variable to front of data set (default action)

df2<-df %>%
    relocate(total)
df2
   total                state abb        region population
1    135              Alabama  AL         South    4779736
2     19               Alaska  AK          West     710231
3    232              Arizona  AZ          West    6392017
4     93             Arkansas  AR         South    2915918
5   1257           California  CA          West   37253956
6     65             Colorado  CO          West    5029196
7     97          Connecticut  CT     Northeast    3574097
8     38             Delaware  DE         South     897934
9     99 District of Columbia  DC         South     601723
10   669              Florida  FL         South   19687653
11   376              Georgia  GA         South    9920000
12     7               Hawaii  HI          West    1360301
13    12                Idaho  ID          West    1567582
14   364             Illinois  IL North Central   12830632
15   142              Indiana  IN North Central    6483802
16    21                 Iowa  IA North Central    3046355
17    63               Kansas  KS North Central    2853118
18   116             Kentucky  KY         South    4339367
19   351            Louisiana  LA         South    4533372
20    11                Maine  ME     Northeast    1328361
 [ reached 'max' / getOption("max.print") -- omitted 31 rows ]
# move multiple variables to the front of data set

df2<-df %>%
    relocate(total, population)
df2
   total population                state abb        region
1    135    4779736              Alabama  AL         South
2     19     710231               Alaska  AK          West
3    232    6392017              Arizona  AZ          West
4     93    2915918             Arkansas  AR         South
5   1257   37253956           California  CA          West
6     65    5029196             Colorado  CO          West
7     97    3574097          Connecticut  CT     Northeast
8     38     897934             Delaware  DE         South
9     99     601723 District of Columbia  DC         South
10   669   19687653              Florida  FL         South
11   376    9920000              Georgia  GA         South
12     7    1360301               Hawaii  HI          West
13    12    1567582                Idaho  ID          West
14   364   12830632             Illinois  IL North Central
15   142    6483802              Indiana  IN North Central
16    21    3046355                 Iowa  IA North Central
17    63    2853118               Kansas  KS North Central
18   116    4339367             Kentucky  KY         South
19   351    4533372            Louisiana  LA         South
20    11    1328361                Maine  ME     Northeast
 [ reached 'max' / getOption("max.print") -- omitted 31 rows ]
df
                  state abb        region population total
1               Alabama  AL         South    4779736   135
2                Alaska  AK          West     710231    19
3               Arizona  AZ          West    6392017   232
4              Arkansas  AR         South    2915918    93
5            California  CA          West   37253956  1257
6              Colorado  CO          West    5029196    65
7           Connecticut  CT     Northeast    3574097    97
8              Delaware  DE         South     897934    38
9  District of Columbia  DC         South     601723    99
10              Florida  FL         South   19687653   669
11              Georgia  GA         South    9920000   376
12               Hawaii  HI          West    1360301     7
13                Idaho  ID          West    1567582    12
14             Illinois  IL North Central   12830632   364
15              Indiana  IN North Central    6483802   142
16                 Iowa  IA North Central    3046355    21
17               Kansas  KS North Central    2853118    63
18             Kentucky  KY         South    4339367   116
19            Louisiana  LA         South    4533372   351
20                Maine  ME     Northeast    1328361    11
 [ reached 'max' / getOption("max.print") -- omitted 31 rows ]
# move variable to a particular location

# total between abb and region

df3<- df %>%
    relocate(total, .before=region)

df3
                  state abb total        region population
1               Alabama  AL   135         South    4779736
2                Alaska  AK    19          West     710231
3               Arizona  AZ   232          West    6392017
4              Arkansas  AR    93         South    2915918
5            California  CA  1257          West   37253956
6              Colorado  CO    65          West    5029196
7           Connecticut  CT    97     Northeast    3574097
8              Delaware  DE    38         South     897934
9  District of Columbia  DC    99         South     601723
10              Florida  FL   669         South   19687653
11              Georgia  GA   376         South    9920000
12               Hawaii  HI     7          West    1360301
13                Idaho  ID    12          West    1567582
14             Illinois  IL   364 North Central   12830632
15              Indiana  IN   142 North Central    6483802
16                 Iowa  IA    21 North Central    3046355
17               Kansas  KS    63 North Central    2853118
18             Kentucky  KY   116         South    4339367
19            Louisiana  LA   351         South    4533372
20                Maine  ME    11     Northeast    1328361
 [ reached 'max' / getOption("max.print") -- omitted 31 rows ]

A little teaser for next week

This is an interactive scatterplot, try to rotate, zoom, pan, and hover

library(plotly)

mtcars$am[which(mtcars$am == 0)] <- 'Automatic'
mtcars$am[which(mtcars$am == 1)] <- 'Manual'
mtcars$am <- as.factor(mtcars$am)

fig <- plot_ly(mtcars, x = ~wt, y = ~hp, z = ~qsec, color = ~am, colors = c('#BF382A', '#0C4B8E'))
fig <- fig %>% add_markers()
fig <- fig %>% layout(scene = list(xaxis = list(title = 'Weight'),
                     yaxis = list(title = 'Gross horsepower'),
                     zaxis = list(title = '1/4 mile time')))

fig
---
title: "Week 2: Descriptive Statistics and Data Manipulation"
author: "Udi Alter"
date: "`r Sys.Date()`"
output: 
  rmdformats::readthedown:
    code_download: true
    highlight: kate
---

```{r setup, echo=FALSE, cache=FALSE, warning=F, message=F}
library(knitr)
library(rmdformats)
library(bslib)

# Global options
options(max.print=100)
opts_chunk$set(echo=TRUE,
	             cache=TRUE,
               prompt=FALSE,
               tidy=TRUE,
               comment=NA,
               message=FALSE,
               warning=FALSE)
opts_knit$set(width=75)

```


## ....Just a little bit more on Data Types and Structured

Last week we saw different atomic types like:

* Integer
* Numeric
* Logical
* Character

And various data structures such as:

* vector
* matrix
* list
* data frame
* tibble


Today we will briefly add these:

* **Factor** 
* **Missing Data**

## Factors

In R, factors are used to work with categorical variables, variables that have a fixed and known set of possible values. **Factor is a special case of vector** (data structure).

More than in other software, in R we need to explicitly think about the type of the variables when we do statistical analysis. The following types of variables can be defined or coerced as **factor**:


*Categorical Variables* - These are variables that have distinct, *non-ordered* categories like marital status (e.g., single, married, divorced, widowed), names (e.g., Jack, Mimi, Amy, Zoidberg), fruit (e.g., apple, orange, watermelon) etc. When a variable that has numeric values as placeholders for levels is imported into R, it is good to transform it into a factor so that R doesn't use it as continuous variable in statistical analysis. For example, 1= married, 2 = divorced, 3 = single, etc.

*Ordinal Variables* - These are categorical variables that have an order like letter grades (e.g., A+, A, B+,B), tax/income brackets, shoe sizes (e.g., 7, 7.5, 8, 8,5) and Likert scales (e.g., from 'strongly disagree' to 'strongly agree'). In R, this "order" is called **levels**.


## Missing Values (NA)

Missing values are coded as NA in R. Different functions will handle missing data differently by default. Some will spit out an NA (like mean), others will rely on a default for handling the missing data (e.g., listwise deletion, pairwise deletion, etc.). **It is therefore important for you to check out the defaults in functions you're using because you might be handling missing data differently than you thought.**

### Load Packages

```{r packages}
library(tidyverse)
library(haven)
library(babynames)
library(dslabs)
library(psych)
library(car)
```

# Descriptive Statistics
we will be working with the `mtcars` dataset.

Get the names of columns and view the top few rows of the dataset
```{r descriptives1}
data(mtcars)
names(mtcars)
head(mtcars)
```

You can also look at specific variables using the name of the dataframe object followed by the '$' operator and the name of the specific column you would like to inspect.

## Descriptives (Base R)

Look at just the mpg column
```{r col}
mtcars$mpg
```

This is particularly useful if you want to inspect the descriptives for only one or a subset of columns. So we could find the mean of a specific column by calling on the mean and star function and specifying the column we are interested in. We can use this 

Find the mean and standard deviation of the mpg column
```{r}
mean(mtcars$mpg)
sd(mtcars$mpg)
```

We can find most of the common descriptive statistics in base R, just by coding the function and specifying the specific column we want to summarize. Other common descriptives might be `range()`, `median()`, `mode()`, etc. However, it can be incredibly cumbersome to analyze multiple variables using a line of code for each fit stat. 

This dataset has 11 variables, so even to find the mean and standard deviation for each would require 22 lines of code without use of a simpler method. This method is most useful for quickly pulling specific stats from specific columns. 

If you want to describe your whole dataset, it would be a lot more efficient to use a single function. The `describe` function in the `psych` package is one of the simplest ways to take care of this.

## Descriptives (psych and car packages)

### psych::describe
By default, missing values will be removed. If you would like missing values to be included, you can add `na.rm = FALSE` to the code.

The `describe()` function will provide all of the most common descriptive statistics including mean, median, mode, standard deviation, etc. as well as skew and kurtosis to provide an idea of the shape of the distribution

```{r describe}
describe(mtcars)
```

### Skewness
Negative skewness values indicate that the distribution is skewed to the left, while positive indicates skew to the right.

### Kurtosis
Negative kurtosis indicates thin tails (platykurtic distributions), and positive values represent longer tails (leptokurtic distributions).

For a more brief list of descriptives, you can add `fast = T`, which will delete columns like skew, kurtosis, trimmed, etc.

## Descriptives by Grouping Variable

If your dataset includes a grouping variable, you can also find descriptives by group level. So in this dataset, the 'am' variable is coded as 0 or 1 to indicate the car has an automatic (0) or manual (1) transmission. This 'dummy coding' means that it can be used as a grouping variable. We can therefore look at the means of other variables between types of transmission.


```{r}
describeBy(mtcars, group = 'am', digits = 2)
```
As you can see from the output of the above code, we now have 2 sets of descriptive statistics, one for **am: 0** and another for **am: 1**.

Another common data inspection tool is graphs and distributions. There are numerous kinds of graphs, and some of these will be explored further using GGPlot; however, below, we will go over a few of the more common graph types.

To build off of our investigation of descriptive statistics, we may wish to generate a frequency distribution such as a histogram (continuous data) or bar graph (categorical data). This can be done easily in base R without installing any new packages. In this case, we will be using the `mpg` variable as an example of continuous data, and `cyl` as an example of categorical/discrete data. 

## Basic Histogram for Continuous Variables
Titles are optional, but improve readability
```{r hist}
hist(mtcars$mpg, xlab = "Miles Per Gallon", ylab = "Frequency", 
     main = "MPG Frequency Distribution", breaks=10)
```

## Bar Plot for Discrete Variables
```{r barplot}

counts <- table(mtcars$gear)

barplot(counts, main="Car Distribution",
   xlab="Number of Gears", ylab = "Count",
   col = c("dark blue", "red", "yellow")) 
#"col =" will define the colours you wish to use


```


If you wish to create a stacked or grouped bar plot, you may define the variables you want to graph and call upon the table object in your code.

## Stacked barplot

"legend =" will create a legend so that it is clear which variable corresponds to each colour
```{r stackeed barplot}
stackedvars <- table(mtcars$am, mtcars$cyl)
stackedvars
barplot(stackedvars, xlab = "Number of Cylinders", 
        ylab = "Frequency", main = "Frequency Distribution of Cylinders by Transmission", 
        col = c("blue", "green"), legend = c("automatic", "manual"))
```

## Grouped Barplot

```{r grouped bar graph}
# Side by Side
# Add "beside = T" to the above code
barplot(stackedvars, xlab = "Number of Cylinders", ylab = "Frequency", 
        main = "Frequency Distribution of Cylinders by Transmission", 
        col = c("blue", "green"), legend = c("automatic", "manual"), beside = T)
```

You can look up either the `hist` or `barplot` functions to see more possible additions. You can also adjust the placement of your figure on the plot using the `par()` function[https://www.datamentor.io/r-programming/subplot/](https://www.datamentor.io/r-programming/subplot/).

Say we want to visually inspect the relationship between miles per gallon and weight of a vehicle. Simple scatterplots are also relatively easily created in base R using the `plot` function. This can similarly be customized for aspects like colour and placement, but we will be focusing here on a basic introduction to plotting functions.

Using the `plot` function, we will first input the variable to plot on the x-axis, then the variable we want on the y-axis. We should also add in titles for interpretability.

## Scatterplot
```{r scatterplot1}
plot(mtcars$wt, mtcars$mpg, xlab = "Weight", 
     ylab = "Miles per Gallon", 
     main = "Miles per Gallon by Weight")

```



## Scatterplot with line of best fit (linear)


```{r line of best fit}
plot(mtcars$wt, mtcars$mpg, xlab = "Weight", 
     ylab = "Miles per Gallon", 
     main = "Miles per Gallon by Weight") + # plus sign
abline(lm(mpg ~ wt, data = mtcars), col = "blue") 
# Add a line of best fit (linear)

```



## Scatterplot with lowess line


```{r lowess}
plot(mtcars$wt, mtcars$mpg, xlab = "Weight", 
     ylab = "Miles per Gallon", 
     main = "Miles per Gallon by Weight") + # plus sign
lines(lowess(mtcars$wt, mtcars$mpg), 
      col = "red")
```


You may also be interested in inspecting quickly the correlations between variables to see if any relationship appears to exist or to quickly eyeball the data for linearity (although this should be examined much more in depth through other testing - this method will only indicate major violations).


## Scatterplor Matrix 


```{r }


sub.mtcars <- mtcars %>%
  select(mpg, disp, hp, drat, wt, qsec)

scatterplotMatrix(sub.mtcars)


# OR in psych

  psych::pairs.panels(sub.mtcars,scale = FALSE)
```

It is important to note that in order to create scatterplots, you must have equal numbers of cases between x and y. If there are missing values in either categories, you will need to remove missing values. This can be done using `complete.cases` or `na.omit`. Although unnecessary for this dataset, the code would look like this:

# Removing Missing Values


```{r NA}
mtcars.complete <- mtcars[complete.cases(mtcars),]
# OR
mtcars.complete <- na.omit(mtcars)
```

When doing this, I would suggest creating a *new* dataframe object, like `mtcars.complete` so that the original data frame and its information isn't lost or deleted.


# Data Manipulation Using the Tidyverse

The powerhouse package for doing data transformation and manipulation is called `dplyr` (part of `tidyverse`).

We will be learning how to use the `dplyr` package to find different ways to present and transform our data. 

There are 6 primary functions in the dplyr toolkit:

## Main Functions

`filter()` … filters rows (creates a subset of individuals but preserves all variables)

`select()` … selects columns (creates a subset of variables but preserves all individuals)

`arrange()` …  sorts the data in ascending or descending order

`mutate()` … adds new variables to your dataset 

`group_by()` … creates a grouping variable so that you can obtain summary statistics separately for each group on a particular variable (usually used with `summarise`)

`summarise()` … calculates summary statistics - this is typically used in conjuction with some combination of the other 5 functions above to calculate means, SDs, frequencies etc. 

## Introduction of the 'pipe' `%>%`

There are a couple of different ways to use the `dplyr` functions. All of them take as their first argument, the name of the dataset that you want to manipulate. 

Example:
Your data is called "dataset" and you have variables x1, x2, x3

to sort the dataset in increasing order for on x1 you would use:

```{r eval=F}
arrange(dataset, x1)
```

See `?arrange` for the help file.

But instead of having to start every function with the name of the dataset, dplyr can also use a 'pipe' to list the dataset outside of the function. The pipe is represented by the symbol:  `%>%` (percent, greater than, percent). 

Here's the approach using the pipe, that achieves the same result as the code example above. 

```{r eval=F}
dataset %>%
  arrange(x1)
```

With a single function, there is minimal benefit to this approach, but it is meant to be used in a chain of commands. So if for example, you wanted to sort the dataset using arrange, then filter based on some criteria, then create a new variable using mutate, you can use the pipe in a chain of commands like: 

```{r eval=F}
dataset %>%
  arrange(...)  %>%
  filter(... ) %>%
  mutate(...)
```

In English, you could read the `%>%` symbol as the word "and then". E.g., take "dataset" **and then** arrange based on some variable, **and then** filter based on some other variable, **and then** mutate based on some third variable.


### Dataset
We will be using the "murders" dataset. If you installed and loaded the `dslabs` packages, this dataset is aleady imported to your environemnt.

The murders dataset shows the US gun murders by state in 2010. From the R Documentation: "Gun murder data from FBI reports. Also contains the population of each state."


```{r data}
# murders dataset
head(murders)
describe(murders) # descriptives

df <- murders # making a copy of the dataset under an easier name to type, 'df' as in dataframe
```

# arrange()

Sorting in ascending or descending order is pretty straightforward

```{r arrange1}
arrange(df, total) # sort on age from lowest to highest
arrange(df, desc(total)) # sort on age from highest to lowest

# here is the same code using the pipe instead

df %>%
    arrange(total)

df %>%
    arrange(desc(total))

```

**Note that while the output was sorted on total in any of these situations above, it does not change the sorting in your original dataset!**

**Why?**

Remember that R is object based, and we did not create a new object that *saves the sorting*. 

To preserve the sorting in our dataset we have to either over-ride our df dataset or create a second version of our df dataset. E.g., 

```{r new_arrange}

df2 <- df %>%
    arrange(total)


```

With `arrange`, saving a new dataset is trivial because there will be no effects on your statistics, but it non-trivial if you start filtering and transforming variables, and then want to use this transformed dataset in an analysis. 

Note you can also sort across multiple variables. Note that the order matters because it sorts on the first variable first, then the next one, then the next one...etc.

```{r arrange2}

df %>% 
    arrange(total,population) # sort total from highest to lowest and on population highest to lowest

df %>%
    arrange(total,desc(population)) 
```

# select()

This function helps you work with a smaller version of your dataset (on the variable side). This can save you time so you don't have to scroll through to find the exact names of your variables (remember R is case sensitive so it has to be a perfect match). 

Our dataset has a small number of variables so it is already manageable but we will still see how to use the select function.

**NOTE:**
There is one really annoying issue with select() that may come up for you. select() is also the name of a function from the MASS package and MASS is loaded by default so R often wants to use that function instead. 

If you use ?select you will see it come up for two packages

There are two ways you can fix this:

```{r select}

sub.df <-  df %>%
    dplyr::select(abb, total) # tells R to use the select function from dplyr in this case only

select<-dplyr::select # tells R that you want to use the select function from dplyr EVERY time within this R session. If you close R, you will need to run this again

sub.df <- df %>%
    select(abb, total)

# other ways to use select

sub.df <- df %>% 
    select(-state) # all variables EXCEPT these ones

sub.df

df %>%
    select(region:total) # select everything starting in between region and total (inclusive)

# note if you want to use excluding a range you need parentheses around the variables with a minus sign in front of it
df %>%
    select(-(region:total)) # select everything excluding variables in between region and total (inclusive)

```

# filter()

filter() allows you to work with a smaller number of rows in your dataset based on a commonality. You can also think of this as subsetting your data to include participants with particular characteristics. 

Here is where you will be using logical data because you need to tell filter() not just the variable you want to filter on, but also the particular conditions or values. 

Here are some examples:

```{r filter}

df$state <- factor(df$state)
df$abb <- factor(df$abb)

# only west region

df %>%
    filter(region == "West")  # note that you need two equal signs

# at or more than 50 murders 

df %>%
    filter(total >= 50) # greater than or equal to

# population smaller than 1M

df %>%
    filter(population < 1000000) # less than 5 (doesn't include 5)
```

You can also filter on multiple conditions

```{r filter2}

# Only Northeast AND have more than 100 murders
df %>%
    filter(region == "Northeast" & total > 100)  # both of these have to be true

# Northwest or less than 100

df %>%
    filter(region == "Northwest" | total < 100) # only one of these needs to be true

# you could also say what you do NOT want

df %>%
    filter(region != "East") # != means not equal to

```

## Revisiting the dplyr 'pipeline'

One of the convenient features of dplyr is that the functions and code were designed to be used together for limitless opportunities to analyze different versions of the dataset. You may not need to use some of the more advanced features but you will likely need to string at least a couple of them together at some point. 

In SPSS, you simply click through the different buttons and apply each of them separately to get all of your conditions applied.

In R, you set up a "pipeline" using the symbol %>% as the pipe that joins everything together. You can think read this symbol as the word "then"

Here's an example pipeline in plain English:

Start with the df dataset **and then**
Create a smaller subset of the dataset for only the variables  state, abb, region, total **and then** 
Filter the dataset to only include West regions **and then** 
Sort the dataset by total from highest to lowest

In R this would look like:

```{r pipeline}
df %>%
    select(state, abb, region, total) %>%
    filter(region == "West") %>%
    arrange(desc(total)) 
```

Now we can easily find the highest (highest two, three, etc.) number of murders for west regions

# mutate()

mutate() is used when you want to create a **new** variable in your dataset. 


For example, we may want to create a new murder ratio variable that is the total / population

```{r mutate}
new_df<-df %>%
    mutate(murderatio =total/population)

new_df

new_df <- new_df %>%
  mutate(murderatio_percent = murderatio*100)

new_df
```

You can also apply square root, log, etc. transformations to variables and apply mutate multiple variables in one step. 

# summarise() and group_by()

summarise() is most often used to get descriptive statistics at the end of the pipeline for particular subgroups, aggregates of groups, frequency counts, etc. 

How summarise() differs from output from regular summary statistics from functions like summary(), describe(), etc. is that the output itself is also a tibble that you could use in future analysis. 

Generally speaking, where I would find it more useful is if you need to use a series of data manipulation functions in the pipeline to get exploratory statistics for different groups or want to check something quickly. 

You can use it for general descriptive statistics if you want very specific pieces of information as well (especially used in conjunction with group_by() )

You can find a list of useful functions to use in summarise here:
<https://dplyr.tidyverse.org/reference/summarise.html> 

```{r summarise}
df %>%
    summarise(mean_cases=mean(total, na.rm=TRUE))


# with group_by

new_df %>%
    group_by(state) %>%
    summarise(mean_percent=mean(murderatio_percent, na.rm=TRUE))

new_df %>%
    group_by(region) %>%
    summarise(N=n()) 

```


# relocate()

relocate allows you to change the position of some of your variables (move to front, end, before or after certain other variables, etc.)

Changing variable order doesn't affect your analyses or statistics in any way but it may make things more convenient or neater for you. 

```{r relocate}

# move a variable to front of data set (default action)

df2<-df %>%
    relocate(total)
df2

# move multiple variables to the front of data set

df2<-df %>%
    relocate(total, population)
df2

df
# move variable to a particular location

# total between abb and region

df3<- df %>%
    relocate(total, .before=region)

df3

```


# A little teaser for next week

This is an interactive scatterplot, try to rotate, zoom, pan, and hover

```{r}
library(plotly)

mtcars$am[which(mtcars$am == 0)] <- 'Automatic'
mtcars$am[which(mtcars$am == 1)] <- 'Manual'
mtcars$am <- as.factor(mtcars$am)

fig <- plot_ly(mtcars, x = ~wt, y = ~hp, z = ~qsec, color = ~am, colors = c('#BF382A', '#0C4B8E'))
fig <- fig %>% add_markers()
fig <- fig %>% layout(scene = list(xaxis = list(title = 'Weight'),
                     yaxis = list(title = 'Gross horsepower'),
                     zaxis = list(title = '1/4 mile time')))

fig

```
