1 R and RStudio Installation
1.1 Step 1: Installing R (should be done before installing RStudio)
To install R, visit cran.r-project.org and click Download R for [your operating system]. For example, because I’m using an apple computer, I clicked on Download R for macOS.
Then, under ‘Latest release’, click on the first/top one. For the macOS users this should be R-4.2.0.pkg (if you don’t have apple silicone chip; usually computers older than 2) or R-4.2.0-arm64.pkg (if your apple computer has an M1 or newer silicone chip).
For Windows, you’ll need to click on ‘base’ (or ‘install R for the first time,’ same thing) to download the most recent version of R. To finish installing R on your computer, all that is left to do is to run the .exe file.
1.2 Step 2: Installing R studio (should be done after installing R)
Once R is installed, you can proceed to install the RStudio at www.rstudio.com/products/rstudio/download/#download. Under ‘Download RStudio Desktop’, click on ’DOWNLOAD RSTUDIO FOR [your operating system].
For Windows: Run the .exe file to finish installing RStudio on your computer.
To access R from here on, you only need to open RStudio. Feel free to explore it. But, don’t worry about this too much, I will explain RStudio when we first meet.
Here’s a brief video with R and RStudio installation instructions for your reference.
If you’re unable to install R and RStudio before class, you can use RStudio Cloud instead: rstudio.cloud. RStudio Cloud is a cloud-based solution that allows anyone to use R and RStudio without any prior installation. Signing up is free! After you log-in, all you need to do is go to New Project > New RStudio Project. You can save, share, and download all R projects created in your workspace henceforth.
Note: If you run into any issues with the installation process or have any questions before we start, please feel free to e-mail me at udialter@yorku.ca.
2 Firing Up RStudio
2.0.1 Before a new data analysis:
- Create a folder for each data analysis project.
- Open Rstudio.
- Go to File > New Project > Existing Directory
- Navigate to the folder you created
- Click Create Project. A “.Rproj” file will be created in the folder. Next time, simply double click this file to open your project.
- Go to File > New File > R Script and click OK.
- Save the new R Script file.
3 Basic Operations
- Hashtags # are used to comment
3.1 Math Operations
1+2
[1] 3
2-3
[1] -1
5*6
[1] 30
1/6
[1] 0.1666667
3^2 # to the power of 2
[1] 9
(2+3)*5 #BEDMAS
[1] 25
3.2 Logical Operations
1>2 # greater than
[1] FALSE
2<17 # less than
[1] TRUE
2==7 # equals
[1] FALSE
2<=7 # less or equal to
[1] TRUE
7>=6 # greater or equal to
[1] TRUE
2*2 != 3 # does not equal
[1] TRUE
(2 | 3) < 7 # or
[1] TRUE
(2 & 3) == 3 # and
[1] FALSE
4 Functions
Functions are little programs. Almost everything we do in R requires
functions. Functions can import data, manipulate/clean our data, and
export our data to use elsewhere. They are the fundamental building
blocks within R. We can use functions by typing them directly into the
Console pane or into script file. Here are a few of examples of
functions in R:
NOTE: spaces do not matter, but R
is case sensitive
sqrt(9) # square root of 9
[1] 3
factorial(3) # 3 factorial - 3! e.g., 3*2*1
[1] 6
seq(1,5) # sequence of numbers from 1 to 5 e.g., 1, 2, 3, 4, 5
[1] 1 2 3 4 5
class(1)
[1] "numeric"
class(2>=3)
[1] "logical"
#' You can even combine functions:
seq(1,3)*factorial(2) # multiply each number in the sequence of 1-3 by 2!
[1] 2 4 6
#' or incorporate them into one another:
factorial(sqrt(9)) # the factorial of the square root of 9
[1] 6
seq(1,sqrt(9)) # sequence from 1 to 3 (square root of 9)
[1] 1 2 3
# Functions' arguments
seq(from = 1, to = 5, by=1)
[1] 1 2 3 4 5
seq(from = 1, to = 5, by=2)
[1] 1 3 5
plot(x=1:10, y=11:20) # spaces do not matter, but R is case sensitive
5 Getting Help with Functions
?mean # single ? searches within loaded environement
?seq
??neg.reg
# double ??, same as search bar in Help tab, looks for function/package on CRAN
# Try the search box in Help tab, and look at what's inside the {}, that's the package name!
5.0.1 Google is programmers’ best friend!
It’s True!
R Help is not considered great, but as you get more familiar with R you will find yourself using it… Even expert R users often look for help, nobody knows and/or remembers all commands, functions, and packages (not to mention that these change and evolve all the time). The R Documentation is a bit dry and you will often find more comprehensive explanations and examples online.
Ask anyone, looking up code (including stuff that you know and tried before) is the bread an butter of R (and other software) users!
6 Objects
In R, we temporary save the results of operations or function inside an object. Think of objects as bins containing what you need. To create a bin, you should name it first. It is SO helpful (both for you and others) to select meaningful names for your bins so that they would intuitively reveal their content. It is also helpful to define (or know) what is the type of content in your object, e.g., data frame, vector, integer, list, etc.
6.0.1 Assignment Operator
<-
R uses a special operator for creating objects to hold our results:
<- It frequently is read as “gets” or “assign.”
NOTE: You cannot start your object name with a number
(but in the middle/end name, you can)
meow <- 1 # NOT a good name, why?
variable <- c(1,2,3,4,5, NA) # How about this?
calc_results <- 2*3 # Good object name!
# Also good name:
data <- data.frame(x=1:10, y=11:20) # creating an object called data that will be a dataframe with variables x and y
Take a look at the Environment tab! Remember that R is case
sensitive, so calc_results will NOT be the same as
Calc_Results. You can use either “snake case” e.g.,
calc_results or “CamelCase,” e.g.,
CalcResults, or “dot case,” e.g.,
calc.results.
You can call on the object to see its content, for example:
calc_results
[1] 6
7 Udi’s Recommendation Corner
- Use # to comment on your code as much as you can. Think of comments as little post-its you are leaving for others and your future self! The better your comments, the easier it will be for you and others to understand your code.
- Always, pick object names that are meaningful. An object name should be a good hint about what results are saved inside an object. It is worth the time you spend now choosing an appropriate object name compared to the time you will likely spend later trying to figure out which object is which.
8 Packages
You can think of a package as a collection of functions, data and
help files collated into a well defined standard structure which you can
download and install in R. These packages can be downloaded from a
variety of sources but the most popular are CRAN, Bioconductor and
GitHub. The base installation of R comes with many useful packages as
standard. These packages will contain many of the functions you will use
on a daily basis. However, as you start using R for more diverse
projects (and as your own use of R evolves) you will find that there
comes a time when you will need to extend R’s capabilities. Fortunately,
many thousands of R users have developed useful code and shared this
code as installable packages - this is what is meant by
open-source!
To use functions from a package, you must first:
8.0.1 Install the package:
install.packages("car")
But, make sure to remove the install.package() line (or
make it a comment) before saving. You can also install packages from the
Packages tab.
8.0.2 Load the package:
library(car)
8.0.3 Use any of the functions from this package:
Example:
data <- data.frame(x=1:10, y=11:20) # creating an object called data that will be a dataframe with variables x and y
scatterplotMatrix(data)
9 Types of Data & Data Structure
9.1 Atomic (Data) Types
There are 6 atomic types of data in R, but, for now you only need to know (and use) 4 of them:
9.1.1 Logical (TRUE/FALSE)
1 > 2
[1] FALSE
2 > 1
[1] TRUE
9.1.2 Integer
e.g., 1, 96, NA, 78910
9.1.3 Numeric
e.g., 2, 4.67, pi
9.1.4 Character
e.g., “hello”, “34”
"I like a lot pulp in my organge juice"
[1] "I like a lot pulp in my organge juice"
9.2 Data Structures
Data structures are ways which R stores data. Just like the name implies, a data structure tells us exactly how our data is structured or organized within R. R, like most programming languages, uses a variety of data structures. The most common of which are:
9.2.1 Vectors
A series of data which must have the same data type. Vectors can be any of the basic data types, e.g., logical, integer, numeric, character
vector_example1 <- 1:5
vector_example2 <- c(1,2,3,4,5) # c function is meant to **combine** values to form a vector
vector_example3 <- c(T, F, T, F, T) # T and F is the same as TRUE and FALSE
vector_example4 <- c("My", "mom", "is", "a", "teacher")
9.2.2 Matrices
Must be 2 dimensional. Must be of the same data type.
(matrix_example <- matrix(1:10, 1:10))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1 2 3 4 5 6 7 8 9 10
9.2.3 Arrays
Similar to matrices, but they can be more than 2 dimensions
(array_exaple <- array(dim = c(3,4,5)))
, , 1
[,1] [,2] [,3] [,4]
[1,] NA NA NA NA
[2,] NA NA NA NA
[3,] NA NA NA NA
, , 2
[,1] [,2] [,3] [,4]
[1,] NA NA NA NA
[2,] NA NA NA NA
[3,] NA NA NA NA
, , 3
[,1] [,2] [,3] [,4]
[1,] NA NA NA NA
[2,] NA NA NA NA
[3,] NA NA NA NA
, , 4
[,1] [,2] [,3] [,4]
[1,] NA NA NA NA
[2,] NA NA NA NA
[3,] NA NA NA NA
, , 5
[,1] [,2] [,3] [,4]
[1,] NA NA NA NA
[2,] NA NA NA NA
[3,] NA NA NA NA
9.2.4 Lists
A list can have mixed data types, but they can be a bit trickier to use.
list_example <- list("cat", 3, TRUE)
9.2.5 Data frames
Used for tabular data, think of them as a basic spreadsheet, but each column is a vector.
(dataframe_example <- data.frame(x=1:10, y=11:20))
x y
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15
6 6 16
7 7 17
8 8 18
9 9 19
10 10 20
(dataframe_example2 <- data.frame(eyecolour= c("blue", "brown", "grey"),
hight = c(1.75, 2.1, 1.56),
registered = c(T,F, T)))
eyecolour hight registered
1 blue 1.75 TRUE
2 brown 2.10 FALSE
3 grey 1.56 TRUE
9.2.6 Tibbles
Similar to data frames with a few differences. Tibbles are the
standard data structures within the tidyverse.
(tibble_example <- tibble(x=letters, y=1:length(letters)))
# A tibble: 26 × 2
x y
<chr> <int>
1 a 1
2 b 2
3 c 3
4 d 4
5 e 5
6 f 6
7 g 7
8 h 8
9 i 9
10 j 10
# … with 16 more rows
If you are not sure what type of data or data structure an object (or value) is, you can check using:
class(dataframe_example)
[1] "data.frame"
class(1.34)
[1] "numeric"
class(TRUE)
[1] "logical"
class(list_example)
[1] "list"
class(dataframe_example[1,2]) # dataframe_example[1,2] means the value in the 1st row, 2nd column
[1] "integer"
# It is always: [Rows, Columns]
dataframe_example
x y
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15
6 6 16
7 7 17
8 8 18
9 9 19
10 10 20
dataframe_example[1,2]
[1] 11
10 Importing Data
R is very flexible and can import many data formats. RStudio will help you with that, using necessary packages. In this workshop I will show you how to do it through RStudio and also through R script. One thing to notice is that if you have a RStudio project in the same folder where your data sets are, you don’t need to specify path addresses to import your data. R will by default import and save anything on that folder.
10.0.1 R allows you to read in data from many different formats:
- SPSS
- Excel
- SAS
- .csv
- .txt
- .dat
The easiest way to get started with reading data into R is go to the environment tab and click on the import dataset button and then read in the data accordingly (note that csv files are a type of text file).
Let’s start by importing an SPSS dataset (.sav) using the
read_sav() function in the haven package:
10.0.2 Install the
haven package
install.packages("haven")
10.0.3 Load the
haven package
library(haven)
10.0.4 Use
read_sav()
You want to make sure the file is saved in the same workspace as your R Script, or set the path to where the data is. And, don’t forget to store the data in an object!
aggression_data <- read_sav("aggression.sav")
class(aggression_data) # what data structure is the dataset
[1] "tbl_df" "tbl" "data.frame"
str(aggression_data) # structure, variables and their data type and structure
tibble [275 × 8] (S3: tbl_df/tbl/data.frame)
$ age : num [1:275] 18 18 20 17 17 17 17 17 17 17 ...
..- attr(*, "format.spss")= chr "F2.0"
$ BPAQ : num [1:275] 2.62 2.24 2.72 1.93 2.72 ...
..- attr(*, "label")= chr "Aggression total score"
..- attr(*, "format.spss")= chr "F12.10"
..- attr(*, "display_width")= int 14
$ AISS : num [1:275] 2.65 2.85 3.05 2.65 2.95 1.95 2.55 2.3 2 2.15 ...
..- attr(*, "label")= chr "sensation seeking total score"
..- attr(*, "format.spss")= chr "F4.2"
$ alcohol: num [1:275] 28 NA 80 28 10 12 21 3 21 0 ...
..- attr(*, "label")= chr "alcohol consumption (in drinks)"
..- attr(*, "format.spss")= chr "F2.0"
$ BIS : num [1:275] 2.15 3.08 3 1.85 2.08 ...
..- attr(*, "label")= chr "Impulsivity total score"
..- attr(*, "format.spss")= chr "F12.10"
..- attr(*, "display_width")= int 14
$ NEOc : num [1:275] 2.83 2.5 2.75 3.42 3.58 ...
..- attr(*, "label")= chr "Conscientiousness total score"
..- attr(*, "format.spss")= chr "F12.10"
..- attr(*, "display_width")= int 14
$ gender : dbl+lbl [1:275] 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, ...
..@ label : chr "biological sex of participant"
..@ format.spss: chr "F1.0"
..@ labels : Named num [1:2] 0 1
.. ..- attr(*, "names")= chr [1:2] "male" "female"
$ NEOo : num [1:275] 2.92 4.17 3.92 4.17 3.5 ...
..- attr(*, "label")= chr "openness to experience total score"
..- attr(*, "format.spss")= chr "F12.10"
..- attr(*, "display_width")= int 14
head(aggression_data) # inspect the top 6 (head) of the data
# A tibble: 6 × 8
age BPAQ AISS alcohol BIS NEOc gender NEOo
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl+lbl> <dbl>
1 18 2.62 2.65 28 2.15 2.83 1 [female] 2.92
2 18 2.24 2.85 NA 3.08 2.5 1 [female] 4.17
3 20 2.72 3.05 80 3 2.75 1 [female] 3.92
4 17 1.93 2.65 28 1.85 3.42 1 [female] 4.17
5 17 2.72 2.95 10 2.08 3.58 0 [male] 3.5
6 17 2.45 1.95 12 2.62 3.83 1 [female] 3.25
names(aggression_data) # variable/column names
[1] "age" "BPAQ" "AISS" "alcohol" "BIS" "NEOc" "gender"
[8] "NEOo"
10.0.5 finding the value located in the 2nd row, 3rd column
aggression_data[2,3]
# A tibble: 1 × 1
AISS
<dbl>
1 2.85
10.0.6 looking only at
the age variable/column
aggression_data$age
[1] 18 18 20 17 17 17 17 17 17 17 17 17 17 18 18 18 18 18 18 18 18 18 18 18 18
[26] 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18
[51] 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18
[ reached getOption("max.print") -- omitted 200 entries ]
attr(,"format.spss")
[1] "F2.0"
10.0.7 finding the
age of the 3rd participant
aggression_data$age[3]
[1] 20
10.0.8 finding the mean
age of all participants
mean(aggression_data$age)
[1] 20.21091
10.0.9 finding the
standard deviation of age
sd(aggression_data$age)
[1] 4.960342
10.0.10 finding the
median of age
median(aggression_data$age)
[1] 18
10.0.11 find how many
values in age
length(aggression_data$age)
[1] 275
nrow(aggression_data)
[1] 275
10.0.12 plotting
age vs. alcohol
plot(aggression_data$age, aggression_data$alcohol)