R variables and data types: Introduction to R Programming

Overview

Teaching: 40 min
Exercises: 20 min
Questions
  • R data types that will be useful in your own work

Objectives
  • Assign values to objects in R.

  • Solve simple arithmetic operations in R.

  • Describe what a data frame is.

R variables and data types

First, we introduce the common variable types and data types that you’ll be working with in R. Commonly, errors involve using the wrong variable or data type

Variable type Type Example
integer Whole numbers 1, 100, -9
numeric Decimals 0.1, -0.09, 234.567
character Text “A”, “hello”, “welcome”
logical Booleans TRUE or FALSE
factor Categorical “green”, “blue”, “red”, “purple”
missing Logical NA
empty - NULL
Data type Type
vector 1D collection of variables of the same type
matrix 2D collection of variables of the same type
data.frame 2D collection of variables of multiple types

R variable and data types

Getting Started

Working directory

We’ve created this project in a “working directory”. To check where this is, use:

getwd()
[1] "/Users/nicholasho/Projects/workshops/sih_bmc_r_workshop_2018/lessonbmc/_episodes_rmd"

Calculating things in R

Standard math functions work in R:

2+3
[1] 5
1/1000
[1] 0.001
sqrt(2)
[1] 1.414214

We can store values in variables. Variables are a way to both store data and to label data.

myvariable <- 3
myvariable
[1] 3
myvariable = 3
myvariable
[1] 3
3 -> myvariable
myvariable
[1] 3
myvariable^2
[1] 9

Variable and Data Types

There are several different types of data you can use in R. We’ll examine a few common ones in a little more detail.

Text

Strings are known as “character” in R. Use the double quotes " or single quotes ' to wrap around the string

myname <- "nick"

We can use the class() function to see what data type it is

class(myname)
[1] "character"

Numbers

Numbers have different classes. The most common two are integer and numeric. Integers are whole numbers:

favourite.integer <- as.integer(8)
print(favourite.integer)
[1] 8
class(favourite.integer)
[1] "integer"

Numbers can be numeric which are decimals:

favourite.numeric <- as.numeric(8.8)
print(favourite.numeric)
[1] 8.8
class(favourite.numeric)
[1] "numeric"
pvalue.threshold <- 0.05

Logical (True/False)

We use the == to test for equality in R

class(TRUE)
[1] "logical"
favourite.numeric == 8.8
[1] TRUE
favourite.numeric == 9.9
[1] FALSE

Vectors

We can create 1D data structures called “vectors”.

1:10
 [1]  1  2  3  4  5  6  7  8  9 10
2*(1:10)
 [1]  2  4  6  8 10 12 14 16 18 20
seq(0, 10, 2)
[1]  0  2  4  6  8 10

We can store vectors and perform operations on them.

myvector <- 1:10
myvector
 [1]  1  2  3  4  5  6  7  8  9 10
2^myvector
 [1]    2    4    8   16   32   64  128  256  512 1024
b <- c(3,4,5)
b^2
[1]  9 16 25
disorders <- c("autism","ocd", "depression", "ocd", "anxiety", "autism")
disorders
[1] "autism"     "ocd"        "depression" "ocd"        "anxiety"   
[6] "autism"    

Automatic type conversions

R tries to be helpful by converting data to the same type within a vector when adding elements to a vector. This can result in unexpected problems…

mybool <- c(rep(TRUE, 5), rep(FALSE, 5))
class(mybool)
[1] "logical"
mybool <- c(mybool, 10)
class(mybool)
[1] "numeric"
mybool <- c(mybool, "10")
class(mybool)
[1] "character"

Factors

Factors store categorical data. Under the hood, factors are actually integers that have a string label attached to each unique integer. For example, if we have a long list of Male/Female labels for each of our patients, this will be stored a “row” of zeros and ones by R.

disorders <- as.factor(disorders)
class(disorders)
[1] "factor"

How many categories are there for disorders and what are they?

levels(disorders)
[1] "anxiety"    "autism"     "depression" "ocd"       
nlevels(disorders)
[1] 4

A factor can be ordered. This makes sense in the context of a ranking such as a survey response, e.g. from ‘Strongly agree’ to ‘Strong disagree’.

responses <- c("low", "high", "medium", "low", "low", "high", "high", "medium", "medium")

myfactor <- factor(responses, levels = c("low", "medium", "high"))

myorderedfactor <- factor(responses, levels = c("low", "medium", "high"), ordered = TRUE)

levels(myfactor)
[1] "low"    "medium" "high"  

By default, factors will be ordered in alphabetical order.

Now our factor is ordered, we can find the lowest category by using min()

min(myfactor) #this will fail
Error in Summary.factor(structure(c(1L, 3L, 2L, 1L, 1L, 3L, 3L, 2L, 2L: 'min' not meaningful for factors
min(myorderedfactor)
[1] low
Levels: low < medium < high

Working with data

A lot of the time in R, we are working with tables of data, which are stored in a special data structure called R “data frames”.

Commonly,

rows should represent instances or individual observations e.g. data points, patients, events, samples, etc. while

columns will represent different types of data associated with each data point or instance e.g. Name, ID, location, time, value

It is good practive to have a single row for every instance, and an individual, distinct measurement in each of the columns (not multiple measurements in one or redunant information in multiple columns). This is called tidy data, and makes it a lot easier to work with data frames. It’s also the source for the name “tidyverse”, which is a suite of packages we’ll be making extensive use of in the next few weeks to work with our data.

Here is an example data frame:

bmc.data <- data.frame(fname = c("Alice", "Bob", "Carol", "David"),
                       gender = as.factor(c("Female", "Male", "Female", "Male")),
                       disorder = c("autism", "anxiety", "autism", "depression"),
                       age = c(20, 45, 15, 12),
                       biomarker1 = c(5.70, 4.96, 1.37, 10.44),
                       clinicalstage = c("1b", "1a", "1a", "2"),
                       stringsAsFactors = FALSE)

Viewing The Data

Use the function View() to visually inspect the data in a new RStudio pane:

View(bmc.data)

How many rows and columns do we have?

nrow(bmc.data)
[1] 4
ncol(bmc.data)
[1] 6
dim(bmc.data)
[1] 4 6

Accessing Subsets

Return the first N rows of your data frame

head(bmc.data)
  fname gender   disorder age biomarker1 clinicalstage
1 Alice Female     autism  20       5.70            1b
2   Bob   Male    anxiety  45       4.96            1a
3 Carol Female     autism  15       1.37            1a
4 David   Male depression  12      10.44             2

The default for the head() function is to show the first 6 rows. How do we know this? Type ? infront of the function name in your console

?head

Return the first 3 rows of your data frame

head(bmc.data, n = 3)
  fname gender disorder age biomarker1 clinicalstage
1 Alice Female   autism  20       5.70            1b
2   Bob   Male  anxiety  45       4.96            1a
3 Carol Female   autism  15       1.37            1a
head(bmc.data, 3)
  fname gender disorder age biomarker1 clinicalstage
1 Alice Female   autism  20       5.70            1b
2   Bob   Male  anxiety  45       4.96            1a
3 Carol Female   autism  15       1.37            1a
bmc.data[1:3, ]
  fname gender disorder age biomarker1 clinicalstage
1 Alice Female   autism  20       5.70            1b
2   Bob   Male  anxiety  45       4.96            1a
3 Carol Female   autism  15       1.37            1a
bmc.data[c(1, 2, 3), ]
  fname gender disorder age biomarker1 clinicalstage
1 Alice Female   autism  20       5.70            1b
2   Bob   Male  anxiety  45       4.96            1a
3 Carol Female   autism  15       1.37            1a
bmc.data[c(TRUE, TRUE, TRUE, FALSE), ]
  fname gender disorder age biomarker1 clinicalstage
1 Alice Female   autism  20       5.70            1b
2   Bob   Male  anxiety  45       4.96            1a
3 Carol Female   autism  15       1.37            1a

As you can see, there are multiple ways to achieve the same result in R; this is very powerful for advanced users, but can be quite confusing for newcomers, since it’s not always clear what a particular chunk of code is doing.

Return the last 2 rows in a data set

tail(bmc.data, 2)
  fname gender   disorder age biomarker1 clinicalstage
3 Carol Female     autism  15       1.37            1a
4 David   Male depression  12      10.44             2

Return the “age” column in the data set

bmc.data$age
[1] 20 45 15 12
bmc.data[, 4]
[1] 20 45 15 12
bmc.data[, "age"]
[1] 20 45 15 12

Return only the first 3 rows and columns 2 and 5 of the data set

bmc.data[1:3, c(2,5)]
  gender biomarker1
1 Female       5.70
2   Male       4.96
3 Female       1.37

Return the columns named “fname” and “biomarker1”

bmc.data[,c("fname", "biomarker1")]
  fname biomarker1
1 Alice       5.70
2   Bob       4.96
3 Carol       1.37
4 David      10.44

Filtering the data

Return only the rows (patients) who are Female

bmc.data[bmc.data$gender == "Female", ]
  fname gender disorder age biomarker1 clinicalstage
1 Alice Female   autism  20       5.70            1b
3 Carol Female   autism  15       1.37            1a

What exactly happened here? We made a vector of TRUE/FALSE statements, for each row in which this condition is true and then we subset rows in which the index is true

females <- bmc.data$gender == "Female"
females
[1]  TRUE FALSE  TRUE FALSE
bmc.data[females, ]
  fname gender disorder age biomarker1 clinicalstage
1 Alice Female   autism  20       5.70            1b
3 Carol Female   autism  15       1.37            1a

Another way to subset the patients is with the which() function. This returns the TRUE indices of a logical object.

females <- which(bmc.data$gender == "Female")
females
[1] 1 3
bmc.data[females, ]
  fname gender disorder age biomarker1 clinicalstage
1 Alice Female   autism  20       5.70            1b
3 Carol Female   autism  15       1.37            1a
bmc.data[which(bmc.data$gender == "Female"), ]
  fname gender disorder age biomarker1 clinicalstage
1 Alice Female   autism  20       5.70            1b
3 Carol Female   autism  15       1.37            1a

What if we want all patients older than 16 years of age?

bmc.data[bmc.data$age > 16, ]
  fname gender disorder age biomarker1 clinicalstage
1 Alice Female   autism  20       5.70            1b
2   Bob   Male  anxiety  45       4.96            1a

Adding records

Add a new row to the data set using the rbind() function:

new.person <- data.frame(fname = "Evelyn",
                         gender = "Female",
                         disorder = "anxiety",
                         age = 27,
                         biomarker1 = 40.8,
                         clinicalstage = "2")

bmc.data <- rbind(bmc.data, new.person)

Section quiz

  1. Return those patients whose clinical stage is “1a”

  2. Return those patients whose biomarker1 value is less than 6.7

  3. Return just the first name of all patients older than 16 years of age

Solution

  1. Return those patients whose clinical stage is “1a”
bmc.data[bmc.data$clinicalstage == "1a",]
  1. Return those patients whose biomarker1 value is less than 6.7
bmc.data[bmc.data$biomarker1 < 6.7,]
  1. Return just the first name of all patients older than 16 years of age
bmc.data[bmc.data$age > 16,]$fname
bmc.data[bmc.data$age > 16,"fname"]

Key Points

  • R supports multiple variable types

  • Errors often result because of trying to perform an unsupported operation on a specific data type

  • Errors can be cryptic to interpret

  • We can use helper packages to import and filter data in R