Data Structures
Overview
Teaching: 50 min
Exercises: 25 minQuestions
Working with more realistic data
Objectives
Assign values to objects in R.
Solve simple arithmetic operations in R.
Describe what a data frame is.
Working with more realistic Data
The dataset that we will be using today is taken from University of California, Irvine’s Machine Learning Repository. We will be examining the Autism Screening Adult Data Set. This is data collected by Dr Fadi Fayez Thabtah at NMIT via the mobile app ASD Tests.
Click here to open the data on your browser. Then, right-click and Save As “autism_data” in the “comma-separated values” or “csv” format into a data/ folder in your working directory.
Variable | Definition |
---|---|
id | The participant’s ID number |
AX_Score | Score based on the Autism Spectrum Quotient (AQ) 10 item screening tool AQ-10 |
age | Age in years |
gender | Male or Female |
ethnicity | Ethnicities in text form |
jaundice | Whether or not the participant was born with jaundice? |
autism | Whether or not anyone in tbe immediate family has been diagnosed with autism? |
country_of_res | Countries in text format |
used_app_before | Whether the participant has used a screening app |
result | Score from the AQ-10 screening tool |
age_desc | Age as categorical |
relation | Relation of person who completed the test |
Class/ASD | Participant classification |
Reading in Data
The first thing to do before we can begin an analysis is loading some data. To do this we can use the below command.
autism.data <- read.csv(file = "data/autism_data.csv",
header = TRUE,
sep = ",",
na.strings = "?")
Before going any further, let’s break down what this command does specifically.
- It reads the file, which could be
- a file path e.g.
"~/Documents/MyFolder/datafile.csv"
- a url like we have here e.g.
"http://www.website.com/subdomain/datafile.csv"
- a file path e.g.
- It specifies that the first row in the csv file contains “header” information i.e. column names
- It specifies that neighbouring columns are separated by a comma “,”
- It specifies that this CSV file uses “?” as the text for missing data
How would we find this out if we didn’t know already? Look at the help file:
?read.csv
Comma Separated Value (.csv) files are usually the standard, simplest format which is compatible with all sort of different software, e.g. R, python, Excel, MATLAB, …
But if we needed to read in a different format of data, there’s usually a function or a library for doing just that,
e.g. in base R:
read.csv()
read.table()
In the xlsx
library:
read.xlsx()
In the haven
library:
read_spss()
for SPSS files
In the tidyverse
family of libraries (more specifically, part of the readr
library):
read_csv()
read_tsv()
read_delim()
read_fwf()
read_table()
Viewing The Data
Remember, we use the function View()
to visually inspect at the data
View(autism.data)
If the dataset is large, it is usually more prudent to view only the first few rows, using the head command:
head(autism.data)
id A1_Score A2_Score A3_Score A4_Score A5_Score A6_Score A7_Score
1 1 1 1 1 1 0 0 1
2 2 1 1 0 1 0 0 0
3 3 1 1 0 1 1 0 1
4 4 1 1 0 1 0 0 1
5 5 1 0 0 0 0 0 0
6 6 1 1 1 1 1 0 1
A8_Score A9_Score A10_Score age gender ethnicity jaundice autism
1 1 0 0 26 f White-European no no
2 1 0 1 24 m Latino no yes
3 1 1 1 27 m Latino yes yes
4 1 0 1 35 f White-European no yes
5 1 0 0 40 f <NA> no no
6 1 1 1 36 m Others yes no
contry_of_res used_app_before result age_desc relation Class.ASD
1 United States no 6 18 and more Self NO
2 Brazil no 5 18 and more Self NO
3 Spain no 8 18 and more Parent YES
4 United States no 6 18 and more Self NO
5 Egypt no 2 18 and more <NA> NO
6 United States no 9 18 and more Self YES
What data type is this data?
class(autism.data)
[1] "data.frame"
What variables are included in this dataset?
colnames(autism.data)
[1] "id" "A1_Score" "A2_Score"
[4] "A3_Score" "A4_Score" "A5_Score"
[7] "A6_Score" "A7_Score" "A8_Score"
[10] "A9_Score" "A10_Score" "age"
[13] "gender" "ethnicity" "jaundice"
[16] "autism" "contry_of_res" "used_app_before"
[19] "result" "age_desc" "relation"
[22] "Class.ASD"
Changing The Column Names
We’ve a typo in the country of residence column. Let’s correct this by first finding the column with the typo.
colnames(autism.data)
[1] "id" "A1_Score" "A2_Score"
[4] "A3_Score" "A4_Score" "A5_Score"
[7] "A6_Score" "A7_Score" "A8_Score"
[10] "A9_Score" "A10_Score" "age"
[13] "gender" "ethnicity" "jaundice"
[16] "autism" "contry_of_res" "used_app_before"
[19] "result" "age_desc" "relation"
[22] "Class.ASD"
which(colnames(autism.data) == "contry_of_res")
[1] 17
colnames(autism.data)[17] <- "country"
Afterwards, view the data again to check the column names have been changed
View(autism.data)
Add a new column
Let’s create a better participant ID than just 1,2,3,…
The first function we will need is paste()
. This function concatenates strings. For example;
paste("hel", "lo")
[1] "hel lo"
This leaves an unwanted gap, hence we’ll use the sep
parameter
paste("hel", "lo", sep = "")
[1] "hello"
A shorthand for this command is called paste0, but we’re showing the paste()
command here, since you may want to use other delimeters, such as “@”, for example, in your own work.
Now let’s add the letters “PatientID_” to the front of the IDS. Then, we’ll add this as a new column
autism.data$pids <- paste( "PatientID_" , autism.data$id, sep = "")
Afterwards, view the data again to check the new “pids” column.
View(autism.data)
Accessing Subsets
Let’s look at all the records for those with a family history of autism:
family.autism.data <- autism.data[autism.data$autism == "yes",]
What is the average AQ-10 score for those with a familial history of autism?
mean(family.autism.data$result)
[1] 6.120879
What about for those with no familial history of autism?
no.family.autism.data <- autism.data[autism.data$autism == "no",]
mean(no.family.autism.data$result)
[1] 4.690049
Who is the oldest person in the dataset and what is their age?
max(autism.data$age)
[1] NA
# ignore missing data
max(autism.data$age, na.rm = TRUE)
[1] 383
which.max(autism.data$age)
[1] 53
Someone is apparently 383 years old! There are a number of possibilities why this value is here with a range of probabilities from the most likely (data entry error) to the least (the participant is yoda).
For our purposes, let’s remove this person. To do so, we’ll use the -
sign in front of the indices we want to remove.
autism.data <- autism.data[-which.max(autism.data$age), ]
Who is the youngest person in this dataset and how old are they?
min(autism.data$age, na.rm = TRUE)
[1] 17
which.min(autism.data$age)
[1] 7
Summarising Your Data
A handy way to inspect your data is to summarise it. To do so, use the summary()
function
summary(autism.data)
id A1_Score A2_Score A3_Score
Min. : 1.0 Min. :0.0000 Min. :0.0000 Min. :0.000
1st Qu.:177.5 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000
Median :353.0 Median :1.0000 Median :0.0000 Median :0.000
Mean :352.9 Mean :0.7212 Mean :0.4538 Mean :0.458
3rd Qu.:528.5 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.000
Max. :704.0 Max. :1.0000 Max. :1.0000 Max. :1.000
A4_Score A5_Score A6_Score A7_Score
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000
Mean :0.4964 Mean :0.4993 Mean :0.2845 Mean :0.4182
3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
A8_Score A9_Score A10_Score age
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :17.00
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:21.00
Median :1.0000 Median :0.0000 Median :1.0000 Median :27.00
Mean :0.6501 Mean :0.3243 Mean :0.5747 Mean :29.19
3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:35.00
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :64.00
NA's :2
gender ethnicity jaundice autism
f:336 White-European :233 no :634 no :612
m:367 Asian :123 yes: 69 yes: 91
Middle Eastern : 92
Black : 43
South Asian : 36
(Other) : 81
NA's : 95
country used_app_before result
United States :113 no :691 Min. : 0.000
United Arab Emirates: 82 yes: 12 1st Qu.: 3.000
India : 81 Median : 4.000
New Zealand : 80 Mean : 4.881
United Kingdom : 77 3rd Qu.: 7.000
Jordan : 47 Max. :10.000
(Other) :223
age_desc relation Class.ASD
18 and more:703 Health care professional: 4 NO :514
Others : 5 YES:189
Parent : 50
Relative : 28
Self :521
NA's : 95
pids
Length:703
Class :character
Mode :character
Use the summary function to return a quick summary of only the country column
summary(autism.data$country)
Afghanistan AmericanSamoa Angola
13 2 1
Argentina Armenia Aruba
2 2 1
Australia Austria Azerbaijan
27 4 1
Bahamas Bangladesh Belgium
2 3 3
Bolivia Brazil Burundi
1 9 1
Canada Chile China
15 1 1
Costa Rica Cyprus Czech Republic
1 1 1
Ecuador Egypt Ethiopia
1 3 2
Finland France Germany
1 11 4
Hong Kong Iceland India
1 2 81
Indonesia Iran Iraq
1 7 1
Ireland Italy Japan
5 5 1
Jordan Kazakhstan Lebanon
47 3 1
Malaysia Mexico Nepal
5 8 1
Netherlands New Zealand Nicaragua
10 80 1
Niger Oman Pakistan
1 1 3
Philippines Portugal Romania
4 1 3
Russia Saudi Arabia Serbia
7 4 1
Sierra Leone South Africa Spain
1 2 3
Sri Lanka Sweden Tonga
14 2 1
Turkey Ukraine United Arab Emirates
1 2 82
United Kingdom United States Uruguay
77 113 1
Viet Nam
5
Why does the summary function give you quartiles for some column and provide counts for others? Remember that data frames are 2D collections that can hold data of different types
class(autism.data$age)
[1] "integer"
summary(autism.data$age)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
17.00 21.00 27.00 29.19 35.00 64.00 2
class(autism.data$country)
[1] "factor"
summary(autism.data$country)
Afghanistan AmericanSamoa Angola
13 2 1
Argentina Armenia Aruba
2 2 1
Australia Austria Azerbaijan
27 4 1
Bahamas Bangladesh Belgium
2 3 3
Bolivia Brazil Burundi
1 9 1
Canada Chile China
15 1 1
Costa Rica Cyprus Czech Republic
1 1 1
Ecuador Egypt Ethiopia
1 3 2
Finland France Germany
1 11 4
Hong Kong Iceland India
1 2 81
Indonesia Iran Iraq
1 7 1
Ireland Italy Japan
5 5 1
Jordan Kazakhstan Lebanon
47 3 1
Malaysia Mexico Nepal
5 8 1
Netherlands New Zealand Nicaragua
10 80 1
Niger Oman Pakistan
1 1 3
Philippines Portugal Romania
4 1 3
Russia Saudi Arabia Serbia
7 4 1
Sierra Leone South Africa Spain
1 2 3
Sri Lanka Sweden Tonga
14 2 1
Turkey Ukraine United Arab Emirates
1 2 82
United Kingdom United States Uruguay
77 113 1
Viet Nam
5
Another way to summarise a factor
column is to use the table()
function
table(autism.data$country)
Afghanistan AmericanSamoa Angola
13 2 1
Argentina Armenia Aruba
2 2 1
Australia Austria Azerbaijan
27 4 1
Bahamas Bangladesh Belgium
2 3 3
Bolivia Brazil Burundi
1 9 1
Canada Chile China
15 1 1
Costa Rica Cyprus Czech Republic
1 1 1
Ecuador Egypt Ethiopia
1 3 2
Finland France Germany
1 11 4
Hong Kong Iceland India
1 2 81
Indonesia Iran Iraq
1 7 1
Ireland Italy Japan
5 5 1
Jordan Kazakhstan Lebanon
47 3 1
Malaysia Mexico Nepal
5 8 1
Netherlands New Zealand Nicaragua
10 80 1
Niger Oman Pakistan
1 1 3
Philippines Portugal Romania
4 1 3
Russia Saudi Arabia Serbia
7 4 1
Sierra Leone South Africa Spain
1 2 3
Sri Lanka Sweden Tonga
14 2 1
Turkey Ukraine United Arab Emirates
1 2 82
United Kingdom United States Uruguay
77 113 1
Viet Nam
5
A difference between using the summary()
function over the table()
function for factors is that by default, the summary()
function also counts NA
values whereas in this must be specified for table()
summary(autism.data$ethnicity)
Asian Black Hispanic Latino
123 43 13 20
Middle Eastern others Others Pasifika
92 1 30 11
South Asian Turkish White-European NA's
36 6 233 95
table(autism.data$ethnicity)
Asian Black Hispanic Latino
123 43 13 20
Middle Eastern others Others Pasifika
92 1 30 11
South Asian Turkish White-European
36 6 233
table(autism.data$ethnicity, useNA = "ifany")
Asian Black Hispanic Latino
123 43 13 20
Middle Eastern others Others Pasifika
92 1 30 11
South Asian Turkish White-European <NA>
36 6 233 95
Quiz
Section quiz
What is the proportion of males to females in this dataset?
Return those participants whose
gender
is femaleReturn those participants whose
gender
is female into a new data frame called “females.autism.data”What is the mean
result
score for those classified as ASDWhat is the mean
result
score for those not classified as ASDSolution
- What is the proportion of males to females in this dataset?
table(autism.data$gender) prop.table(table(autism.data$gender))
- Return those participants whose ‘gender’ is female
autism.data[autism.data$gender == "f", ]
- Return those participants whose ‘gender’ is female into a new data frame called “females.autism.data”
females.autism.data <- autism.data[autism.data$gender == "f", ]
- What is the mean ‘result’ score for those classified as ASD
mean(autism.data[autism.data$Class.ASD == "YES" ,"result"])
- What is the mean ‘result’ score for those not classified as ASD
mean(autism.data[autism.data$Class.ASD == "NO" ,"result"])
Writing the data out
Use the write.table() function to write the data into a text file
write.table(x = autism.data,
file = "data/autism_pids.txt",
row.names = FALSE)
Use the write.csv() function to make the datafile a csv (comma separated values) file
write.csv(x = autism.data,
file = "data/autism_pids.csv",
row.names = FALSE)
[Optional] Homework
[Optional] Homework
Use the autism.data for this excercise
Amongst people who had used the app before, were more from Armenia or Brazil?
How many people annotated as parents were also autistic?
What is the mean age of males and females in the dataset? (Hint: you may need to use na.rm = TRUE as an arguement for the mean function)
[Extra credit] Using the
chisq.test
function, try to see whether there is a significant relationship between ASD and having jaundice as a child. What about gender and autism?
Key Points
R supports multiple variable types
Errors often result because of trying to perform an unsupported operation on a specific data type
Errors can be cryptic to interpret
We can use helper packages to import and filter data in R