Data Structures

Overview

Teaching: 50 min
Exercises: 25 min
Questions
  • Working with more realistic data

Objectives
  • Assign values to objects in R.

  • Solve simple arithmetic operations in R.

  • Describe what a data frame is.

Working with more realistic Data

The dataset that we will be using today is taken from University of California, Irvine’s Machine Learning Repository. We will be examining the Autism Screening Adult Data Set. This is data collected by Dr Fadi Fayez Thabtah at NMIT via the mobile app ASD Tests.

Click here to open the data on your browser. Then, right-click and Save As “autism_data” in the “comma-separated values” or “csv” format into a data/ folder in your working directory.

Variable Definition
id The participant’s ID number
AX_Score Score based on the Autism Spectrum Quotient (AQ) 10 item screening tool AQ-10
age Age in years
gender Male or Female
ethnicity Ethnicities in text form
jaundice Whether or not the participant was born with jaundice?
autism Whether or not anyone in tbe immediate family has been diagnosed with autism?
country_of_res Countries in text format
used_app_before Whether the participant has used a screening app
result Score from the AQ-10 screening tool
age_desc Age as categorical
relation Relation of person who completed the test
Class/ASD Participant classification

Reading in Data

The first thing to do before we can begin an analysis is loading some data. To do this we can use the below command.

autism.data <- read.csv(file = "data/autism_data.csv",
                        header = TRUE,
                        sep = ",",
                        na.strings = "?")

Before going any further, let’s break down what this command does specifically.

  1. It reads the file, which could be
    • a file path e.g. "~/Documents/MyFolder/datafile.csv"
    • a url like we have here e.g. "http://www.website.com/subdomain/datafile.csv"
  2. It specifies that the first row in the csv file contains “header” information i.e. column names
  3. It specifies that neighbouring columns are separated by a comma “,”
  4. It specifies that this CSV file uses “?” as the text for missing data

How would we find this out if we didn’t know already? Look at the help file:

?read.csv

Comma Separated Value (.csv) files are usually the standard, simplest format which is compatible with all sort of different software, e.g. R, python, Excel, MATLAB, …

But if we needed to read in a different format of data, there’s usually a function or a library for doing just that,

e.g. in base R:

In the xlsx library:

In the haven library:

In the tidyverse family of libraries (more specifically, part of the readr library):

Viewing The Data

Remember, we use the function View() to visually inspect at the data

View(autism.data)

If the dataset is large, it is usually more prudent to view only the first few rows, using the head command:

head(autism.data)
  id A1_Score A2_Score A3_Score A4_Score A5_Score A6_Score A7_Score
1  1        1        1        1        1        0        0        1
2  2        1        1        0        1        0        0        0
3  3        1        1        0        1        1        0        1
4  4        1        1        0        1        0        0        1
5  5        1        0        0        0        0        0        0
6  6        1        1        1        1        1        0        1
  A8_Score A9_Score A10_Score age gender      ethnicity jaundice autism
1        1        0         0  26      f White-European       no     no
2        1        0         1  24      m         Latino       no    yes
3        1        1         1  27      m         Latino      yes    yes
4        1        0         1  35      f White-European       no    yes
5        1        0         0  40      f           <NA>       no     no
6        1        1         1  36      m         Others      yes     no
  contry_of_res used_app_before result    age_desc relation Class.ASD
1 United States              no      6 18 and more     Self        NO
2        Brazil              no      5 18 and more     Self        NO
3         Spain              no      8 18 and more   Parent       YES
4 United States              no      6 18 and more     Self        NO
5         Egypt              no      2 18 and more     <NA>        NO
6 United States              no      9 18 and more     Self       YES

What data type is this data?

class(autism.data)
[1] "data.frame"

What variables are included in this dataset?

colnames(autism.data)
 [1] "id"              "A1_Score"        "A2_Score"       
 [4] "A3_Score"        "A4_Score"        "A5_Score"       
 [7] "A6_Score"        "A7_Score"        "A8_Score"       
[10] "A9_Score"        "A10_Score"       "age"            
[13] "gender"          "ethnicity"       "jaundice"       
[16] "autism"          "contry_of_res"   "used_app_before"
[19] "result"          "age_desc"        "relation"       
[22] "Class.ASD"      

Changing The Column Names

We’ve a typo in the country of residence column. Let’s correct this by first finding the column with the typo.

colnames(autism.data)
 [1] "id"              "A1_Score"        "A2_Score"       
 [4] "A3_Score"        "A4_Score"        "A5_Score"       
 [7] "A6_Score"        "A7_Score"        "A8_Score"       
[10] "A9_Score"        "A10_Score"       "age"            
[13] "gender"          "ethnicity"       "jaundice"       
[16] "autism"          "contry_of_res"   "used_app_before"
[19] "result"          "age_desc"        "relation"       
[22] "Class.ASD"      
which(colnames(autism.data) == "contry_of_res")
[1] 17
colnames(autism.data)[17] <- "country"

Afterwards, view the data again to check the column names have been changed

View(autism.data)

Add a new column

Let’s create a better participant ID than just 1,2,3,…

The first function we will need is paste(). This function concatenates strings. For example;

paste("hel", "lo")
[1] "hel lo"

This leaves an unwanted gap, hence we’ll use the sep parameter

paste("hel", "lo", sep = "")
[1] "hello"

A shorthand for this command is called paste0, but we’re showing the paste() command here, since you may want to use other delimeters, such as “@”, for example, in your own work.

Now let’s add the letters “PatientID_” to the front of the IDS. Then, we’ll add this as a new column

autism.data$pids <- paste( "PatientID_" , autism.data$id, sep = "")

Afterwards, view the data again to check the new “pids” column.

View(autism.data)

Accessing Subsets

Let’s look at all the records for those with a family history of autism:

family.autism.data <- autism.data[autism.data$autism == "yes",]

What is the average AQ-10 score for those with a familial history of autism?

mean(family.autism.data$result)
[1] 6.120879

What about for those with no familial history of autism?

no.family.autism.data <- autism.data[autism.data$autism == "no",]
mean(no.family.autism.data$result)
[1] 4.690049

Who is the oldest person in the dataset and what is their age?

max(autism.data$age)
[1] NA
# ignore missing data
max(autism.data$age, na.rm = TRUE)
[1] 383
which.max(autism.data$age)
[1] 53

Someone is apparently 383 years old! There are a number of possibilities why this value is here with a range of probabilities from the most likely (data entry error) to the least (the participant is yoda).

For our purposes, let’s remove this person. To do so, we’ll use the - sign in front of the indices we want to remove.

autism.data <- autism.data[-which.max(autism.data$age), ]

Who is the youngest person in this dataset and how old are they?

min(autism.data$age, na.rm = TRUE)
[1] 17
which.min(autism.data$age)
[1] 7

Summarising Your Data

A handy way to inspect your data is to summarise it. To do so, use the summary() function

summary(autism.data)
       id           A1_Score         A2_Score         A3_Score    
 Min.   :  1.0   Min.   :0.0000   Min.   :0.0000   Min.   :0.000  
 1st Qu.:177.5   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.000  
 Median :353.0   Median :1.0000   Median :0.0000   Median :0.000  
 Mean   :352.9   Mean   :0.7212   Mean   :0.4538   Mean   :0.458  
 3rd Qu.:528.5   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.000  
 Max.   :704.0   Max.   :1.0000   Max.   :1.0000   Max.   :1.000  
                                                                  
    A4_Score         A5_Score         A6_Score         A7_Score     
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :0.0000   Median :0.0000   Median :0.0000   Median :0.0000  
 Mean   :0.4964   Mean   :0.4993   Mean   :0.2845   Mean   :0.4182  
 3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
                                                                    
    A8_Score         A9_Score        A10_Score           age       
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :17.00  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:21.00  
 Median :1.0000   Median :0.0000   Median :1.0000   Median :27.00  
 Mean   :0.6501   Mean   :0.3243   Mean   :0.5747   Mean   :29.19  
 3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:35.00  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :64.00  
                                                    NA's   :2      
 gender            ethnicity   jaundice  autism   
 f:336   White-European :233   no :634   no :612  
 m:367   Asian          :123   yes: 69   yes: 91  
         Middle Eastern : 92                      
         Black          : 43                      
         South Asian    : 36                      
         (Other)        : 81                      
         NA's           : 95                      
                 country    used_app_before     result      
 United States       :113   no :691         Min.   : 0.000  
 United Arab Emirates: 82   yes: 12         1st Qu.: 3.000  
 India               : 81                   Median : 4.000  
 New Zealand         : 80                   Mean   : 4.881  
 United Kingdom      : 77                   3rd Qu.: 7.000  
 Jordan              : 47                   Max.   :10.000  
 (Other)             :223                                   
        age_desc                       relation   Class.ASD
 18 and more:703   Health care professional:  4   NO :514  
                   Others                  :  5   YES:189  
                   Parent                  : 50            
                   Relative                : 28            
                   Self                    :521            
                   NA's                    : 95            
                                                           
     pids          
 Length:703        
 Class :character  
 Mode  :character  
                   
                   
                   
                   

Use the summary function to return a quick summary of only the country column

summary(autism.data$country)
         Afghanistan        AmericanSamoa               Angola 
                  13                    2                    1 
           Argentina              Armenia                Aruba 
                   2                    2                    1 
           Australia              Austria           Azerbaijan 
                  27                    4                    1 
             Bahamas           Bangladesh              Belgium 
                   2                    3                    3 
             Bolivia               Brazil              Burundi 
                   1                    9                    1 
              Canada                Chile                China 
                  15                    1                    1 
          Costa Rica               Cyprus       Czech Republic 
                   1                    1                    1 
             Ecuador                Egypt             Ethiopia 
                   1                    3                    2 
             Finland               France              Germany 
                   1                   11                    4 
           Hong Kong              Iceland                India 
                   1                    2                   81 
           Indonesia                 Iran                 Iraq 
                   1                    7                    1 
             Ireland                Italy                Japan 
                   5                    5                    1 
              Jordan           Kazakhstan              Lebanon 
                  47                    3                    1 
            Malaysia               Mexico                Nepal 
                   5                    8                    1 
         Netherlands          New Zealand            Nicaragua 
                  10                   80                    1 
               Niger                 Oman             Pakistan 
                   1                    1                    3 
         Philippines             Portugal              Romania 
                   4                    1                    3 
              Russia         Saudi Arabia               Serbia 
                   7                    4                    1 
        Sierra Leone         South Africa                Spain 
                   1                    2                    3 
           Sri Lanka               Sweden                Tonga 
                  14                    2                    1 
              Turkey              Ukraine United Arab Emirates 
                   1                    2                   82 
      United Kingdom        United States              Uruguay 
                  77                  113                    1 
            Viet Nam 
                   5 

Why does the summary function give you quartiles for some column and provide counts for others? Remember that data frames are 2D collections that can hold data of different types

class(autism.data$age)
[1] "integer"
summary(autism.data$age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  17.00   21.00   27.00   29.19   35.00   64.00       2 
class(autism.data$country)
[1] "factor"
summary(autism.data$country)
         Afghanistan        AmericanSamoa               Angola 
                  13                    2                    1 
           Argentina              Armenia                Aruba 
                   2                    2                    1 
           Australia              Austria           Azerbaijan 
                  27                    4                    1 
             Bahamas           Bangladesh              Belgium 
                   2                    3                    3 
             Bolivia               Brazil              Burundi 
                   1                    9                    1 
              Canada                Chile                China 
                  15                    1                    1 
          Costa Rica               Cyprus       Czech Republic 
                   1                    1                    1 
             Ecuador                Egypt             Ethiopia 
                   1                    3                    2 
             Finland               France              Germany 
                   1                   11                    4 
           Hong Kong              Iceland                India 
                   1                    2                   81 
           Indonesia                 Iran                 Iraq 
                   1                    7                    1 
             Ireland                Italy                Japan 
                   5                    5                    1 
              Jordan           Kazakhstan              Lebanon 
                  47                    3                    1 
            Malaysia               Mexico                Nepal 
                   5                    8                    1 
         Netherlands          New Zealand            Nicaragua 
                  10                   80                    1 
               Niger                 Oman             Pakistan 
                   1                    1                    3 
         Philippines             Portugal              Romania 
                   4                    1                    3 
              Russia         Saudi Arabia               Serbia 
                   7                    4                    1 
        Sierra Leone         South Africa                Spain 
                   1                    2                    3 
           Sri Lanka               Sweden                Tonga 
                  14                    2                    1 
              Turkey              Ukraine United Arab Emirates 
                   1                    2                   82 
      United Kingdom        United States              Uruguay 
                  77                  113                    1 
            Viet Nam 
                   5 

Another way to summarise a factor column is to use the table() function

table(autism.data$country)

         Afghanistan        AmericanSamoa               Angola 
                  13                    2                    1 
           Argentina              Armenia                Aruba 
                   2                    2                    1 
           Australia              Austria           Azerbaijan 
                  27                    4                    1 
             Bahamas           Bangladesh              Belgium 
                   2                    3                    3 
             Bolivia               Brazil              Burundi 
                   1                    9                    1 
              Canada                Chile                China 
                  15                    1                    1 
          Costa Rica               Cyprus       Czech Republic 
                   1                    1                    1 
             Ecuador                Egypt             Ethiopia 
                   1                    3                    2 
             Finland               France              Germany 
                   1                   11                    4 
           Hong Kong              Iceland                India 
                   1                    2                   81 
           Indonesia                 Iran                 Iraq 
                   1                    7                    1 
             Ireland                Italy                Japan 
                   5                    5                    1 
              Jordan           Kazakhstan              Lebanon 
                  47                    3                    1 
            Malaysia               Mexico                Nepal 
                   5                    8                    1 
         Netherlands          New Zealand            Nicaragua 
                  10                   80                    1 
               Niger                 Oman             Pakistan 
                   1                    1                    3 
         Philippines             Portugal              Romania 
                   4                    1                    3 
              Russia         Saudi Arabia               Serbia 
                   7                    4                    1 
        Sierra Leone         South Africa                Spain 
                   1                    2                    3 
           Sri Lanka               Sweden                Tonga 
                  14                    2                    1 
              Turkey              Ukraine United Arab Emirates 
                   1                    2                   82 
      United Kingdom        United States              Uruguay 
                  77                  113                    1 
            Viet Nam 
                   5 

A difference between using the summary() function over the table() function for factors is that by default, the summary() function also counts NA values whereas in this must be specified for table()

summary(autism.data$ethnicity)
          Asian           Black        Hispanic          Latino 
            123              43              13              20 
Middle Eastern           others          Others        Pasifika 
             92               1              30              11 
    South Asian         Turkish  White-European            NA's 
             36               6             233              95 
table(autism.data$ethnicity)

          Asian           Black        Hispanic          Latino 
            123              43              13              20 
Middle Eastern           others          Others        Pasifika 
             92               1              30              11 
    South Asian         Turkish  White-European 
             36               6             233 
table(autism.data$ethnicity, useNA = "ifany")

          Asian           Black        Hispanic          Latino 
            123              43              13              20 
Middle Eastern           others          Others        Pasifika 
             92               1              30              11 
    South Asian         Turkish  White-European            <NA> 
             36               6             233              95 

Quiz

Section quiz

  1. What is the proportion of males to females in this dataset?

  2. Return those participants whose gender is female

  3. Return those participants whose gender is female into a new data frame called “females.autism.data”

  4. What is the mean result score for those classified as ASD

  5. What is the mean result score for those not classified as ASD

Solution

  1. What is the proportion of males to females in this dataset?
table(autism.data$gender)
prop.table(table(autism.data$gender))
  1. Return those participants whose ‘gender’ is female
autism.data[autism.data$gender == "f", ]
  1. Return those participants whose ‘gender’ is female into a new data frame called “females.autism.data”
females.autism.data <- autism.data[autism.data$gender == "f", ]
  1. What is the mean ‘result’ score for those classified as ASD
mean(autism.data[autism.data$Class.ASD == "YES" ,"result"])
  1. What is the mean ‘result’ score for those not classified as ASD
mean(autism.data[autism.data$Class.ASD == "NO" ,"result"])

Writing the data out

Use the write.table() function to write the data into a text file

write.table(x = autism.data,
            file = "data/autism_pids.txt",
            row.names = FALSE)

Use the write.csv() function to make the datafile a csv (comma separated values) file

write.csv(x = autism.data,
          file = "data/autism_pids.csv",
          row.names = FALSE)

[Optional] Homework

[Optional] Homework

Use the autism.data for this excercise

  1. Amongst people who had used the app before, were more from Armenia or Brazil?

  2. How many people annotated as parents were also autistic?

  3. What is the mean age of males and females in the dataset? (Hint: you may need to use na.rm = TRUE as an arguement for the mean function)

  4. [Extra credit] Using the chisq.test function, try to see whether there is a significant relationship between ASD and having jaundice as a child. What about gender and autism?

Key Points

  • R supports multiple variable types

  • Errors often result because of trying to perform an unsupported operation on a specific data type

  • Errors can be cryptic to interpret

  • We can use helper packages to import and filter data in R