Data and statistical analysis
Overview
Teaching: 60 min
Exercises: 25 minQuestions
R has most common (and uncommon) statistical tests implemented as libraries
Objectives
Introduction into using R for statistical analysis and visualisation
Data and statistical analysis
We have previously shown you how to wrangle your data into the right shape with the tidyverse. Now we will use core R libraries to perform hypothesis testing and statistical analysis.
Our aim here is not to teach you statistics, but how to use R to perform the most popular statistical analysis on your data. If you’d like a statistical consultation for your specific project, please contact Sydney Informatics Hub’s statistics team (select the “less than a day” of help option).
To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of
- Ronald Fisher 1938
The data for this session can be downloaded from here. This is a selection of datapoints from a database of demographic and clinical measurements of Parkinson’s patients and controls. More information about this dataset can be found at http://physionet.org/physiobank/database/gaitpdb/.
Please download this .csv file into your working directory’s data
folder.
library(tidyverse)
gait <- read_csv("data/gait_clean.csv", na = "NaN")
# another way to read in this CSV file is to use the URL option in read.csv()
gait <- read_csv(url("https://raw.githubusercontent.com/Sydney-Informatics-Hub/lessonbmc/gh-pages/_episodes_rmd/data/gait_clean.csv"), na = "NaN")
# clean up the names to not have spaces
names(gait) <- make.names(names(gait), unique=TRUE)
# let's inspect the data
summary(gait)
ID Study Group Subjnum
Length:166 Length:166 Length:166 Min. : 1.00
Class :character Class :character Class :character 1st Qu.: 8.25
Mode :character Mode :character Mode :character Median :16.00
Mean :16.47
3rd Qu.:23.75
Max. :40.00
Gender Age Height..meters. Weight..kg.
Length:166 Min. :36.00 Min. : 1.450 Min. : 47.00
Class :character 1st Qu.:59.00 1st Qu.: 1.655 1st Qu.: 64.00
Mode :character Median :65.00 Median : 1.740 Median : 72.00
Mean :65.14 Mean : 57.634 Mean : 72.56
3rd Qu.:71.00 3rd Qu.:160.000 3rd Qu.: 80.00
Max. :86.00 Max. :185.000 Max. :105.00
NA's :3 NA's :3
HoehnYahr UPDRS UPDRSM TUAG
Min. :0.000 Min. : 0.00 Min. : 0.00 Min. : 6.23
1st Qu.:2.000 1st Qu.: 1.00 1st Qu.: 1.00 1st Qu.: 8.94
Median :2.000 Median :24.00 Median :14.00 Median :10.32
Mean :1.892 Mean :21.47 Mean :13.18 Mean :10.92
3rd Qu.:2.500 3rd Qu.:33.00 3rd Qu.:21.00 3rd Qu.:11.99
Max. :3.000 Max. :70.00 Max. :44.00 Max. :36.34
NA's :55 NA's :31 NA's :31 NA's :13
Speed_01..m.sec. Speed_10
Min. :0.360 Min. :0.2280
1st Qu.:1.013 1st Qu.:0.7930
Median :1.144 Median :0.9550
Mean :1.125 Mean :0.9675
3rd Qu.:1.261 3rd Qu.:1.1890
Max. :1.542 Max. :1.5320
NA's :1 NA's :146
Correlation
Let’s explore whether there is correlation between two clinical measures: UPDRS (Unified Parkinson’s Disease Rating Scale) and TUAG (Timed Up And Go Test). We will use the Spearman rank correlation here (as opposed to the default method, “pearson”). Remember that the Spearman correlation is useful when there is a monotonic relationship between two continuous or ordinal variables, i.e. the two variables tend change together, but not necessarily at a constant rate (if the rate is expected to be constant, use Pearson).
cor(gait$UPDRS, gait$TUAG, method = "spearman", use="pairwise.complete.obs")
[1] 0.5136702
ggplot(gait, aes(x = TUAG, y = UPDRS)) +
geom_point() + theme_bw()
Section quiz
- What’s the Spearman’s rank correlation coefficient between UPDRS and UPDRSM?
Solution
1.
cor(gait$UPDRS, gait$UPDRSM, method = "spearman", use="pairwise.complete.obs")
We’re interested to see if there’s any correlation between any of the clinical variables in the dataset. Let’s explore this by generating a correlation plot.
# please install corrplot prior to running the below code by running: install.packages("corrplot")
library(corrplot)
# keep the clinical variables
gait_clin <- gait %>%
select(HoehnYahr, UPDRS, UPDRSM, TUAG)
# create a correlation matrix of the clinical markers using Spearman correlation
correlations <- cor(gait_clin, method = "spearman" ,use="pairwise.complete.obs")
# create the correlation plot
corrplot(correlations, method = "color")
Hypothesis testing
2 samples: t-test (parametric)
Let’s say we want to simply compare the mean of the TUAG
score between Group
by using the 2-sample t-test This assumes that the variable in question is normally distributed in the two groups (although it is also relatively robust when this is not the case thanks to the Central Limit Theorem with sample sizes greater than 30).
Our null hypothesis is that there is no difference between the mean TUAG
of the Parkinson’s and Control participants. By default in R, the t.test()
function will perform Welch’s 2-sample t-test. We can visualise the distribution of this data using a violin plot:
t.test(TUAG ~ Group, data = gait)
Welch Two Sample t-test
data: TUAG by Group
t = -5.9645, df = 125.22, p-value = 2.339e-08
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.680347 -1.846478
sample estimates:
mean in group CO mean in group PD
9.292698 12.056111
# Let's visualise this data
ggplot(gait, aes(y=TUAG, x = Group)) +
geom_violin() + theme_bw()
Section quiz
- Perform the Welch 2-sample t-test for TUAG by gender? Generate violin plots of this.
Solution
1.
t.test(TUAG ~ Gender, data = gait) ggplot(gait, aes(y=TUAG, x = Gender, fill = Gender)) + geom_violin() +theme_bw()
2 samples: Wilcoxon-Mann-Whitney (non-parametric)
The Mann–Whitney U test, also called the Mann–Whitney–Wilcoxon, Wilcoxon rank-sum test, or Wilcoxon–Mann–Whitney test - is sometimes preferred to the 2 sample t-test when the distribution cannot be assumed to be normal; for a discussion of whether this is appropriate see here.
Thankfully, performing it in R is easier than trying to figure out if it’s appropriate…
Here the null hypothesis is that the two independent samples were selected from populations having a similar distribution (if the samples are dependent, use the Wilcoxon signed-rank test, by specifying the arguement paired=TRUE
).
wilcox.test(TUAG ~ Group, data = gait)
Wilcoxon rank sum test with continuity correction
data: TUAG by Group
W = 1122, p-value = 2.172e-10
alternative hypothesis: true location shift is not equal to 0
More than 2 comparisons: ANOVA
If we want to explore the relationship between a variable and multiple factors, we can use an ANOVA. Many experiments you’ve told us about are amenable to this analysis technique, for example determining levels of which miRNA are distinct in cancer vs normal tissue (miRNA1, miRNA2, miRNA3, miRNA4 as factor 1, and cancer vs normal as factor 2); whether the concentration of a compound is distinct in WT vs mutant cell lines at different ages (age1, age2, age3) etc.
One-way ANOVA
We can perform an ANOVA using the Study
grouping of patients and test the null hypothesis that the mean TUAG
is the same for all groups
aov_study <- aov(TUAG ~ Study , data = gait)
summary(aov_study)
Df Sum Sq Mean Sq F value Pr(>F)
Study 2 63.6 31.82 2.691 0.0711 .
Residuals 150 1773.7 11.82
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
13 observations deleted due to missingness
ggplot(gait, aes(y=TUAG, x = Study)) +
geom_violin() +
theme_bw()
Two-way ANOVA
We can perform an ANOVA using the Study
grouping of patients and the Group
, i.e. whether they are controls or have a PD diagnosis. Here, we are testing the null hypothesis that the mean TUAG
is the same for all groups.
aov_study <- aov(TUAG ~ Study + Group, data = gait)
summary(aov_study)
Df Sum Sq Mean Sq F value Pr(>F)
Study 2 63.6 31.82 3.108 0.0476 *
Group 1 248.2 248.25 24.248 2.22e-06 ***
Residuals 149 1525.4 10.24
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
13 observations deleted due to missingness
ggplot(gait, aes(y=TUAG, x = Study)) +
geom_violin() +
theme_bw() +
facet_grid(~Group)
We can also code for more complex designs, where we expect Height
and Weight
to not affect the mean TUAG
score independently (i.e. code for an interaction term).
names(gait)
[1] "ID" "Study" "Group"
[4] "Subjnum" "Gender" "Age"
[7] "Height..meters." "Weight..kg." "HoehnYahr"
[10] "UPDRS" "UPDRSM" "TUAG"
[13] "Speed_01..m.sec." "Speed_10"
aov_study2 <- aov(TUAG ~ Age + Height..meters. + Weight..kg. + Height..meters.:Weight..kg., data = gait)
summary(aov_study2)
Df Sum Sq Mean Sq F value Pr(>F)
Age 1 191.3 191.28 19.354 2.12e-05 ***
Height..meters. 1 8.3 8.32 0.842 0.360
Weight..kg. 1 9.4 9.38 0.949 0.332
Height..meters.:Weight..kg. 1 8.3 8.33 0.843 0.360
Residuals 142 1403.4 9.88
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
19 observations deleted due to missingness
aov_study3 <- aov(TUAG ~ Age + Height..meters.*Weight..kg., data = gait)
summary(aov_study3)
Df Sum Sq Mean Sq F value Pr(>F)
Age 1 191.3 191.28 19.354 2.12e-05 ***
Height..meters. 1 8.3 8.32 0.842 0.360
Weight..kg. 1 9.4 9.38 0.949 0.332
Height..meters.:Weight..kg. 1 8.3 8.33 0.843 0.360
Residuals 142 1403.4 9.88
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
19 observations deleted due to missingness
Let’s assume that the we had sufficient evidence from the ANOVA to reject the null. We would then perform post-hoc tests. For example, we can perform multiple pairwise-comparison between the means of groups using Tukey Honest Significant Differences:
aov_study <- aov(TUAG ~ Study , data = gait)
TukeyHSD(aov_study)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = TUAG ~ Study, data = gait)
$Study
diff lwr upr p adj
Ju-Ga -1.2964508 -3.040312 0.44741045 0.1867481
Si-Ga -1.6172115 -3.303214 0.06879145 0.0631204
Si-Ju -0.3207607 -1.837987 1.19646521 0.8711995
Many, many other post-hoc tests are available in R - if you know you’d like to use a specific test, chances are that R already has a package implemented.
Linear regression
Let’s say that for the Parkinson’s patients, we want to make a model to predict UPDRS
score using the UPDRSM
gait_pd <- gait %>%
filter(Group == "PD")
# Let's plot this first
ggplot(gait_pd, aes(x = UPDRSM, y = UPDRS)) +
geom_point() +
geom_smooth(method = 'lm', se = TRUE) +
theme_bw()
Fit a linear regression predicting UPDRS using UPDRSM, and look at the summary of the results
gait_pd_slr <- lm(UPDRS ~ UPDRSM, data = gait_pd)
summary(gait_pd_slr)
Call:
lm(formula = UPDRS ~ UPDRSM, data = gait_pd)
Residuals:
Min 1Q Median 3Q Max
-9.2820 -4.2972 -0.6608 3.0513 16.3392
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.97901 1.53207 3.25 0.00163 **
UPDRSM 1.37879 0.07378 18.69 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.359 on 89 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.7969, Adjusted R-squared: 0.7946
F-statistic: 349.2 on 1 and 89 DF, p-value: < 2.2e-16
gait_pd_slr$coefficients
(Intercept) UPDRSM
4.979012 1.378788
plot(gait_pd_slr)
What if we now include Age
in our model?
gait_pd_mlr <- lm(UPDRS ~ UPDRSM + Age, data = gait_pd)
summary(gait_pd_mlr)
Call:
lm(formula = UPDRS ~ UPDRSM + Age, data = gait_pd)
Residuals:
Min 1Q Median 3Q Max
-9.8453 -4.1916 -0.5681 2.6996 16.2686
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.97473 4.48919 -0.217 0.829
UPDRSM 1.39551 0.07433 18.775 <2e-16 ***
Age 0.08490 0.06022 1.410 0.162
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.33 on 88 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.8014, Adjusted R-squared: 0.7969
F-statistic: 177.5 on 2 and 88 DF, p-value: < 2.2e-16
gait_pd_mlr$coefficients
(Intercept) UPDRSM Age
-0.97473294 1.39551286 0.08490291
plot(gait_pd_mlr)
Visualisation with PCA
We will use PCA and plot the first 2 Principal Components to observe the variance in a microarray dataset. This is a famous dataset from 1999 where 27 acute lymphoblatic leukaemia (ALL) and 11 acute myeloid leukaemia (AML) patient’s gene expression data were measured for 3051 genes.
First, we need to install and load up the multtest
package which comes with the golub
data.
library(multtest)
data(golub)
# this is a vector of patient's diagnoses. ALL is 0 and AML is 1
table(golub.cl)
golub.cl
0 1
27 11
# let's try PCA in R for visualisation. First, we have to transpose the dataset because we want each patient to be recorded in rows and genes in columns
golub <- t(golub)
# perform PCA with the input data z-scored such that the mean for each gene is 0 and the SD for each gene is 1
pca <- prcomp(golub, scale = TRUE, center = TRUE)
# I'm just interested in plotting the Principal Components. This is in the "x" matrix of the pca object
pca_plotdata <- data.frame(pca$x)
# attach the patient diagnosis as we will be colouring in our plot with this
pca_plotdata <- pca_plotdata %>%
mutate(samples = as.factor(golub.cl))
# plot PC1 and PC2 with ggplot
ggplot(pca_plotdata, aes(x = PC1, y = PC2, colour = samples)) +
geom_point() +
theme_bw()
# inspect the cumulative proportion of variance explained by each PC
summary(pca)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6
Standard deviation 21.7960 16.93278 14.28289 13.3825 11.48963 11.36482
Proportion of Variance 0.1557 0.09398 0.06686 0.0587 0.04327 0.04233
Cumulative Proportion 0.1557 0.24968 0.31655 0.3752 0.41851 0.46085
PC7 PC8 PC9 PC10 PC11 PC12
Standard deviation 10.67968 10.08086 9.75473 9.64132 8.70279 8.3228
Proportion of Variance 0.03738 0.03331 0.03119 0.03047 0.02482 0.0227
Cumulative Proportion 0.49823 0.53154 0.56273 0.59319 0.61802 0.6407
PC13 PC14 PC15 PC16 PC17 PC18
Standard deviation 8.24560 8.00956 7.9287 7.61089 7.41558 7.27317
Proportion of Variance 0.02228 0.02103 0.0206 0.01899 0.01802 0.01734
Cumulative Proportion 0.66301 0.68403 0.7046 0.72362 0.74165 0.75898
PC19 PC20 PC21 PC22 PC23 PC24
Standard deviation 7.10444 6.96890 6.90898 6.78065 6.65938 6.55649
Proportion of Variance 0.01654 0.01592 0.01565 0.01507 0.01454 0.01409
Cumulative Proportion 0.77553 0.79145 0.80709 0.82216 0.83670 0.85079
PC25 PC26 PC27 PC28 PC29 PC30
Standard deviation 6.4648 6.4170 6.26913 6.19409 6.12242 6.06556
Proportion of Variance 0.0137 0.0135 0.01288 0.01258 0.01229 0.01206
Cumulative Proportion 0.8645 0.8780 0.89086 0.90344 0.91572 0.92778
PC31 PC32 PC33 PC34 PC35 PC36
Standard deviation 5.90103 5.82533 5.77058 5.64872 5.51038 5.44628
Proportion of Variance 0.01141 0.01112 0.01091 0.01046 0.00995 0.00972
Cumulative Proportion 0.93920 0.95032 0.96123 0.97169 0.98164 0.99136
PC37 PC38
Standard deviation 5.13284 1.344e-14
Proportion of Variance 0.00864 0.000e+00
Cumulative Proportion 1.00000 1.000e+00
# quickly plot the proportion of variance explained by each component using base R's plotting function
plot(summary(pca)$importance["Proportion of Variance", ])
# and now, plot the cumulative proportion of variance explained
plot(summary(pca)$importance["Cumulative Proportion", ])
Clustering and heatmap
# please install gplots prior to running the below code by running: install.packages("gplots")
library(gplots)
# my favourite genes are in these columns
favgenes <- c(703,717,766,829,896,1037,1334,1665,1817,1834,2002,2124,2386,2600,2645,2801,2851,2939)
# let's make a heatmap of my favourite genes
heatmap.2(golub[ ,favgenes],
trace = "none",
scale = "column",
RowSideColors = c(rep("yellow", 27), rep("darkgreen", 11)),
dendrogram = "both",
col = "bluered")
Further reading/links
Several of you have asked about analysing so-called Likert data, i.e situations when you are trying to measure respondents attitudes to a particular question or statement (i.e. “very unsatisfied”, “unsatisifed”, “neither unsatisifed nor satisfied”, “satisfied”, “very satisfied”) etc.
While we don’t have the time to go into details of the complexities of analysing this type of data today, they can be analysed in R. Some ideas on how to do this can be found here, here, and in the likert package.
Key Points
R has a large community of developers creating power tools for data and statistical analysis