Exploratory analysis
Questions
- How can we evaluate the quality of our gene count matrix data using DESeq2?
- What quality control techniques can we use to tidy our data?
Before performing any DE analysis, it is good to explore and visualise our data. This helps us get a sense of the quality of our data at this stage of the analysis. We’ll be performing the following techniques in order to evaluate the quality of our raw counts data.
This session will be done in breakout rooms. Work through the code and challenge questions in the .Rmd file with your facilitators.
The total library size
For each sample, check the total library size. This is essentially the total number of reads which we have aligned to each sample.
# Sum raw gene counts for each sample (each column)
colSums(counttable)
Challenge
How do you think the differences in total library size could affect DE analysis?
Solution
Notice that the counts are different for each sample, which we will need to account for. Total library size can give us an indication of sensitivity (more reads = higher ability to detect lowly expressed genes).
Raw data distribution
Here we plot a boxplot of gene level raw counts (y axis) for each sample (x axis).
boxplot(counttable,
col="red")
Challenge
How would you describe what you are seeing in the plot?
Solution
For most genes, the total number of raw counts for each gene are relatively low. We can visualise distribution better if the counts were on a log2 scale.
# Add 1 to make sure all values are > 0
boxplot(log2((counttable)+1),
col = "red")
The distribution of counts across samples is not comparable (although not too dissimilar in this case!). This is a consideration to take if you plan to use any statistical tests that assume an equal distribution of counts across samples.
DESeq2
The DESeq2 package contains functions that perform normalisation, data transformation, visualisation and DE. This is a highly regarded and popular tool. We will use some of its functions to perform exploratory analysis.
This is a large package and we will only scratch the surface of key concepts in this workshop. We recommend you read the DESeq2 paper and manual before performing your own analysis.
Experimental design and the DESeqDataSet object
In order for DESeq2 to perform DE, we need to store data in a DESeqDataSet object (dds) which contains:
- Our count matrix file
- Our experimental information (“meta”) file
- Our design formula
For exploratory analysis, we set design = ~ 1 which tells DESeq2 to be blind to experimental groups. We do not want DESeq2 to account for any within group variability during exploratory analysis and quality checking. This will allow us to observe for any unexpected batch effects.
We will spend more time understanding the dds object later in this workshop.
# We will call this object by name 'dds' as this is a standard practice
<- DESeqDataSetFromMatrix(countData = counttable,
dds colData = meta,
design = ~1)
Data transformation
Count data is transformed with regularized log (rlog) or variance stabilising transformation (vst), required before performing exploratory data analysis such as visualisation and clustering (e.g. PCA). Both methods produce data on the log2 scale, and normalize for other factors such as library size.
rlog performs slightly better, but can take a lot longer than vst if you have many samples. We will set blind = TRUE so that DESeq2 is blind to experimental groups, for the same reasons as previously described.
# Calculate rlog and store it in a dds-like object
<- rlog(dds, blind = TRUE)
rlog <- assay(rlog) rlog.data
We can check the effect of transformation by again using boxplots.
boxplot(rlog.data,
col = "red")
Notice that the count distribution across samples is much more comparable with rlog transformed data.
Principal component analysis
Principal component analysis (PCA) is a dimensionality reduction method that summarises high dimensionality data into a number of principal components (PCs). For RNA-Seq, our highly dimensional data is our per sample gene-expression data and the variance that exists across samples. We can observe the relationship between these in a 2D space, by plotting two components at a time (usually the top two that account for most of the variance).
Create a PCA plot using rlog transformed data. By default, DESeq2::plotPCA()
will colour each dot by the sample’s experimental group, but we have included some additional code to remove these for our discussion.
# DESeq2's plotPCA function will create a PCA plot using an object that has rlog or vst values
<- plotPCA(rlog, returnData=TRUE)
pcaData
# Convert to percentages
<- round(100 * attr(pcaData, "percentVar"))
percentVar
# Plot table
ggplot(pcaData, aes(PC1, PC2)) +
geom_point(size=3) +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance")) +
coord_fixed()
Challenge
Each dot represents one sample. Samples that appear closer together have a more similar gene expression profile. Can you speculate which samples belong to the same experimental group?
Solution
The 3 on the left are likely to belong to one experimental group, and the 3 on the right to another.
Let’s recreate the plot, now colouring samples by their experimental group.
# DESeq2's plotPCA function will create a PCA plot using an object that has rlog or vst values
::plotPCA(rlog) DESeq2
Challenge
- Can you comment on how the samples cluster together in the plot?
- If you saw one red dot cluster more closely with the blue dots, what might this suggest?
- Apart from experimental groups, what other relationships might be revealed when looking at PCA plots?
- How much of the overall variance is explained by PC1 & PC2?
Solution
- The wild samples cluster together nicely, as do the knockout, although the knockout samples are a little bit more spread out.
- The gene expression profile of that knockout is more similar to the wildtype mice and the knockout might not have worked. This is not the case here and we have checked our alignments - all knockouts do not have exon 1 of the Gtf2ird1 gene.
- Batch effects can be very tricky to deal with and ideally, you would implement strategies to avoid or control for unwanted batch effects before starting your experiment.
- 90%. As PC1 and PC2 explain most of the variance observed in the dataset, we do not check the other PCs.
Plotting the other PCs is something you may want to do until you have explored most of the variation in the dataset and what their potential sources might be. We will not have time to cover this in the workshop, but do recommend you look into scree plots and observing the genes contributing to each PC.
Sample-to-sample distances heatmap
Another way to visualise how similar or dissimilar our samples are is to plot sample distances in a heatmap and a hierarchical cluster.
# dist() calculates eucleadean distance, which requires data to be in a specific format
<- dist(t(assay(rlog)))
sampleDists <- as.matrix(sampleDists)
sampleDistMatrix
# Get some pretty blue colours
<- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
colors
# Plot the sampleDistMatrix in a heatmap
# pheatmap also calculates and plots hierachical clusters
pheatmap(sampleDistMatrix,
clustering_distance_rows = sampleDists,
clustering_distance_cols = sampleDists,
col = colors)
Dark blue indicates samples that are more similar (distance = 0).
Challenge
What do you notice about sample KO3?
Solution
The hierarchical cluster groups KO3 more closely with the wild type samples. This could be an indication that our knockout has not worked - although we have confirmed that it has in part 1 of this workshop and are hence not too worried about it. Another reason could be that the knockout effect is not as strong in this sample as the other KO samples - for reasons we do not yet know!
Key points
- Performing exploratory analysis is required in order to understand our data and decide if the experimental variability is explained by the conditions of interest.
- Exploratory analysis can also help to identify and avoid possible outlier samples from the experiment.
All materials copyright Sydney Informatics Hub, University of Sydney