Exploratory data analysis of a count table

Overview

Teaching: 30 min
Exercises: 0 min

Questions

How do we filter out lowly expressed genes?

How can we reveal batch effects in our data?

Objectives

Filter out lowly expressed genes (and understandw why this is essential for DE)

Use PCA and clustering to assess batch effects and overall experiment reliability

Visualise PCA, MDS and heatmap plots

Now that we’ve identified that the library was generated in a strand-specific manner, we can drop the irrelevant (“unstranded”, “forwardstrand”) columns of our counttablefull dataframe. Check that we have 13 columns in the resulting dataframe (12 count, 1 gene name).

counttablefull <- counttablefull %>% 
  select(-ends_with("unstranded"), -ends_with("forwardstrand")) 
names(counttablefull) <- str_replace(names(counttablefull), "_reversestrand", "")

dim(counttablefull)

[1] 58780    13

Lets visualise the count data:

counttablefull %>%
  select(-ensembl_gene) %>%
  gather(library, value) %>%
  ggplot(aes(x = value)) + geom_histogram(bins = 100) + facet_wrap(~library)

plot of chunk Visi1

We can clearly see that BY FAR most genes have NO counts. Let’s filter these out first, and visualise again:

counttablefull %>%
  select(-ensembl_gene) %>%
  gather(library, value) %>%
  filter(value > 0) %>% 
  ggplot(aes(x = value)) + geom_density() + facet_wrap(~library) 

plot of chunk density

We still see that most genes have very low counts. One way to visualise this better is to log-transform the x-axis:

counttablefull %>%
  select(-ensembl_gene) %>%
  gather(library, value) %>%
  filter(value > 0) %>% 
  ggplot(aes(x = value)) + geom_density() + facet_wrap(~library) + scale_x_log10()

plot of chunk densityLog

This allows us to see that for most datasets, there are many genes that have ~10 counts each, and ~100 counts, as well as some genes with very low, and some genes with very high counts.

Challenge 1

Use ggplot to make a boxplot of the data above. Is the median number of in each sample what you thought it might be based on the density plots above?
Solution
counttablefull %>%
  select(-ensembl_gene) %>%
  gather(library, value) %>%
  filter(value > 0) %>% 
  ggplot(aes(y = value)) + geom_boxplot() + facet_wrap(~library) 

Genes with very low expression levels:

Are impossible to accurately assess for differential expression
Add to the multiple testing burden, thereby reducing the power to detect truly differentially expressed genes

Filtering out lowly expressed genes

We will use the cpm() function from the edgeR library to filter out lowly expressed genes. CPM stands for counts per million, and allows us to normalise for the very different library depths characteristic of our datasets.

# best to convert to a matrix, as edgeR expects a matrix of counts or a DGEList object as input
counttablematrix <- counttablefull %>%
  select(-ensembl_gene) %>%
  as.matrix()

row.names(counttablematrix) <- counttablefull$ensembl_gene
counttable_cpm <- cpm(counttablematrix)

head(counttable_cpm)

                  input29b_1 input29b_2 input29b_3 inputscram_1
ENSG00000223972.5 0.00000000   0.000000   0.000000     0.000000
ENSG00000227232.5 0.89137020   1.474682   1.968631     2.965814
ENSG00000278267.1 0.00000000   0.000000   0.000000     0.000000
ENSG00000243485.5 0.09904113   0.000000   0.000000     0.000000
ENSG00000284332.1 0.00000000   0.000000   0.000000     0.000000
ENSG00000237613.2 0.00000000   0.000000   0.000000     0.000000
                  inputscram_2 inputscram_3 pull29b_1 pull29b_2  pull29b_3
ENSG00000223972.5     0.000000     0.000000 0.0000000 0.0000000 0.05728526
ENSG00000227232.5     3.453934     3.517655 0.2293199 0.4191681 0.28642629
ENSG00000278267.1     0.000000     0.000000 0.0000000 0.0000000 0.00000000
ENSG00000243485.5     0.000000     0.000000 0.0000000 0.0000000 0.00000000
ENSG00000284332.1     0.000000     0.000000 0.0000000 0.0000000 0.00000000
ENSG00000237613.2     0.000000     0.000000 0.0000000 0.0000000 0.00000000
                  pullscram_1 pullscram_2 pullscram_3
ENSG00000223972.5  0.09678162  0.03468246   0.0000000
ENSG00000227232.5  0.29034487  0.48555444   0.5395529
ENSG00000278267.1  0.00000000  0.00000000   0.0000000
ENSG00000243485.5  0.00000000  0.00000000   0.0000000
ENSG00000284332.1  0.00000000  0.00000000   0.0000000
ENSG00000237613.2  0.00000000  0.00000000   0.0000000

As a good rule of thumb, we want to filter out genes that have counts less than 10-15 reads in our raw data.

Let’s figure out how many reads that would be given our library depth. Also, since we’re looking at a pulldown experiment, it makes sense for us to develop this cutoff on the input samples (!):

counttablefull %>%
  select(-ensembl_gene) %>%
  purrr::modify(~sum(.)) %>% 
  distinct() %>%
  gather(library, counts) %>%  # split library column into dataset and protocol
  mutate(millionreads = counts/(10^6)) %>%
  mutate(cpmThreshold = 15/millionreads) %>%
  filter(str_detect(library, 'input')) %>%
  mutate(mycutoff = 0.6 * millionreads)

       library   counts millionreads cpmThreshold  mycutoff
 input29b_1 20193630     20.19363    0.7428085 12.116178
 input29b_2 28480715     28.48072    0.5266722 17.088429
 input29b_3 23366497     23.36650    0.6419447 14.019898
inputscram_1 16521603     16.52160    0.9079022  9.912962
inputscram_2 15634347     15.63435    0.9594261  9.380608
inputscram_3 15066857     15.06686    0.9955626  9.040114

A cutoff of 0.6 seems reasonable.

Let’s generate a matrix with which we can subset the cpm matrix, which will have a value of TRUE for which at least 3 samples meet the required criteria:

subsetting_matrix <- counttable_cpm > 0.6
head(subsetting_matrix)

                  input29b_1 input29b_2 input29b_3 inputscram_1
ENSG00000223972.5      FALSE      FALSE      FALSE        FALSE
ENSG00000227232.5       TRUE       TRUE       TRUE         TRUE
ENSG00000278267.1      FALSE      FALSE      FALSE        FALSE
ENSG00000243485.5      FALSE      FALSE      FALSE        FALSE
ENSG00000284332.1      FALSE      FALSE      FALSE        FALSE
ENSG00000237613.2      FALSE      FALSE      FALSE        FALSE
                  inputscram_2 inputscram_3 pull29b_1 pull29b_2 pull29b_3
ENSG00000223972.5        FALSE        FALSE     FALSE     FALSE     FALSE
ENSG00000227232.5         TRUE         TRUE     FALSE     FALSE     FALSE
ENSG00000278267.1        FALSE        FALSE     FALSE     FALSE     FALSE
ENSG00000243485.5        FALSE        FALSE     FALSE     FALSE     FALSE
ENSG00000284332.1        FALSE        FALSE     FALSE     FALSE     FALSE
ENSG00000237613.2        FALSE        FALSE     FALSE     FALSE     FALSE
                  pullscram_1 pullscram_2 pullscram_3
ENSG00000223972.5       FALSE       FALSE       FALSE
ENSG00000227232.5       FALSE       FALSE       FALSE
ENSG00000278267.1       FALSE       FALSE       FALSE
ENSG00000243485.5       FALSE       FALSE       FALSE
ENSG00000284332.1       FALSE       FALSE       FALSE
ENSG00000237613.2       FALSE       FALSE       FALSE

subsetting_vector <- rowSums(subsetting_matrix) >= 3
head(subsetting_vector)

ENSG00000223972.5 ENSG00000227232.5 ENSG00000278267.1 ENSG00000243485.5 
            FALSE              TRUE             FALSE             FALSE 
ENSG00000284332.1 ENSG00000237613.2 
            FALSE             FALSE 

We have made a vector with which we can now subset the counttablematrix.

counttablematrix_filt <- counttablematrix[subsetting_vector,]
dim(counttablematrix_filt)

[1] 13795    12

This has resulted in 13795 of 58780 genes remaining for differential expression testing.

We can now convert it to a DGEList object (which stands for a “digital gene expression list”), to enable some critical quality control data analysis and subsequent differential gene expression.

Convert to DGEList

mydgelist <- DGEList(counttablematrix_filt)

mydgelist

An object of class "DGEList"
$counts
                   input29b_1 input29b_2 input29b_3 inputscram_1
ENSG00000227232.5          18         42         46           49
ENSG00000279457.4          26         39         58           25
ENSG00000230021.10          1         22         32            0
ENSG00000225972.1         180        191        298          105
ENSG00000225630.1        3093       8978       5747         1509
                   inputscram_2 inputscram_3 pull29b_1 pull29b_2 pull29b_3
ENSG00000227232.5            54           53        10        18        10
ENSG00000279457.4            16           29        17        23        22
ENSG00000230021.10            0            3        11         9        15
ENSG00000225972.1            45           70       419       322       451
ENSG00000225630.1          1073         1171      5274      5277      4595
                   pullscram_1 pullscram_2 pullscram_3
ENSG00000227232.5            9          14          15
ENSG00000279457.4           25          11          10
ENSG00000230021.10          45          16           2
ENSG00000225972.1          310         408         361
ENSG00000225630.1         5429        5396        5085
13790 more rows ...

$samples
             group lib.size norm.factors
input29b_1       1 20173578            1
input29b_2       1 28455783            1
input29b_3       1 23344076            1
inputscram_1     1 16504464            1
inputscram_2     1 15617845            1
7 more rows ...

head(mydgelist$samples)

             group lib.size norm.factors
input29b_1       1 20173578            1
input29b_2       1 28455783            1
input29b_3       1 23344076            1
inputscram_1     1 16504464            1
inputscram_2     1 15617845            1
inputscram_3     1 15050789            1

How different are the library depths after filtering?

mydgelist$samples %>% 
  as.data.frame() %>%
  rownames_to_column() %>%
  separate(rowname, into = c("treatment", "replicate"), sep="_") %>%
  ggplot(aes(y = lib.size, x = treatment, fill = replicate)) +geom_bar(stat = "identity", position = "dodge") 

plot of chunk unnamed-chunk-7

We can see that there is no substantial difference from our visualisation of the unfiltered data in the previous section.

To visualise whether there is a difference in the distribution of raw counts, we can normalise by library depth to counts per million, and apply a log2 transformation to better see the distribution:

mydgelist_cpm <- cpm(mydgelist, log = TRUE)
boxplot(mydgelist_cpm)
abline(h=median(mydgelist_cpm),col="blue")
title("Boxplots of logCPMs (unnormalised)")

plot of chunk unnamed-chunk-8

We can also use ggplot to create a publication-quality version of the above plot:

mydgelist_cpm %>% 
  as.data.frame() %>%
  rownames_to_column() %>%
  gather(library, log2cpm, -rowname) %>%
  separate(library, into = c("treatment", "replicate"), sep = "_") %>%
  ggplot(aes(x = treatment, y = log2cpm, fill = replicate)) + geom_boxplot() +
  geom_hline(yintercept = median(mydgelist_cpm), lty = 2)

plot of chunk unnamed-chunk-9

Essential quality control: most of the variability in your data should be explained by the conditions you are trying to investigate

PCA

Principle components analysis (PCA) is an approach to dimensionality reduction which aims to identify a combination of the set of variables in your data (in this case: a grouping of many, many genes), which together explain most of the variability in your data.

In the context of differential expression analysis (or, really, ANY kind of differential RNA analysis), we expect that most of the variability in our data is explained by the conditions we perturbed (i.e treatment vs control, disease vs normal etc). If this is not the case, and distinct samples are grouped together, this indicates that a differential expression analysis is likely to be unreliable, since this analysis specifically looks at the variance in the expression of genes in the different libraries.

PCA allows for the unbiased identification of batch effects, which then need to be taken into account in the experimental design, and potentially reveals samples to exclude, if only one or two are “grouping” the wrong way.

library(factoextra)

Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ

res.pca <- prcomp(t(mydgelist_cpm))
colors <- as.factor(str_split(rownames(mydgelist$samples), "_", simplify = TRUE)[,1])
fviz_pca_ind(res.pca,
             col.ind = colors,
             palette = "Set1",
             repel = TRUE,    # Avoid text overlapping
             title = "PCA plot of human data"
             )

plot of chunk unnamed-chunk-10

We can also generate a “scree” plot, which shows what proportion of variance is explained by each of the components. We can see that PC1 captures by far the most variance.

fviz_eig(res.pca, addlabels = TRUE)

plot of chunk unnamed-chunk-11

We can also see that, as is common in most biological experiments (but not other types of data analysis), it’s not one gene that contributes strongly to each of the principle components, but many, many genes, that each contribute a tiny bit (< 0.1% !):

fviz_contrib(res.pca, choice = "var", axes = 1, top = 100)

plot of chunk fvizPCA

limma supports the creation of a multidimensional scaling (MDS) plot, which is similar to a PCA analysis. It takes into account only the 500 most variable genes in the dataset.

MDS

plotMDS(mydgelist)

plot of chunk unnamed-chunk-12

# We can also add labels and colour to this plot
plotMDS(mydgelist,col= c("red","purple","blue","orange")[colors])
legend("topright",fill=c("red","purple","blue","orange"),legend=levels(colors))

plot of chunk unnamed-chunk-13

The distance between each pair of samples in the MDS plot is called the “leading fold change”: the root-mean-square of the largest 500 log2-fold changes between that pair of samples. In this case, we can see that the distance between the pull_29b samples and the input ones is

In the MDS plot, the distance between the input samples (approximately -1 on logFC dim 1) and pull29b samples (+2.5) is approximately 3.5 units, which corresponds to a leading fold-change of 2^3.5 = 11.3 between input and specific pulldown. Differences between input and non-specific pulldown (~3 fold) are substantially less.

Reassuringly, all of the input samples group together, indicating that addition of a specific or non-specific oligonucleotide did not substantially affect the RNA sample (i.e. the input is truly an input!).

Challenge 2

The panel above shows the experimental results from a Bioconductor support question. Imagine that this was the PCA plot you got in one of your experiments.

Assume that:

you observe this result on your highly expressed genes (not all genes)

condition (color) is what you care about

batch (shape) indicates the days on which you did the experiment

Do you think there is a batch efect in the data?

The author has carried out batch correction. Do you think it was accurate?

Do you think the differential expression will be accurate in this sample?

If this was the PCA you observed, what would you do next?

Heirarchical clustering to generate heatmaps of the most variable genes

Another helpful visualisation that can assist with understanding the relationship between sequenced datasets uses heirarchical clustering, another unsupervised learning technique. It is designed to group together “more similar” samples based on a distance metric, calculated in n-dimensional space.

In our case, every sample (i.e. library) is captured in the n-dimensional (13795-dimensional!) space, where very gene is a dimension! /Talk about the need for dimensionality reduction!/

First, we need to generate a dataframe of most variable genes.

most_var_genes <- sort(apply(mydgelist_cpm, 1, var), decreasing=TRUE)[1:500]
mydgelist_cpm_most_var <- mydgelist_cpm[most_var_genes,] 

Use the pheatmap library to visualise:

library(pheatmap)

Warning: package 'pheatmap' was built under R version 3.5.2

pheatmap(mydgelist_cpm_most_var, show_rownames  = FALSE) # hiding rownames as there are 500 undreadable ones

plot of chunk unnamed-chunk-14

Calclulate a z-score instead of colouring by log2 expression level:

my_z_score <- function(x){
  return( (x - mean(x)) / sd(x))
}

mydgelist_cpm_most_var2 <-  t(apply(mydgelist_cpm_most_var, 1, my_z_score))
pheatmap(mydgelist_cpm_most_var2, show_rownames  = FALSE)

plot of chunk unnamed-chunk-15

Challenge 3

Consider what happens when we use all of the genes. How do the libraries cluster?
Solution
mydgelist_cpm2 <-  t(apply(mydgelist_cpm, 1, my_z_score))
pheatmap(mydgelist_cpm2, show_rownames  = FALSE)

Key Points

We must filter out lowly expressed genes prior to DGEA

Prior to any DGEA, we must take advantage of unsupervised learning techniques to ensure that no batch effects or other confounding issues affect our experiment as a whole

PCA and heirarchical clustering can be used to achieve this

If there are issues with the PCA and/or clustering, while a list of genes will still be reported in the differential expression, it will most NOT be reliable or accurate, as these issues need to be taken into account

There are tools in R to take into account both understood (different time points) and inexplicable batch effects

previous episode

RNA sequencing data analysis using R and the Artemis HPC

next episode

Exploratory data analysis of a count table

Overview

Challenge 1

Solution