Functional enrichment analysis in R

Learning objectives

Understand and apply functional enrichment analysis in the context of RNAseq DE experiments.
Define gene ontologies.
Perform functional enrichment analysis on a set of differentially expressed genes.

We now have a list of significant DE genes. To gain greater biological insights on the DE genes we can determine if there is enrichment of known biological functions, pathways or interactions. We are at the final stage of our RNA experimental workflow:

There are different methods for identifying functional enrichments, e.g.:

Over representation analysis (ORA): These methods group genes based on shared function, involvement in a biological pathway, presence in a specific cellular location, or other categories like functional pathways.
Functional class scoring (FCS): This method considers the full list of genes and their associated metrics, rather than just a pre-defined threshold for determining which genes are significant. It includes gene set enrichment analysis (GSEA).

Enrichment analysis can be done using some of the freely available web- and R-based tools. Today we will use a R-package clusterProfiler to perform Over representation analysis. You can read more about the GSEA method here

In this lesson, we will perform over representation analysis (ORA) using gene ontology databases.

2.4.1 Over-representation analysis

There are many functional enrichment tools that perform over-representation analysis using curated databases that categorise genes into groups based on shared function, or involvement in a pathway, or presence in a specific cellular location. These categories are independent of any organism, however each organism has distinct categories available.

To determine whether any categories are over-represented, you can determine the probability of having the observed proportion of genes associated with a specific category in your gene list based on the proportion of genes associated with the same category in the background gene set (gene category for the appropriate organism).

2.4.2 Gene Ontologies

One of the most widely-used categorisation databases is the Gene Ontology (GO) resource. To describe the functional roles of genes and gene products, they are grouped into GO terms. GO terms are organised into three independent controlled vocabularies (ontologies) in a species-independent manner:

Biological process: Refers to the biological role involving the gene or gene product, and could include “transcription”, “signal transduction”, and “apoptosis”. A biological process generally involves a chemical or physical change of the starting material or input.
Molecular function: Represents the biochemical activity of the gene product, such activities could include “ligand”, “GTPase”, and “transporter”.
Cellular component: Refers to the location in the cell of the gene product. Cellular components could include “nucleus”, “lysosome”, and “plasma membrane”.

GO enrichment analysis tools will determine GO terms that are enriched when you supply it will a list of DE genes.

2.4.3 Visualise functional profiles of genes

The clusterProfiler R package implements methods to analyse and visualise functional profiles of genes. The clusterProfiler R-library supports functional characteristics of both coding and non-coding genomics data for thousands of species with up-to-date gene annotation. It provides a tidy interface to access, manipulate, and visualise enrichment results to help users achieve efficient data interpretation.

2.4.4 Prepare DE genes for enrichment analysis

We will use clusterProfiler to perform enrichment analysis for upregulated and downregulated significant DE genes separately. We can use the resSig005_subset_lfc dataframe created earlier and prepare the data for clusterProfiler.

# We will use the resSig005_subset_lfc dataframe created earlier
# This has been filtered for padj < 0.05 and |LFC| > 1 DE genes

# Upregulated genes
# Filter for significant upregulated genes by log2 fold change > 1. Remove NAs.
sig.up <- resSig005_subset_lfc[resSig005_subset_lfc$log2FoldChange > 1, ]
sig.up <- na.omit(sig.up)
# Create list by decreasing log fold change for upregulated genes
sig.up.LFC <- sig.up$log2FoldChange
names(sig.up.LFC) <- rownames(sig.up)
# Sort by LFC, decreasing
sig.up.LFC <- sort(sig.up.LFC, decreasing = TRUE)

# Downregulated genes - let's do the same thing
sig.dn <- resSig005_subset_lfc[resSig005_subset_lfc$log2FoldChange < -1, ]
# Filter for significant upregulated genes by log2 fold change < -1. Remove NAs.
sig.dn <- na.omit(sig.dn)
# Create list by decreasing log fold change for upregulated genes
sig.dn.LFC <- sig.dn$log2FoldChange
names(sig.dn.LFC) <- rownames(sig.dn)
# Sort by LFC, decreasing
sig.dn.LFC <- sort(sig.dn.LFC, decreasing = TRUE)

You can check that you have correctly prepared the data by inspecting sig.up.LFC and sig.dn.LFC in the console. Use class() to check what type of R class we have stored the data in.

2.4.5 GO enrichment of upregulated genes

Clusterprofiler implements the function enrichGO() for gene ontology over-representation test. Let’s run the analysis for up-regulated genes. You can check what the different parameters in the function do by running the command ?enrichGO. The command will take a few minutes to run.

ego.up <- enrichGO(gene = names(sig.up.LFC),
                   OrgDb = org.Mm.eg.db, 
                   keyType = 'SYMBOL',
                   readable = FALSE,
                   ont = "ALL",
                   pAdjustMethod = "BH",
                   pvalueCutoff = 0.05, 
                   qvalueCutoff = 0.05)

Bar plot of upregulated enriched GO terms

The bar plot is a commonly used method to visualise enriched terms. It depicts the enrichment scores (e.g. p-adj values) and gene count or ratio as bar height and colour.

barplot(ego.up, 
        showCategory = 20)

Dot plot of upregulated enriched GO terms

A dot plot is similar to a scatter plot and bar plot with the capability to encode another score as dot size. In R the dot plot displays the index (each category) in the vertical axis and the corresponding value in the horizontal axis, so you can see the value of each observation following a horizontal line from the label.

dotplot(ego.up, 
        showCategory = 20,
        font.size = 8)

cnetplot of upregulated enriched GO terms

Both the barplot and dotplot only displayed most significant enriched terms, while users may want to know which genes are involved in these significant terms. The cnetplot depicts the linkages of genes and biological concepts (e.g. GO terms or KEGG pathways) as a network.

cnetplot(ego.up, 
         categorySize = "pvalue", 
         foldChange = sig.up.LFC,
         cex_label_gene = 0.7,
         showCategory = 5,
         cex_label_category = 1.2,
         shadowtext = 'category')

Heatmap-like functional classification of upregulated enriched GO terms

The heatplot is similar to cnetplot, while displaying the relationships as a heatmap. The gene-concept network may become too complicated if user want to show a large number significant terms. The heatplot can simplify the result and more easy to identify expression patterns.

heatplot(ego.up,
         showCategory = 20,
         foldChange = sig.up.LFC)

Challenge exercise

Try performing the above steps for downregulated genes. Please remember to rename the variable differently so as to not over-write them e.g. ego.dn in place of ego.up

This will be a good time to familiarise yourself with the various functions you have used. Remember, you can try and run ?Function_NAME e.g. ?enrichGO to get the help manual pages for that function. You can play with pvalue and qvalue cutoffs, categories etc. Try and interpret the results. We will discuss them at the end of this session.

Solution

ego.dn <- enrichGO(gene = names(sig.dn.LFC),
                   OrgDb = org.Mm.eg.db, 
                   keyType = 'SYMBOL',
                   readable = FALSE,
                   ont = "ALL",
                   pAdjustMethod = "BH",
                   pvalueCutoff = 0.05, 
                   qvalueCutoff = 0.05??q)

Bar plot of downregulated enriched GO terms

barplot(ego.dn,
        showCategory = 20)

Dot-plot of downregulated enriched GO terms

dotplot(ego.dn, 
        showCategory = 20,
        font.size = 10)

cnetplot of downregulated enriched GO terms

cnetplot(ego.dn,
         categorySize = "pvalue", 
         foldChange = sig.dn.LFC,
         cex_label_gene = 0.7,
         showCategory = 5,
         cex_label_category = 1.5,
         shadowtext = 'category')

Heatmap-like functional classification of downregulated enriched GO terms

heatplot(ego.dn,
         showCategory = 20,
         foldChange = sig.dn.LFC)

2.4.6 Other functional enrichment tools

Other enrich functions in clusterprofiler

clustrprofiler has other functions

enrichKEGG KEGG
enrichWP WikiPathways
enrichPathway Reactome
enrichr Enrichr resource

Open source tools

Pantherdb http://pantherdb.org/
Enrichr https://maayanlab.cloud/Enrichr/
David https://david.ncifcrf.gov/
gprofiler https://biit.cs.ut.ee/gprofiler/ And many more …

Commercial tools

Ingenuity Pathway Analysis (IPA) https://www.qiagen.com/us/
Metacore https://portal.genego.com/

2.4.7 sessionInfo()

It is good practice to record the version of R and all tools you are using for reproducibility purposes (and for the methods section in your paper!). R’s sessionInfo() prints all of this information.

sessionInfo()

Key points

Functional enrichments can guide our intepretation of results in their biological context.
Functional enrichments can include pathways and gene ontologies