Functional enrichment analysis in R
- Understand and apply functional enrichment analysis in the context of RNAseq DE experiments.
- Define gene ontologies.
- Perform functional enrichment analysis on a set of differentially expressed genes.
We now have a list of significant DE genes. To gain greater biological insights on the DE genes we can determine if there is enrichment of known biological functions, pathways or interactions. We are at the final stage of our RNA experimental workflow:
There are different methods for identifying functional enrichments, e.g.:
- Over representation analysis (ORA): These methods group genes based on shared function, involvement in a biological pathway, presence in a specific cellular location, or other categories like functional pathways.
- Functional class scoring (FCS): This method considers the full list of genes and their associated metrics, rather than just a pre-defined threshold for determining which genes are significant. It includes gene set enrichment analysis (GSEA).
Enrichment analysis can be done using some of the freely available web- and R-based tools. Today we will use a R-package clusterProfiler to perform Over representation analysis. You can read more about the GSEA method here
In this lesson, we will perform over representation analysis (ORA) using gene ontology databases.
2.4.1 Over-representation analysis
There are many functional enrichment tools that perform over-representation analysis using curated databases that categorise genes into groups based on shared function, or involvement in a pathway, or presence in a specific cellular location. These categories are independent of any organism, however each organism has distinct categories available.
To determine whether any categories are over-represented, you can determine the probability of having the observed proportion of genes associated with a specific category in your gene list based on the proportion of genes associated with the same category in the background gene set (gene category for the appropriate organism).
2.4.2 Gene Ontologies
One of the most widely-used categorisation databases is the Gene Ontology (GO) resource. To describe the functional roles of genes and gene products, they are grouped into GO terms. GO terms are organised into three independent controlled vocabularies (ontologies) in a species-independent manner:
- Biological process: Refers to the biological role involving the gene or gene product, and could include “transcription”, “signal transduction”, and “apoptosis”. A biological process generally involves a chemical or physical change of the starting material or input.
- Molecular function: Represents the biochemical activity of the gene product, such activities could include “ligand”, “GTPase”, and “transporter”.
- Cellular component: Refers to the location in the cell of the gene product. Cellular components could include “nucleus”, “lysosome”, and “plasma membrane”.
GO enrichment analysis tools will determine GO terms that are enriched when you supply it will a list of DE genes.
2.4.3 Visualise functional profiles of genes
The clusterProfiler R package implements methods to analyse and visualise functional profiles of genes. The clusterProfiler R-library supports functional characteristics of both coding and non-coding genomics data for thousands of species with up-to-date gene annotation. It provides a tidy interface to access, manipulate, and visualise enrichment results to help users achieve efficient data interpretation.
2.4.4 Prepare DE genes for enrichment analysis
We will use clusterProfiler to perform enrichment analysis for upregulated and downregulated significant DE genes separately. We can use the resSig005_subset_lfc
dataframe created earlier and prepare the data for clusterProfiler.
# We will use the resSig005_subset_lfc dataframe created earlier
# This has been filtered for padj < 0.05 and |LFC| > 1 DE genes
# Upregulated genes
# Filter for significant upregulated genes by log2 fold change > 1. Remove NAs.
<- resSig005_subset_lfc[resSig005_subset_lfc$log2FoldChange > 1, ]
sig.up <- na.omit(sig.up)
sig.up # Create list by decreasing log fold change for upregulated genes
<- sig.up$log2FoldChange
sig.up.LFC names(sig.up.LFC) <- rownames(sig.up)
# Sort by LFC, decreasing
<- sort(sig.up.LFC, decreasing = TRUE)
sig.up.LFC
# Downregulated genes - let's do the same thing
<- resSig005_subset_lfc[resSig005_subset_lfc$log2FoldChange < -1, ]
sig.dn # Filter for significant upregulated genes by log2 fold change < -1. Remove NAs.
<- na.omit(sig.dn)
sig.dn # Create list by decreasing log fold change for upregulated genes
<- sig.dn$log2FoldChange
sig.dn.LFC names(sig.dn.LFC) <- rownames(sig.dn)
# Sort by LFC, decreasing
<- sort(sig.dn.LFC, decreasing = TRUE) sig.dn.LFC
You can check that you have correctly prepared the data by inspecting sig.up.LFC and sig.dn.LFC in the console. Use class()
to check what type of R class we have stored the data in.
2.4.5 GO enrichment of upregulated genes
Clusterprofiler implements the function enrichGO() for gene ontology over-representation test. Let’s run the analysis for up-regulated genes. You can check what the different parameters in the function do by running the command ?enrichGO
. The command will take a few minutes to run.
<- enrichGO(gene = names(sig.up.LFC),
ego.up OrgDb = org.Mm.eg.db,
keyType = 'SYMBOL',
readable = FALSE,
ont = "ALL",
pAdjustMethod = "BH",
pvalueCutoff = 0.05,
qvalueCutoff = 0.05)
Bar plot of upregulated enriched GO terms
The bar plot is a commonly used method to visualise enriched terms. It depicts the enrichment scores (e.g. p-adj values) and gene count or ratio as bar height and colour.
barplot(ego.up,
showCategory = 20)
Dot plot of upregulated enriched GO terms
A dot plot is similar to a scatter plot and bar plot with the capability to encode another score as dot size. In R the dot plot displays the index (each category) in the vertical axis and the corresponding value in the horizontal axis, so you can see the value of each observation following a horizontal line from the label.
dotplot(ego.up,
showCategory = 20,
font.size = 8)
cnetplot of upregulated enriched GO terms
Both the barplot and dotplot only displayed most significant enriched terms, while users may want to know which genes are involved in these significant terms. The cnetplot depicts the linkages of genes and biological concepts (e.g. GO terms or KEGG pathways) as a network.
cnetplot(ego.up,
categorySize = "pvalue",
foldChange = sig.up.LFC,
cex_label_gene = 0.7,
showCategory = 5,
cex_label_category = 1.2,
shadowtext = 'category')
Heatmap-like functional classification of upregulated enriched GO terms
The heatplot is similar to cnetplot, while displaying the relationships as a heatmap. The gene-concept network may become too complicated if user want to show a large number significant terms. The heatplot can simplify the result and more easy to identify expression patterns.
heatplot(ego.up,
showCategory = 20,
foldChange = sig.up.LFC)
Try performing the above steps for downregulated genes. Please remember to rename the variable differently so as to not over-write them e.g. ego.dn
in place of ego.up
This will be a good time to familiarise yourself with the various functions you have used. Remember, you can try and run ?Function_NAME e.g. ?enrichGO to get the help manual pages for that function. You can play with pvalue and qvalue cutoffs, categories etc. Try and interpret the results. We will discuss them at the end of this session.
<- enrichGO(gene = names(sig.dn.LFC),
ego.dn OrgDb = org.Mm.eg.db,
keyType = 'SYMBOL',
readable = FALSE,
ont = "ALL",
pAdjustMethod = "BH",
pvalueCutoff = 0.05,
qvalueCutoff = 0.05??q)
Bar plot of downregulated enriched GO terms
barplot(ego.dn,
showCategory = 20)
Dot-plot of downregulated enriched GO terms
dotplot(ego.dn,
showCategory = 20,
font.size = 10)
cnetplot of downregulated enriched GO terms
cnetplot(ego.dn,
categorySize = "pvalue",
foldChange = sig.dn.LFC,
cex_label_gene = 0.7,
showCategory = 5,
cex_label_category = 1.5,
shadowtext = 'category')
Heatmap-like functional classification of downregulated enriched GO terms
heatplot(ego.dn,
showCategory = 20,
foldChange = sig.dn.LFC)
2.4.6 Other functional enrichment tools
Other enrich functions in clusterprofiler
clustrprofiler has other functions
- enrichKEGG KEGG
- enrichWP WikiPathways
- enrichPathway Reactome
- enrichr Enrichr resource
Open source tools
- Pantherdb http://pantherdb.org/
- Enrichr https://maayanlab.cloud/Enrichr/
- David https://david.ncifcrf.gov/
- gprofiler https://biit.cs.ut.ee/gprofiler/ And many more …
Commercial tools
- Ingenuity Pathway Analysis (IPA) https://www.qiagen.com/us/
- Metacore https://portal.genego.com/
2.4.7 sessionInfo()
It is good practice to record the version of R and all tools you are using for reproducibility purposes (and for the methods section in your paper!). R’s sessionInfo() prints all of this information.
sessionInfo()
Key points
- Functional enrichments can guide our intepretation of results in their biological context.
- Functional enrichments can include pathways and gene ontologies