Building a genome index

Overview

Teaching: 30 min
Exercises: 0 min
Questions
  • What is the first step of mapping data?

  • How do I find reference genomes and transcriptomes for my species?

Objectives
  • Use STAR to build a genome index for mapping.

Background

In order to quantitate how many reads map to which genes in the genome, we need 3 things:

  1. Good-quality read data (which we assessed in step 1)
  2. An aligner. When working with transcriptomic RNA-seq data, we recommend the following aligners:
    • STAR - developed by Alex Dobin - for short-read whole-transcriptome sequencing data, which we will use in this workshop
    • BWA - a genomic/DNA aligner, developed by Heng Li - for short-read sequencing data of small RNA (ex - miRNA libraries), where splicing is not expected to be
    • minimap2 - a transcriptomic aligner, also developed by Heng Li - for long read sequencing data such as Oxford Nanopore and PacBio. Unlike “standard” DNA/RNA aligners, minimap2 is able to handle the high error rate of these technologies.

Both STAR and minimap2 are what are called “splicing-aware” aligners, in that they are designed to align RNA-seq data, which needs to accomodate for (and not penalise too heavily) the natural “gaps” that occur when aligning RNA to genomic DNA sequence as a result of splicing.

  1. A reference genome The genomes of many species have already been sequenced, and RNA-seq is often done on samples that come from these. We will focus the analysis below on working with human data, for whom the genome was sequenced in 2001. Prior to mapping, most aligners require you to construct and index the genome, so that the aligner can quickly and efficiently retrieve reference sequence information.

Note

If the genome of your organism is not available, you can (1) carry out de novo transcripome assembly and analysis, for example using Trinity, and/or (2) use the genome of a closely related species, adjust the mapping parameters to allow for more mismatches, and map and quantitate to that organism as below.

Where do I get the genome for my organism of interest?

If you do bioinformatics for any length of time, you’ll find that many people working on similar ideas without an overarching rulebook to standardize their activity results in a LOT of challenges with formats. One of the best known, and most irritating, has to do with chromosomes and their naming conventions. The UCSC Genome Browser and NCBI, based in the US, use the prefix “chr” before the chromosome name, so chromosome 1 becomes “chr1”, chromosome 2 - “chr2” etc. Ensembl and EBI, based in the UK/EU, do not add a prefix to their chromosome names, so “1” and “2” are the names of the exact same bits of DNA… Now, you’d think we could use a tool to just add “chr” to Ensembl coordinates to get the UCSC ones, but they can also join the scaffolds together in a different way, call them by different names, and use 1 vs 0 based coordinates to specify a gene locations (which in turn are different between the two). So, because annotations (including gene coordinates) usually use a specific convention (depending on where you’re downloading the data from), they must be matched to your genome of interest. As a general rule of thumb, we recommend:

  1. For human and mouse data, use the GENCODE annotations. You can read more about the GENCODE project in detail here, but briefly it is a project that aims to create a high-quality, reliable annotation of mouse and human genes. They also provide genome reference fasta files which match the chromosome names in the transcriptome annotations. This is the genome/annotation source we will use in this workshop.
  2. If you’re not working with mouse or human, there are two primary repositories of genome information: ENSEMBL and UCSC. If your organism is represented in only one of them, it’s pretty simple to choose which one to use…

  3. If your organism is available in both, we would recommend using ENSEMBL for gene annotation files + read mapping + gene quantification, for differential expression and subsequent analysis (i.e. use Ensembl genome fasta and gtf file). If you plan to visualise your data in UCSC (which we highly recommend!), we would suggest downloading a genome fasta file from UCSC, map to that using STAR to generate a “wig” file, convert that to a bigWig, and visualise in UCSC. If you then make an AMAZING discovery using the differential gene expression with the Ensembl annotation, but don’t see it supported at ALL in UCSC, you’ll know that there might be something strange going on, and you should did deeper before publishing that Science paper…

To construct an index of the human reference genome using STAR, we need to carry out the following steps:

1. Download the data: fasta genome sequence and gtf annotation file

We will use the human gencode 29 comprehensive annotation, “PRI” from the primary chromosomes (this includes scaffolds, but not haplotypes and assembly patches). Depending on your project, you may want to use the smaller “CHR” annotation, which excludes scaffolds as well.

The following script (please do not run) would have downloaded the entire genome and the correct annotation for you:

mkdir genome
cd genome
# download the "Genome sequence, primary assembly (GRCh38)" fasta file
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/GRCh38.primary_assembly.genome.fa.gz
# filter it as described in the note below
# download the annotations that correspond to it 
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/gencode.v29.annotation.gtf.gz

However, as this is a training session, in the interests of having the tasks finish in a reasonable amount of time, we will use only the data/annotations for chromosome 19 to map to (not to worry: we will provide a full count table for you for the differential expression analysis).

# make sure we're in the right spot!
pwd
# should say something like "/project/Training/darya/"
ls
# should have "data" folder

Once you’re in the correct folder, make a genome directory, move into it, and download the necessary files from Figshare. Because of Figshare vagaries, you will need to rename the files you download prior to ungzipping them; you don’t need to do so when downloading from Ensembl/UCSC.

# make a new directory called genome 
mkdir genome
cd genome

# download the "Genome sequence, primary assembly (GRCh38)" fasta file
wget https://ndownloader.figshare.com/files/14669702
mv 14669702 GRCh38.primary_assembly.genome.fa.chr19.gz

# download the annotations that correspond to it 
wget https://ndownloader.figshare.com/files/14658830
mv 14658830 gencode.v29.primary_assembly.annotation.chr19.gtf.gz

Write a script to build the genome index file

We will use STAR to index the genome fasta file we just downloaded. We highly recommend you read and refer to the STAR manual when doing your own RNA-seq work, as it explains the meaning of all of the many parameters that are essential to produce an accurate, reliable STAR alignment.

For example, when generating a reference genome index file, it is useful to understand the --sjdbOverhang parameter, which “specifies the length of the genomic sequence around the annotated junction to be used in constructing the splice junctions database. Ideally, this length should be equal to the ReadLength-1, where ReadLength is the length of the reads. For instance, for Illumina 2x100b paired-end reads, the ideal value is 100-1=99. In case of reads of varying length, the ideal value is max(ReadLength)-1” (quoted from the STAR manual).

In our case, as we saw from fastqc, the read length was 100 nt, so the default value of 100 is suitable. However, for more “modern” Illumina data (PE 150+), this value is likely to be too short, and should be adjusted!

Unfortunately, as of right now, STAR needs us to ungzip the genome files. In order to save space, we recommend ungzipping both the fasta and the gtf files, and then re-gzipping the fasta once the genome generation step is done. We will need the unzipped gtf file later.

gunzip GRCh38.primary_assembly.genome.fa.chr19.gz
# wait
gunzip gencode.v29.primary_assembly.annotation.chr19.gtf.gz
# after genome has been generated
gzip GRCh38.primary_assembly.genome.fa.chr19

To make the genome index, we need to run the following pbs script (call it 1903_genomeGenerate.pbs and store it in a directory you create called scripts):

#!/bin/bash
# Building a star index file

#PBS -P RDS-CORE-SIH4HPC-RW
#PBS -N starIndex
#PBS -l select=1:ncpus=2:mem=10gb
#PBS -l walltime=24:0:0

# Load modules

module load star
# set the runThreadN to be one less than your NCPU request!
# STAR --runThreadN 1 --runMode genomeGenerate --genomeDir $GENOMEDIR/STAR --genomeFastaFiles $GENOMEDIR/GRCh38.primary_assembly.genome.chr19.fa --sjdbGTFfile $GENOMEDIR/gencode.v29.primary_assembly.annotation.chr19.gtf

We would then submit it to the queue as qsub 1903_genomeGenerate.pbs.

Challenge

Write a pbs script to build an index for chromosome 19, and submit it to the training queue

Solution

? Tips if needed

Once this job has successfully finished, we should have a STAR folder in the genome directory, with the following files:

chrLength.txt
chrNameLength.txt
chrName.txt
chrStart.txt
exonGeTrInfo.tab
exonInfo.tab
geneInfo.tab
Genome
genomeParameters.txt
SA
SAindex
sjdbInfo.txt
sjdbList.fromGTF.out.tab
sjdbList.out.tab
transcriptInfo.tab

Note

If we were building a fullscale human genome (DO NOT DO THIS IN THE CLASS), this script/resources would usually be suitable:

#!/bin/bash
# Building a star index file

#PBS -P YOURPROJECT
#PBS -N starIndex
#PBS -l select=1:ncpus=24:mem=60gb
#PBS -l walltime=6:0:0

# Load modules

module load star
GENOMEDIR="/home/username/scratch/190321_RNAseqR/genome/"
mdkir -p $GENOMEDIR/STAR
STAR --runThreadN 23 --runMode genomeGenerate --genomeDir $GENOMEDIR/STAR --genomeFastaFiles $GENOMEDIR/GRCh38.primary_assembly.genome.fa --sjdbGTFfile $GENOMEDIR/gencode.v29.primary_assembly.annotation.gtf

When I ran it, the following resources were sufficient:

-- Job Summary -------------------------------------------------------
Job Id: 3075585.pbsserver for user SIHuser in queue normal
Job Name: starIndex
Project: SIHproject
Exit Status: 0
Job run as chunks (hpc160:ncpus=24:mem=62914560kb)
Walltime requested:   6:00:00 :      Walltime used:   00:28:27
                               :   walltime percent:       8.0%
-- Nodes Summary -----------------------------------------------------
-- node hpc160 summary
    Cpus requested:         24 :          Cpus Used:      10.55
          Cpu Time:   05:00:09 :        Cpu percent:      44.0%
     Mem requested:     60.0GB :           Mem used:     56.4GB
                               :        Mem percent:      94.0%

The following is what we would expect as standard output in the output log file:

Mar 22 10:29:43 ..... started STAR run
Mar 22 10:29:43 ... starting to generate Genome files
Mar 22 10:30:47 ... starting to sort Suffix Array. This may take a long time...
Mar 22 10:31:01 ... sorting Suffix Array chunks and saving them to disk...
Mar 22 10:48:54 ... loading chunks from disk, packing SA...
Mar 22 10:50:00 ... finished generating suffix array
Mar 22 10:50:00 ... generating Suffix Array index
Mar 22 10:53:38 ... completed Suffix Array index
Mar 22 10:53:38 ..... processing annotations GTF
Mar 22 10:53:54 ..... inserting junctions into the genome indices
Mar 22 10:56:33 ... writing Genome to disk ...
Mar 22 10:56:59 ... writing Suffix Array to disk ...
Mar 22 10:58:00 ... writing SAindex to disk
Mar 22 10:58:07 ..... finished successfully

Key Points

  • Mapping RNA-seq data requires using splicing-aware mappers.

  • The first step of mapping sequencing data is to build a genome index.

  • This involves figuring out which reference file and annotation you need, and making sure the chromosome names in them match