Assessing the quality of your sequencing data

Overview

Teaching: 30 min
Exercises: 0 min
Questions
  • What is the first step of any sequencing data analysis?

Objectives
  • Assess the quality of our raw sequencing data

Can I use my data at all?

The first step of any NGS sequencing data analysis is to assess the quality of the generated reads, to understand whether the sequencer was working properly and we can trust the generated data. To do this, we use a tool called fastqc to calculate the summary statistics about the quality for each base position across all of the generated reads. To do this:

  1. Open your terminal application
  2. Type the following to log on to the University’s Artemis HPC. This is a “standard” Linux filesystem, which we will navigate around in and interact with using basic Unix commands. [If you need additional support in working with such an interface, we highly recommend attending one of our Unix courses, followed by our HPC series].
    ssh yourunikey@hpc.sydney.edu.au
    #enter your password when prompted
    
  3. Type the following to add fastqc to the list of software that is available to us on the HPC:
    module load fastqc
    

Note for the curious

If you’ve used a terminal before, on your local machine or on a “normal” server, you may wonder why we couldn’t just run fastqc as an executable directly from the command line. The reason for this is that on large, multi-user systems (like Artemis, or most other University HPC clusters you’ll encounter in your career), specialised software that is not used by everybody (so pretty much all bioinformatics tools) or software where versions are updated frequently (like R or python) are not “linked” to your environment, and have to be explicitly added to your $PATH via the module load command. This means that if you have developed a tool that works only with R version 3.1.2, for example, you can module load R/3.1.2 and be confident that your tool will work, while your colleague can use version 3.5, which is required for a package he needs to run. Other useful commands that can help you interact with modules are module avail, which tells you which modules are installed on the system, module list, which lists which modules are already active in your environment, and module purge, which unloads all of the modules you currently have loaded. This can sometimes be useful if you’re having weird issues with your environment.

  1. Download the data to the HPC. In this exercise, we will store the data in the /projects/training/ folder. You will have a /projects/RDS-YOURPROJECTID/ folder for your actual “production” projects, after you have submitted a Research Data Management Plan to access Artemis or RDS for HPC/storage of your research data.
cd /project/Training/
mkdir yourtrainingkey
cd yourtrainingkey
mkdir data
wget FIXME
tar -xvzf FIXME
cd ../
  1. In order to run fastqc we will need to use a PBS script to specify the resources we’d like to use, and a script with the actual code we’d like to run. PBS stands for “portable batch system”, which is the name of the system (PBS Pro) that manages the queue of jobs on Artemis. Other queuing systems include SLURM and Torque, which use slightly different command to achieve the same thing /so if you’ve used one, you can figure out how to use the others/.

Below is an example of a basic PBS script, that will run fastqc on our dataset. Use vim or nano to save this as YYMMDD_chr19fastqc.pbs, where YYMMDD is they year, month and day of today’s date.

#! /bin/bash

# specific to pbs
#PBS -P YOURPROJECTNAME
#PBS -N chr19fastqc
#PBS -l select=1:ncpus=8:mem=10gb
#PBS -l walltime=12:0:0

module load fastqc

# this remains the same for all jobs (specific to mapping)
INPUTDIR="/project/Training/yourtrainingkey/data/"
NCPU=8
OUTDIR="/project/Training/yourtrainingkey/fastqc_chr19"



# creates directory if it's not already there
mkdir -p $OUTDIR
cd $OUTDIR

fastqc -t $NCPU -o $OUTDIR $INPUTDIR/*fastq.gz

To submit it to the training queue, we would do the following:

qsub YYMMDD_chr19fastqc.pbs

We can run the following command to see whether our job is in the queue, is running, or has completed successfully.

qstat

After the job is completed, we can cd to the fastqc_chr19 directory, and see the output files. The easiest way to do this is either to:

  1. Download them to your local machine and open in your normal web browser:
# on your local machine
cd whereyouwantthefiles
scp:yourusername@hpc.sydney.edu.au:/project/Training/yourtrainingkey/fastqc_chr19/* .
  1. Use a tool like multiqc to visualise them together with other QC outputs.

Challenge

Download one of the fastqc reports.

  1. What do you think each of the components means? Refer to the documentation if needed.
  2. Do you think this dataset is of “good” quality? How can you tell?

Key Points

  • fastqc is the first step in any sequencing data analysis when working with fastq files