2.2 Running nf-core pipelines
Learning objectives
- Build a basic run script for a nf-core pipeline
- Run an nf-core pipeline
- Understand the need for HPC-specific configurations
2.2.1 Prepraring the environment
As mentioned in the previous section, we will be running a small section of the overall sarek workflow: the markduplicates stage. This takes in an alignment .bam file and a reference .fasta file as input, using a samplesheet: a .csv file that describes each sample, its name, and where to find its associated .bam file.
Exercise: review the nf-core/sarek inputts
In the parent directory to your current working directory, there is a data/ folder containing all the input files we need for this workshop. Within the bams/ sub-directory is a pre-built CSV samplesheet (samplesheet.csv). Open it up in VSCode and review the contents:
patient,sample,bam,bai
NA12877,test_sample1,../data/bams/NA12877_chr20-22.bam,../data/bams/NA12877_chr20-22.bam.bai
NA12878,test_sample2,../data/bams/NA12878_chr20-22.bam,../data/bams/NA12878_chr20-22.bam.bai
NA12889,test_sample3,../data/bams/NA12889_chr20-22.bam,../data/bams/NA12889_chr20-22.bam.bai
The first line is a header line: it describes the "column names" of the table that the CSV represents. We have four columns in this file:
patient: An identifier for the individual the sample came from. Note that more than one sample may originate from an individual, such as in the case of paired tumor and normal samples.sample: A unique identifier for the sample. Each line must have a unique value for this column.bam: The path to the BAM alignment file. The path can be relative to the working directory; we will run this from thepart1directory, so the relative path is../data/bams/*.bambai: The path to the matched index file for the BAM file. Indexes are vital for being able to quickly search a large sequencing dataset for reads at a specific location in the genome.
We can also see that we have three samples defined in our samplesheet. Samplesheets make it very easy to add new samples to our workflow run, simply by adding a new line and filling out the relevant columns.
Let's now take a look at the BAM files themselves. These are binary files, so we can't view them directly, although we can use a tool called samtools to get a quick glance at their contents:
This will print out the first three lines of the BAM file for sample NA12878_chr20-22:
HSQ1004:134:C0D8DACXX:1:1105:9389:51835 163 chr20 70768 6 9S81M1D11M = 70907 231 TTTGGATGAGGCAATCAGACAAGAGAAAGAAATAAAGATCATTTAAATAGGAAGAGAAGAAGTTAAACTATCCCTGTTGGCAGATGACATATCCTATATCT @C@FFFFFGHAHHGIJIIJJIJJJJJJJJIJJJJIGJJJIJJIIGGIIEIGHIFJIJJIGGIIIGGHIJJJGGHHHEEBEEFEDEDDEDCCDCDEDDDECC MD:Z:9G9A8T0G3C0C4G0A7G5C14T6C4^G11 PG:Z:MarkDuplicates RG:Z:C0D8DACXX.1 NM:i:13 AS:i:25 XS:i:36 MQ:i:22 MC:Z:5S92M4S
HSQ1004:134:C0D8DACXX:1:1105:9389:51835 83 chr20 70907 22 5S92M4S = 70768 -231 CTGATAACTTCAGCAAAGTCTCAGGATACAAAATCAATGTGCAAAAATCATTAACATTCTTATACACCAACAGTCAAGCCAGGAGCCAAATCAGGAACACA DDCEEDCEECDEEFFFDFFHGGFEHEIIJJJJIJJIJJIGGGIGJJIIIIIIGDIGIGJIGHF?GCIIHGJJJIJJJJJJJJJJJJJJHHGHHFFFFFCCC MD:Z:32A3T0G7C2G5C21A15 PG:Z:MarkDuplicates RG:Z:C0D8DACXX.1 NM:i:7 AS:i:57 XS:i:55 MQ:i:6 MC:Z:9S81M1D11M
HSQ1004:134:C0D8DACXX:1:2101:9406:116923 163 chr20 81255 60 101M = 81496 342 AATCATACTAGCTCTCATCAGATTGAAATGGCTGAAATGACAGACATAGAATTAATGATCTGGAAGGTAAGGAAGCTCAAGAACATTCAGGAGAAAGTTGA ?B<D;:B?2AF?AGHIBACHH@4C3EHH@BGCDDDGHH@FBD>FB?GHHIE4?*9*??BBBFGE8BE)=@@@GA(.7=AE3AD@>);)6@>CDABD(---; MD:Z:53C47 PG:Z:MarkDuplicates RG:Z:C0D8DACXX.1 NM:i:1 AS:i:96 XS:i:29 MQ:i:60 MC:Z:101M
You can see that these files have a tabular structure as well; briefly, the important columns to note are:
- 1: The read name
- 3: The chromosome it aligns to
- 4: The position on the chromosome it aligns to
- 5: The mapping quality, or how well the sequence aligns to the genome
- 9: The length of the aligned sequence (with negative numbers indicating that the sequence aligns to the reverse DNA strand)
- 10: The sequence itself
- 11: The per-base quality of the sequence, encoded as ASCII characters
Finaly, we'll take a look at the FASTA file. This is a plain text file that holds the full sequence of the reference genome that the sequencing data has been aligned to. You can open this file up in the VSCode editor: it's located at data/ref/Hg38.subsetchr20-22.fasta. You'll notice that the first line starts with a > character, followed by the name of the sequence that follows it; in this case, chr20. You'll also notice that the next 600 lines are just full of N characters; this is a "mask" character, and is used to essentially block out regions of the genome that are unreliably sequenced, as is the case with the chromosome ends. It isn't until line 602 that we see some actual A's, C's, G's and T's. In total, there are more than half a million lines in the file, representing the sequences of the human chromosomes 20 to 22. We have subset the genome to just these smallest chromosomes to speed up the example analyses in this workshop.
2.2.2 Write a simple run script
Exercise: Create a run script for nf-core/sarek
In the current working directory, create a new blank file and name it run.sh. You can do this via VSCode's interface by right-clicking on the current directory (part1) in the explorer side bar, clicking on "New File...", writing "run.sh" and hitting the Enter key. You can also do this via the terminal:
You should also make the file executable by running the following command:
Next, we'll start the run command by adding the following lines:
So far, this is pretty straight-forward: we first tell the OS that we're running a bash script with the #!/bin/bash comment. Then, we load the nextflow and singularity modules; these are necessary for ensuring that the nextflow command is available to us, and that Nextflow is able to use Singularity for running containers. Finally, we run the main.nf Nextflow script within the sarek repository.
As it exists, this won't run: the sarek pipeline requires you to give it a samplesheet as input, and we also need to configure a few other parameters to ensure that we only run our small markduplicates stage of the pipeline.
Add a space and a backslash (\) to the last line, indicating that the command continues on the following line, then add the following lines to define the inputs:
| run.sh | |
|---|---|
We've provided here the samplesheet CSV file to the --input parameter, as well as the FASTA file and its paired index (FAI) file to the --fasta and --fasta_fai parameters.
Next, we'll specify that we want to run the pipeline from the markduplicates step, and skip several downstream tools, including baserecalibrator, mosdepth, and samtools. This will have the effect of just running the markduplicates step, followed by a final run of MultiQC; in total we will run only four distinct processes, which will take only a few minutes for our dataset:
Wrapping up, we will tell the pipeline where to place our outputs, as well as configure a few additional parameters to make things run a bit smoother in our small example:
The --no_intervals true parameter tells sarek that we don't want to worry about splitting our data up into distinct genomic intervals. This is a very useful feature for large datasets, as it lets us parallelise large genomic sequencing data into chunks that can be processed simultaneously - and takes advantage of the parallel nature of HPCs! However, in our very small example here, it would actually cause things to run slower by adding more jobs and waiting for them to start and finish.
The next line, --igenomes_ignore true stops the pipeline from downloading some additional files from public databases; again, this can be useful in a proper run, but for our purposes it is more of a nuisance.
Finally, the -resume flag tells Nextflow to use the outputs of previously successful tasks where it is appropriate, rather than running them again every time we run the pipeline.
At the end, the run.sh script should look like the following:
2.2.3 Running the pipeline
Exercise: Run the workflow
We've created a full run script for our pipeline, so let's try it out and see what happens!
Result...
You should find that the pipeline quickly fails!
Execution cancelled -- Finishing pending tasks before exit
-[nf-core/sarek] Pipeline completed with errors-
ERROR ~ Error executing process > 'NFCORE_SAREK:SAREK:BAM_MARKDUPLICATES:GATK4_MARKDUPLICATES (test_sample2)'
Caused by:
Process `NFCORE_SAREK:SAREK:BAM_MARKDUPLICATES:GATK4_MARKDUPLICATES (test_sample2)` terminated with an error exit status (127)
Command executed:
gatk --java-options "-Xmx24576M -XX:-UsePerfData" \
MarkDuplicates \
--INPUT NA12878_chr20-22.bam \
--OUTPUT test_sample2.md.bam \
--METRICS_FILE test_sample2.md.cram.metrics \
--TMP_DIR . \
--REFERENCE_SEQUENCE Hg38.subsetchr20-22.fasta \
-REMOVE_DUPLICATES false -VALIDATION_STRINGENCY LENIENT
# If cram files are wished as output, the run samtools for conversion
if [[ test_sample2.md.cram == *.cram ]]; then
samtools view -Ch -T Hg38.subsetchr20-22.fasta -o test_sample2.md.cram test_sample2.md.bam
rm test_sample2.md.bam
samtools index test_sample2.md.cram
fi
cat <<-END_VERSIONS > versions.yml
"NFCORE_SAREK:SAREK:BAM_MARKDUPLICATES:GATK4_MARKDUPLICATES":
gatk4: $(echo $(gatk --version 2>&1) | sed 's/^.*(GATK) v//; s/ .*$//')
samtools: $(echo $(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*$//')
END_VERSIONS
Command exit status:
127
Command output:
(empty)
Command error:
.command.sh: line 8: gatk: command not found
Work dir:
/scratch/pawsey1227/michaelgeaghan/nextflow-on-hpc-materials/part1/work/60/7651bb1e0034589dc27b50b829d53b
Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`
-- Check '.nextflow.log' file for details
ERROR ~ Pipeline failed. Please refer to troubleshooting docs: https://nf-co.re/docs/usage/troubleshooting
-- Check '.nextflow.log' file for details
But why did this happen? You should see in the error output a message that one or more tools couldn't be found:
It's failing because we haven't yet told Nextflow to use Singularity, and each task that runs is trying to find the software installed on the system and is failing to do so.
There's another problem with our run though: we haven't actually told Nextflow to use the HPC. Instead, it's trying to run everything on the login node: and that is not a good idea at all. Once again, login nodes have limited resources and should never be used for running anything computationally intensive. At best the jobs will fail due to not enough resources being available, and at worst we will slow down the entire login node for everyone else using the system!
In the next section, we will build up a configuration file that tells Nextflow to both use Singularity and to talk to the HPC's scheduler to submit jobs to the compute nodes