Prepare your data for processing

Questions

How do I set up my base directory?
How do I download the required input files?

In this workshop series we are working with a real dataset from a study of a knockout mouse model by Corley et al. (2016). Before we can get started, we need to download the materials used in this workshop and organise our working space on the command line.

Copying the code from the grey boxes on training materials

In this workshop we need to copy code from the grey boxes in the training materials and run it in the terminal. If you hover your mouse over a grey box on the website, a clipboard icon will appear on the right side. Click on the clipboard logo to copy the code. Test it out with:

ssh training@###.###.###.###

Log in to Nimbus

Go to the terminal window in which you have logged on to Nimbus. If you have accidently closed it, you will need to repeat steps on the setup page again. Check where you are by typing:

pwd

It should return /home/training. To get back here at any point, you can always type:

cd ~

Make your base directory

This is where we will be working for the next 2 days. In your terminal once you have logged into Nimbus and are sure you’re in /home/training, type the following:

mkdir base_directory

Download the files to be used in this workshop

In your terminal, type the following to move into to the base_directory:

cd base_directory

Then copy and paste the following command into your terminal to download the input files:

wget -O working_directory.tar.gz https://cloudstor.aarnet.edu.au/plus/s/xveu7WCIdj7bk6c/download

The files are packaged up to save space. Let’s unpack it:

tar -zxvf working_directory.tar.gz

This creates a directory called working_directory. Move into this directory by running:

cd working_directory

You can list all the files in your current directory by running the list command:

ls -lh

It should return this:

-rw------- 1 training training  940838 Sep 26 08:29 FULL_count_matrix.txt
-rw------- 1 training training    1340 Sep 26 08:28 README
-rwxr-xr-x 1 training training 3566569 Sep 26 08:26 SRR3473984_selected.fastq.gz
-rwxr-xr-x 1 training training 3208125 Sep 26 08:26 SRR3473985_selected.fastq.gz
-rwxr-xr-x 1 training training 3170753 Sep 26 08:26 SRR3473986_selected.fastq.gz
-rwxr-xr-x 1 training training 3488119 Sep 26 08:26 SRR3473987_selected.fastq.gz
-rwxr-xr-x 1 training training 4370755 Sep 26 08:26 SRR3473988_selected.fastq.gz
-rwxr-xr-x 1 training training 3094352 Sep 26 08:26 SRR3473989_selected.fastq.gz
-rw------- 1 training training    4961 Sep 26 08:30 nextflow.config
-rw-r--r-- 1 training training   36091 Sep 27 00:57 rnaseq_DE_analysis_Day2.Rmd
-rw------- 1 training training     330 Sep 26 08:34 samplesheet.csv

The working_directory folder contains the following files:

6 gzipped fastq files: SRR*_selected.fastq.gz
A samplesheet file required by the nf-core/rnaseq pipeline: samplesheet.csv
The nf-core/rnaseq pipeline configuration file: nextflow.config
The genome-wide gene count matrix file for day 2: FULL_count_matrix.txt
The lessons and code for day 2: rnaseq_DE_analysis_Day2.Rmd

Look at the sample sheet

A sample sheet is a standard input rquirement of nf-core pipelines, including nf-core/rnaseq. This sheet contains information on each sample being processed with this pipeline. Take a look at the nf-core/rnaseq documentation for more information. For example:

The sample sheet we are using today contains a header row and one row per sample being processed. It has 4 columns, each separated by the comma. The pipeline will auto-detect whether a sample is single- or paired-end using the information provided in the samplesheet. The samplesheet can have as many columns as you desire, however, there is a strict requirement for the first 4 columns to match those defined in the table below:

Challenge

Can you view the provided sample sheet using the cat command and identify differences between our sheet the one shown in the example screenshot above?

Solution

View samplesheet.csv by running:

cat ~/base_directory/working_directory/samplesheet.csv

Our sample sheet looks like this: Today’s samplesheet

Both sample sheets contain 6 samples. However, the example samplesheet consists of both single- and paired-end data for the control samples and TREATMENT_REP3 has been sequenced twice.

Our sample sheet is a lot simpler, with only single-end reads. Note that the column for the reverse-reads is empty in our sample sheet.

Now that we have prepared our work space and downloaded the required files, we will proceed to running the nf-core/rnseq pipeline. Before proceeding:

Check your path by running:

pwd

This should return: /home/training/base_directory/working_directory. If you are not here, move into the above path by running:

cd /home/training/base_directory/working_directory

Keep your terminal window open and proceed to the next lesson by clicking on Day 1 kick off > Run the pipeline on the menu bar.

Key points

Organise your working space on the command line using directories
The nf-core/rnaseq pipeline requires raw sequence data in the fastq format
We can tell nf-core/rnaseq which fastq files belong to which sample using the sample sheet

All materials copyright Sydney Informatics Hub, University of Sydney