Processing PacBio reads

This workflow will take two PacBio full length 16S metabarcoding samples in the format that we receive from the sequencing facility through adapter trimming and the Bigelow edna-dada2 workflow to generate an ASV table, taxonomic assignment table, and a fasta file of ASV sequences. The workflow assumes you have a user account on Charlie and access to the Bigelow Github. You will need some familiarity with command line interface and bash, for a good introduction see Mike Lee's amazing site. If this is your first time working with metabarcoding data I encourage you to go through the tutorial from last Fall's workshop before diving into HPCC analyses. If you need assistance please use the issue tracker to post questions or bugs.

Connecting to Charlie and installing or upgrading packages

These instructions assume you have access to Charlie and are connected via the Bigelow network or VPN (most cases). For other connections please see instructions here.

This pipeline is under development and as such, important updates will be made on a regular basis. When running this for the first time or after a hiatus it is necessary to update the packages before running the pipeline. This can be done with the following command on Charlie (note: this can take several minutes to finish). Note: this pipeline expects samples to be in gzip format, please ensure your fastq files are gzip compressed before starting.

Open terminal (or Windows Subsystem for Linux, or Command Prompt)

ssh username@cfe.bigelow.org #replace username with your username
#enter your password
module load dada2
Rscript /mnt/storage/data/edna/packages/edna_user_installer.R charlier auntie dadautils
module purge

Setting up your directory

The workflow described below assumes you use the following directory structure:

dada_pacbio
├── raw_reads                # directory with raw reads
├── process                  # output directory (created by pipeline if it doesn't exist)
├── job_submission.sh        # script to submit workflow to Charlie's job scheduler      
├── pacbio_preprocess.R    # the edna-dada2 preprocess workflow RScript
├── pacbio_preprocess.yaml  # the edna-dada2 preprocess workflow yaml configuration file
├── pacbio_postprocess.R   # the edna-dada2 postprocess workflow RScript
└── pacbio_postprocess.yaml # the edna-dada2 postprocess workflow yaml configuration file

If you need to create this structure, navigate to where you would like to work on Charlie (typically your home directory or a directory under your name in a shared space) and type (or copy and paste) the following commands:

mkdir dada_pacbio
cd dada_pacbio
mkdir raw_reads
mkdir process
cp /mnt/storage/data/edna/packages/edna-dada2/job_submission.sh .
cp /mnt/storage/data/edna/packages/edna-dada2/3step/pacbio_preprocess.R .
cp /mnt/storage/data/edna/packages/edna-dada2/3step/pacbio_preprocess.yaml .
cp /mnt/storage/data/edna/packages/edna-dada2/3step/pacbio_postprocess.R .
cp /mnt/storage/data/edna/packages/edna-dada2/3step/pacbio_postprocess.yaml .
cp /mnt/storage/data/edna/dada/example/pacbio/*.gz raw_reads/

Editing the configuration file

You will need to edit several lines of the configuration file. Open the file using nano or another editor

pwd #copy the full path that this command returns for the next step

nano pacbio_preprocess.yaml

#edit data_path: /full/path/to/dada_pacbio 
#edit email:
#edit the primer: fields to be sure your primer sequences are listed
#edit the minLen and maxLen to match your amplicon size

Editing the job submission file

nano job_submission.sh
#edit #PBS -M to add your email username
#edit data_path=/full/path/to/dada_pacbio
#edit filenames to be 
Rscript $data_path"/pacbio_preprocess.R" $data_path"/pacbio_preprocess.yaml"
Rscript $data_path"/pacbio_postprocess.R" $data_path"/pacbio_postprocess.yaml"
#make sure the preprocess script is not commented out and that the postprocess script is commented out

Submitting the job

qsub job_submission.sh

Check the status of the job

Type qstat

You should see your username listed next to your job. If there are many jobs you can use the following to only look at your jobs. Replace username below with your username. The job will have an R in the S (status) column if it is running as expected or an E if there are errors. If you get an error check the files that end in run_dada2_pipeline.e[jobid]

qstat -u username

Viewing preprocess results

The job should finish in <10 minutes. Results and log will appear in the process directory. To look at results

cd process
less filter_and_trim.csv #check filtering results
#download and view quality_profiles.pdf (use sftp, see below for example)

Running postprocess

If you are happy with the results so far you are ready to run the final postprocess step.

cd .. #get back to dada_pacbio
nano pacbio_postprocess.yaml
#edit data_path: to point to the dada_pacbio directory 
#edit input_path: to point to the filtN directory in your dada_pacbio/process directory
#edit output_path: to point to the dada_pacbio/process directory
#edit truncLen: to point to the truncLen.csv file in your dada_pacbio/process directory
#edit refFasta: to point to the taxonomic database of your choice

Editing the job submission file

nano job_submission.sh
#make sure the postprocess script is not commented out and that the preprocess script is commented out

Submitting the job

qsub job_submission.sh

Viewing postprocess results

The job should finish in <10 minutes. Results and log will appear in the process directory.

If you used the default settings the first 10 lines of seqtab-nochim.csv (head -10 seqtab-nochim.csv) should be:

ASV	18S_green_algae_1.hifi_reads.fastq.gz	18S_green_algae_2.hifi_reads.fastq.gz
ASV_01	19228	26
ASV_02	2117	3365
ASV_03	0	2594
ASV_04	700	1742
ASV_05	1531	0
ASV_06	983	0
ASV_07	256	623
ASV_08	842	0
ASV_09	723	0

Downloading and visualizing results

Files can be moved between your computer and Charlie using sftp. Open a new tab in terminal and type:

pwd #copy the path to the process directory
sftp user@cfe.bigelow.org
get /full/path/to/dada_pacbio/process/seqtab-nochim.csv #replace the path with your own
get /full/path/to/dada_pacbio/process/ASV_sequences.fasta
get /full/path/to/dada_pacbio/process/ASV_taxa.csv

Create a directory called dada_pacbio_local on your computer and move the files into the directory (sftp will download to user directory unless you specify a path). Download the info.csv file and visualization_script.R from GitHub and place them in the dada_pacbio_local directory. Open visualization_script.R in RStudio. Follow the script in RStudio to create barplots and explore the dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Processing PacBio reads

Connecting to Charlie and installing or upgrading packages

Setting up your directory

Editing the configuration file

Editing the job submission file

Submitting the job

Check the status of the job

Viewing preprocess results

Running postprocess

Editing the job submission file

Submitting the job

Viewing postprocess results

Downloading and visualizing results

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally