-
Notifications
You must be signed in to change notification settings - Fork 0
Processing PacBio reads
This workflow will take two PacBio full length 16S metabarcoding samples in the format that we receive from the sequencing facility through adapter trimming and the Bigelow edna-dada2 workflow to generate an ASV table, taxonomic assignment table, and a fasta file of ASV sequences. The workflow assumes you have a user account on Charlie and access to the Bigelow Github. You will need some familiarity with command line interface and bash, for a good introduction see Mike Lee's amazing site. If this is your first time working with metabarcoding data I encourage you to go through the tutorial from last Fall's workshop before diving into HPCC analyses. If you need assistance please use the issue tracker to post questions or bugs.
These instructions assume you have access to Charlie and are connected via the Bigelow network or VPN (most cases). For other connections please see instructions here.
This pipeline is under development and as such, important updates will be made on a regular basis. When running this for the first time or after a hiatus it is necessary to update the packages before running the pipeline. This can be done with the following command on Charlie (note: this can take several minutes to finish). Note: this pipeline expects samples to be in gzip format, please ensure your fastq files are gzip compressed before starting.
Open terminal (or Windows Subsystem for Linux, or Command Prompt)
ssh username@cfe.bigelow.org #replace username with your username
#enter your password
module load dada2
Rscript /mnt/storage/data/edna/packages/edna_user_installer.R charlier auntie dadautils
module purge
The workflow described below assumes you use the following directory structure:
dada_pacbio
├── raw_reads # directory with raw reads
├── process # output directory (created by pipeline if it doesn't exist)
├── job_submission.sh # script to submit workflow to Charlie's job scheduler
├── pacbio_preprocess.R # the edna-dada2 preprocess workflow RScript
├── pacbio_preprocess.yaml # the edna-dada2 preprocess workflow yaml configuration file
├── pacbio_postprocess.R # the edna-dada2 postprocess workflow RScript
└── pacbio_postprocess.yaml # the edna-dada2 postprocess workflow yaml configuration file
If you need to create this structure, navigate to where you would like to work on Charlie (typically your home directory or a directory under your name in a shared space) and type (or copy and paste) the following commands:
mkdir dada_pacbio
cd dada_pacbio
mkdir raw_reads
mkdir process
cp /mnt/storage/data/edna/packages/edna-dada2/job_submission.sh .
cp /mnt/storage/data/edna/packages/edna-dada2/3step/pacbio_preprocess.R .
cp /mnt/storage/data/edna/packages/edna-dada2/3step/pacbio_preprocess.yaml .
cp /mnt/storage/data/edna/packages/edna-dada2/3step/pacbio_postprocess.R .
cp /mnt/storage/data/edna/packages/edna-dada2/3step/pacbio_postprocess.yaml .
cp /mnt/storage/data/edna/dada/example/pacbio/*.gz raw_reads/
You will need to edit several lines of the configuration file. Open the file using nano or another editor
pwd #copy the full path that this command returns for the next step
nano pacbio_preprocess.yaml
#edit data_path: /full/path/to/dada_pacbio
#edit email:
#edit the primer: fields to be sure your primer sequences are listed
#edit the minLen and maxLen to match your amplicon size
nano job_submission.sh
#edit #PBS -M to add your email username
#edit data_path=/full/path/to/dada_pacbio
#edit filenames to be
Rscript $data_path"/pacbio_preprocess.R" $data_path"/pacbio_preprocess.yaml"
Rscript $data_path"/pacbio_postprocess.R" $data_path"/pacbio_postprocess.yaml"
#make sure the preprocess script is not commented out and that the postprocess script is commented out
qsub job_submission.sh
Type qstat
You should see your username listed next to your job. If there are many jobs you can use the following to only look at your jobs. Replace username below with your username. The job will have an R in the S (status) column if it is running as expected or an E if there are errors. If you get an error check the files that end in run_dada2_pipeline.e[jobid]
qstat -u username
The job should finish in <10 minutes. Results and log will appear in the process directory. To look at results
cd process
less filter_and_trim.csv #check filtering results
#download and view quality_profiles.pdf (use sftp, see below for example)
If you are happy with the results so far you are ready to run the final postprocess step.
cd .. #get back to dada_pacbio
nano pacbio_postprocess.yaml
#edit data_path: to point to the dada_pacbio directory
#edit input_path: to point to the filtN directory in your dada_pacbio/process directory
#edit output_path: to point to the dada_pacbio/process directory
#edit truncLen: to point to the truncLen.csv file in your dada_pacbio/process directory
#edit refFasta: to point to the taxonomic database of your choice
nano job_submission.sh
#make sure the postprocess script is not commented out and that the preprocess script is commented out
qsub job_submission.sh
The job should finish in <10 minutes. Results and log will appear in the process directory.
If you used the default settings the first 10 lines of seqtab-nochim.csv (head -10 seqtab-nochim.csv) should be:
| ASV | 18S_green_algae_1.hifi_reads.fastq.gz | 18S_green_algae_2.hifi_reads.fastq.gz |
|---|---|---|
| ASV_01 | 19228 | 26 |
| ASV_02 | 2117 | 3365 |
| ASV_03 | 0 | 2594 |
| ASV_04 | 700 | 1742 |
| ASV_05 | 1531 | 0 |
| ASV_06 | 983 | 0 |
| ASV_07 | 256 | 623 |
| ASV_08 | 842 | 0 |
| ASV_09 | 723 | 0 |
Files can be moved between your computer and Charlie using sftp. Open a new tab in terminal and type:
pwd #copy the path to the process directory
sftp user@cfe.bigelow.org
get /full/path/to/dada_pacbio/process/seqtab-nochim.csv #replace the path with your own
get /full/path/to/dada_pacbio/process/ASV_sequences.fasta
get /full/path/to/dada_pacbio/process/ASV_taxa.csv
Create a directory called dada_pacbio_local on your computer and move the files into the directory (sftp will download to user directory unless you specify a path). Download the info.csv file and visualization_script.R from GitHub and place them in the dada_pacbio_local directory. Open visualization_script.R in RStudio. Follow the script in RStudio to create barplots and explore the dataset.