Goal: map paired-end
*.fastq.gzreads against aMAST4X.fastagenome, generate a sorted and indexed BAM, and (optionally) apply a 95% identity filter usingidfilter.pl.This is written as a tutorial / recipe. Copy and paste only the blocks you need.
bwa(genome indexing and mapping)samtools(SAM/BAM manipulation)perl+idfilter.pl(only if identity filtering is required)
On HPC systems these are often loaded via module load bwa samtools perl, but here we assume they are already available in your PATH.
Define the following:
GENOME: target genome FASTAR1andR2: paired-end reads (gzipped)THREADS: number of threadsOUTDIR: output directoryPREFIX: base prefix for output files
Example:
GENOME=../data/genomes/MAST4A.fasta
R1=../data/tara_samples/GGMM.reads/TARA11.SUR.GGMM.1.clean.fastq.gz
R2=../data/tara_samples/GGMM.reads/TARA11.SUR.GGMM.2.clean.fastq.gz
THREADS=24
OUTDIR=../analyses/mapping/MAST4A
PREFIX=TA11.SUR.GGMM.MAST4A
mkdir -p "$OUTDIR"Tip: for MAST-4A/B/C/E, only change the genome letter and output directory.
This step only needs to be done once for each MAST4X.fasta file.
bwa index "$GENOME"This creates the .amb, .ann, .bwt, .pac, and .sa index files next to the FASTA.
This block:
- decompresses
fastq.gzon the fly - converts SAM to BAM
- sorts alignments by coordinate
- indexes the BAM
bwa mem -t "$THREADS" "$GENOME" <(zcat "$R1") <(zcat "$R2") \
| samtools view -bh - \
| samtools sort -@ "$THREADS" -o "$OUTDIR/${PREFIX}.sorted.bam" -
samtools index "$OUTDIR/${PREFIX}.sorted.bam"Output:
.../${PREFIX}.sorted.bam.../${PREFIX}.sorted.bam.bai
If your pipeline uses idfilter.pl to retain alignments with ≥95% identity and ≥80% read coverage, a typical workflow is:
IDFILTER=./idfilter.pl
IDENTITY=0.95
FILTERED_DIR="$OUTDIR/filtered_95id"
mkdir -p "$FILTERED_DIR"
perl "$IDFILTER" \
-i "$OUTDIR/${PREFIX}.sorted.bam" \
-o "$FILTERED_DIR/${PREFIX}.95id.sorted.bam" \
-d "$IDENTITY" \
samtools index "$FILTERED_DIR/${PREFIX}.95id.sorted.bam"Output:
.../filtered_95id/${PREFIX}.95id.sorted.bam.../filtered_95id/${PREFIX}.95id.sorted.bam.bai
Note: if
idfilter.plsupports an explicit coverage parameter (-l, default 0.80), include it here.This is a custom script. Nowadays, there are software available to do this step that is actually supported by the developers, such as bam-filter
If you only need the filtered BAM, you can remove the unfiltered BAM:
rm -f "$OUTDIR/${PREFIX}.sorted.bam" "$OUTDIR/${PREFIX}.sorted.bam.bai"