A Snakemake pipeline that filters paired-end FASTQ files to retain only reads mapping to specified genomic regions.
Given aligned BAM files and a BED file defining regions of interest, the pipeline:
- Sorts and indexes the BAMs
- Extracts reads overlapping the target regions (via
samtools view -L) - Filters the original FASTQ files to keep only those read IDs
- Compresses the output FASTQs with
pigz
Edit the top of filter_fastqs.smk to set your paths:
aligned_bams_folder = "/path/to/aligned_bams" # STAR-style: <folder>/<sample>/Aligned.out.bam
fastq_folder = "/path/to/fastqs" # paired-end: <sample>_1.fq.gz, <sample>_2.fq.gz
bed_file = "/path/to/regions.bed" # regions to retain
output_folder = "/path/to/output" # created automaticallySamples are discovered automatically from the FASTQ filenames.
snakemake -s filter_fastqs.smk --cores <N><output_folder>/
sorted_bams/ # coordinate-sorted BAMs and indexes
filtered_bams/ # SAM files containing only reads in target regions
<sample>_1.fq.gz # filtered paired-end FASTQs
<sample>_2.fq.gz
scripts/get_regions.py extracts specific gene regions from a GTF into a BED file. Edit the gtf_file, output_bed_file, and the feature filter string inside the script, then run it directly with Python to generate the BED file needed for the pipeline.