Skip to content

simkin-bioinformatics/filter_fastqs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

filter_fastqs

A Snakemake pipeline that filters paired-end FASTQ files to retain only reads mapping to specified genomic regions.

Overview

Given aligned BAM files and a BED file defining regions of interest, the pipeline:

  1. Sorts and indexes the BAMs
  2. Extracts reads overlapping the target regions (via samtools view -L)
  3. Filters the original FASTQ files to keep only those read IDs
  4. Compresses the output FASTQs with pigz

Dependencies

Configuration

Edit the top of filter_fastqs.smk to set your paths:

aligned_bams_folder = "/path/to/aligned_bams"   # STAR-style: <folder>/<sample>/Aligned.out.bam
fastq_folder        = "/path/to/fastqs"          # paired-end: <sample>_1.fq.gz, <sample>_2.fq.gz
bed_file            = "/path/to/regions.bed"     # regions to retain
output_folder       = "/path/to/output"          # created automatically

Samples are discovered automatically from the FASTQ filenames.

Usage

snakemake -s filter_fastqs.smk --cores <N>

Output

<output_folder>/
  sorted_bams/         # coordinate-sorted BAMs and indexes
  filtered_bams/       # SAM files containing only reads in target regions
  <sample>_1.fq.gz     # filtered paired-end FASTQs
  <sample>_2.fq.gz

Helper script

scripts/get_regions.py extracts specific gene regions from a GTF into a BED file. Edit the gtf_file, output_bed_file, and the feature filter string inside the script, then run it directly with Python to generate the BED file needed for the pipeline.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages