This is a Nextflow implementation of the WGS QC workflow for quality control and metrics collection of whole genome sequencing data.
The pipeline performs comprehensive quality control and metrics collection for whole genome sequencing data. It supports both BAM and CRAM input formats and can optionally convert the final BAM to CRAM format.
graph TD
subgraph Input
A1[FASTQ R1] --> B[Fastp]
A2[FASTQ R2] --> B
C1[BAM/CRAM] --> D[Samtools]
C2[Reference FASTA] --> D
end
subgraph Processing
B --> E[Fastp Reports]
D --> F[Processed BAM]
F --> G1[Picard Multiple Metrics]
F --> G2[Picard WGS Metrics]
F --> H[Qualimap]
F --> I[CRAM Conversion]
end
subgraph Output
E --> J[MultiQC]
G1 --> J
G2 --> J
H --> J
I --> K[CRAM File]
J --> L[Final Report]
end
style Input fill:#f9f,stroke:#333,stroke-width:2px
style Processing fill:#bbf,stroke:#333,stroke-width:2px
style Output fill:#bfb,stroke:#333,stroke-width:2px
- Nextflow 22.10.0 or later
- Docker
- Java 8 or later
- Clone this repository:
git clone <repository-url>
cd <repository-directory>- Make sure Nextflow is installed:
curl -s https://get.nextflow.io | bashThe pipeline can be run using the following command:
nextflow run main.nf \
--fastq_r1 "path/to/reads_R1.fastq.gz" \
--fastq_r2 "path/to/reads_R2.fastq.gz" \
--prefix "sample_name" \
--fasta "path/to/reference.fasta" \
--aligned_file "path/to/input.bam" \
--cram false--fastq_r1: Path to the first read file (R1) in FASTQ format--fastq_r2: Path to the second read file (R2) in FASTQ format--prefix: Sample prefix to be used in output creation--fasta: Path to the reference genome in FASTA format--aligned_file: Path to the input BAM or CRAM file--cram: Whether to convert final BAM to CRAM format (default: false)
The pipeline generates the following outputs in the results directory:
- Fastp reports:
{prefix}_fastp.html{prefix}_fastp.json
- Picard metrics:
- Alignment summary metrics
- Insert size metrics
- Quality score distribution
- Mean quality by cycle
- Base distribution by cycle
- WGS metrics
- Qualimap reports:
- Coverage statistics
- Mapping quality metrics
- Insert size distribution
- GC content analysis
- CRAM file (if
--cram true):{prefix}.cram{prefix}.cram.crai
- MultiQC report:
{prefix}_multiqc_report.html- MultiQC data directory
The pipeline uses the following Docker containers:
staphb/fastp:latestfor read quality controlquay.io/biocontainers/samtools:1.21--h96c455f_1for BAM/CRAM processingbroadinstitute/picard:latestfor metrics collectionquay.io/biocontainers/qualimap:2.3--hdfd78af_0for alignment quality assessmentewels/multiqc:latestfor report generation
The pipeline generates several reports:
pipeline_report.html: Pipeline execution reporttimeline_report.html: Timeline of pipeline executiontrace.txt: Detailed execution trace
results/
├── fastp/ # Fastp quality control reports
├── picard/ # Picard metrics
├── qualimap/ # Qualimap reports
├── cram/ # CRAM files (if --cram true)
└── multiqc/ # MultiQC report and data
This project is licensed under the terms of the license included in the repository.
George Carvalho ([email protected])