A comprehensive Nextflow-based RNA-seq analysis pipeline for processing and analyzing RNA sequencing data, featuring quality control, alignment, quantification, and advanced analysis capabilities including IRFinder, RNA-SeQC, and Kallisto quantification.
Current Version: v1.1.0 - Enhanced stability and coverage analysis capabilities
- Quality Control: Fastp preprocessing with rRNA and globin RNA contamination detection
- Alignment: STAR alignment with multiple read length optimizations (69-151bp)
- Quantification:
- STAR gene-level counts
- Kallisto transcript-level quantification
- Subread featureCounts for gene expression
- Quality Metrics: RNA-SeQC comprehensive quality assessment
- Advanced Analysis:
- IRFinder for intron retention detection
- Splice junction analysis with BAM2SJ
- Enhanced coverage analysis with dual Mosdepth processes (MOSDEPTH + MOSDEPTH_BED)
- CDS region analysis and filtering with BEDTOOLS
- Advanced variant calling with DEEPVARIANT
- Output Formats: CRAM compression for efficient storage
- Containerized: Docker-based execution for reproducibility
graph TB
subgraph Input
A1[FASTQ Files] --> B1[Fastp QC]
A2[Reference Data] --> B1
end
subgraph Contamination Check
B1 --> C1[BWA rRNA Check]
B1 --> C2[BWA Globin Check]
C1 --> D1[Samtools Flagstat]
C2 --> D2[Samtools Flagstat]
end
subgraph Alignment & Quantification
B1 --> E1[STAR Alignment]
B1 --> E2[Kallisto Quant]
E1 --> F1[Samtools Index]
F1 --> F2[Sambamba MarkDup]
F2 --> G1[FeatureCounts]
F2 --> G2[RNA-SeQC]
F2 --> G3[IRFinder]
F2 --> G4[Mosdepth Coverage Analysis]
F2 --> G4b[Enhanced Mosdepth Coverage]
F2 --> G5[BAM2SJ]
F2 --> G6[BEDTOOLS CDS Analysis]
G6 --> G7[DEEPVARIANT Variant Calling]
end
subgraph Output
G1 --> H1[Gene Counts]
G2 --> H2[QC Reports]
G3 --> H3[IRFinder Results]
G4 --> H4[Coverage Files]
G4b --> H4b[Enhanced Coverage Analysis]
G5 --> H5[Splice Junctions]
G6 --> H6[CDS Region Files]
G7 --> H7[Variant Call Files]
F2 --> H8[CRAM Files]
end
style Input fill:#e1f5fe,stroke:#01579b,stroke-width:2px
style Contamination Check fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
style Alignment fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px
style Output fill:#fff3e0,stroke:#e65100,stroke-width:2px
The pipeline follows a logical sequence where each step builds upon the previous:
- Quality Control & Alignment: FASTQ files are processed and aligned to generate BAM files
- Coverage Analysis: Mosdepth processes analyze coverage across the genome and coding regions
- CDS Region Identification: BEDTOOLS processes the coverage data to identify coding sequences with adequate coverage
- Variant Calling: DEEPVARIANT uses the BEDTOOLS-generated CDS bed file to focus variant calling on well-covered coding regions
Key Dependency: DEEPVARIANT requires the CDS bed file from BEDTOOLS to define the regions for variant calling, ensuring efficient and accurate variant detection in coding sequences.
- Operating System: Linux-based (Ubuntu 18.04+ recommended)
- CPU: Minimum 48 cores for optimal performance
- Memory: Minimum 192GB RAM for STAR alignment
- Storage: Sufficient space for RNA-seq data processing
- Docker: Required for containerized execution
# Install Docker
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
# Add user to docker group
sudo usermod -aG docker $USER
newgrp docker
# Verify installation
docker --version# Install Nextflow
curl -s https://get.nextflow.io | bash
sudo mv nextflow /usr/local/bin/
# Verify installation
nextflow -version- Clone the repository:
git clone https://github.com/uclanelsonlab/nl-rna-seq_wf.git
cd nl-rna-seq_wf/- Set up Docker images:
# Build or pull required Docker images
# Update nextflow.config with your Docker image paths- Verify installation:
nextflow run main.nf --help| Parameter | Description | Example |
|---|---|---|
--fastq_r1 |
Forward FASTQ file path | /path/to/sample_R1.fastq.gz |
--fastq_r2 |
Reverse FASTQ file path | /path/to/sample_R2.fastq.gz |
--sample_name |
Sample identifier | SAMPLE001 |
--rrna_reference |
rRNA reference genome | /path/to/rrna_ref.fa |
--globinrna_reference |
Globin RNA reference | /path/to/globin_ref.fa |
--gencode_gtf_path |
Gencode GTF annotation | /path/to/gencode.v43.gtf |
--human_fasta |
Human reference genome | /path/to/GRCh38.fa |
--star_index_151 |
STAR index for 151bp reads | /path/to/star_index_151 |
--kallisto_index |
Kallisto transcript index | /path/to/kallisto.idx |
The pipeline generates organized outputs in the following structure:
results/
├── QC/ # Quality control reports
│ ├── *.rrna.flagstat.txt # rRNA contamination stats
│ ├── *.globinrna.flagstat.txt # Globin contamination stats
│ ├── *.html # Fastp QC reports
│ └── *.json # Fastp QC metrics
├── ALIGNMENT/ # Alignment files
│ ├── *.ReadsPerGene.out.tab.gz # STAR gene counts
│ ├── *.Log.final.out # STAR alignment stats
│ ├── *.SJ.out.tab.gz # Splice junctions
│ ├── *.bam2SJ.out.tab.gz # Reconstructed junctions
│ └── *.cram # Compressed alignment files
├── COUNTS/ # Gene expression counts
│ ├── *.gene_id.exon.ct # FeatureCounts output
│ └── *.gene_id.exon.ct.summary # Counts summary
├── QUANT/ # Kallisto quantification
│ ├── *abundance.h5 # Transcript abundances h5
│ ├── *abundance.tsv # Transcript abundances
│ └── *run_info.json # Quantification info
├── IR/ # IRFinder results
│ ├── *.txt.gz # Intron retention data
│ └── *.bw # Coverage bigWig files
├── MOSDEPTH/ # Coverage analysis
│ ├── *.mosdepth.global.dist.txt # Global coverage
│ ├── *.regions.bed.gz # Region-specific coverage
│ └── *.per-base.bed.gz # Per-base coverage analysis
├── BEDTOOLS/ # CDS analysis and filtering
│ └── *_cds_3x.bed # Coding sequence regions with coverage filtering
├── DEEPVARIANT/ # Variant calling
│ ├── *.vcf.gz # Variant call format files
│ ├── *.g.vcf.gz # Genomic variant call format
│ └── *.visual_report.html # Variant calling reports
└── BAM2SJ/ # Splice junction analysis
├── *_rare_junctions_all.tsv # All detected junctions
└── *_rare_junctions_filtered.xlsx # Filtered junctions
Update nextflow.config with your Docker image paths:
params {
bwa_docker = "your-registry/bwa:latest"
star_docker = "your-registry/star:latest"
fastp_docker = "your-registry/fastp:latest"
// ... other images
}Modify process resources in nextflow.config:
process {
withLabel: 'star_alignreads' {
memory = 192.GB
cpus = 48
}
}- Adapter trimming
- Quality filtering
- Length filtering
- Duplicate detection
- rRNA contamination using BWA
- Globin RNA contamination analysis
- Statistical reporting
- Multi-read length optimization
- Splice-aware alignment
- Junction detection
- Gene-level quantification
- STAR: Gene-level read counts
- Kallisto: Transcript-level abundance
- FeatureCounts: Exon-level counting
- Coverage metrics
- GC bias analysis
- Mapping quality
- Strand specificity
- IRFinder: Intron retention detection
- BAM2SJ: Splice junction reconstruction
- Mosdepth: Enhanced coverage analysis with dual processes for comprehensive coverage assessment
- BEDTOOLS: CDS region identification and coverage-based filtering for variant calling
- DEEPVARIANT: Advanced variant calling with custom model support and comprehensive reporting
Workflow Dependencies: DEEPVARIANT requires BEDTOOLS to run first, as it uses the generated CDS bed file to define the regions for variant calling. This ensures variant calling is focused on coding sequences with adequate coverage.
-
Memory Errors:
- Increase memory allocation in
nextflow.config - Use
-resumeto restart from failed step
- Increase memory allocation in
-
Docker Issues:
- Verify Docker daemon is running
- Check image availability and permissions
-
Reference File Errors:
- Ensure all reference files are accessible
- Verify file formats and integrity
-
Channel Input Errors (Fixed in v1.1.0):
- All input channel mismatches have been resolved
- Reference file handling has been optimized
- Duplicate output channel issues have been fixed
- STAR Alignment: 48 cores, 192GB RAM
- IRFinder: 16 cores, 32GB RAM
- Other processes: 8-16 cores, 16-32GB RAM
- Small dataset (<50M reads): 2-4 hours
- Medium dataset (50-100M reads): 4-8 hours
- Large dataset (>100M reads): 8-16 hours
We welcome contributions! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
- George Carvalho - [email protected]
- UCLA Nelson Lab - https://nelsonlab.ucla.edu/
- Nextflow community for the workflow engine
- STAR developers for RNA-seq alignment
- IRFinder team for intron retention analysis
- All contributors to the bioinformatics tools used
For aberrant splicing analysis, see the FRASER configuration in drop_files/config.yaml.
For aberrant expression analysis, use the R script in script/run_outrider.R.
Use the provided sample annotation files for processing multiple samples.
- Fixed Channel Errors: Resolved all input channel count mismatches across samtools processes
- Eliminated Duplicates: Fixed duplicate output channel naming issues
- Optimized References: Streamlined reference file handling for better compatibility
- Dual Mosdepth Processes: Added comprehensive coverage analysis with both MOSDEPTH and MOSDEPTH_BED
- Improved Integration: Better workflow integration between coverage analysis and downstream processing
- Performance Optimization: Enhanced channel efficiency and process compatibility
- Enhanced DEEPVARIANT Integration: Improved variant calling pipeline with optimized input handling
- CDS Region Analysis: Advanced BEDTOOLS processing for coding sequence identification and coverage filtering
- Comprehensive Variant Reports: Enhanced VCF outputs with visual reporting and gVCF support
- Sequential Processing: BEDTOOLS runs first to generate CDS bed files, which DEEPVARIANT then uses for targeted variant calling
- Coverage-Based Filtering: Variant calling is restricted to coding regions with adequate coverage, improving accuracy and efficiency
- Pipeline now runs without channel-related errors
- Enhanced coverage analysis capabilities
- Improved stability and reliability
- Better resource utilization and workflow efficiency
- Advanced variant calling with DEEPVARIANT integration
- Enhanced CDS region analysis and filtering capabilities
- Comprehensive variant reporting and visualization