Some useful scripts for sc5Pseq project. Not everyone is well tested. 😀
Description
This script calculates and plots base frequencies from a FASTQ file. It can generate an interactive HTML plot. It can also specify base positions of interest.
Requirements
It could run in python2+ and python3+ enviroment, but need some libraries.
pip3 install pandas matplotlib plotly Bio argparse gzip osUsage:
usage: stat_base_content.py [-h] [-i INPUT_FILE] [-a TABLE_INPUT] -o OUTPUT_PREFIX
[-n NUM_BASES] [-s POS [POS ...]]
[-sp SPECIFIC_OUTPUT_PREFIX] [--interactive]
[--add_percentage]
Calculate and plot base frequencies in a FASTQ file or from an existing CSV table.
options:
-h, --help show this help message and exit
-i INPUT_FILE, --input_file INPUT_FILE
Path to the FASTQ file, support .gz
-a TABLE_INPUT, --table_input TABLE_INPUT
Path to an existing base frequencies table to use as input
instead of a FASTQ file.
-o OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX
Output file name prefix.
-n NUM_BASES, --num_bases NUM_BASES
Number of bases to analyze (default: 85).
-s POS [POS ...], --specific_positions POS [POS ...]
Specific base positions or range to analyze (e.g., -s 10 15)
-sp SPECIFIC_OUTPUT_PREFIX, --specific_output_prefix SPECIFIC_OUTPUT_PREFIX
Output file name prefix for specific positions plot.
--interactive Generate an interactive HTML plot.
--add_percentage Add percentage labels to the static image.
Example
module load python #(python/3.12.1)
- To analyze a FASTQ file and generate output files:
python script.py -i example.fastq -o output_prefix
- Specify bases at specific positions, generate dynamic interactive HTML, and add base percentage to PNG:
python script.py -i example.fastq -o output_prefix -s 8 16 --interactive --add_percentage -sp special_output_prefix
- Generate plots directly from an existing CSV table:
python script.py --table_input existing_frequencies.csv -o output_prefix --interactive
Description
Parses the .final.out file and extracts the relevant information from the STAR alignment result.
Usage:
python star_format.py DirectoryPath OutputNameDescription
Easily plot the format mapping stat results from STAR. It can generate bar plot and interactive html to easy compare the mulit samples. You could also specify the features you interest to plot.
The input are the format file from star_format.py
Usage
usage: star_plot.py [-h] -i INPUT [-f FEATURES [FEATURES ...]]
[-c {blue,orange,green,red,purple}] -o OUTPUT
[--interactive] [--sort-samples]
Plot features from formatted STAR log data.
options:
-h, --help show this help message and exit
-i INPUT, --input INPUT
Input CSV file containing formatted STAR
log data.
-f FEATURES [FEATURES ...], --features FEATURES [FEATURES ...]
List of features to compare (default is
'umr_pct'). Available features: input,
umr_num, umr_pct, mismatch_rate, avg_len,
multi_num, multi_pct, unmapped_short_num,
unmapped_short_pct.
-c {blue,orange,green,red,purple}, --color {blue,orange,green,red,purple}
Specify the color for the plots (default is
'blue'). Available colors: blue, orange,
green, red, purple.
-o OUTPUT, --output OUTPUT
Output file name prefix for the plots.
--interactive Generate interactive plots using Plotly.
--sort-samples Sort sample names in alphanumeric order.Example
module load python #python3.12.7
- plot umr_pct and umr_num feature with sort sample name and in red color.
python star_plot.py -i final.out.all -f "umr_pct" "umr_num" -o cdiff --interactive --sort-samples -c red
Description
This tools could analysis and filter the barcode you have. It allows you control the hamming distance, GC content and exclude any sequence based on the window sliding. It will generate the hamming distance matrix and filter information. It also visualizes the hamming distance matrix.
The input including the barcode file (fasta) and excluded fasta.
Usage
usage: barcode_filter.py [-h] -i INPUT [-e EXCLUDE] [-ht HAMMING] [-gmin GC_MIN] [-gmax GC_MAX] -of
OUTPUT_FILTERED -oi OUTPUT_INFO -hmf HAMMING_MATRIX_FILE -hpf HAMMING_PLOT_FILE
[-d]
Filter barcode sequences based on Hamming distance, GC content, and exclusions.
options:
-h, --help show this help message and exit
-i INPUT, --input INPUT
Input fasta file containing barcodes.
-e EXCLUDE, --exclude EXCLUDE
Fasta file containing sequences to exclude.
-ht HAMMING, --hamming HAMMING
Minimum Hamming distance threshold (default: 3).
-gmin GC_MIN, --gc_min GC_MIN
Minimum GC content percentage (default: 40.0).
-gmax GC_MAX, --gc_max GC_MAX
Maximum GC content percentage (default: 60.0).
-of OUTPUT_FILTERED, --output_filtered OUTPUT_FILTERED
Output fasta file for filtered barcodes.
-oi OUTPUT_INFO, --output_info OUTPUT_INFO
Output text file for detailed filtering information.
-hmf HAMMING_MATRIX_FILE, --hamming_matrix_file HAMMING_MATRIX_FILE
Output text file for Hamming distance matrix.
-hpf HAMMING_PLOT_FILE, --hamming_plot_file HAMMING_PLOT_FILE
Output image file for Hamming distance heatmap.
-d, --debug Enable debug logging.
Example
barcode_filter.py -i barcode.fa -e exclude_file.fa -of barcode.filter.fa -oi barcode.filterinfo.txt -hmf barcode.filter.matrix.txt -hpf barcode.filter.matrix.plotDescription
This script parses umi-tools log files and extracts statistics for R1 only. It extracts the following from each log file:
- Input Reads (only R1; lines mentioning read2 are skipped)
- regex matches read1
- regex does not match read1
The sample name is derived from the file name by taking the part before the first underscore (e.g., for A1_S188.umi.extract.log, the sample is A1). If multiple log files belong to the same sample, their statistics are merged. Additionally, an optional parameter (-a) allows you to append an extra column (for example, to indicate a version or any other field) immediately after the sample column in the output.
Requirements
This script requires Python 3 and uses only standard libraries:
argparsereos
No additional packages are needed.
Usage
usage: parse_umi_log.py -i INPUT [INPUT ...] [-o OUTPUT] [-a APPEND]
Parse umi-tools log files and output merged summary for R1 only.
options:
-i INPUT [INPUT ...], --input INPUT [INPUT ...]
List of .log files to process.
-o OUTPUT, --output OUTPUT
Output file name (default: merged_summary.txt).
-a APPEND, --append APPEND
Additional field to append after sample (e.g. version).Example
python parse_umi_log.py -i A1_S188.umi.extract.log B2_S189.umi.extract.log -o umi_summary.txt -a version1.0This command will:
- Process the log files
A1_S188.umi.extract.logandB2_S189.umi.extract.log. - Parse only the R1 statistics from each file.
- Merge statistics for samples with the same sample name.
- Append an extra column (here,
version1.0) after the sample column. - Write the output summary to
umi_summary.txt.
This toolset identifies, visualizes, and filters TSO-related artifacts in 5' end-mapping RNA-seq data. Input BAM should be pre-trimmed to remove TSO/barcode/primer, mapping only the RNA/cDNA true 5' ends. Without pre-trimmed are developing.
module load python/3.12.3 # if working on PDC
python3 detect_remove_artifactTSO.py -h
usage: detect_remove_artifactTSO.py [-h] {stat,plot,strand_invasion,missing_pairing} ...
A pipeline for strand invasion/missing pairing artifact analysis: stat, plot, filter.
positional arguments:
{stat,plot,strand_invasion,missing_pairing}
stat Stat mode: outputs TSV for TSO-hamming+G, rcTSO-hamming+C, logo table, and grouped percentage curves.
plot Plot mode: visualize heatmaps/logo/curve from stat .tsv
strand_invasion Filter by G in last 3 nt + TSO hamming
missing_pairing Filter by C in last 3 nt + rcTSO hamming
options:
-h, --help show this help message and exitFeatures:
stat: Output statistics (TSO/rcTSO hamming distance vs. last 3nt G/C count) and sequence logo table.plot: Flexible visualization (heatmaps, sequence logos, grouped line plots) from stat output.filtering: Filter BAM files by user-specified thresholds.
TSO sequence can include IUPAC ambiguity codes. Hamming distance is computed only on the TSO core (excluding the trailing GGG/CCC).
Installation
Python 3.7+ is required. Install dependencies with:
pip install pysam pyfaidx pandas matplotlib logomakerusage
- Stat: Generate all statistics and tables
python detect_remove_artifactTSO.py stat \
-b input.bam -f genome.fa -t TTTCTTATATGGG \
--g_out stat_g.tsv --c_out stat_c.tsv --logo_table logo_table.tsv \
--g_curve_out g_curve.tsv --c_curve_out c_curve.tsv- Plot: Visualize heatmap, sequence logo, or grouped line plot
python detect_remove_artifactTSO.py plot \
--g_out stat_g.tsv --c_out stat_c.tsv --logo_table logo_table.tsv \
--g_curve g_curve.tsv --logo_mode G --curve_png gcurve.png --logo_png logo.png- Filter: Remove artifact reads from BAM
python detect_remove_artifactTSO.py strand_invasion \
-b input.bam -f genome.fa -t TTTCTTATATGGG \
--min_g 2 --max_hamming 3 --outbam filtered.bamSingle-cell sequencing saturation curves for a general bam file with library and per-cell support.
Modes (mutually exclusive via --mode):
- global: Monte Carlo subsampling of the whole BAM and counting uniques --dedup cell -> unique (CB, UMI) --dedup library -> unique UMI (ignore CB & Gene) --dedup cell-gene -> unique (CB, Gene, UMI)
- per-cell: For each selected CB, compute EXPECTED unique molecules vs reads analytically (no Monte Carlo) --per-cell-dedup {cell, cell-gene}
Plot Axes:
- X axis is always "Number of reads (M)"; you can override both axes via --xlabel/--ylabel
- Journal style (no grid). Optional --logx, and --no-band (global only)
CSV (long-format):
- global : reads,reads_M,mean_unique,std_unique,dedup
- per-cell: cell,reads,reads_M,expected_unique,dedup
Requirements
If working on PDC: ml bioinfo-tools; ml python/3.12.3
usage
- Global, per-cell dedup, 25 points, CSV
python scSaturation.py --mode global --bam in.bam --out-prefix sample \
--dedup cell --num-points 25 --skip-secondary --csv- Global, library-level unique UMI (ignore CB/Gene), log-x, no band
python scSaturation.py --mode global --bam in.bam --out-prefix sample_lib \
--dedup library --logx --no-band- Global, cell-gene dedup using GX then GN
python scSaturation.py --mode global --bam in.bam --out-prefix sample_cg \
--dedup cell-gene --gene-tags GX,GN- Per-cell curves, per-cell dedup, top100 cells (>=2k reads), CSV
python scSaturation.py --mode per-cell --bam in.bam --out-prefix sample_cells \
--per-cell-dedup cell --per-cell-max 100 --per-cell-min-reads 2000 --csv- Per-cell curves, cell-gene dedup for a custom CB list
python scSaturation.py --mode per-cell --bam in.bam --out-prefix sample_cells_cg \
--per-cell-dedup cell-gene --cb-list cells.txtPipeline to score, visualize and filter 5' and 3' end artefacts in RNA-seq libraries
Usage
usage: libend_artifact_filter.py [-h] --mode {5p,3p} [--log-level {DEBUG,INFO,WARNING,ERROR}] {score,plot,filter} ...
Unified 5' (TSO) & 3' (oligo-dT) artefact analysis: score, plot, filter.
positional arguments:
{score,plot,filter}
score Compute per-read scores and threshold grids.
plot Plot heatmaps, coverage curves and logos.
filter Filter BAM by artefact rules.
options:
-h, --help show this help message and exit
--mode {5p,3p} Library end mode: 5p (TSO) or 3p (oligo-dT)
--log-level {DEBUG,INFO,WARNING,ERROR}
Logging verbosity.An example
python3 libend_artifact_filter.py --mode 3p score \
-b CCE78_SCFlu25r1t2_3P_dedup.bam \
-f /cfs/klemming/projects/supr/sllstore2017018/ref_share/fungi/SaCer_embl_74/Saccharomyces_cerevisiae.EF4.74.dna.toplevel.fa \
--oligos /cfs/klemming/home/h/huizhou/software/RNAEdgeFlow_dev/external/oligo_dT.tsv \
--min-mapq 255 \
--prefix CCE78_SCFlu25r1t2_3P_dedup
python3 libend_artifact_filter.py --mode 3p plot \
--prefix CCE78_SCFlu25r1t2_3P_dedup
python3 libend_artifact_filter.py --mode 3p plot \
--prefix CCE78_SCFlu25r1t2_3P_dedup \
--make-logo bits \
--logo-score1-max 2 \
--logo-score2-max 7
# Filter reads based on 3' artifact scores and logo plots
python3 libend_artifact_filter.py --mode 3p filter \
-b CCE78_SCFlu25r1t2_3P_dedup.bam \
-f /cfs/klemming/projects/supr/sllstore2017018/ref_share/fungi/SaCer_embl_74/Saccharomyces_cerevisiae.EF4.74.dna.toplevel.fa \
--oligos /cfs/klemming/home/h/huizhou/software/RNAEdgeFlow_dev/external/oligo_dT.tsv \
--read-window-len 34 --score1-thresh 2 --score2-thresh 7
-o CCE78_SCFlu25r1t2_3P_filtered.bam