Skip to content

Conversation

@eliottBo
Copy link

@eliottBo eliottBo commented Dec 2, 2025

Add module whatshap/stats.

PR checklist

Closes #5787

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the module conventions in the contribution docs
  • If necessary, include test data in your PR.
  • Remove all TODO statements.
  • Emit the versions.yml file.
  • Follow the naming conventions.
  • Follow the parameters requirements.
  • Follow the input/output options guidelines.
  • Add a resource label
  • Use BioConda and BioContainers if possible to fulfil software requirements.
  • Ensure that the test works with either Docker / Singularity. Conda CI tests can be quite flaky:
    • For modules:
      • nf-core modules test <MODULE> --profile docker
      • nf-core modules test <MODULE> --profile singularity
      • nf-core modules test <MODULE> --profile conda
    • For subworkflows:
      • nf-core subworkflows test <SUBWORKFLOW> --profile docker
      • nf-core subworkflows test <SUBWORKFLOW> --profile singularity
      • nf-core subworkflows test <SUBWORKFLOW> --profile conda

@eliottBo eliottBo self-assigned this Dec 2, 2025
@eliottBo eliottBo marked this pull request as ready for review December 2, 2025 14:11
@nschan
Copy link
Contributor

nschan commented Dec 2, 2025

This looks overall fine. My concern is that whatshap stats has a bunch of arguments that are not clearly documented on readthedocs (https://github.com/whatshap/whatshap/blob/main/whatshap/cli/stats.py):

    add("--gtf", metavar="FILE", help="Write phased blocks as GTF with each block represented as a "
        "'gene'. If blocks are interleaved or nested, they are split into multiple 'exons'.")
    add("--block-list", metavar="FILE", help="Write list of all blocks to FILE (one block per "
        "line). Nested/interleaved blocks are not split.")
    add("--sample", metavar="SAMPLE", help="Name of the sample "
        "to process. If not given, use first sample found in VCF.")
    add("--chr-lengths", metavar="FILE",
        help="Override chromosome lengths in VCF with those from FILE (one line per chromosome, "
        "tab separated '<chr> <length>'). Lengths are used to compute NG50 values.")
    add("--tsv", metavar="FILE", help="Write statistics in tab-separated value format to FILE")
    add("--only-snvs", default=False, action="store_true", help="Only process SNVs "
        "and ignore all other variants.")
    add("--chromosome", dest="chromosomes", metavar="CHROMOSOME", default=[], action="append",
        help="Name of chromosome(s) to process. If not given, all chromosomes in the "
        "input VCF are considered. Can be used multiple times and accepts a comma-separated list. ")
    add("vcf", metavar="VCF", help="Phased VCF file")

And it looks like the current implementation uses --tsv, but I think it would be nice to support maybe an input switch for format, to toggle between gtf, block-list and tsv? I think the other arguments can come via $args but additional output formats would not be captured as it is now.

@eliottBo
Copy link
Author

eliottBo commented Dec 3, 2025

This looks overall fine. My concern is that whatshap stats has a bunch of arguments that are not clearly documented on readthedocs (https://github.com/whatshap/whatshap/blob/main/whatshap/cli/stats.py):

    add("--gtf", metavar="FILE", help="Write phased blocks as GTF with each block represented as a "
        "'gene'. If blocks are interleaved or nested, they are split into multiple 'exons'.")
    add("--block-list", metavar="FILE", help="Write list of all blocks to FILE (one block per "
        "line). Nested/interleaved blocks are not split.")
    add("--sample", metavar="SAMPLE", help="Name of the sample "
        "to process. If not given, use first sample found in VCF.")
    add("--chr-lengths", metavar="FILE",
        help="Override chromosome lengths in VCF with those from FILE (one line per chromosome, "
        "tab separated '<chr> <length>'). Lengths are used to compute NG50 values.")
    add("--tsv", metavar="FILE", help="Write statistics in tab-separated value format to FILE")
    add("--only-snvs", default=False, action="store_true", help="Only process SNVs "
        "and ignore all other variants.")
    add("--chromosome", dest="chromosomes", metavar="CHROMOSOME", default=[], action="append",
        help="Name of chromosome(s) to process. If not given, all chromosomes in the "
        "input VCF are considered. Can be used multiple times and accepts a comma-separated list. ")
    add("vcf", metavar="VCF", help="Phased VCF file")

And it looks like the current implementation uses --tsv, but I think it would be nice to support maybe an input switch for format, to toggle between gtf, block-list and tsv? I think the other arguments can come via $args but additional output formats would not be captured as it is now.

Thanks for your review @nschan , I now updates my PR for the user to be able to choose which output they want to make it more flexible.

@nschan
Copy link
Contributor

nschan commented Dec 3, 2025

Thanks, that looks very good to me. I have one additional question, probably because I have no experience with it: independent of any --gtf or --tsv setting, will it always print to stdout, i.e. can you capture the stats with > ${prefix}.txt even if a different output format is specified?

@eliottBo
Copy link
Author

eliottBo commented Dec 3, 2025

Thanks, that looks very good to me. I have one additional question, probably because I have no experience with it: independent of any --gtf or --tsv setting, will it always print to stdout, i.e. can you capture the stats with > ${prefix}.txt even if a different output format is specified?

I thought about always capturing the output in the .txt file but from a user point of view I thought to get only the output you want would be better. As it is now the user can definately get the stdout which will end up in the .txt file. The combination of all output is possible. If, for example, the user is just interested in the .tsv or .gtf it is maybe not necessary to always create this .txt file with the stdout. But if you think it is better to always have it, I can add it by default.

@nschan
Copy link
Contributor

nschan commented Dec 3, 2025

No, that is not what I meant, I only wanted to clarify if whatshap stats will always output the same format to stdout independent of the format specified.
However, this made me remember something: nf-core guidelines recommend using tee when capturing stdout (https://nf-co.re/docs/guidelines/components/modules#capturing-stdout-and-stderr). I think this is more of a "nice to do right" thing, but maybe it would be useful?

Copy link
Contributor

@nschan nschan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be a bit confusing that now the default behaviour is not producing any output, as all include_* values are defaulting to false. Since this is documented, I am okay with it, but maybe it would be better to have some default output..

Copy link
Contributor

@fellen31 fellen31 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions! Nice job!

Comment on lines +19 to +22
tuple val(meta), path("${prefix}_whap_stats.tsv"), emit: tsv, optional: true
tuple val(meta), path("${prefix}_whap_stats.gtf"), emit: gtf, optional: true
tuple val(meta), path("${prefix}_whap_stats_block.txt"), emit: block, optional: true
tuple val(meta), path("${prefix}_whap_stats.txt"), emit: txt, optional: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest giving the user of the module the opportunity to choose to include e.g. _whap_stats via the prefix, if they want to, rather than hardcoding it. Although we would need some distinction between the block (which is also tab-separated) and txt output (calling the txt log for example, which I'd argue that it is, especially when running with the --debug flag), if it's kept.

Suggested change
tuple val(meta), path("${prefix}_whap_stats.tsv"), emit: tsv, optional: true
tuple val(meta), path("${prefix}_whap_stats.gtf"), emit: gtf, optional: true
tuple val(meta), path("${prefix}_whap_stats_block.txt"), emit: block, optional: true
tuple val(meta), path("${prefix}_whap_stats.txt"), emit: txt, optional: true
tuple val(meta), path("${prefix}.tsv"), emit: tsv, optional: true
tuple val(meta), path("${prefix}.gtf"), emit: gtf, optional: true
tuple val(meta), path("${prefix}.txt"), emit: block, optional: true
tuple val(meta), path("${prefix}.log"), emit: log, optional: true

$output_gtf \\
$output_block \\
$vcf \\
$output_txt \\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
$output_txt \\
$output_txt

def output_tsv = include_tsv_output ? "--tsv ${prefix}_whap_stats.tsv" : ''
def output_gtf = include_gtf_output ? "--gtf ${prefix}_whap_stats.gtf" : ''
def output_block = inlude_block_output ? "--block-list ${prefix}_whap_stats_block.txt" : ''
def output_txt = include_txt_output ? "> ${prefix}_whap_stats.txt" : ''
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having it like this is OK, but I would also be down to scrap the stdout output altogether. I think it's more of a log rather than an output really. No one in their right mind would choose to process that over the TSV 😅 (used by e.g. MultiQC). The printed output would still available in the stdout files.

If kept, since the tool will aways write this to stdout, regardless of whether any of the other output formats are selected, I think it would be okay to always output this log as default as well. The user doesn't have to use it if they don't want to.

Comment on lines +49 to +57
def output_tsv = include_tsv_output ? "--tsv ${prefix}_whap_stats.tsv" : ''
def output_gtf = include_gtf_output ? "--gtf ${prefix}_whap_stats.gtf" : ''
def output_block = inlude_block_output ? "--block-list ${prefix}_whap_stats_block.txt" : ''
def output_txt = include_txt_output ? "> ${prefix}_whap_stats.txt" : ''
"""
touch ${prefix}_whap_stats.tsv
touch ${prefix}_whap_stats.gtf
touch ${prefix}_whap_stats_block.txt
touch ${prefix}_whap_stats.txt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this would make the stub section mimic the script section correctly.

Suggested change
def output_tsv = include_tsv_output ? "--tsv ${prefix}_whap_stats.tsv" : ''
def output_gtf = include_gtf_output ? "--gtf ${prefix}_whap_stats.gtf" : ''
def output_block = inlude_block_output ? "--block-list ${prefix}_whap_stats_block.txt" : ''
def output_txt = include_txt_output ? "> ${prefix}_whap_stats.txt" : ''
"""
touch ${prefix}_whap_stats.tsv
touch ${prefix}_whap_stats.gtf
touch ${prefix}_whap_stats_block.txt
touch ${prefix}_whap_stats.txt
def tsv_touch_cmd = include_tsv_output ? "--tsv ${prefix}_whap_stats.tsv" : ''
def gtf_touch_cmd = include_gtf_output ? "touch ${prefix}.tsv" : ''
def block_touch_cmd = inlude_block_output ? "touch ${prefix}.txt" : ''
def log_touch_cmd = include_txt_output ? "touch ${prefix}.log" : ''
"""
echo $args
$tsv_touch_cmd
$gtf_touch_cmd
$block_touch_cmd
$txt_touch_cmd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

new module: WHATSHAP/STATS

3 participants