Skip to content

Off-by-one error in postprocess pipeline #58

@stephaniesamm

Description

@stephaniesamm

Problem

In the postprocess pipeline script for making negatives, we attempt to sort the data by the miRNA family column but we use nl -v 0 with sort -k. The former has 0-based indexing and the latter has 1-based indexing, resulting in an off-by-one error in selecting the column by which to sort the data. The data is instead sorted by miRNA name. This is present up to https://github.com/BioGeMT/miRBench_paper/releases/tag/v1.0.0.

Consequences

Since the data is sorted by miRNA name and the data is processed by miRNA fam block in the make_neg_sets.py script, the same miRNA family may be processed more than once. Additionally, since the blacklisted genes are based on cluster ID that are not in a specific miRNA family (now miRNA name) block, genes assigned to miRNAs from the same miRNA family are pooled as candidate genes to be sampled from.

However, the miraw_analysis was carried out on the gene column as a sanity check and the evaluation metric was random (APS=0.50 refer to miRBench publication), proving no notable effect.

Solution

In future versions, nl -v 0 should be changed to nl -v 1.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions