Problem
In the postprocess pipeline script for making negatives, we attempt to sort the data by the miRNA family column but we use nl -v 0 with sort -k. The former has 0-based indexing and the latter has 1-based indexing, resulting in an off-by-one error in selecting the column by which to sort the data. The data is instead sorted by miRNA name. This is present up to https://github.com/BioGeMT/miRBench_paper/releases/tag/v1.0.0.
Consequences
Since the data is sorted by miRNA name and the data is processed by miRNA fam block in the make_neg_sets.py script, the same miRNA family may be processed more than once. Additionally, since the blacklisted genes are based on cluster ID that are not in a specific miRNA family (now miRNA name) block, genes assigned to miRNAs from the same miRNA family are pooled as candidate genes to be sampled from.
However, the miraw_analysis was carried out on the gene column as a sanity check and the evaluation metric was random (APS=0.50 refer to miRBench publication), proving no notable effect.
Solution
In future versions, nl -v 0 should be changed to nl -v 1.
Problem
In the postprocess pipeline script for making negatives, we attempt to sort the data by the miRNA family column but we use
nl -v 0withsort -k. The former has 0-based indexing and the latter has 1-based indexing, resulting in an off-by-one error in selecting the column by which to sort the data. The data is instead sorted by miRNA name. This is present up to https://github.com/BioGeMT/miRBench_paper/releases/tag/v1.0.0.Consequences
Since the data is sorted by miRNA name and the data is processed by miRNA fam block in the make_neg_sets.py script, the same miRNA family may be processed more than once. Additionally, since the blacklisted genes are based on cluster ID that are not in a specific miRNA family (now miRNA name) block, genes assigned to miRNAs from the same miRNA family are pooled as candidate genes to be sampled from.
However, the
miraw_analysiswas carried out on thegenecolumn as a sanity check and the evaluation metric was random (APS=0.50 refer to miRBench publication), proving no notable effect.Solution
In future versions,
nl -v 0should be changed tonl -v 1.