-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Describe the bug
The query for transcripts against the UTA database is non-deterministic, and returns an arbitrary matching transcript each time it's run.
This is a result of a join between associated_accessions and tx_exon_aln_v which doesn't fully constrain the rows in the tx_exon_aln_v table, resulting in the database choosing one arbitrarily. The fields of this arbitrary row are used to order the results, resulting in a random transcript being selected and returned.
I ran fusor 50 times using a script, and it resulted in a variety of transcripts being selected for the 5' partner (columns are count and transcript ID):
8 refseq:NM_001320453.1
6 refseq:NM_001320454.1
4 refseq:NM_001320454.2
11 refseq:NM_001320455.1
3 refseq:NM_001320455.2
14 refseq:NM_006365.2
4 refseq:NM_006365.3
Steps to reproduce
- Download this arriba fusion.
- Use fusor to annotate this fusion.
- Check the 5' partner's transcript.
- Rerun fusor, and check the transcript again.
Expected behavior
The chosen transcript is deterministic and does not change from one run to another.
Current behavior
The chosen transcript is non-deterministic, and returns an arbitrary matching transcript each time it is run.
Possible reason(s)
The query joins the tables associated_accessions with tx_exon_aln_v on associated_accessions.tx_ac = tx_exon_aln_v.tx_ac (this happens transitively through a join on the transcript with the same column).
For a given value of associated_accessions.tx_ac, there may be several matching tx_exon_aln_v rows (example below).
As a result, the join chooses an arbitrary row from tx_exon_aln_v. These rows differ in their tx_start_i and tx_end_i, which is used to order the transcripts, resulting in the transcript order varying each time the query is run.
Example query of matching values of tx_exon_aln_v for a given tx_ac:
uta=> select tx_ac, tx_start_i, tx_end_i from tx_exon_aln_v where tx_ac = 'NM_001320453.1';
tx_ac | tx_start_i | tx_end_i
----------------+------------+----------
NM_001320453.1 | 0 | 173
NM_001320453.1 | 173 | 248
NM_001320453.1 | 248 | 344
NM_001320453.1 | 344 | 444
NM_001320453.1 | 444 | 794
NM_001320453.1 | 0 | 173
NM_001320453.1 | 173 | 248
NM_001320453.1 | 248 | 344
NM_001320453.1 | 344 | 444
NM_001320453.1 | 444 | 794
(10 rows)Suggested fix
Find a way to make the join from associated_accessions.tx_ac = tx_exon_aln_v.tx_ac deterministic.
Branch, commit, and/or version
main
Screenshots
No response
Environment details
Not relevant
Additional details
No response
Contribution
I can help prepare a PR, but I would like someone with a better understanding of the structure of the UTA database to determine how best to make the join deterministic.