Skip to content

UTA Query for transcript is non-deterministic #435

@zealws

Description

@zealws

Describe the bug

The query for transcripts against the UTA database is non-deterministic, and returns an arbitrary matching transcript each time it's run.

This is a result of a join between associated_accessions and tx_exon_aln_v which doesn't fully constrain the rows in the tx_exon_aln_v table, resulting in the database choosing one arbitrarily. The fields of this arbitrary row are used to order the results, resulting in a random transcript being selected and returned.

I ran fusor 50 times using a script, and it resulted in a variety of transcripts being selected for the 5' partner (columns are count and transcript ID):

   8 refseq:NM_001320453.1
   6 refseq:NM_001320454.1
   4 refseq:NM_001320454.2
  11 refseq:NM_001320455.1
   3 refseq:NM_001320455.2
  14 refseq:NM_006365.2
   4 refseq:NM_006365.3

Steps to reproduce

  1. Download this arriba fusion.
  2. Use fusor to annotate this fusion.
  3. Check the 5' partner's transcript.
  4. Rerun fusor, and check the transcript again.

Expected behavior

The chosen transcript is deterministic and does not change from one run to another.

Current behavior

The chosen transcript is non-deterministic, and returns an arbitrary matching transcript each time it is run.

Possible reason(s)

The query joins the tables associated_accessions with tx_exon_aln_v on associated_accessions.tx_ac = tx_exon_aln_v.tx_ac (this happens transitively through a join on the transcript with the same column).
For a given value of associated_accessions.tx_ac, there may be several matching tx_exon_aln_v rows (example below).

As a result, the join chooses an arbitrary row from tx_exon_aln_v. These rows differ in their tx_start_i and tx_end_i, which is used to order the transcripts, resulting in the transcript order varying each time the query is run.

Example query of matching values of tx_exon_aln_v for a given tx_ac:

uta=> select tx_ac, tx_start_i, tx_end_i from tx_exon_aln_v where tx_ac = 'NM_001320453.1';
     tx_ac      | tx_start_i | tx_end_i
----------------+------------+----------
 NM_001320453.1 |          0 |      173
 NM_001320453.1 |        173 |      248
 NM_001320453.1 |        248 |      344
 NM_001320453.1 |        344 |      444
 NM_001320453.1 |        444 |      794
 NM_001320453.1 |          0 |      173
 NM_001320453.1 |        173 |      248
 NM_001320453.1 |        248 |      344
 NM_001320453.1 |        344 |      444
 NM_001320453.1 |        444 |      794
(10 rows)

Suggested fix

Find a way to make the join from associated_accessions.tx_ac = tx_exon_aln_v.tx_ac deterministic.

Branch, commit, and/or version

main

Screenshots

No response

Environment details

Not relevant

Additional details

No response

Contribution

I can help prepare a PR, but I would like someone with a better understanding of the structure of the UTA database to determine how best to make the join deterministic.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingpriority:mediumMedium priority

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions