DBmiRNA uses a dedicated normalization layer so canonical miRNA, gene, and transcript IDs do not have to be hard-coded inside each loader.
The active policy is stored in:
The current canonical position is:
- miRNAs:
miRBase v22 - genes:
Ensembl v115 - transcripts:
Ensembl v115
This is the current BioGeMT DBmiRNA target namespace, not a claim that every source repository already uses those same releases internally.
DBmiRNA keeps track of the fact that source repositories do not all live in the same release world.
Current examples:
FuNmiRBenchgene identifiers are currently configured with source releasev109genomic-region-annotatorgenes and transcripts are currently configured with source releasev115
That source-side release context is preserved in each entity's normalization block.
DBmiRNA separates:
- canonical namespace choice
- source release context
- source identifier evidence
- normalization status
Current internal ID patterns are:
- miRNAs:
mirna:mirbase_v22:<miRNA_name> - genes:
gene:ensembl_v115:<ENSG...> - transcripts:
tx:ensembl_v115:<ENST...>
For miRNAs, DBmiRNA currently uses the active miRBase release plus the miRNA name as the internal stable join key. If a miRBase accession is present, it is preserved in the entity document and in the normalization metadata. This avoids splitting equivalent miRNAs across sources when one source provides only the name and another provides both name and accession.
For genes and transcripts, the accession remains the canonical anchor, while the source release is tracked separately from the DBmiRNA canonical release.
Each normalized entity can carry:
providercanonical_releasesource_releasesource_nameorsource_accessionstatus
This lets BioGeMT answer three important questions at any time:
- what DBmiRNA considers canonical right now
- what release the source data came from
- how strong the identifier normalization was for that record
Edit config/normalization.json.
The most important keys are:
providers.mirna.canonical_releaseproviders.gene.canonical_releaseproviders.transcript.canonical_releasesource_defaults.<module>.mirna_releasesource_defaults.<module>.gene_release_defaultsource_defaults.<module>.transcript_release_default
If BioGeMT later wants to move to a different release, such as miRBase v23 or Ensembl v116, this config is the place to change the active target.
Changing the canonical release in config does not automatically remap historical data across releases.
It changes:
- what new exports consider canonical
- what namespace new IDs are written into
It does not yet do:
- Ensembl release-to-release remapping
- transcript coordinate liftovers
- miRBase accession history reconciliation across releases
Those need explicit bridge tables or external mapping resources.
Keep both of these for every imported entity:
- canonical ID under the active DBmiRNA policy
- source normalization metadata describing the original source world
That way BioGeMT always knows:
- where we stand now
- what source world the data came from
- what would need re-exporting or remapping if the canonical version changes later