Skip to content

Conversation

@gaurav
Copy link
Collaborator

@gaurav gaurav commented Oct 29, 2025

Changes:

Filename 2025sep1 2025nov4 Diff % Diff
Count of CURIEs in all files 688,983,999 637,487,209 -51,496,790 -7.47%
Count of cliques in all files 490,293,340 438,399,813 -51,893,527 -10.58%
AnatomicalEntity 249,584 249,612 +28 0.01%
BiologicalProcess 67,929 66,533 -1,396 -2.06%
Cell 13,175 13,291 +116 0.88%
CellLine 38,810 38,810 0 0.00%
CellularComponent 14,696 14,714 +18 0.12%
ChemicalEntity 4,046,131 4,132,795 +86,664 2.14%
ChemicalMixture 530 610 +80 15.09%
ComplexMolecularMixture 276 274 -2 -0.72%
Disease 632,330 629,837 -2,493 -0.39%
Drug 360,925 361,770 +845 0.23%
Gene 79,427,652 80,884,648 +1,456,996 1.83%
GeneFamily 28,050 28,137 +87 0.31%
GrossAnatomicalStructure 15,709 15,745 +36 0.23%
MacromolecularComplex 1,258 631 -627 -49.84%
MolecularActivity 206,636 207,894 +1,258 0.61%
MolecularMixture 21,879,355 21,872,441 -6,914 -0.03%
OrganismTaxon 3,543,867 3,558,912 +15,045 0.42%
Pathway 53,125 53,366 +241 0.45%
PhenotypicFeature 483,108 483,981 +873 0.18%
Polypeptide 166 164 -2 -1.20%
Protein 275,514,857 221,451,194 -54,063,663 -19.62%
Publication 79,773,973 80,814,784 +1,040,811 1.30%
SmallMolecule 221,734,011 221,709,221 -24,790 -0.01%
umls 897,846 897,845 -1 -0.00%

(Note that these is at least one error in the 2025sep1 report: it says there are 1,258 MacromolecularComplexes, but really there are 629)

@gaurav gaurav changed the base branch from master to drugchemical-improvements December 14, 2025 04:16
@gaurav gaurav marked this pull request as ready for review December 14, 2025 04:27
gaurav added a commit that referenced this pull request Dec 15, 2025
…mical conflation (#626)

Back in PR #506, I
implemented a fancy new DrugChemical conflation system that chose an
overall type for the entire conflated clique with a preference order
(e.g. SmallMolecule is preferred over ChemicalEntity), and then that
type would be used to choose the identifier order for the conflated
clique. However, while building Babel 1.14
(#606), I realized that
this approach was occasionally failing badly: most prominently,
CHEBI:5931 "insulin human" would get placed further down in the
conflation list than lots of other identifiers, including
UNII:AVT680JB39 "Insulin pork". Partially this is because other changes
have caused more identifiers to be identified as ChemicalEntities, so
they end up lower in the list than some other identifiers.

This PR replaces that approach with a much simpler algorithm: we now
group identifiers by prefix rather than by Biolink type, and then order
them with [the preferred prefix order for
biolink:ChemicalEntity](https://biolink.github.io/biolink-model/ChemicalEntity/#valid-id-prefixes),
with RXCUI forced to the end of the list. This pushes a lot more CHEBIs
and UNIIs to the top of the conflations, which gives us better primary
identifiers and labels for the conflation.

Additionally:
- We were occasionally including non-chemical identifiers (possibly
because we're pulling in proteins or something?). This PR modifies our
process so that we check whether an identifier is a chemical before we
add it to the glomming.
- Removes the preferred prefix order from the config.yaml file and puts
it back into the DrugChemical conflated file and commented it out. We
might move it back in the future if we can make it work better/at all.
- Renamed `load_cliques()` to `load_cliques_containing_rxcui()` to make
its purpose clearer.
Base automatically changed from drugchemical-improvements to master December 15, 2025 17:57
@gaurav gaurav merged commit 3131827 into master Dec 15, 2025
2 of 4 checks passed
@gaurav gaurav deleted the babel-1.14 branch December 15, 2025 19:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GitHub Docker image needs fixing before it can be used as a Github repository

2 participants