Babel 1.14 #606

gaurav · 2025-10-29T22:20:02Z

Changes:

Updated versions (Biolink Model from 4.2.6-rc5 to 4.3.2, RxNorm version from 07072025 to 10062025)
In PR Add a taxon_specific flag to the synonyms file output #604, I added a taxon_specific flag to synonyms (closing Add a taxon_specific flag to the NameRes database #601), but I forgot to add that to the UMLS leftover file or the conflated synonym files (DrugChemicalConflated.txt and GeneProteinConflated.txt). I've added them in this PR.
Set persist-credentials: false when checking out code when building the Docker package. Closes GitHub Docker image needs fixing before it can be used as a Github repository #119.
Set fetch-depth: 0 when checking out code when building the Docker package: we don't need to include the full Git history in the Docker image, and we can fetch the other changes as needed.
Updated some references from https://github.com/TranslatorSRI/Babel to https://github.com/NCATSTranslator/Babel.
Set up a logger for synonymconflation.py.
Minor fixes to the pull_via_wget().
Minor fixes to the Kubernetes files needed to start Babel.

Filename	2025sep1	2025nov4	Diff	% Diff
Count of CURIEs in all files	688,983,999	637,487,209	-51,496,790	-7.47%
Count of cliques in all files	490,293,340	438,399,813	-51,893,527	-10.58%
AnatomicalEntity	249,584	249,612	+28	0.01%
BiologicalProcess	67,929	66,533	-1,396	-2.06%
Cell	13,175	13,291	+116	0.88%
CellLine	38,810	38,810	0	0.00%
CellularComponent	14,696	14,714	+18	0.12%
ChemicalEntity	4,046,131	4,132,795	+86,664	2.14%
ChemicalMixture	530	610	+80	15.09%
ComplexMolecularMixture	276	274	-2	-0.72%
Disease	632,330	629,837	-2,493	-0.39%
Drug	360,925	361,770	+845	0.23%
Gene	79,427,652	80,884,648	+1,456,996	1.83%
GeneFamily	28,050	28,137	+87	0.31%
GrossAnatomicalStructure	15,709	15,745	+36	0.23%
MacromolecularComplex	1,258	631	-627	-49.84%
MolecularActivity	206,636	207,894	+1,258	0.61%
MolecularMixture	21,879,355	21,872,441	-6,914	-0.03%
OrganismTaxon	3,543,867	3,558,912	+15,045	0.42%
Pathway	53,125	53,366	+241	0.45%
PhenotypicFeature	483,108	483,981	+873	0.18%
Polypeptide	166	164	-2	-1.20%
Protein	275,514,857	221,451,194	-54,063,663	-19.62%
Publication	79,773,973	80,814,784	+1,040,811	1.30%
SmallMolecule	221,734,011	221,709,221	-24,790	-0.01%
umls	897,846	897,845	-1	-0.00%

(Note that these is at least one error in the 2025sep1 report: it says there are 1,258 MacromolecularComplexes, but really there are 629)

Also updated versions.

This should always be false, unless there is some standardized way in which this is stored.

…mical conflation (#626) Back in PR #506, I implemented a fancy new DrugChemical conflation system that chose an overall type for the entire conflated clique with a preference order (e.g. SmallMolecule is preferred over ChemicalEntity), and then that type would be used to choose the identifier order for the conflated clique. However, while building Babel 1.14 (#606), I realized that this approach was occasionally failing badly: most prominently, CHEBI:5931 "insulin human" would get placed further down in the conflation list than lots of other identifiers, including UNII:AVT680JB39 "Insulin pork". Partially this is because other changes have caused more identifiers to be identified as ChemicalEntities, so they end up lower in the list than some other identifiers. This PR replaces that approach with a much simpler algorithm: we now group identifiers by prefix rather than by Biolink type, and then order them with [the preferred prefix order for biolink:ChemicalEntity](https://biolink.github.io/biolink-model/ChemicalEntity/#valid-id-prefixes), with RXCUI forced to the end of the list. This pushes a lot more CHEBIs and UNIIs to the top of the conflations, which gives us better primary identifiers and labels for the conflation. Additionally: - We were occasionally including non-chemical identifiers (possibly because we're pulling in proteins or something?). This PR modifies our process so that we check whether an identifier is a chemical before we add it to the glomming. - Removes the preferred prefix order from the config.yaml file and puts it back into the DrugChemical conflated file and commented it out. We might move it back in the future if we can make it work better/at all. - Renamed `load_cliques()` to `load_cliques_containing_rxcui()` to make its purpose clearer.

gaurav mentioned this pull request Nov 5, 2025

Normalize Ubergraph hierarchy using DuckDB #509

Draft

2 tasks

gaurav mentioned this pull request Nov 14, 2025

Getting Babel to work on Hatteras #594

Open

This was referenced Dec 3, 2025

Bulk normalize #564

Draft

Babel 1.15 #620

Draft

gaurav mentioned this pull request Dec 14, 2025

Replaced type-based DrugChemical conflation with prefix-based DrugChemical conflation #626

Merged

gaurav and others added 12 commits December 13, 2025 23:14

Set fetch-depth=1 and turn off persist-credentials.

03f71c9

Added build block in config.yaml.

0635769

Also updated versions.

Added on:push trigger for testing.

a349136

Removed on:push trigger after testing.

1dfb877

Changed image tag to latest.

b71c00d

Updated ChEBI download URLs.

773b338

Added taxon_specific to conflated outputs.

cfc7c24

Replaced logging with logger in SynonymConflation.

4a307a7

Add a taxon_specific flag to leftover UMLS.

a37d65b

This should always be false, unless there is some standardized way in which this is stored.

Improved decompress when downloading with wget.

d7864e8

Updated references to TranslatorSRI to NCATSTranslator.

4d67f47

Updated some other references to TranslatorSRI to NCATSTranslator.

11b3331

gaurav force-pushed the babel-1.14 branch from 09c56b8 to 11b3331 Compare December 14, 2025 04:14

gaurav changed the base branch from master to drugchemical-improvements December 14, 2025 04:16

gaurav marked this pull request as ready for review December 14, 2025 04:27

Base automatically changed from drugchemical-improvements to master December 15, 2025 17:57

Merge branch 'master' into babel-1.14

2e995ee

gaurav merged commit 3131827 into master Dec 15, 2025
2 of 4 checks passed

gaurav deleted the babel-1.14 branch December 15, 2025 19:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Babel 1.14 #606

Babel 1.14 #606

Uh oh!

gaurav commented Oct 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Babel 1.14 #606

Babel 1.14 #606

Uh oh!

Conversation

gaurav commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gaurav commented Oct 29, 2025 •

edited

Loading