-
Notifications
You must be signed in to change notification settings - Fork 3
Babel 1.14 #606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Babel 1.14 #606
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
2 tasks
Also updated versions.
This should always be false, unless there is some standardized way in which this is stored.
gaurav
added a commit
that referenced
this pull request
Dec 15, 2025
…mical conflation (#626) Back in PR #506, I implemented a fancy new DrugChemical conflation system that chose an overall type for the entire conflated clique with a preference order (e.g. SmallMolecule is preferred over ChemicalEntity), and then that type would be used to choose the identifier order for the conflated clique. However, while building Babel 1.14 (#606), I realized that this approach was occasionally failing badly: most prominently, CHEBI:5931 "insulin human" would get placed further down in the conflation list than lots of other identifiers, including UNII:AVT680JB39 "Insulin pork". Partially this is because other changes have caused more identifiers to be identified as ChemicalEntities, so they end up lower in the list than some other identifiers. This PR replaces that approach with a much simpler algorithm: we now group identifiers by prefix rather than by Biolink type, and then order them with [the preferred prefix order for biolink:ChemicalEntity](https://biolink.github.io/biolink-model/ChemicalEntity/#valid-id-prefixes), with RXCUI forced to the end of the list. This pushes a lot more CHEBIs and UNIIs to the top of the conflations, which gives us better primary identifiers and labels for the conflation. Additionally: - We were occasionally including non-chemical identifiers (possibly because we're pulling in proteins or something?). This PR modifies our process so that we check whether an identifier is a chemical before we add it to the glomming. - Removes the preferred prefix order from the config.yaml file and puts it back into the DrugChemical conflated file and commented it out. We might move it back in the future if we can make it work better/at all. - Renamed `load_cliques()` to `load_cliques_containing_rxcui()` to make its purpose clearer.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Changes:
taxon_specificflag to the synonyms file output #604, I added ataxon_specificflag to synonyms (closing Add ataxon_specific flagto the NameRes database #601), but I forgot to add that to the UMLS leftover file or the conflated synonym files (DrugChemicalConflated.txt and GeneProteinConflated.txt). I've added them in this PR.persist-credentials: falsewhen checking out code when building the Docker package. Closes GitHub Docker image needs fixing before it can be used as a Github repository #119.fetch-depth: 0when checking out code when building the Docker package: we don't need to include the full Git history in the Docker image, and we can fetch the other changes as needed.(Note that these is at least one error in the 2025sep1 report: it says there are 1,258 MacromolecularComplexes, but really there are 629)