Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
141 commits
Select commit Hold shift + click to select a range
31aff98
added slurm folder to support running babel in a distributed fashion …
hyi Oct 11, 2025
f307765
Attempt to make a script to run Babel on Slurm.
gaurav Nov 14, 2025
71fa049
Added hashbang to sbatch script.
gaurav Nov 14, 2025
aa1d587
Increased memory on the leader node.
gaurav Nov 14, 2025
e5d3311
Tweaked some settings.
gaurav Nov 21, 2025
c7cf777
Tweaked some commands.
gaurav Nov 21, 2025
27d1b52
Added snakemake-executor-plugin-slurm.
gaurav Nov 21, 2025
df28656
Cleaned up Slurm Snakemake profile.
gaurav Nov 21, 2025
f904fd9
Cleaned up and documented the Slurm profile.
gaurav Nov 21, 2025
c42bb13
Added retries to OBO downloads.
gaurav Nov 21, 2025
eaea173
Added additional run-specific config file, which is initially empty.
gaurav Nov 21, 2025
1f1e693
That didn't work.
gaurav Nov 21, 2025
eca7a39
Made configfile a list.
gaurav Nov 21, 2025
560763a
Added more Slurm configuration options.
gaurav Nov 21, 2025
005a0d5
Added sleep between Ubergraph queries.
gaurav Nov 21, 2025
4d3082f
Removed old cluster config file that's no longer used.
gaurav Nov 22, 2025
710b4f6
Increased time for the Snakemake runner.
gaurav Nov 22, 2025
252425b
Upgraded packages.
gaurav Nov 22, 2025
0e7166e
Don't build anatomy, build whatever is on the command line.
gaurav Nov 22, 2025
720b18f
Removed anatomy, now build everything.
gaurav Nov 22, 2025
8b9156e
Removed unnecessary import.
gaurav Nov 22, 2025
924631e
Reduced requirements for each node.
gaurav Nov 22, 2025
341fd42
Added Hatteras partition information.
gaurav Nov 22, 2025
47209d5
Added a runtime of 4h for get_ensembl.
gaurav Nov 22, 2025
6b1cc81
Modified EFO to explicate OWL file dependency.
gaurav Nov 22, 2025
6421364
Added retries to get_protein_uniprotkb_ensembl_relationships.
gaurav Nov 22, 2025
ddca518
Added memory in GB to give a sense of what's available.
gaurav Nov 22, 2025
a188afb
Don't delete log files.
gaurav Nov 22, 2025
ec660d9
Reduced BIOMART_MAX_ATTRIBUTE_COUNT to 7.
gaurav Nov 22, 2025
2aaf73b
Increased runtime for generate_pubmed_concords.
gaurav Nov 22, 2025
cbc4e8c
Increased memory required by chembl_labels_and_smiles.
gaurav Nov 22, 2025
f49c41a
Reduced BIOMART_MAX_ATTRIBUTE_COUNT to 6.
gaurav Nov 22, 2025
498a724
Fixed bug in EFO OWL file reading.
gaurav Nov 22, 2025
f8db97f
Note sure why we're opening this file in binary mode.
gaurav Nov 22, 2025
18e063a
Increased runtime for generate_pubmed_concords.
gaurav Nov 23, 2025
9de118a
Increased timeout for generate_pubmed_concords to 12h.
gaurav Nov 23, 2025
3ff3b16
Increased memory for chemical_unichem_concordia.
gaurav Nov 23, 2025
7431f95
Added /usr/bin/time to Python code to monitor memory usage.
gaurav Nov 23, 2025
90688a0
Increased memory for untyped_chemical_compendia.
gaurav Nov 24, 2025
54955fb
Increased get_ensembl runtime to 6h.
gaurav Nov 24, 2025
271f59d
Increased untyped_chemical_compendia to 256G.
gaurav Nov 24, 2025
34713bb
Increased timeout on the initial node to 24h.
gaurav Nov 24, 2025
0c4d0ea
Increased memory for some chemical and protein tasks.
gaurav Nov 24, 2025
fcb5bdb
Increased timeout for generate_pubmed_concords to 24h.
gaurav Nov 25, 2025
3a9798f
Increased memory for generate_pubmed_concords.
gaurav Nov 25, 2025
e313d75
Added runtime and mem to gene_compendia and protein_compendia.
gaurav Nov 25, 2025
6a8c237
Increased gene_compendia memory and protein_compendia timeout.
gaurav Nov 27, 2025
02e6efe
Turn on --keep-going when building on Slurm.
gaurav Nov 27, 2025
569c09e
Increased memory for generate_pubmed_compendia (128G).
gaurav Nov 28, 2025
4b74709
Added runtime=6h to chemical_compendia.
gaurav Nov 28, 2025
7e5d135
Increased memory and runtime for export_compendia_to_duckdb.
gaurav Nov 28, 2025
930ba30
Increased CPUs per task and timeout for rule protein.
gaurav Nov 28, 2025
5e940dd
Increased runtime for drugchemical_conflated_synonyms.
gaurav Nov 28, 2025
31a5434
It's "runtime", not "timeout".
gaurav Nov 29, 2025
d62b4c6
Increased memory for export_compendia_to_duckdb.
gaurav Nov 29, 2025
940b8c8
Increased memory and runtime for geneprotein_conflated_synonyms.
gaurav Nov 30, 2025
3f40151
Increased memory for export_compendia_to_duckdb to 750G.
gaurav Nov 30, 2025
bc048f8
Increased runtime for generate_sapbert_training_data.
gaurav Dec 1, 2025
ac16b30
Added some settings to DuckDB initialization.
gaurav Dec 2, 2025
b123832
Attempt to improve compendium load in DuckDB.
gaurav Dec 3, 2025
b768e1f
Tried to reduce requirements.
gaurav Dec 3, 2025
293eaf4
Fixed cpus_per_task.
gaurav Dec 3, 2025
564ba9e
Tried to reduce the memory limit.
gaurav Dec 3, 2025
29ccc9c
Let's try increasing the memory limit again.
gaurav Dec 3, 2025
704395b
Reduce cpus_per_task.
gaurav Dec 3, 2025
5a6bb70
Increased memory further.
gaurav Dec 3, 2025
b403821
What if we reduce the memory limit?
gaurav Dec 3, 2025
0af5e40
Tried some new things, laid out a possible solution.
gaurav Dec 3, 2025
07dec38
Removed duckdb_config, fixed (?) Node generation.
gaurav Dec 5, 2025
c6f483b
Fully removed duckdb_config.
gaurav Dec 5, 2025
2710d37
Attempt to fix Node.parquet generation.
gaurav Dec 5, 2025
6e74a5d
Added an extraneous print to hopefully trigger a rebuild.
gaurav Dec 5, 2025
1684171
Attempted new syntax for Node generation.
gaurav Dec 5, 2025
672b624
Next attempt at fixing nodes.
gaurav Dec 5, 2025
8c471b8
Next attempt at getting the JSON export working correctly.
gaurav Dec 5, 2025
8da0389
Next attempt.
gaurav Dec 5, 2025
6271568
Moved Node generation to the top so it will fail first.
gaurav Dec 5, 2025
672cfb4
Added Node.parquet as an explicit output.
gaurav Dec 5, 2025
c91f64e
Tweaked rule to force a rebuild.
gaurav Dec 5, 2025
38d92e4
Okay, maybe this will work.
gaurav Dec 5, 2025
7844dbc
Maybe this work.
gaurav Dec 5, 2025
a4f55a4
First stab at a compendium splitter.
gaurav Dec 5, 2025
8853d48
Small improvements.
gaurav Dec 5, 2025
806dbd6
Fail if the row counts don't match.
gaurav Dec 5, 2025
cfc73c2
Improved outputs.
gaurav Dec 5, 2025
10564ea
Wait, we actually wouldn't expect these to line up.
gaurav Dec 5, 2025
49dbeac
Reorganized/cleaned up code a bit.
gaurav Dec 5, 2025
7821ff2
Add a TODO that will come in handy later.
gaurav Dec 5, 2025
6bc0726
Increased resources for DuckDB reports.
gaurav Dec 5, 2025
2a4f860
Oops, missed one.
gaurav Dec 5, 2025
8fb24ea
Further increased memory to generate_prefix_report.
gaurav Dec 5, 2025
90dd9b5
Increased memory for a bunch of DuckDB reports.
gaurav Dec 5, 2025
95725da
Maybe we're going to 1024G everywhere.
gaurav Dec 5, 2025
dd3d46d
Increased memory even further.
gaurav Dec 5, 2025
c0db3a1
Cleaned up code for CURIE prefix summary.
gaurav Dec 8, 2025
0572709
Improved check_for_identically_labeled_cliques and reduced memory.
gaurav Dec 8, 2025
cb2c8f1
Tried to improve check_for_duplicate_curies.
gaurav Dec 8, 2025
db88887
Tried to improve check_for_duplicate_clique_leaders.
gaurav Dec 8, 2025
93b9012
Gave generate_prefix_report 512G of memory.
gaurav Dec 8, 2025
d7d316a
Fixed SQL error.
gaurav Dec 8, 2025
dba7b41
Simplified check_for_duplicate_clique_leaders.
gaurav Dec 8, 2025
a3189bb
Increased all memory to its max.
gaurav Dec 8, 2025
7813dff
Turn off preserve_insertion_order=false for all reports.
gaurav Dec 8, 2025
9f424f2
Separated db.sql() from writing out the result.
gaurav Dec 8, 2025
a2e22c5
Added some memory tracking.
gaurav Dec 8, 2025
19cf893
Tried to improve logging.
gaurav Dec 8, 2025
3fb266d
More debugging.
gaurav Dec 8, 2025
2081e82
Incorporated additional settings.
gaurav Dec 8, 2025
862e711
Moved progress bar to overall DuckDB settings.
gaurav Dec 8, 2025
9b10bc6
Improved DuckDB configuration documentation.
gaurav Dec 8, 2025
682fb09
Improved DuckDB settings output.
gaurav Dec 8, 2025
801fde9
Moved progress bar as a local setting.
gaurav Dec 8, 2025
45ecc7c
Gah.
gaurav Dec 8, 2025
3abeccd
Increased memory for check_for_duplicate_curies.
gaurav Dec 8, 2025
e859e8a
Partially improved generate_prefix_report.
gaurav Dec 8, 2025
997210f
Removed orders from generate_prefix_report.
gaurav Dec 8, 2025
6c49983
Increased memory availability.
gaurav Dec 8, 2025
cb0c3c3
Split prefix report into by-clique and by-prefix reports.
gaurav Dec 8, 2025
4e809f5
Fixed Snakemake rules.
gaurav Dec 8, 2025
40eb14a
Cleaned up some `all` rules.
gaurav Dec 8, 2025
0903059
We don't want a list of all cliques labeled ''.
gaurav Dec 9, 2025
881cfbd
Look out for double-quotes too.
gaurav Dec 9, 2025
20e0803
Let's compress identically_labeled_cliques.tsv.
gaurav Dec 9, 2025
8b0f443
Fixed filename in dependencies.
gaurav Dec 9, 2025
82b4998
Improved script.
gaurav Dec 9, 2025
e7f88db
Simplified by-clique report.
gaurav Dec 9, 2025
7cbea97
Added retries to some Ubergraph queries.
gaurav Dec 9, 2025
cdb4b2e
Improved by-clique report.
gaurav Dec 9, 2025
0e332e3
Fixed script.
gaurav Dec 9, 2025
2920f65
More fixes maybe.
gaurav Dec 9, 2025
0b859e4
Fixed SQL error.
gaurav Dec 9, 2025
84bf4fa
For by_clique report, only look at non-conflated edges.
gaurav Dec 9, 2025
c2adc9b
For CURIE report, use only non-conflated edges.
gaurav Dec 9, 2025
857e3b6
Improved command line.
gaurav Dec 9, 2025
ddd01b3
Updated pull_via_urllib() to be able to verify a gzipped file.
gaurav Dec 10, 2025
8215381
Add verify_gzip to UniChem downloads because they are being wobbly.
gaurav Dec 10, 2025
203ddd0
Wow.
gaurav Dec 11, 2025
b3c85a1
Fail verification if the downloaded file is too small.
gaurav Dec 11, 2025
345e194
Replaced UniChem download with wget download.
gaurav Dec 11, 2025
393209e
Reverted to urllib, added retries.
gaurav Dec 11, 2025
4f1e93c
Fixed references to datetime.
gaurav Dec 14, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 11 additions & 6 deletions Snakefile
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
configfile: "config.yaml"


include: "src/snakefiles/datacollect.snakefile"
include: "src/snakefiles/anatomy.snakefile"
include: "src/snakefiles/cell_line.snakefile"
Expand All @@ -20,13 +19,15 @@ include: "src/snakefiles/duckdb.snakefile"
include: "src/snakefiles/reports.snakefile"
include: "src/snakefiles/exports.snakefile"

# Some general imports.
import shutil
from src.snakefiles.util import write_done

# Some global settings.
import os

os.environ["TMPDIR"] = config["tmp_directory"]


# Top-level rules.
rule all:
input:
Expand All @@ -41,10 +42,14 @@ rule all:
# Build all the exports.
config["output_directory"] + "/kgx/done",
config["output_directory"] + "/sapbert-training-data/done",
# Store the config.yaml file used to produce the output.
config_file = "config.yaml",
output:
x=config["output_directory"] + "/reports/all_done",
shell:
"echo 'done' >> {output.x}"
output_config_file=config["output_directory"] + "/config.yaml",
run:
shutil.copyfile(input.config_file, output.output_config_file)
write_done(output.x)


rule all_outputs:
Expand All @@ -65,8 +70,8 @@ rule all_outputs:
config["output_directory"] + "/reports/publications_done",
output:
x=config["output_directory"] + "/reports/outputs_done",
shell:
"echo 'done' >> {output.x}"
run:
write_done(output.x)


rule clean_compendia:
Expand Down
9 changes: 7 additions & 2 deletions config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,13 @@ intermediate_directory: babel_outputs/intermediate
output_directory: babel_outputs
tmp_directory: babel_downloads/tmp

#
# SHARED
#

# DuckDB settings for use in all DuckDB connections:
duckdb_config: {}

#
# UMLS
#
Expand Down Expand Up @@ -413,7 +420,5 @@ ensembl_datasets_to_skip:
- otshawytscha_gene_ensembl
- aocellaris_gene_ensembl

duckdb_config: {}

demote_labels_longer_than: 25

3 changes: 2 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ version = "1.14"
description = "Babel creates cliques of equivalent identifiers across many biomedical vocabularies. "
readme = "README.md"
license = "MIT"
requires-python = ">=3.11"
requires-python = ">=3.11,<3.14"
dependencies = [
"apybiomart",
"beautifulsoup4>=4.14.2",
Expand All @@ -28,6 +28,7 @@ dependencies = [
"pyyaml>=6.0.3",
"requests>=2.32.5",
"snakemake>=9.13.3",
"snakemake-executor-plugin-slurm>=1.9.2",
"sparqlwrapper>=2.0.0",
"wheel>=0.45.1",
"xmltodict>=1.0.2",
Expand Down
37 changes: 37 additions & 0 deletions slurm/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# This is a Snakemake profile (https://snakemake.readthedocs.io/en/stable/executing/cli.html#profiles) that provides
# configuration options that should be applied when Snakemake is run on the RENCI Hatteras cluster using SLURM.
#
# To use this profile, run:
# $ snakemake --profile slurm
#
executor: slurm
jobs: 50 # maximum number of parallel cluster jobs
latency-wait: 60 # seconds
slurm-delete-logfiles-older-than: 0 # Don't delete log files automatically.
rerun-incomplete: true # Re-run any jobs that failed with an incomplete status previously.
keep-going: true # Keep going with independent jobs if a job fails.

# Wrap Python execution with `time -v` to report on memory usage for each rule.
python:
executable: "/usr/bin/time -v python"
log_stderr: true

# Set up Hatteras partitions as per https://renci.atlassian.net/wiki/spaces/RENCI/pages/254443570/Cluster+Info#Hardware-Information
partitions:
batch:
max_mem_mb: 191000 # 191 GB
largemem:
max_mem_mb: 1530329 # 1.5 TB

# Default resource settings for all rules
default-resources:
mem: 64G
runtime: 120 # minutes
cpus_per_task: 4

# Set up the Slurm efficiency report.
slurm-efficiency-report: True
slurm-efficiency-report-path: babel_outputs/reports/slurm/slurm_efficiency_report.csv

# Write Slurm logs into a `logs/` directory so we can look at them later.
slurm-logdir: babel_outputs/logs
24 changes: 24 additions & 0 deletions slurm/job/run_babel_mutiple_nodes.job
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
#!/bin/bash -l
#SBATCH --job-name=babel-test-cluster
#SBATCH --output=babel-test-cluster.out
#SBATCH --time=1:00:00
#SBATCH --mem=2G
#SBATCH -n 1

source ~/.bashrc
conda activate babel

# Go to Babel project directory
cd /projects/babel/babel-ht-test/Babel

export UMLS_API_KEY="YOUR UMLS API KEY"
export PYTHONPATH=.

# Build anatomy related compendia in a distributed fashion as defined in slurm/config.yaml profile
# Note that since Snakemake supports slurm executor plugin natively, submitting this as a SLURM batch
# job is not recommended since that will create an outer SLURM job running Snakemake which then
# submits innter SLURM jobs for workflow rules as specified in the profile. The recommended way
# is to run this directly on the login or head node. However, it might not be a good thing to have
# a long-running process on login/head nodes. So a good compromise is to still use the sbatch wrapper
# to submit the snakemake job but request minimal resources for the outer job as shown in this job script.
snakemake --profile slurm anatomy
18 changes: 18 additions & 0 deletions slurm/job/run_babel_one_node.job
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
#!/bin/bash -l
#SBATCH --job-name=babel-test-local
#SBATCH --output=babel-test-local.out
#SBATCH --time=5:00:00
#SBATCH --mem=256G
#SBATCH -n 1

source ~/.bashrc
conda activate babel

# Go to Babel project directory
cd /projects/babel/babel-ht-test/Babel_standalone

export UMLS_API_KEY="YOUR UMLS API KEY"
export PYTHONPATH=.

# Build anatomy related compendia locally using 1 core
snakemake --cores 4 anatomy --rerun-incomplete --latency-wait 60
30 changes: 30 additions & 0 deletions slurm/run-babel-on-slurm.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
#!/bin/bash

sbatch <<EOF
#!/bin/bash
#SBATCH --job-name=babel-${BABEL_VERSION:-current}
#SBATCH --output=babel_outputs/logs/sbatch-${BABEL_VERSION:-babel-current}.out
#SBATCH --error=babel_outputs/logs/sbatch-${BABEL_VERSION:-babel-current}.err
#SBATCH --time=${BABEL_TIMEOUT:-24:00:00}
#SBATCH --mem=16G
#SBATCH --nodes=1
#SBATCH --chdir=$PWD

# Notes:
# --chdir: Change the directory to whatever directory the sbatch job was
# started from. So you should run: BABEL_VERSION=babel-1.14 bash slurm/run-babel-on-slurm.sh

source ~/.bashrc

# Run Babel in a distributed fashion as defined in slurm/config.yaml profile
#
# Note that since Snakemake supports slurm executor plugin natively, submitting this as a SLURM batch
# job is not recommended since that will create an outer SLURM job running Snakemake which then
# submits innter SLURM jobs for workflow rules as specified in the profile. The recommended way
# is to run this directly on the login or head node. However, it might not be a good thing to have
# a long-running process on login/head nodes. So a good compromise is to still use the sbatch wrapper
# to submit the snakemake job but request minimal resources for the outer job as shown in this job script.

uv run snakemake --profile slurm $@

EOF
127 changes: 89 additions & 38 deletions src/babel_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,10 @@
from ftplib import FTP
from io import BytesIO
import gzip
from datetime import timedelta
from datetime import timedelta, datetime
import time
from pathlib import Path

import requests
import os
import urllib
Expand All @@ -23,6 +25,7 @@

# Configuration items
WRITE_COMPENDIUM_LOG_EVERY_X_CLIQUES = 1_000_000
MAX_DOWNLOAD_ERROR = 10

# Set up a logger.
logger = get_logger(__name__)
Expand Down Expand Up @@ -141,15 +144,15 @@
self.delta = timedelta(milliseconds=delta_ms)

def get(self, url):
now = dt.now()
now = datetime.now()
throttled = False
if self.last_time is not None:
cdelta = now - self.last_time
if cdelta < self.delta:
waittime = self.delta - cdelta
time.sleep(waittime.microseconds / 1e6)
throttled = True
self.last_time = dt.now()
self.last_time = datetime.now()
response = requests.get(url)
return response, throttled

Expand All @@ -166,16 +169,32 @@
ntries += 1


def pull_via_urllib(url: str, in_file_name: str, decompress=True, subpath=None):
def pull_via_urllib(url: str, in_file_name: str, decompress=True, subpath=None, verify_gzip=False):
"""
Retrieve files via urllib, optionally decompresses it, and writes it locally into downloads
url: str - the url with the correct version attached
in_file_name: str - the name of the target file to work
returns: str - the output file name
Download a file via the given URL, optionally decompress it, and save it
to the specified local path. Handles HTTP redirects gracefully.

:param url: The base URL of the remote server (e.g., "http://example.com/").
It is combined with the provided filename to determine the full file path.
:type url: str
:param in_file_name: The name of the file to download, specified as the filename
on the remote server.
:type in_file_name: str
:param decompress: Whether to decompress the downloaded file if it is gzipped.
Defaults to True.
:type decompress: bool, optional
:param subpath: An optional subpath under the main download directory to save the file.
If None, the file is saved directly in the download directory.
:type subpath: str, optional
:param verify_gzip: If downloading a Gzip file that isn't being decompressed, verify that the
file is valid (by reading it). Has no effect if decompress=True.
:type verify_gzip: bool, optional
:return: The path to the downloaded (and optionally decompressed) file.
:rtype: str
"""
# Everything goes in downloads
download_dir = get_config()["download_directory"]
working_dir = download_dir

Check failure on line 197 in src/babel_utils.py

View workflow job for this annotation

GitHub Actions / Check Python formatting with ruff

Ruff (F841)

src/babel_utils.py:197:5: F841 Local variable `working_dir` is assigned to but never used

# get the (local) download file name, derived from the input file name
if subpath is None:
Expand All @@ -187,38 +206,70 @@
opener = urllib.request.build_opener(urllib.request.HTTPRedirectHandler())

# get a handle to the ftp file
print(url + in_file_name)
handle = opener.open(url + in_file_name)
download_url = url + in_file_name
logger.info(f"Downloading {download_url}")
handle = opener.open(download_url)

# create the compressed file
with open(dl_file_name, "wb") as compressed_file:
# while there is data
while True:
# read a block of data
data = handle.read(1024)

# fif nothing read about
if len(data) == 0:
break

# write out the data to the output file
compressed_file.write(data)

if decompress:
out_file_name = dl_file_name[:-3]

# create the output text file
with open(out_file_name, "w") as output_file:
# open the compressed file
with gzip.open(dl_file_name, "rt") as compressed_file:
for line in compressed_file:
# write the data to the output file
output_file.write(line)

# remove the compressed file
os.remove(dl_file_name)
else:
out_file_name = dl_file_name
download_verified = False
download_attempt = 0
while not download_verified:
Path(dl_file_name).unlink(missing_ok=True)
download_attempt += 1
if download_attempt > MAX_DOWNLOAD_ERROR:
raise RuntimeError(f"Could not download and verify {download_url}: more than {MAX_DOWNLOAD_ERROR} attempts.")
logger.info(f"Downloading {dl_file_name} using urllib, attempt {download_attempt}...")

with open(dl_file_name, "wb") as compressed_file:
# while there is data
while True:
# read a block of data
data = handle.read(1024)

# fif nothing read about
if len(data) == 0:
break

# write out the data to the output file
compressed_file.write(data)

if decompress:
out_file_name = dl_file_name[:-3]

# create the output text file
with open(out_file_name, "w") as output_file:
# open the compressed file
with gzip.open(dl_file_name, "rt") as compressed_file:
for line in compressed_file:
# write the data to the output file
output_file.write(line)

# remove the compressed file
os.remove(dl_file_name)

download_verified = True
else:
out_file_name = dl_file_name

# Do we need to verify this gzip file?
download_verified = True
if verify_gzip:
# Is it blank/very small? If so, we immediately fail verification.
file_size = os.path.getsize(out_file_name)
if file_size < 1024:
logger.warning(f"Downloaded Gzip file {out_file_name} is too small ({file_size} bytes), skipping verification.")
download_verified = False
continue

# To verify a Gzip file, we need to read it entirely.
try:
with gzip.open(out_file_name, "rb") as f:
for _ in iter(lambda: f.read(1024 * 1024), b""):
pass
download_verified = True
except Exception as e:
logger.warning(f"Error while verifying downloaded Gzip file {out_file_name}: {e}")
download_verified = False

# return the filename to the caller
return out_file_name
Expand Down Expand Up @@ -538,11 +589,11 @@
possible_labels = map(lambda identifier: identifier.get("label", ""), node["identifiers"])

# Step 2. Filter out any suspicious labels.
filtered_possible_labels = [l for l in possible_labels if l] # Ignore blank or empty names.

Check failure on line 592 in src/babel_utils.py

View workflow job for this annotation

GitHub Actions / Check Python formatting with ruff

Ruff (E741)

src/babel_utils.py:592:51: E741 Ambiguous variable name: `l`

# Step 3. Filter out labels longer than config['demote_labels_longer_than'], but only if there is at
# least one label shorter than this limit.
labels_shorter_than_limit = [l for l in filtered_possible_labels if l and len(l) <= config["demote_labels_longer_than"]]

Check failure on line 596 in src/babel_utils.py

View workflow job for this annotation

GitHub Actions / Check Python formatting with ruff

Ruff (E741)

src/babel_utils.py:596:52: E741 Ambiguous variable name: `l`
if labels_shorter_than_limit:
filtered_possible_labels = labels_shorter_than_limit

Expand Down Expand Up @@ -731,7 +782,7 @@
shit_prefixes = set(["KEGG", "PUBCHEM"])
test_id = "xUBERON:0002262"
debugit = False
excised = set()

Check failure on line 785 in src/babel_utils.py

View workflow job for this annotation

GitHub Actions / Check Python formatting with ruff

Ruff (F841)

src/babel_utils.py:785:5: F841 Local variable `excised` is assigned to but never used
for xgroup in newgroups:
if isinstance(xgroup, frozenset):
group = set(xgroup)
Expand All @@ -751,7 +802,7 @@
existing_sets_w_x = [(conc_set[x], x) for x in group if x in conc_set]
# All of these sets are now going to be combined through the equivalence of our new set.
existing_sets = [es[0] for es in existing_sets_w_x]
x = [es[1] for es in existing_sets_w_x]

Check failure on line 805 in src/babel_utils.py

View workflow job for this annotation

GitHub Actions / Check Python formatting with ruff

Ruff (F841)

src/babel_utils.py:805:9: F841 Local variable `x` is assigned to but never used
newset = set().union(*existing_sets)
if debugit:
print("merges:", existing_sets)
Expand Down Expand Up @@ -779,7 +830,7 @@
for up in unique_prefixes:
if test_id in group:
print("up?", up)
idents = [e if type(e) == str else e.identifier for e in newset]

Check failure on line 833 in src/babel_utils.py

View workflow job for this annotation

GitHub Actions / Check Python formatting with ruff

Ruff (E721)

src/babel_utils.py:833:28: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
if len(set([e for e in idents if (e.split(":")[0] == up)])) > 1:
bad += 1
setok = False
Expand All @@ -789,14 +840,14 @@
wrote.add(fs)
for gel in group:
if Text.get_prefix_or_none(gel) == pref:
killer = gel

Check failure on line 843 in src/babel_utils.py

View workflow job for this annotation

GitHub Actions / Check Python formatting with ruff

Ruff (F841)

src/babel_utils.py:843:25: F841 Local variable `killer` is assigned to but never used
# for preset in wrote:
# print(f'{killer}\t{set(group).intersection(preset)}\t{preset}\n')
# print('------------')
NPC = sum(1 for s in newset if s.startswith("PUBCHEM.COMPOUND:"))
if ("PUBCHEM.COMPOUND:3100" in newset) and (NPC > 3):
if debugit:
l = sorted(list(newset))

Check failure on line 850 in src/babel_utils.py

View workflow job for this annotation

GitHub Actions / Check Python formatting with ruff

Ruff (E741)

src/babel_utils.py:850:17: E741 Ambiguous variable name: `l`
print("bad")
for li in l:
print(li)
Expand Down
7 changes: 0 additions & 7 deletions src/cluster_config.yml

This file was deleted.

Loading
Loading