Refactor/cleanup admin_cleanup_datasets.py script #20819

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

ccoulombe wants to merge 7 commits into galaxyproject:dev from ccoulombe:refactor/cleanup-datasets

Contributor

ccoulombe commented Aug 26, 2025 •

edited

Loading

Update the admin_cleanup_datasets.py script to work with SQLAlchemy 2.x. Plus a few little updates.

Changes:

Nicer time format [b6f5553]
Update to work with SA 2.x [3904721]
Refactored and updated the administrative_delete_datasets function to be compatible with SA 2.x, and easier to the eye [860e2cd]
Refactored and updated the _get_tool_id function to be compatible with SA 2.x and easier to the eye [651426a]
Add state of deletion to email subject [f0e63ff]
Add option to not send the email upon deletion [85d8e6a]

How to test the changes?

(Select all options that apply)

I've included appropriate automated tests.
This is a refactoring of components with existing test coverage.
Instructions for manual testing are as follows:
1. On the galaxy server, export $GALAXY_CONFIG_FILE to the galaxy config, activate the virtual environment, move to the <root>/server/scripts/cleanup_datasets directory
2. Create dummy datasets, and create one with a specific tool
3. Run python admin_cleanup_datasets.py --days 0 --info_only --config $GALAXY_CONFIG_FILE --template admin_cleanup_warning_template.txt.sample
4. Run python admin_cleanup_datasets.py --days 0 --info_only --config $GALAXY_CONFIG_FILE --template admin_cleanup_warning_template.txt.sample --tool_id <id>
5. Run python admin_cleanup_datasets.py --days 0 --email_only --config $GALAXY_CONFIG_FILE --template admin_cleanup_warning_template.txt.sample --tool_id <id> and check for the email containing the datasets from the tool that generated them
6. Run python admin_cleanup_datasets.py --days 0 --config $GALAXY_CONFIG_FILE --template admin_cleanup_warning_template.txt.sample --tool_id <id> and check for deleted datasets
7. Run both versions of the scripts and compare their output

ccoulombe added 6 commits

August 19, 2025 11:06


          Format with 3 decimals and display seconds

b6f5553


          Fixed query results mapping for compatibility with sqlalchemy 2.x


          Refactored and updated administrative_delete_datasets function to be …

860e2cd

…more readable but mainly compatible with sqlalchemy 2.x


          Updated _get_tool_id_for_hda function to be more readable but mainly …

651426a

…compatible with sqlalchemy 2.x


          Added state to email subject. Differentiate between the information…

f0e63ff

… and the actual deletion email


          Added --no-send option, to allow not sending an email upon deletion

85d8e6a

github-actions bot added the area/scripts label

github-actions bot added this to the 25.1 milestone

ccoulombe changed the title ~~Refactor/cleanup datasets script~~ Refactor/cleanup admin_cleanup_datasets.py script


          Removed unsused imports

f14f34d

Member

jmchilton commented Aug 28, 2025

Can you run "make format" - this looks solid to me but the linter is unhappy about code formatting.

jdavcs self-requested a review

August 28, 2025 13:52

jdavcs reviewed

View reviewed changes

scripts/cleanup_datasets/admin_cleanup_datasets.py

    
                      select(HDA.id)

                      .join(Dataset, Dataset.id == HDA.dataset_id, isouter=True)

                      .where(and_(

                          Dataset.deleted.is_(False),

Member

jdavcs Aug 28, 2025

deleted.is_(False) is identical to deleted == false(). In the rest of the codebase we use the former. We use the is_ construct when comparing to null. Let's keep false() for consistency.

But replacing Foo.__table__.c.bar with Foo.bar is correct.

jdavcs reviewed

View reviewed changes

scripts/cleanup_datasets/admin_cleanup_datasets.py

    
                  session = app.sa_session

                  # Aliases for ORM‑mapped classes

                  HDA = aliased(app.model.HistoryDatasetAssociation)

Member

jdavcs Aug 28, 2025

We don't need aliased classes here. If you want to improve the readability of the code, you can import the classes from the model and then use just the class names:

from galaxy.model import HistoryDatasetAssociation
my_statement = select(HistoryDatasetAssociation).where(whatever...)

An extra benefit of this is that reading (and grepping) the code is easier: the string HistoryDatasetAssociation will always represent the same thing across the codebase.

jdavcs reviewed

View reviewed changes

scripts/cleanup_datasets/admin_cleanup_datasets.py

    
              def _get_tool_id_for_hda(app, hda_id):

                  # TODO Some datasets don't seem to have an entry in jtod or a copied_from

Member

jdavcs Aug 28, 2025

Let's not delete this: this is a potentially useful comment which, maybe, hasn't been addressed yet. Yes, it's 12 years old, but it still could be helpful. (jtod and copied_from refer to database tables). Would be OK to delete if this particular item were addressed and deemed no longer relevant.

jdavcs reviewed

View reviewed changes

scripts/cleanup_datasets/admin_cleanup_datasets.py

    
                  hda = session.get(app.model.HistoryDatasetAssociation, hda_id)

                  if hda is None:

                      return None

Member

jdavcs Aug 28, 2025

This changes the function's behavior: the previous version would raise an error here, which is correct.
Also, we don't need to check that an hda exists here.

jdavcs reviewed

View reviewed changes

scripts/cleanup_datasets/admin_cleanup_datasets.py

    
                  job_query = (

                      select(Job.tool_id)

                      .join(JTODA, JTODA.job_id == Job.id)

Member

jdavcs Aug 28, 2025

We don't need to specify the join criteria here: SQLAlchemy takes care of it for us.

jdavcs reviewed

View reviewed changes

scripts/cleanup_datasets/admin_cleanup_datasets.py

    
                      )

                      .select_from(sa.outerjoin(model.Dataset.__table__, model.HistoryDatasetAssociation.__table__))

                      select(HDA.id)

                      .join(Dataset, Dataset.id == HDA.dataset_id, isouter=True)

Member

jdavcs Aug 28, 2025

We don't need the join criteria.

Very nice refactoring here - thanks!

jdavcs reviewed

View reviewed changes

scripts/cleanup_datasets/admin_cleanup_datasets.py

    
                              )

                          )

                      # Bind hda_id for current iteration

                      rows = session.execute(

Member

jdavcs Aug 28, 2025

Same as above - this is very nice refactoring, thanks!

Contributor Author

ccoulombe commented Aug 29, 2025

@jmchilton Yes, will do.
@jdavcs Thanks for the comments, will tackle them.

... once I get back from vacation in a week!

ahmedhamidawan added the kind/refactoring label

ahmedhamidawan modified the milestones: 25.1, 26.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/scripts kind/refactoring