Skip to content

Conversation

@saikrishnanc-nv
Copy link
Collaborator

When Curator was first designed, we centralized a run_etl.py script in the core Curator package, which did most of the orchestration of the ETL pipeline.
However, this required users to add the examples directory (or any other directory where their data processing code lived) to the PYTHONPATH, so that the run_etl.py script could import the code necessary to execute it.
This is not ideal, and is generally error prone (and we have received multiple pieces of feedback around this).

Additionally, other PhysicsNeMo repositories do NOT have such a central run_*.py in the core package, and instead, they have similar scripts in each example directory (typically, train.py, infer.py, etc.).

This PR therefore removes the run_etl.py from the core package, and moves them into the respective examples directories.
Several other changes had to be made to the overall codebase as a result of this.

@saikrishnanc-nv saikrishnanc-nv marked this pull request as draft November 19, 2025 00:02
Alexey-Kamenev
Alexey-Kamenev previously approved these changes Nov 19, 2025
Copy link
Collaborator

@Alexey-Kamenev Alexey-Kamenev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!


pytest:
pip install -e ".[dev]" && \
pytest
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can also do pytest ./tests/?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha - This is a good point to discuss.
Previously that's what was in effect happening, and that worked fine, because the imports were very explicit (full paths relative to the root of the repo).
Now however, that doesn't work because, we don't add examples to the PYTHONPATH anymore.
And therefore, the paths are being modified using sys.path.
Multiple examples have files with the same name (data_sources.py for example). And even if I remove the sys.path modifications at what I think are the right locations - This somehow creates some path pollution.
So my options were:

  1. Ensure file names are unique (hard to guarantee given that users might only care about one example)
  2. Separate the tests out and run them one module at a time

I opted for the 2nd option.
However, if you know of a way to overcome this, I'd be very grateful 😄

@saikrishnanc-nv saikrishnanc-nv self-assigned this Nov 21, 2025
@saikrishnanc-nv saikrishnanc-nv marked this pull request as ready for review November 21, 2025 19:53
@saikrishnanc-nv
Copy link
Collaborator Author

/blossom-ci

@saikrishnanc-nv saikrishnanc-nv merged commit 1cde7d8 into NVIDIA:main Nov 21, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants