Move central run_etl into each example #38
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When Curator was first designed, we centralized a
run_etl.pyscript in the core Curator package, which did most of the orchestration of the ETL pipeline.However, this required users to add the
examplesdirectory (or any other directory where their data processing code lived) to thePYTHONPATH, so that the run_etl.py script could import the code necessary to execute it.This is not ideal, and is generally error prone (and we have received multiple pieces of feedback around this).
Additionally, other PhysicsNeMo repositories do NOT have such a central
run_*.pyin the core package, and instead, they have similar scripts in each example directory (typically,train.py,infer.py, etc.).This PR therefore removes the
run_etl.pyfrom the core package, and moves them into the respective examples directories.Several other changes had to be made to the overall codebase as a result of this.