Skip to content

Improved onboard_project.py #783

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

TaperChipmunk32
Copy link
Collaborator

@TaperChipmunk32 TaperChipmunk32 commented Aug 1, 2025

-Added collect_verse_counts to onboard_project
-Replaced command line args with a yaml config


This change is Reviewable

-Added collect_verse_counts to onboard_project
-Replaced command line args with a yaml config
@TaperChipmunk32 TaperChipmunk32 linked an issue Aug 1, 2025 that may be closed by this pull request
Copy link
Collaborator

@benjaminking benjaminking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like what you've done with this. As the onboarding script gets more complex, I think it will be important to make sure we're logging messages that describe the actions that the script is taking, at a high level at least. For example, if the configuration file says to collect verse count, then having a message that says "Collecting verse counts" or something similar.

Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @ddaspit)


silnlp/common/onboard_project.py line 92 at r1 (raw file):

        copy_paratext_project_folder(Path(args.copy_from), paratext_project_dir, overwrite=args.overwrite)

    args = parser.parse_args()

I don't think you should need to parse the arguments again.


silnlp/common/onboard_project.py line 98 at r1 (raw file):

    paratext_project_dir: Path = create_paratext_project_folder_if_not_exists(project_name)

    if args.copy_from:

Maybe we should add an error if the user doesn't include either of copy_from or config, since the script will say it's onboarding the project, but then only create a new folder at the most.


silnlp/common/onboard_project.py line 106 at r1 (raw file):

            raise FileNotFoundError(f"Configuration file '{config_file}' does not exist.")
        with config_file.open("r", encoding="utf-8") as file:
            config = yaml.safe_load(file)

Now that we have a config file, I think we should start a page on the wiki (Home · sillsdev/silnlp Wiki) that we keep up to date as we add new configuration options.

Copy link
Collaborator Author

@TaperChipmunk32 TaperChipmunk32 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've now added some messages at a high level, extract_corpora and collect_verse_counts will both print more messages as they execute.

Reviewable status: 2 of 3 files reviewed, 3 unresolved discussions (waiting on @benjaminking and @ddaspit)


silnlp/common/onboard_project.py line 92 at r1 (raw file):

Previously, benjaminking (Ben King) wrote…

I don't think you should need to parse the arguments again.

I don't know how I did not catch these repeated lines, I've now removed all the copied lines.


silnlp/common/onboard_project.py line 98 at r1 (raw file):

Previously, benjaminking (Ben King) wrote…

Maybe we should add an error if the user doesn't include either of copy_from or config, since the script will say it's onboarding the project, but then only create a new folder at the most.

I've now added some error messages if they are not included.


silnlp/common/onboard_project.py line 106 at r1 (raw file):

Previously, benjaminking (Ben King) wrote…

Now that we have a config file, I think we should start a page on the wiki (Home · sillsdev/silnlp Wiki) that we keep up to date as we add new configuration options.

I'll start working on that now.

-Made --extract-corpora and --collect-verse-counts args without requiring a config, using default values to run both tasks.
@TaperChipmunk32
Copy link
Collaborator Author

@benjaminking I created a Wiki page here. I also made some changes to do the following:

  1. Add --extract-corpora and --collect-verse-counts as args, rather than checking the config
  2. Make --config optional
  3. Add a warning when running --collect-verse-counts without --extract-corpora

I think changing the top level tasks to args makes more intuitive sense, rather than always requiring a config when the default settings are okay.

Copy link
Collaborator

@benjaminking benjaminking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll probably need to have Damien take over reviewing this while I'm gone. We're pretty close here though.

Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @ddaspit)


silnlp/common/onboard_project.py line 123 at r2 (raw file):

        config = {}

    if args.extract_corpora:

I expect that the EITL team will want to have a standard config for onboarding projects and will just re-use it for each new project they onboard. What do you think about keeping extract_corpora and select_verse_counts in the config, but allowing the command line options to override what's in the config?


silnlp/common/onboard_project.py line 139 at r2 (raw file):

    if args.collect_verse_counts:
        if not args.extract_corpora:

Let me know if what I'm thinking isn't feasible, but could we check whether the files that we need for collect_verse_counts are present already? The scenario that I'm imagining is that someone runs this script, but later realizes that they forgot to collect verse counts. Would it be possible for them to run it again with only collect_verse_counts enabled (and overwrite equal to false)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow the onboarding script to create verse counts
2 participants