Skip to content

Automate preparation of additional project characteristics data during onboarding #727

@mmartin9684-sil

Description

@mmartin9684-sil

There are some additional project characteristics / metrics that we find helpful to know for each new project. Automatically getting this info as part of the existing process would be helpful.

  • The number of lines in the verse extract file that are range lines.
  • The number of characters in the verse extract file that are unknown () to the NLLB tokenizer.
  • Tokenization stats for the verse extract file.

This information is already available using the current tools by running the 'preprocess' command or the 'experiment' command, but it either requires an extra step (preprocess), or is available too late (experiment) to guide the onboarding work. A streamlined method will reduce effort when working on onboarding requets.

  • Wildebeest reports for the verse extract file.

Metadata

Metadata

Assignees

Labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions