Automate preparation of additional project characteristics data during onboarding

There are some additional project characteristics / metrics that we find helpful to know for each new project.  Automatically getting this info as part of the existing process would be helpful.

- The number of lines in the verse extract file that are `range `lines.
- The number of characters in the verse extract file that are unknown (<unk>) to the NLLB tokenizer.
- Tokenization stats for the verse extract file.

This information is already available using the current tools by running the 'preprocess' command or the 'experiment' command, but it either requires an extra step (preprocess), or is available too late (experiment) to guide the onboarding work.  A streamlined method will reduce effort when working on onboarding requets.

- Wildebeest reports for the verse extract file.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Automate preparation of additional project characteristics data during onboarding #727

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Automate preparation of additional project characteristics data during onboarding #727

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions