Skip to content

Conversation

@kinanmartin
Copy link

Adds optional parameters to specify target split sizes by number of hours in the ReazonSpeech prepare recipe.

The user can now optionally specify target hours by command line, like so:

lhotse prepare reazonspeech \
  -j $nj \
  --train-hours 100 \
  --dev-hours 2 \
  --test-hours 2 \
  $dl_dir/ReazonSpeech data/manifests

@pzelasko
Copy link
Collaborator

What’s the rationale for this? Wouldn’t it cause different recipes to have different test and dev sets?

@kinanmartin
Copy link
Author

kinanmartin commented Jul 25, 2025

@pzelasko Thanks for the comment!

The rationale is to allow our icefall model training recipe to create different split sizes depending on desired sizes for each split of the dataset.

We are currently working on a bilingual (English and Japanese) icefall recipe which relies on the data prepared via lhotse in the icefall prepare script here as well as data from an English dataset (icefall prepare script here). For the bilingual model, we want to be able to use the icefall recipes to prepare equally sized train, dev, and test sets for both datasets, then combine them to make a balanced dataset for the bilingual model, so we would like to be able to have control over the sizes of the splits instead of hardcoding them.

The way I have written the code here, if the new optional parameters are not specified, the train, dev, and test sets should all be generated identically to the current version. When the parameters are specified, the set random seed should make it such that using the same parameter values produces the same test and dev sets. Please let me know if I'm mistaken though.

@pzelasko
Copy link
Collaborator

I meant that dev and test sets should always be the same regardless of the desired size for your training data. If you could modify it so that this property is preserved, I'd be OK to merge this. Otherwise I have concerns about non-stable test data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants