Skip to content

Conversation

@FrenchKrab
Copy link
Contributor

The balance option of the segmentation tasks allows to pass a list of ProtocolFile fields, e.g. ['database', 'foo']. Then when batches are sampled, it looks at all existing combinations of values for these fields in the task protocol.

For example if they come from databases aishell and ami, and their foo field is either a or b, we compute the cartesian product [('aishell', 'a'), ('aishell', 'b'), ('ami', 'a'), ('ami', 'b')], batches are created by randomly selecting one of these tuples and picking a sample from a matching file.

The PR allows to weight the random choice from the cartesian product. For example with

balance_weights = {
  ('aishell'):2.0,
  ('ami', 'b'): 4.0,
}

we will sample from the cartesian product using random.choices with these weights:

selected = random.choices(
    population=[('aishell', 'a'), ('aishell', 'b'), ('ami', 'a'), ('ami', 'b')],
    weights=[2.0, 2.0, 1.0, 4.0],
    k=1,
)[0]

e.g. for each tuple of the cartesian product, we find the longest matching (tuple) prefix in balance_weights and use this weight.

I'm not sure this approach is flexible/clean enough to be PR-ready, and it's hard to make the docstring concise, but i think it could be really useful :)

@FrenchKrab FrenchKrab marked this pull request as draft December 19, 2023 10:15
@stale
Copy link

stale bot commented Jul 23, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jul 23, 2024
@hbredin hbredin removed the wontfix label Jul 23, 2024
@stale
Copy link

stale bot commented Jan 19, 2025

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jan 19, 2025
@stale stale bot closed this Feb 21, 2025
@hbredin hbredin reopened this Feb 21, 2025
@stale stale bot removed the wontfix label Feb 21, 2025
@FrenchKrab
Copy link
Contributor Author

I wiped out the old branch since it was made too many commits ago. This is my updated proposal.
The task now have the balance parameter typed as :

balance: TaskBalancingSpecifications | Sequence[str] | dict | None = None

TaskBalancingSpecifications is the new class that handles all logic related to weighting/balancing rules. Sequence[str] is to keep compatibility with old code (uniform balancing).
dict is redundant, but it allows the user to easily specify balancing rules easily when using hydra (without having to instantiate the TaskBalancingSpecifications class)... maybe it is not necessary?

@FrenchKrab FrenchKrab marked this pull request as ready for review April 14, 2025 09:23
@FrenchKrab
Copy link
Contributor Author

NOTE: it also implements #1787, which should be closed it this is merged

@stale
Copy link

stale bot commented Oct 11, 2025

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Oct 11, 2025
@hbredin hbredin removed the wontfix label Oct 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants