Skip to content

Conversation

@Samoed
Copy link
Member

@Samoed Samoed commented Jan 3, 2026

Ref #3498

I’ve started integrating audio statistics. For now, I’ve come up with this format. Do you have any suggestions?

class AudioStatistics(TypedDict):
    """Class for descriptive statistics for audio.

    Attributes:
        total_audio_seconds_length: Total length of all audio clips in total frames
        min_audio_seconds_length: Minimum length of audio clip in seconds
        average_audio_seconds_length: Average length of audio clip in seconds
        max_audio_seconds_length: Maximum length of audio clip in seconds
        unique_audios: Number of unique audio clips
        average_sampling_rate: Average sampling rate
        sampling_rates: Dict of unique sampling rates and their frequencies
    """

    total_audio_seconds_length: float

    min_audio_seconds_length: float
    average_audio_seconds_length: float
    max_audio_seconds_length: float

    unique_audios: int

    average_sampling_rate: float
    sampling_rates: dict[int, int]

@Samoed Samoed added the maeb Audio extension label Jan 3, 2026
@isaac-chung
Copy link
Collaborator

When I see length, I think in seconds. I like the frames approach too, and I'd like it spelled out explicitly (num_frames or whatever). I'd like to see:

  • the max/min/total number of seconds
  • the unique set of sampling rates (specify unit)

Would love to hear other feedback as well while I read into it a bit more.

@Samoed
Copy link
Member Author

Samoed commented Jan 3, 2026

Added seconds and sampling rates

Copy link
Collaborator

@isaac-chung isaac-chung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for adding more. Revisited some papers and maybe we should use the standard measure of audio dataset size.

Copy link
Collaborator

@isaac-chung isaac-chung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to align with HF notation + plus some questions.

Image

unique_audios: int

average_sampling_rate: float
sampling_rates: dict[int, int]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this just be a unique set of sampling rates? OK either way.

Suggested change
sampling_rates: dict[int, int]
sampling_rates: list[int]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to keep dict to show full distribution of different sample rates. If this became a problem, we can easily change to list of ints

@Myahr208

This comment has been minimized.

@Myahr208 Myahr208 mentioned this pull request Jan 3, 2026
@Myahr208

This comment has been minimized.

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor things - generally think this looks good (of course Isaac's comments still apply, but nothing more to add)

@Samoed
Copy link
Member Author

Samoed commented Jan 8, 2026

@isaac-chung Can you review this PR?

# Conflicts:
#	pyproject.toml
#	uv.lock
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Judging from the size, does this incorporate changes from #3875 too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I used these changes here too

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lovely!

@Samoed Samoed merged commit 3d17dbc into maeb Jan 8, 2026
10 checks passed
@Samoed Samoed deleted the audio_statistics branch January 8, 2026 20:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

maeb Audio extension

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants