Skip to content

Conversation

@isaac-chung
Copy link
Collaborator

@isaac-chung isaac-chung commented Jan 5, 2026

See the draft benchmarks. (For audio-text I actually use the full collection, no filtering) You'll also find the filtering notebook and the script to generate "Table 1".

@KennethEnevoldsen @AdnanElAssadi56 maybe another one for environmental or something?

isaac-chung and others added 10 commits January 4, 2026 15:16
Implements new task selection approach using correlation analysis
and clustering for MAEB evaluation.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <[email protected]>
- Add domain, category, and language checks to is_candidate_valid_removal
  to preserve at least one task from each unique domain, category, and language
- Add top 5 longest tasks display for CLAP model reference timing
- Add diagnostic cell for tasks with many negative correlations
- Expand correlation thresholds to include 0.8 and 0.9
- Add Languages, Domains, Categories columns to summary table
- Comment out license filtering to include all tasks
- Handle empty model coverage gracefully with fallback logic

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
…ased tasks_to_keep

- Move UMAP+HDBSCAN clustering right after initial correlation matrix
- Define tasks_to_keep from outlier cluster (label -1) instead of empty list
- Split function definitions to break circular dependency
- Add domain counts cell after results DataFrame
- Add model coverage distribution analysis (models at each task count)
- Use models with >= 50 tasks for runtime estimation
- Show task coverage in runtime output (N/M tasks with eval times)

🤖 Generated with [Claude Code](https://claude.ai/claude-code)

Co-Authored-By: Claude <[email protected]>
- Add get_pairs_above_threshold helper to get all correlated pairs
- Track skipped_pairs where neither task can be removed
- Continue to next pair when current pair is protected
- Clear skipped_pairs when task set changes after removal
- Only stop when all pairs above threshold have been tried

🤖 Generated with [Claude Code](https://claude.ai/claude-code)

Co-Authored-By: Claude <[email protected]>
Visualizes results_df with:
- Blue gradient colormap (light to dark)
- White background for NaN values
- Adaptive text color (white for high scores, black for low)
- Dynamic figure sizing based on data dimensions

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add MAEB(audio-text) benchmark with 17 cross-modal retrieval tasks
  (8 audio-to-text, 9 text-to-audio) selected via correlation threshold 0.95
- Inline task lists directly in MAEB benchmark objects
- Add threshold 0.95 to task selection notebook
- Convert comparison plot from 1x5 to 2x3 layout for 6 thresholds
- Fix tasks_to_select_from to use modality-filtered tasks
- Use models with complete eval times for runtime estimation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Expand MAEB(audio-text) benchmark from 17 to 29 tasks (14 A2T + 15 T2A)
- Fix msclap model revision from "N/A" to "no_revision" to match results cache
- Update benchmark contacts

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Script generates top 10 model rankings for MAEB(audio) and MAEB(audio-text)
benchmarks using Borda count, with per-category averages.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally like marimo, but damn this is not the easiest thing to review. This is one of the cases where you really need the results to know what is filtered and why (having to git pull and run it to see seems like a big drawback). Is it possible to convert it to an .ipynb or .md for the results?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya I can export a pdf or html or smth?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Samoed
Copy link
Member

Samoed commented Jan 6, 2026

Created overview table for tasks and where they're used. Also version for google sheets https://docs.google.com/spreadsheets/d/1wyTvW0q6TIat7RMmfimlNKXri9O7cs_S0uebGTNya0c/edit?usp=sharing

Table
Task Name Task description Task type Task language(s) In MAEB(audio) In MAEB(audio-text)
0 AmbientAcousticContext The Ambient Acoustic Context dataset contains 1-second segments for activities that occur in a workplace setting. This is a downsampled version with ~100 train and ~50 test samples per class. AudioClassification eng-Latn No No
1 AmbientAcousticContextClustering Clustering task based on a subset of the Ambient Acoustic Context dataset containing 1-second segments for workplace activities. AudioClustering eng-Latn No No
2 AudioCapsA2TRetrieval Natural language description for any kind of audio in the wild. Any2AnyRetrieval eng-Latn No Yes
3 AudioCapsMiniReranking A smaller subset of AudioCaps dataset preprocessed for audio reranking AudioReranking eng-Latn Yes No
4 AudioCapsT2ARetrieval Natural language description for any kind of audio in the wild. Any2AnyRetrieval eng-Latn No Yes
5 AudioSetMini AudioSet consists of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos. This is a mini version that is sampled from the original dataset. AudioMultilabelClassification eng-Latn No No
6 AudioSetStrongA2TRetrieval Retrieve all temporally-strong labeled events within 10s audio clips from the AudioSet Strongly-Labeled subset. Any2AnyRetrieval eng-Latn No Yes
7 AudioSetStrongT2ARetrieval Retrieve audio segments corresponding to a given sound event label from the AudioSet Strongly-Labeled 10s clips. Any2AnyRetrieval eng-Latn No Yes
8 BeijingOpera Audio classification of percussion instruments into one of 4 classes: Bangu, Naobo, Daluo, and Xiaoluo AudioClassification eng-Latn No No
9 BirdCLEF BirdCLEF+ 2025 dataset for species identification from audio, focused on birds, amphibians, mammals and insects from the Middle Magdalena Valley of Colombia. Downsampled to 50 classes with 20 samples each. AudioClassification eng-Latn No No
10 BirdSet BirdSet: A Large-Scale Dataset for Audio Classification in Avian Bioacoustics AudioClassification eng-Latn Yes No
11 CMUArcticA2TRetrieval Retrieve the correct transcription for an English speech segment. The dataset is derived from the phonetically balanced CMU Arctic single-speaker TTS corpora. The corpora contains 1150 samples based on read-aloud segments from books, which are out of copyright and derived from the Gutenberg project. Any2AnyRetrieval eng-Latn No Yes
12 CMUArcticT2ARetrieval Retrieve the correct audio segment for an English transcription. The dataset is derived from the phonetically balanced CMU Arctic single-speaker TTS corpora. The corpora contains 1150 audio-text pairs based on read-aloud segments from public domain books originally sourced from the Gutenberg project. Any2AnyRetrieval eng-Latn No Yes
13 CREMADPairClassification Classifying pairs as having same or different emotions in actor's voice recordings of text spoken in 6 different emotions AudioPairClassification eng-Latn Yes No
14 CREMA_D Emotion classification of audio into one of 6 classes: Anger, Disgust, Fear, Happy, Neutral, Sad. AudioClassification eng-Latn No No
15 CREMA_DClustering Emotion clustering task with audio data for 6 emotions: Anger, Disgust, Fear, Happy, Neutral, Sad. AudioClustering eng-Latn No No
16 ClothoA2TRetrieval An audio captioning datasetst containing audio clips and their corresponding captions. Any2AnyRetrieval eng-Latn No Yes
17 ClothoT2ARetrieval An audio captioning datasetst containing audio clips from the Freesound platform and their corresponding captions. Any2AnyRetrieval eng-Latn No Yes
18 CommonLanguageAgeDetection Age Classification. This is a stratified subsampled version of the original CommonLanguage dataset. AudioClassification eng-Latn Yes No
19 CommonLanguageGenderDetection Gender Classification. This is a stratified subsampled version of the original CommonLanguage datasets. AudioClassification eng-Latn No No
20 CommonLanguageLanguageDetection Language Classification. This is a stratified subsampled version of the original CommonLanguage dataset. AudioClassification eng-Latn No No
21 CommonVoice17A2TRetrieval Speech recordings with corresponding text transcriptions from CommonVoice dataset. Any2AnyRetrieval abk-Latn,afr-Latn,amh-Ethi,ara-Arab,asm-Beng,ast-Latn,aze-Latn,bak-Cyrl,bas-Latn,bel-Cyrl,ben-Beng,bre-Latn,bul-Cyrl,cat-Latn,ces-Latn,chv-Cyrl,ckb-Arab,cnh-Latn,cym-Latn,dan-Latn,deu-Latn,div-Thaa,dyu-Latn,ell-Grek,eng-Latn,epo-Latn,est-Latn,eus-Latn,fas-Arab,fin-Latn,fra-Latn,fry-Latn,gle-Latn,glg-Latn,grn-Latn,hau-Latn,heb-Hebr,hin-Deva,hsb-Latn,hun-Latn,hye-Armn,ibo-Latn,ina-Latn,ind-Latn,spa-Latn No No
22 CommonVoice17T2ARetrieval Speech recordings with corresponding text transcriptions from CommonVoice dataset. Any2AnyRetrieval abk-Latn,afr-Latn,amh-Ethi,ara-Arab,asm-Beng,ast-Latn,aze-Latn,bak-Cyrl,bas-Latn,bel-Cyrl,ben-Beng,bre-Latn,bul-Cyrl,cat-Latn,ces-Latn,chv-Cyrl,ckb-Arab,cnh-Latn,cym-Latn,dan-Latn,deu-Latn,div-Thaa,dyu-Latn,ell-Grek,eng-Latn,epo-Latn,est-Latn,eus-Latn,fas-Arab,fin-Latn,fra-Latn,fry-Latn,gle-Latn,glg-Latn,grn-Latn,hau-Latn,heb-Hebr,hin-Deva,hsb-Latn,hun-Latn,hye-Armn,ibo-Latn,ina-Latn,ind-Latn,spa-Latn No No
23 CommonVoice21A2TRetrieval Speech recordings with corresponding text transcriptions from CommonVoice dataset. Any2AnyRetrieval abk-Latn,afr-Latn,amh-Ethi,ara-Arab,asm-Beng,ast-Latn,aze-Latn,bak-Cyrl,bas-Latn,bel-Cyrl,ben-Beng,bre-Latn,bul-Cyrl,cat-Latn,ces-Latn,chv-Cyrl,ckb-Arab,cnh-Latn,cym-Latn,dan-Latn,deu-Latn,div-Thaa,dyu-Latn,ell-Grek,eng-Latn,epo-Latn,est-Latn,eus-Latn,fas-Arab,fin-Latn,fra-Latn,fry-Latn,gle-Latn,glg-Latn,grn-Latn,hau-Latn,heb-Hebr,hin-Deva,hsb-Latn,hun-Latn,hye-Armn,ibo-Latn,ina-Latn,ind-Latn,spa-Latn No No
24 CommonVoice21T2ARetrieval Speech recordings with corresponding text transcriptions from CommonVoice dataset. Any2AnyRetrieval abk-Latn,afr-Latn,amh-Ethi,ara-Arab,asm-Beng,ast-Latn,aze-Latn,bak-Cyrl,bas-Latn,bel-Cyrl,ben-Beng,bre-Latn,bul-Cyrl,cat-Latn,ces-Latn,chv-Cyrl,ckb-Arab,cnh-Latn,cym-Latn,dan-Latn,deu-Latn,div-Thaa,dyu-Latn,ell-Grek,eng-Latn,epo-Latn,est-Latn,eus-Latn,fas-Arab,fin-Latn,fra-Latn,fry-Latn,gle-Latn,glg-Latn,grn-Latn,hau-Latn,heb-Hebr,hin-Deva,hsb-Latn,hun-Latn,hye-Armn,ibo-Latn,ina-Latn,ind-Latn,spa-Latn No No
25 ESC50 Environmental Sound Classification Dataset. AudioClassification eng-Latn No No
26 ESC50AudioReranking ESC-50 environmental sound dataset adapted for audio reranking. Given a query audio of environmental sounds, rank 5 relevant audio samples higher than 16 irrelevant ones from different sound classes. Contains 200 queries across 50 environmental sound categories for robust evaluation. AudioReranking eng-Latn No No
27 ESC50Clustering The ESC-50 dataset contains 2,000 labeled environmental audio recordings evenly distributed across 50 classes (40 clips per class). These classes are organized into 5 broad categories: animal sounds, natural soundscapes & water sounds, human (non-speech) sounds, interior/domestic sounds, and exterior/urban noises. This task evaluates unsupervised clustering performance on environmental audio recordings. AudioClustering eng-Latn No No
28 ESC50PairClassification Environmental Sound Classification Dataset. AudioPairClassification eng-Latn No No
29 ESC50_Zeroshot Environmental Sound Classification Dataset. AudioZeroshotClassification eng-Latn No No
30 EmoVDBA2TRetrieval Natural language emotional captions for speech segments from the EmoV-DB emotional voices database. Any2AnyRetrieval eng-Latn No Yes
31 EmoVDBT2ARetrieval Natural language emotional captions for speech segments from the EmoV-DB emotional voices database. Any2AnyRetrieval eng-Latn No Yes
32 FSD2019Kaggle Multilabel Audio Classification. AudioMultilabelClassification eng-Latn Yes No
33 FSD50K Multilabel Audio Classification. AudioMultilabelClassification eng-Latn Yes No
34 FSDD Spoken digit classification of audio into one of 10 classes: 0-9 AudioClassification eng-Latn No No
35 FSDnoisy18kAudioReranking FSDnoisy18k sound event dataset adapted for audio reranking. Given a query audio with potential label noise, rank 4 relevant audio samples higher than 16 irrelevant ones from different sound classes. Contains 200 queries across 20 sound event categories. AudioReranking eng-Latn Yes No
36 FleursA2TRetrieval Speech recordings with corresponding text transcriptions from the FLEURS dataset. Any2AnyRetrieval afr-Latn,amh-Ethi,ara-Arab,asm-Beng,ast-Latn,aze-Latn,bel-Cyrl,ben-Beng,bos-Latn,bul-Cyrl,cat-Latn,ceb-Latn,ces-Latn,ckb-Arab,cmn-Hans,cym-Latn,dan-Latn,deu-Latn,ell-Grek,eng-Latn,est-Latn,fas-Arab,fil-Latn,fin-Latn,fra-Latn,ful-Latn,gle-Latn,glg-Latn,guj-Gujr,hau-Latn,heb-Hebr,hin-Deva,hrv-Latn,hun-Latn,hye-Armn,ibo-Latn,ind-Latn,isl-Latn,ita-Latn,jav-Latn,jpn-Jpan,kam-Latn,kan-Knda,kat-Geor,kaz-Cyrl,kea-Latn,khm-Khmr,kir-Cyrl,kor-Hang,lao-Laoo,lin-Latn,lit-Latn,ltz-Latn,lug-Latn,luo-Latn,lvs-Latn,mal-Mlym,mar-Deva,mkd-Cyrl,mlt-Latn,mon-Cyrl,mri-Latn,msa-Latn,mya-Mymr,nld-Latn,nob-Latn,npi-Deva,nso-Latn,nya-Latn,oci-Latn,ori-Orya,orm-Latn,pan-Guru,pol-Latn,por-Latn,pus-Arab,ron-Latn,rus-Cyrl,slk-Latn,slv-Latn,sna-Latn,snd-Arab,som-Latn,spa-Latn,srp-Cyrl,swe-Latn,swh-Latn,tam-Taml,tel-Telu,tgk-Cyrl,tha-Thai,tur-Latn,ukr-Cyrl,umb-Latn,urd-Arab,uzn-Latn,vie-Latn,wol-Latn,xho-Latn,yor-Latn,yue-Hant,zul-Latn No Yes
37 FleursT2ARetrieval Speech recordings with corresponding text transcriptions from the FLEURS dataset. Any2AnyRetrieval afr-Latn,amh-Ethi,ara-Arab,asm-Beng,ast-Latn,aze-Latn,bel-Cyrl,ben-Beng,bos-Latn,bul-Cyrl,cat-Latn,ceb-Latn,ces-Latn,ckb-Arab,cmn-Hans,cym-Latn,dan-Latn,deu-Latn,ell-Grek,eng-Latn,est-Latn,fas-Arab,fil-Latn,fin-Latn,fra-Latn,ful-Latn,gle-Latn,glg-Latn,guj-Gujr,hau-Latn,heb-Hebr,hin-Deva,hrv-Latn,hun-Latn,hye-Armn,ibo-Latn,ind-Latn,isl-Latn,ita-Latn,jav-Latn,jpn-Jpan,kam-Latn,kan-Knda,kat-Geor,kaz-Cyrl,kea-Latn,khm-Khmr,kir-Cyrl,kor-Hang,lao-Laoo,lin-Latn,lit-Latn,ltz-Latn,lug-Latn,luo-Latn,lvs-Latn,mal-Mlym,mar-Deva,mkd-Cyrl,mlt-Latn,mon-Cyrl,mri-Latn,msa-Latn,mya-Mymr,nld-Latn,nob-Latn,npi-Deva,nso-Latn,nya-Latn,oci-Latn,ori-Orya,orm-Latn,pan-Guru,pol-Latn,por-Latn,pus-Arab,ron-Latn,rus-Cyrl,slk-Latn,slv-Latn,sna-Latn,snd-Arab,som-Latn,spa-Latn,srp-Cyrl,swe-Latn,swh-Latn,tam-Taml,tel-Telu,tgk-Cyrl,tha-Thai,tur-Latn,ukr-Cyrl,umb-Latn,urd-Arab,uzn-Latn,vie-Latn,wol-Latn,xho-Latn,yor-Latn,yue-Hant,zul-Latn No Yes
38 GTZANAudioReranking GTZAN music genre dataset adapted for audio reranking. Given a query audio from one of 10 music genres, rank 3 relevant audio samples higher than 10 irrelevant ones from different genres. Contains 100 queries across 10 music genres for comprehensive evaluation. AudioReranking eng-Latn No No
39 GTZANGenre Music Genre Classification (10 classes) AudioClassification eng-Latn No No
40 GTZANGenreClustering Music genre clustering task based on GTZAN dataset with 10 music genres. AudioClustering eng-Latn No No
41 GigaSpeechA2TRetrieval Given an English speech segment, retrieve its correct transcription. Audio comes from the 10 000‑hour training subset of GigaSpeech, which originates from ≈40 000 hours of transcribed audiobooks, podcasts, and YouTube. Any2AnyRetrieval eng-Latn No Yes
42 GigaSpeechT2ARetrieval Given an English transcription, retrieve its corresponding audio segment. Audio comes from the 10 000‑hour training subset of GigaSpeech, sourced from ≈40 000 hours of transcribed audiobooks, podcasts, and YouTube. Any2AnyRetrieval eng-Latn No Yes
43 GunshotTriangulation Classifying a weapon based on its muzzle blast AudioClassification eng-Latn No No
44 HiFiTTSA2TRetrieval Sentence-level text captions aligned to 44.1 kHz audiobook speech segments from the Hi‑Fi Multi‑Speaker English TTS dataset. Dataset is based on public audiobooks from LibriVox and texts from Project Gutenberg. Any2AnyRetrieval eng-Latn No Yes
45 HiFiTTST2ARetrieval Sentence-level text captions aligned to 44.1 kHz audiobook speech segments from the Hi‑Fi Multi‑Speaker English TTS dataset. Dataset is based on public audiobooks from LibriVox and texts from Project Gutenberg. Any2AnyRetrieval eng-Latn No Yes
46 IEMOCAPEmotion Classification of speech samples into emotions (angry, happy, sad, neutral, frustrated, excited, fearful, surprised, disgusted) from interactive emotional dyadic conversations. AudioClassification eng-Latn No No
47 IEMOCAPGender Classification of speech samples by speaker gender (male/female) from the IEMOCAP database of interactive emotional dyadic conversations. AudioClassification eng-Latn No No
48 JLCorpusA2TRetrieval Emotional speech segments from the JL-Corpus, balanced over long vowels and annotated for primary and secondary emotions. Any2AnyRetrieval eng-Latn No Yes
49 JLCorpusT2ARetrieval Emotional speech segments from the JL-Corpus, balanced over long vowels and annotated for primary and secondary emotions. Any2AnyRetrieval eng-Latn No Yes
50 LibriCount Multiclass speaker count identification. Dataset contains audio recordings with between 0 to 10 speakers. AudioClassification eng-Latn Yes No
51 LibriTTSA2TRetrieval Given audiobook speech segments from the multi‑speaker LibriTTS corpus, retrieve the correct text transcription. LibriTTS is a 585‑hour, 24 kHz, multi‑speaker English TTS corpus derived from LibriVox (audio) and Project Gutenberg (text). Any2AnyRetrieval eng-Latn No Yes
52 LibriTTST2ARetrieval Given an English text transcription, retrieve its corresponding audiobook speech segment from the multi‑speaker LibriTTS corpus. LibriTTS is a 585‑hour, 24 kHz, multi‑speaker English TTS corpus derived from LibriVox and Project Gutenberg. Any2AnyRetrieval eng-Latn No Yes
53 MACSA2TRetrieval Audio captions and tags for urban acoustic scenes in TAU Urban Acoustic Scenes 2019 development dataset. Any2AnyRetrieval eng-Latn No Yes
54 MACST2ARetrieval Audio captions and tags for urban acoustic scenes in TAU Urban Acoustic Scenes 2019 development dataset. Any2AnyRetrieval eng-Latn No Yes
55 MInDS14 MInDS-14 is an evaluation resource for intent detection with spoken data in 14 diverse languages. AudioClassification ces-Latn,deu-Latn,eng-Latn,fra-Latn,ita-Latn,kor-Hang,nld-Latn,pol-Latn,por-Latn,rus-Cyrl,spa-Latn,zho-Hans Yes No
56 MridinghamStroke Stroke classification of Mridingham (a pitched percussion instrument) into one of 10 classes: ["bheem", "cha", "dheem", "dhin", "num", "tham", "ta", "tha", "thi", "thom"] AudioClassification eng-Latn Yes No
57 MridinghamTonic Tonic classification of Mridingham (a pitched percussion instrument) into one of 6 classes: B,C,C#,D,D#,E AudioClassification eng-Latn No No
58 MusicCapsA2TRetrieval Natural language description for music audio. Any2AnyRetrieval eng-Latn No Yes
59 MusicCapsT2ARetrieval Natural language description for music audio. Any2AnyRetrieval eng-Latn No Yes
60 MusicGenreClustering Clustering music recordings in 9 different genres. AudioClustering eng-Latn Yes No
61 NMSQAPairClassification A textless Q&A dataset. Given a pair of audio question and audio answer, is the answer relevant to the question? AudioPairClassification eng-Latn Yes No
62 NSynth Instrument Source Classification: one of acoustic, electronic, or synthetic. AudioClassification eng-Latn No No
63 RavdessZeroshot Emotion classification Dataset. RAVDESS contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech emotions includes neutral,calm, happy, sad, angry, fearful, surprise, and disgust expressions. These 8 emtoions also serve as labels for the dataset. AudioZeroshotClassification eng-Latn Yes No
64 SIBFLEURS Topic Classification for multilingual audio dataset. This dataset is a stratified and downsampled subset of the SIBFLEURS dataset, which is a collection of 1000+ hours of audio data in 100+ languages. AudioMultilabelClassification afr-Latn,amh-Ethi,arb-Arab,asm-Beng,ast-Latn,azj-Latn,bel-Cyrl,ben-Beng,bos-Latn,bul-Cyrl,cat-Latn,ceb-Latn,ces-Latn,ckb-Arab,cym-Latn,dan-Latn,deu-Latn,ell-Grek,eng-Latn,est-Latn,fin-Latn,fra-Latn,fuv-Latn,gaz-Latn,gle-Latn,glg-Latn,guj-Gujr,hau-Latn,heb-Hebr,hin-Deva,hrv-Latn,hun-Latn,hye-Armn,ibo-Latn,ind-Latn,isl-Latn,ita-Latn,jav-Latn,jpn-Jpan,kam-Latn,kan-Knda,kat-Geor,kaz-Cyrl,kea-Latn,khk-Cyrl,khm-Khmr,kir-Cyrl,kor-Hang,lao-Laoo,lin-Latn,lit-Latn,ltz-Latn,lug-Latn,luo-Latn,lvs-Latn,mal-Mlym,mar-Deva,mkd-Cyrl,mlt-Latn,mri-Latn,mya-Mymr,nld-Latn,nob-Latn,npi-Deva,nso-Latn,nya-Latn,oci-Latn,ory-Orya,pan-Guru,pbt-Arab,pes-Arab,pol-Latn,por-Latn,ron-Latn,rus-Cyrl,slk-Latn,slv-Latn,sna-Latn,snd-Arab,som-Latn,spa-Latn,srp-Cyrl,swe-Latn,swh-Latn,tam-Taml,tel-Telu,tgk-Cyrl,tgl-Latn,tha-Thai,tur-Latn,ukr-Cyrl,umb-Latn,urd-Arab,uzn-Latn,vie-Latn,wol-Latn,xho-Latn,yor-Latn,zho-Hans,zho-Hant,zsm-Latn,zul-Latn Yes No
65 SoundDescsA2TRetrieval Natural language description for different audio sources from the BBC Sound Effects webpage. Any2AnyRetrieval eng-Latn No Yes
66 SoundDescsT2ARetrieval Natural language description for different audio sources from the BBC Sound Effects webpage. Any2AnyRetrieval eng-Latn No Yes
67 SpeechCommands A set of one-second .wav audio files, each containing a single spoken English word or background noise. To keep evaluation fast, we use a downsampled version of the original dataset by keeping ~50 samples per class for training. AudioClassification eng-Latn No No
68 SpeechCommandsZeroshotv0.01 Sound Classification/Keyword Spotting Dataset. This is a set of one-second audio clips containing a single spoken English word or background noise. These words are from a small set of commands such as 'yes', 'no', and 'stop' spoken by various speakers. With a total of 10 labels/commands for keyword spotting and a total of 30 labels for other auxiliary tasks AudioZeroshotClassification eng-Latn Yes No
69 SpokeNEnglish Human Sound Classification Dataset. AudioClassification eng-Latn Yes No
70 SpokenQAForIC SpokenQA dataset reformulated as Intent Classification (IC) task AudioClassification eng-Latn Yes No
71 SpokenSQuADT2ARetrieval Text-to-audio retrieval task based on SpokenSQuAD dataset. Given a text question, retrieve relevant audio segments that contain the answer. Questions are derived from SQuAD reading comprehension dataset with corresponding spoken passages. Any2AnyRetrieval eng-Latn No Yes
72 TUTAcousticScenes TUT Urban Acoustic Scenes 2018 dataset consists of 10-second audio segments from 10 acoustic scenes recorded in six European cities. This is a stratified subsampled version of the original dataset. AudioClassification eng-Latn Yes No
73 UrbanSound8KA2TRetrieval UrbanSound8K: Audio-to-text retrieval of urban sound events. Any2AnyRetrieval eng-Latn No Yes
74 UrbanSound8KAudioReranking UrbanSound8K urban sound dataset adapted for audio reranking. Given a query audio of urban sounds, rank 4 relevant audio samples higher than 16 irrelevant ones from different urban sound classes. Contains 200 queries across 10 urban sound categories for comprehensive evaluation. AudioReranking eng-Latn No No
75 UrbanSound8KT2ARetrieval UrbanSound8K: Text-to-audio retrieval of urban sound events. Any2AnyRetrieval eng-Latn No Yes
76 UrbanSound8kZeroshot Environmental Sound Classification Dataset. AudioZeroshotClassification eng-Latn No No
77 VehicleSoundClustering Clustering vehicle sounds recorded from smartphones (0 (car class), 1 (truck, bus and van class), 2 (motorcycle class)) AudioClustering eng-Latn No No
78 VocalSound Human Vocal Sound Classification Dataset. AudioClassification eng-Latn No No
79 VocalSoundAudioReranking VocalSound dataset adapted for audio reranking. Given a query vocal sound from one of 6 categories, rank 4 relevant vocal samples higher than 16 irrelevant ones from different vocal sound types. Contains 198 queries across 6 vocal sound categories for robust evaluation. AudioReranking eng-Latn Yes No
80 VocalSoundPairClassification Recognizing whether two audio clips are the same human vocal expression (laughing, sighing, etc.) AudioPairClassification eng-Latn Yes No
81 VoiceGenderClustering Clustering audio recordings based on gender (male vs female). AudioClustering eng-Latn Yes No
82 VoxCelebClustering Clustering task based on the VoxCeleb dataset for sentiment analysis, clustering by positive/negative sentiment. AudioClustering eng-Latn Yes No
83 VoxCelebSA VoxCeleb dataset augmented for Sentiment Analysis task AudioClassification eng-Latn Yes No
84 VoxLingua107_Top10 Spoken Language Identification for a given audio samples (10 classes/languages) AudioClassification eng-Latn No No
85 VoxPopuliAccentClustering Clustering English speech samples by non-native accent from European Parliament recordings. AudioClustering eng-Latn Yes No
86 VoxPopuliAccentID Classification of English speech samples into one of 15 non-native accents from European Parliament recordings. This is a stratified subsampled version of the original VoxPopuli dataset. AudioClassification eng-Latn Yes No
87 VoxPopuliAccentPairClassification Classifying same or different regional accent of English AudioPairClassification eng-Latn No No
88 VoxPopuliGenderClustering Subsampled Dataset for clustering speech samples by speaker gender (male/female) from European Parliament recordings. AudioClustering deu-Latn,eng-Latn,fra-Latn,pol-Latn,spa-Latn No No
89 VoxPopuliGenderID Subsampled Dataset Classification of speech samples by speaker gender (male/female) from European Parliament recordings. AudioClassification deu-Latn,eng-Latn,fra-Latn,pol-Latn,spa-Latn Yes No
90 VoxPopuliLanguageID Subsampled Dataset for classification of speech samples into one of 5 European languages (English, German, French, Spanish, Polish) from European Parliament recordings. AudioClassification deu-Latn,eng-Latn,fra-Latn,pol-Latn,spa-Latn No No
script
import mteb
import pandas as pd

tasks = mteb.get_tasks(modalities=["audio"])

audio_tasks_names = [t.metadata.name for t in mteb.get_benchmark("MAEB(audio)")]
audio_text_tasks_names = [t.metadata.name for t in mteb.get_benchmark("MAEB(audio-text)")]

row = []
for task in tasks:
    print(task.metadata.name)
    in_audio = task.metadata.name in audio_tasks_names
    in_audio_text = task.metadata.name in audio_text_tasks_names
    row.append(
        {
            "Task Name": task.metadata.name,
            "Task description": task.metadata.description,
            "Task type": task.metadata.type,
            "Task language(s)": ", ".join(task.metadata.eval_langs) if isinstance(task.metadata.eval_langs, list) else ", ".join(v[0] for v in task.metadata.eval_langs.values()),
            "In MAEB(audio)": "Yes" if in_audio else "No",
            "In MAEB(audio-text)": "Yes" if in_audio_text else "No",
        }
    )

df = pd.DataFrame(row)
df = df.sort_values(by=["Task Name", "Task type"]).reset_index(drop=True)
df.to_csv("audio_tasks_table.csv", index=False)
df.to_markdown("audio_tasks_table.md")

@Samoed
Copy link
Member

Samoed commented Jan 6, 2026

Probably we can create english only version, but I'm not sure if it is relevant, because most of the tasks are english only

@isaac-chung
Copy link
Collaborator Author

Where are all the multilingual tasks?

@Samoed
Copy link
Member

Samoed commented Jan 6, 2026

I think we can create

  1. MAEB(audio)
  2. MAEB(audio-text-multilingual)
  3. MAEB(audio-text-eng)

But this might be complicated to understand for users

@isaac-chung
Copy link
Collaborator Author

I think we can create

  1. MAEB(audio)
  2. MAEB(audio-text-multilingual)
  3. MAEB(audio-text-eng)

But this might be complicated to understand for users

Why would it be complicated? Seems clear to me

@KennethEnevoldsen
Copy link
Contributor

Hmm I would maybe do:

  • MAEB: the full MAEB, including audio, audio-text and multilingual
  • MAEB(audio): The audio-only subset of MAEB
  • MAEB(english): The english subset of MAEB

However, I would probably argue we could just make two columns that are audio-only and English and just maintain a single benchmark. WDYT? This both simplifies use, the selection and the paper itself.

PS: We have to fix the language annotations - birdset for example, is not English.

@Samoed
Copy link
Member

Samoed commented Jan 6, 2026

We have to fix the language annotations - birdset for example, is not English

How we should name it? Just other or you have something specific in mind? We probably need to change it also to GTZAN (music classification), GunshotTriangulation, MridinghamTonic, NSynth tasks. Added an issue #3872

However, I would probably argue we could just make two columns that are audio-only and English and just maintain a single benchmark. WDYT? This both simplifies use, the selection and the paper itself.

For leaderboard, I agree, but for the users I'm not sure because this can create problems on inference

@isaac-chung
Copy link
Collaborator Author

isaac-chung commented Jan 6, 2026

we could just make two columns that are audio-only and English and just maintain a single benchmark.

Sorry it's been a long day, and for some reason I struggle to envision this. What would this look like? Would this need any change to the LB?

Ah I get it now, only maintain MAEB. Do we bother filtering out similar tasks? or use the entire collection?

MAEB is the full Massive Audio Embedding Benchmark (v1), containing all
tasks with audio modality across 7 task types: classification (35),
clustering (10), pair classification (5), reranking (6), zero-shot
classification (5), audio-to-text retrieval (18), and text-to-audio
retrieval (17).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@Samoed
Copy link
Member

Samoed commented Jan 6, 2026

I'm a bit afraid that if we use only 1 benchmark, but users would want to evaluate only on part of it, e.g. audio only. They would need to filter tasks

@isaac-chung
Copy link
Collaborator Author

What if we have an english list, an audio list, a "the rest of the collection" list, and MAEB is english + audio + "the rest"? We can still have MAEB(eng)v1, MAEB(audio)v1, and MAEBv1 ?

isaac-chung and others added 5 commits January 7, 2026 00:20
Rename UrbanSound8kZeroshotClassification to UrbanSound8kClassification
in audio_classification module to avoid collision with the identically
named class in audio_zeroshot_classification module.

Both classes had the same Python name but different task names:
- audio_classification: task name "UrbanSound8k"
- audio_zeroshot_classification: task name "UrbanSound8kZeroshot"

The * imports caused the zeroshot version to overwrite the classification
version, leaving only "UrbanSound8kZeroshot" registered in the task
registry and breaking MAEB benchmarks that reference "UrbanSound8k".

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
The dill/datasets library had a pickle incompatibility with Python 3.14.
Datasets v4+ resolves this issue.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
The v0.02 task class was defined but not exported in __init__.py,
causing KeyError when referenced in benchmarks.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Renamed classes to match their metadata names so they can be found in the task registry:
- JamAltArtist → JamAltArtistA2ARetrieval
- JamAltLyricsT2A → JamAltLyricT2ARetrieval
- JamAltLyricsA2T → JamAltLyricA2TRetrieval

Also added explicit imports and exports for proper registration.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@isaac-chung isaac-chung force-pushed the maeb-task-selection branch 2 times, most recently from 2631fc8 to 411a4ce Compare January 6, 2026 23:21
@AdnanElAssadi56
Copy link
Contributor

Possible Splits we could have:

  1. By Modality & Language

MAEB (English): The standard "default" benchmark.
Contains: English tasks and and zxx (universal) tasks.

MAEB (Multilingual): For multilingual evaluation.
Contains: FLEURS, CommonVoice, VoxPopuli, MInDS-14.

MAEB (Audio-Only): Pure audio tasks (no text encoders required).
Contains: Classification, Clustering, PairClassification, etc...

MAEB (Audio-Text): Cross-modal tasks.
Contains: Retrieval, Zeroshot

  1. By Domain (Specialized)

MAEB (Environmental):
Contains: ESC50, UrbanSound, Gunshot, Vehicle, TUT Acoustic Scenes.

MAEB (Music):
Contains: GTZAN, MusicCaps, BeijingOpera, NSynth.

MAEB (Bioacoustics):
Contains: BirdCLEF, BirdSet.

Note on Language Tags: For the specialized domains (Music, Bio, Env), I suggest we use the language tag zxx (No Linguistic Content) instead of eng-Latn. This clarifies that they are universal and fit into both English and Multilingual suites.

@Samoed
Copy link
Member

Samoed commented Jan 7, 2026

I don't think we should split between audio-text and multilingual/english

@isaac-chung
Copy link
Collaborator Author

Don't think we can group a handful of tasks to call them "massive benchmarks". It's likely better to say we cover ABC domains in a single benchmark.

For practicality, modality will be the biggest driver, and I think we should keep the split by modalities.

isaac-chung and others added 3 commits January 7, 2026 11:14
- Export MAEB benchmark from benchmarks/__init__.py
- Add Audio tab after Image tab in benchmark selector with MAEB(audio),
  MAEB(audio-text), and MAEB benchmarks
- Add skip_cache_file option to load results from local cache path
- Configure local maeb-results path for development testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add 5 zero-shot classification tasks to MAEB_AUDIO_TEXT benchmark
- Update description to reflect 34 total tasks
- Add KennethEnevworlds and Samoed to all MAEB benchmark contacts

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
isaac-chung and others added 2 commits January 7, 2026 12:13
Zero-shot classification tasks require text modality and are now only
in MAEB(audio-text). MAEB(audio) now has 24 tasks across 4 task types.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@isaac-chung
Copy link
Collaborator Author

I resolved #3877 and removed zeroshot tasks from the audio-only benchmark.

isaac-chung and others added 2 commits January 7, 2026 22:20
Resolved conflict in any_2_any_retrieval/__init__.py, keeping correct class names:
- JamAltArtistA2ARetrieval
- JamAltLyricA2TRetrieval
- JamAltLyricT2ARetrieval

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@isaac-chung
Copy link
Collaborator Author

isaac-chung commented Jan 7, 2026

Just added the MAEB audio extended and lite benchmarks. These have 54 tasks, 38 models and 19 tasks, 44 models respectively. MAEB audio-text lite has 30 tasks, 10 models.

This is done by finding the most number of tasks with the most number of model eval runs completed. No filtering is applied. @AdnanElAssadi56 @KennethEnevoldsen @Samoed would love a quick pair of eyes on these. I'd say we can probably start with these, and start filling in the relevant paper subsections.


Audio, Extended

maeb-audio-extended-20260107

Audio, Lite

maeb-audio-lite-20260107

Audio-Text, Lite

maeb-audio-text-lite-20260107

isaac-chung and others added 4 commits January 7, 2026 23:42
Replace MAEB(audio) and MAEB(audio-text) with new benchmarks optimized
for maximum model coverage:

- MAEB(audio, lite): 19 tasks, 44 models with complete results
- MAEB(audio, extended): 54 tasks, 38 models with complete results
- MAEB(audio-text, lite): 30 tasks, 10 models with complete results

Tasks selected via greedy algorithm maximizing models with all tasks.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
The prerun loop was calling on_benchmark_select() and update_task_list()
which return gr.update() objects, but then passing those objects to
functions expecting raw lists. This caused cache corruption and Gradio
validation errors when switching between benchmarks with different task
types (e.g., from MAEB(audio-text, lite) with Any2AnyRetrieval to
MAEB(audio, lite) without it).

Fix by calling the underlying cached functions directly:
- _cache_on_benchmark_select() instead of on_benchmark_select()
- _cache_update_task_list() instead of update_task_list()

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Leaderboard fixes:
- Cancel pending filter events when benchmark changes to prevent
  race conditions with stale values
- Make _update_description derive counts from benchmark tasks directly
  instead of filter selections to avoid validation errors

Benchmark changes:
- Remove AudioCapsMiniReranking from MAEB, MAEB(audio, lite), and
  MAEB(audio, extended)
- Update task counts in descriptions (96→95, 19→18, 54→53)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@Samoed
Copy link
Member

Samoed commented Jan 8, 2026

Look great!

isaac-chung and others added 3 commits January 8, 2026 11:15
- Use MAEB(audio, lite) and MAEB(audio-text, lite) benchmarks
- Table 1: Classification, PairClassification, Reranking, Clustering
- Table 2: Retrieval, ZeroshotClassification
- Make table functions accept task_names and benchmark_name parameters

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Add alias mapping for task types that lose digits during column name
processing (e.g., Any2AnyRetrieval -> AnyAnyRetrieval). Also add more
audio models to annotation list.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Marimo notebook for analyzing evaluation times across MAEB benchmarks.
Loads model metadata and task results to compare eval times between
large and small models for audio and audio-text benchmarks.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@AdnanElAssadi56
Copy link
Contributor

Great work, @isaac-chung!

When we say audio-text, lite here, are we implying an extended version to the readers?

@isaac-chung
Copy link
Collaborator Author

Great work, @isaac-chung!

When we say audio-text, lite here, are we implying an extended version to the readers?

It's the most complete collection bases on what we have run. We're missing results for an extended version.

isaac-chung and others added 2 commits January 9, 2026 12:03
Resolve conflicts in pyproject.toml and uv.lock, taking maeb's
version for speechbrain dependency constraint.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add MAEB(audio-text, extended) benchmark with 36 tasks:
  - All 30 tasks from lite version
  - Clotho A2T/T2A for audio captioning
  - Fleurs A2T/T2A (102 languages)
  - CommonVoice 21 A2T/T2A (82+ languages)

- Refine MAEB(audio-text, lite) to 17 tasks:
  - Remove redundant A2T tasks that have T2A equivalents
  - Remove SpeechCommandsZeroshotv0.01 (keep only v0.02)
  - Keep 13 T2A retrieval + 4 zero-shot classification

- Add MAEB(audio-text, extended) to benchmark selector

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@isaac-chung
Copy link
Collaborator Author

Added MAEB(audio-text, extended) as well. Removed same task A2T variants from MAEB(audio-text, lite)

isaac-chung and others added 2 commits January 9, 2026 13:36
New utility script that calculates total evaluation times for specified
benchmarks and models. Features:
- Takes --benchmarks and --models as required arguments
- Optional --results-dir for custom cache location
- Outputs formatted table with task coverage and times per benchmark
- Shows totals per model

Usage:
  python scripts/calculate_eval_times.py \
    -b "MAEB(audio-text, lite)" "MAEB(audio-text, extended)" \
    -m "OpenMuQ/MuQ-MuLan-large" "laion/clap-htsat-unfused" \
    -r /path/to/results

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Computes Spearman and Pearson correlations between MAEB lite and extended
benchmark variants to validate that lite benchmarks preserve model rankings.
Outputs correlation values and scatter plots (PNG and PDF).

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants