-
Notifications
You must be signed in to change notification settings - Fork 530
[WIP] MAEB task selection #3867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: maeb
Are you sure you want to change the base?
Conversation
Implements new task selection approach using correlation analysis and clustering for MAEB evaluation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <[email protected]>
- Add domain, category, and language checks to is_candidate_valid_removal to preserve at least one task from each unique domain, category, and language - Add top 5 longest tasks display for CLAP model reference timing - Add diagnostic cell for tasks with many negative correlations - Expand correlation thresholds to include 0.8 and 0.9 - Add Languages, Domains, Categories columns to summary table - Comment out license filtering to include all tasks - Handle empty model coverage gracefully with fallback logic 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
…ased tasks_to_keep - Move UMAP+HDBSCAN clustering right after initial correlation matrix - Define tasks_to_keep from outlier cluster (label -1) instead of empty list - Split function definitions to break circular dependency - Add domain counts cell after results DataFrame - Add model coverage distribution analysis (models at each task count) - Use models with >= 50 tasks for runtime estimation - Show task coverage in runtime output (N/M tasks with eval times) 🤖 Generated with [Claude Code](https://claude.ai/claude-code) Co-Authored-By: Claude <[email protected]>
- Add get_pairs_above_threshold helper to get all correlated pairs - Track skipped_pairs where neither task can be removed - Continue to next pair when current pair is protected - Clear skipped_pairs when task set changes after removal - Only stop when all pairs above threshold have been tried 🤖 Generated with [Claude Code](https://claude.ai/claude-code) Co-Authored-By: Claude <[email protected]>
Visualizes results_df with: - Blue gradient colormap (light to dark) - White background for NaN values - Adaptive text color (white for high scores, black for low) - Dynamic figure sizing based on data dimensions 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add MAEB(audio-text) benchmark with 17 cross-modal retrieval tasks (8 audio-to-text, 9 text-to-audio) selected via correlation threshold 0.95 - Inline task lists directly in MAEB benchmark objects - Add threshold 0.95 to task selection notebook - Convert comparison plot from 1x5 to 2x3 layout for 6 thresholds - Fix tasks_to_select_from to use modality-filtered tasks - Use models with complete eval times for runtime estimation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Expand MAEB(audio-text) benchmark from 17 to 29 tasks (14 A2T + 15 T2A) - Fix msclap model revision from "N/A" to "no_revision" to match results cache - Update benchmark contacts 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Script generates top 10 model rankings for MAEB(audio) and MAEB(audio-text) benchmarks using Borda count, with per-category averages. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I generally like marimo, but damn this is not the easiest thing to review. This is one of the cases where you really need the results to know what is filtered and why (having to git pull and run it to see seems like a big drawback). Is it possible to convert it to an .ipynb or .md for the results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya I can export a pdf or html or smth?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Created overview table for tasks and where they're used. Also version for google sheets https://docs.google.com/spreadsheets/d/1wyTvW0q6TIat7RMmfimlNKXri9O7cs_S0uebGTNya0c/edit?usp=sharing Table
scriptimport mteb
import pandas as pd
tasks = mteb.get_tasks(modalities=["audio"])
audio_tasks_names = [t.metadata.name for t in mteb.get_benchmark("MAEB(audio)")]
audio_text_tasks_names = [t.metadata.name for t in mteb.get_benchmark("MAEB(audio-text)")]
row = []
for task in tasks:
print(task.metadata.name)
in_audio = task.metadata.name in audio_tasks_names
in_audio_text = task.metadata.name in audio_text_tasks_names
row.append(
{
"Task Name": task.metadata.name,
"Task description": task.metadata.description,
"Task type": task.metadata.type,
"Task language(s)": ", ".join(task.metadata.eval_langs) if isinstance(task.metadata.eval_langs, list) else ", ".join(v[0] for v in task.metadata.eval_langs.values()),
"In MAEB(audio)": "Yes" if in_audio else "No",
"In MAEB(audio-text)": "Yes" if in_audio_text else "No",
}
)
df = pd.DataFrame(row)
df = df.sort_values(by=["Task Name", "Task type"]).reset_index(drop=True)
df.to_csv("audio_tasks_table.csv", index=False)
df.to_markdown("audio_tasks_table.md") |
|
Probably we can create english only version, but I'm not sure if it is relevant, because most of the tasks are english only |
|
Where are all the multilingual tasks? |
|
I think we can create
But this might be complicated to understand for users |
Why would it be complicated? Seems clear to me |
|
Hmm I would maybe do:
However, I would probably argue we could just make two columns that are PS: We have to fix the language annotations - birdset for example, is not English. |
How we should name it? Just
For leaderboard, I agree, but for the users I'm not sure because this can create problems on inference |
Ah I get it now, only maintain MAEB. Do we bother filtering out similar tasks? or use the entire collection? |
MAEB is the full Massive Audio Embedding Benchmark (v1), containing all tasks with audio modality across 7 task types: classification (35), clustering (10), pair classification (5), reranking (6), zero-shot classification (5), audio-to-text retrieval (18), and text-to-audio retrieval (17). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
I'm a bit afraid that if we use only 1 benchmark, but users would want to evaluate only on part of it, e.g. audio only. They would need to filter tasks |
|
What if we have an english list, an audio list, a "the rest of the collection" list, and MAEB is english + audio + "the rest"? We can still have MAEB(eng)v1, MAEB(audio)v1, and MAEBv1 ? |
Rename UrbanSound8kZeroshotClassification to UrbanSound8kClassification in audio_classification module to avoid collision with the identically named class in audio_zeroshot_classification module. Both classes had the same Python name but different task names: - audio_classification: task name "UrbanSound8k" - audio_zeroshot_classification: task name "UrbanSound8kZeroshot" The * imports caused the zeroshot version to overwrite the classification version, leaving only "UrbanSound8kZeroshot" registered in the task registry and breaking MAEB benchmarks that reference "UrbanSound8k". 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
The dill/datasets library had a pickle incompatibility with Python 3.14. Datasets v4+ resolves this issue. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
The v0.02 task class was defined but not exported in __init__.py, causing KeyError when referenced in benchmarks. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Renamed classes to match their metadata names so they can be found in the task registry: - JamAltArtist → JamAltArtistA2ARetrieval - JamAltLyricsT2A → JamAltLyricT2ARetrieval - JamAltLyricsA2T → JamAltLyricA2TRetrieval Also added explicit imports and exports for proper registration. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
2631fc8 to
411a4ce
Compare
This reverts commit b244226.
This reverts commit 3147c20.
|
Possible Splits we could have:
MAEB (English): The standard "default" benchmark. MAEB (Multilingual): For multilingual evaluation. MAEB (Audio-Only): Pure audio tasks (no text encoders required). MAEB (Audio-Text): Cross-modal tasks.
MAEB (Environmental): MAEB (Music): MAEB (Bioacoustics): Note on Language Tags: For the specialized domains (Music, Bio, Env), I suggest we use the language tag zxx (No Linguistic Content) instead of eng-Latn. This clarifies that they are universal and fit into both English and Multilingual suites. |
|
I don't think we should split between audio-text and multilingual/english |
|
Don't think we can group a handful of tasks to call them "massive benchmarks". It's likely better to say we cover ABC domains in a single benchmark. For practicality, modality will be the biggest driver, and I think we should keep the split by modalities. |
- Export MAEB benchmark from benchmarks/__init__.py - Add Audio tab after Image tab in benchmark selector with MAEB(audio), MAEB(audio-text), and MAEB benchmarks - Add skip_cache_file option to load results from local cache path - Configure local maeb-results path for development testing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
# Conflicts: # mteb/leaderboard/app.py
- Add 5 zero-shot classification tasks to MAEB_AUDIO_TEXT benchmark - Update description to reflect 34 total tasks - Add KennethEnevworlds and Samoed to all MAEB benchmark contacts 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Zero-shot classification tasks require text modality and are now only in MAEB(audio-text). MAEB(audio) now has 24 tasks across 4 task types. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
I resolved #3877 and removed zeroshot tasks from the audio-only benchmark. |
Resolved conflict in any_2_any_retrieval/__init__.py, keeping correct class names: - JamAltArtistA2ARetrieval - JamAltLyricA2TRetrieval - JamAltLyricT2ARetrieval 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
Just added the MAEB audio extended and lite benchmarks. These have 54 tasks, 38 models and 19 tasks, 44 models respectively. MAEB audio-text lite has 30 tasks, 10 models. This is done by finding the most number of tasks with the most number of model eval runs completed. No filtering is applied. @AdnanElAssadi56 @KennethEnevoldsen @Samoed would love a quick pair of eyes on these. I'd say we can probably start with these, and start filling in the relevant paper subsections. Audio, Extended
Audio, Lite
Audio-Text, Lite
|
Replace MAEB(audio) and MAEB(audio-text) with new benchmarks optimized for maximum model coverage: - MAEB(audio, lite): 19 tasks, 44 models with complete results - MAEB(audio, extended): 54 tasks, 38 models with complete results - MAEB(audio-text, lite): 30 tasks, 10 models with complete results Tasks selected via greedy algorithm maximizing models with all tasks. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
The prerun loop was calling on_benchmark_select() and update_task_list() which return gr.update() objects, but then passing those objects to functions expecting raw lists. This caused cache corruption and Gradio validation errors when switching between benchmarks with different task types (e.g., from MAEB(audio-text, lite) with Any2AnyRetrieval to MAEB(audio, lite) without it). Fix by calling the underlying cached functions directly: - _cache_on_benchmark_select() instead of on_benchmark_select() - _cache_update_task_list() instead of update_task_list() 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Leaderboard fixes: - Cancel pending filter events when benchmark changes to prevent race conditions with stale values - Make _update_description derive counts from benchmark tasks directly instead of filter selections to avoid validation errors Benchmark changes: - Remove AudioCapsMiniReranking from MAEB, MAEB(audio, lite), and MAEB(audio, extended) - Update task counts in descriptions (96→95, 19→18, 54→53) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
Look great! |
- Use MAEB(audio, lite) and MAEB(audio-text, lite) benchmarks - Table 1: Classification, PairClassification, Reranking, Clustering - Table 2: Retrieval, ZeroshotClassification - Make table functions accept task_names and benchmark_name parameters 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Add alias mapping for task types that lose digits during column name processing (e.g., Any2AnyRetrieval -> AnyAnyRetrieval). Also add more audio models to annotation list. Co-Authored-By: Claude Opus 4.5 <[email protected]>
Marimo notebook for analyzing evaluation times across MAEB benchmarks. Loads model metadata and task results to compare eval times between large and small models for audio and audio-text benchmarks. Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
Great work, @isaac-chung! When we say audio-text, lite here, are we implying an extended version to the readers? |
It's the most complete collection bases on what we have run. We're missing results for an extended version. |
Resolve conflicts in pyproject.toml and uv.lock, taking maeb's version for speechbrain dependency constraint. Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add MAEB(audio-text, extended) benchmark with 36 tasks: - All 30 tasks from lite version - Clotho A2T/T2A for audio captioning - Fleurs A2T/T2A (102 languages) - CommonVoice 21 A2T/T2A (82+ languages) - Refine MAEB(audio-text, lite) to 17 tasks: - Remove redundant A2T tasks that have T2A equivalents - Remove SpeechCommandsZeroshotv0.01 (keep only v0.02) - Keep 13 T2A retrieval + 4 zero-shot classification - Add MAEB(audio-text, extended) to benchmark selector Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
Added |
New utility script that calculates total evaluation times for specified
benchmarks and models. Features:
- Takes --benchmarks and --models as required arguments
- Optional --results-dir for custom cache location
- Outputs formatted table with task coverage and times per benchmark
- Shows totals per model
Usage:
python scripts/calculate_eval_times.py \
-b "MAEB(audio-text, lite)" "MAEB(audio-text, extended)" \
-m "OpenMuQ/MuQ-MuLan-large" "laion/clap-htsat-unfused" \
-r /path/to/results
Co-Authored-By: Claude Opus 4.5 <[email protected]>
Computes Spearman and Pearson correlations between MAEB lite and extended benchmark variants to validate that lite benchmarks preserve model rankings. Outputs correlation values and scatter plots (PNG and PDF). Co-Authored-By: Claude Opus 4.5 <[email protected]>



See the draft benchmarks. (For audio-text I actually use the full collection, no filtering) You'll also find the filtering notebook and the script to generate "Table 1".
@KennethEnevoldsen @AdnanElAssadi56 maybe another one for environmental or something?