-
Notifications
You must be signed in to change notification settings - Fork 530
[MAEB] merge from main again #3873
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…benchmark table (#3771) * Update BenchmarkResults to output results of benchmark * added score column and correct TYPE_CHECKING * address comments * address comments * fix import * fix tests * fix tests * change BenchmarkResults to Pydantic dataclass * change benchmark to pydantic dataclass * fix tests * fix model * fix * lint * remove future * fix after review * add test * reapply comments from review * remove mock benchmark * add documentation * added actual results * Update docs/usage/loading_results.md Co-authored-by: Kenneth Enevoldsen <[email protected]> * add actual results --------- Co-authored-by: Roman Solomatin <[email protected]> Co-authored-by: Roman Solomatin <[email protected]> Co-authored-by: Kenneth Enevoldsen <[email protected]>
fix clustering processing
…3787) * docs: update MIEB contributing guide for MTEB v2 AbsTask structure * Update docs/mieb/readme.md * Update docs/mieb/readme.md
* model: add octen_models * add issue link for document prompt
* feat: add detailed timing logs to leaderboard initialization Add comprehensive timing information to track performance of each step in the leaderboard building process: - Loading benchmark results (from cache or remote) - Fetching and processing benchmarks - Filtering models and generating tables - Creating Gradio components and interface - Prerun phase for cache population Each step logs start and completion times with elapsed duration to help identify performance bottlenecks during leaderboard initialization. * perf: optimize benchmark processing with caching and vectorized operations Implemented 3 high-impact optimizations to reduce benchmark processing time: 1. Cache get_model_metas() calls using @functools.lru_cache - Eliminates 59 redundant calls (once per benchmark) - Now called once and cached for all benchmarks 2. Replace pandas groupby().apply() with vectorized operations - Replaced deprecated .apply(keep_best) pattern - Uses sort_values() + groupby().first() instead - Avoids nested function calls per group 3. Cache version string parsing with @functools.lru_cache - Eliminates redundant parsing of same version strings - Uses LRU cache with 10,000 entry limit Performance improvements: - Benchmark processing: 131.17s → 44.73s (2.93x faster, 66% reduction) - join_revisions(): 84.96s → 1.73s (49x faster, 98% reduction) - Leaderboard Step 3: 121.28s → 48.23s (2.51x faster, 60% reduction) This significantly improves leaderboard startup time by reducing the benchmark processing bottleneck. * Update mteb/leaderboard/app.py Co-authored-by: Copilot <[email protected]> * fix: ensure deterministic revision grouping in join_revisions() - Replace groupby(revision_clean) with groupby(revision) - Remove non-deterministic iloc[0] access for revision selection - Tasks with different original revisions (None vs external) now kept separate - Each ModelResult has consistent revision across all its task_results This resolves the issue where tasks with different original revisions that mapped to the same cleaned value would be grouped together non-deterministically. * refactor: use default lru_cache maxsize for _get_cached_model_metas * refactor: remove optimization markers from comments * Apply suggestion from @isaac-chung --------- Co-authored-by: Copilot <[email protected]>
* feat: add detailed timing logs to leaderboard initialization Add comprehensive timing information to track performance of each step in the leaderboard building process: - Loading benchmark results (from cache or remote) - Fetching and processing benchmarks - Filtering models and generating tables - Creating Gradio components and interface - Prerun phase for cache population Each step logs start and completion times with elapsed duration to help identify performance bottlenecks during leaderboard initialization. * perf: optimize benchmark processing with caching and vectorized operations Implemented 3 high-impact optimizations to reduce benchmark processing time: 1. Cache get_model_metas() calls using @functools.lru_cache - Eliminates 59 redundant calls (once per benchmark) - Now called once and cached for all benchmarks 2. Replace pandas groupby().apply() with vectorized operations - Replaced deprecated .apply(keep_best) pattern - Uses sort_values() + groupby().first() instead - Avoids nested function calls per group 3. Cache version string parsing with @functools.lru_cache - Eliminates redundant parsing of same version strings - Uses LRU cache with 10,000 entry limit Performance improvements: - Benchmark processing: 131.17s → 44.73s (2.93x faster, 66% reduction) - join_revisions(): 84.96s → 1.73s (49x faster, 98% reduction) - Leaderboard Step 3: 121.28s → 48.23s (2.51x faster, 60% reduction) This significantly improves leaderboard startup time by reducing the benchmark processing bottleneck. * perf: optimize validate_and_filter_scores filtering logic * Update mteb/results/task_result.py Co-authored-by: Copilot <[email protected]> --------- Co-authored-by: Copilot <[email protected]>
* Add model_type in model_meta for all models * added literal for model_type * update jina embedding model type * Added model_type to from_cross_encoder() method * update test * change location in model_meta to pass test * update late_interaction model and fix test * update late_interaction for colnomic models * update test * Update mteb/models/model_meta.py Co-authored-by: Roman Solomatin <[email protected]> * fix naming * remove is_cross_encoder field and convert it into property --------- Co-authored-by: Roman Solomatin <[email protected]>
* Added warnings.warn when logging warnings * address comments * Added depreciation warning * made better * address comments * address comments * address comments --------- Co-authored-by: Roman Solomatin <[email protected]> Co-authored-by: Kenneth Enevoldsen <[email protected]>
* save kwargs passed to get_model in model_meta * add save_kwargs to load_model * removed copy of meta * Update mteb/models/model_meta.py * try to run with kwargs * try to move kwargs * add tests * change model in tests --------- Co-authored-by: Roman Solomatin <[email protected]> Co-authored-by: Roman Solomatin <[email protected]>
* add pytyped * start typing * finish evaluators * add more types * Update mteb/results/benchmark_results.py Co-authored-by: Kenneth Enevoldsen <[email protected]> * apply comments * continue typechecking * fix typehint * typechecking * fix tests * fix type errors again * fix cache * add more types * fix method * roll back pyproject * activate PGH * install more types * almost finish * fix search wrappers * add ci * fix tests * fix 3.10 types * rollback overload * fixes after merge * change to iterable * add fixes * remove summarization scores hint * simplify deprecated_evaluator * simplify model conversion * add comment for typechecking * remove casts * remove duplicated function --------- Co-authored-by: Kenneth Enevoldsen <[email protected]>
* add benchmark aliases * split to aliases * move aliases * create aliases in separate function * simplify a bit * add test * Apply suggestions from code review Co-authored-by: Kenneth Enevoldsen <[email protected]> * add default value * add MTEB alias --------- Co-authored-by: Kenneth Enevoldsen <[email protected]>
* create function for creating mock tasks * add annotations
* docs: add benchmark filtering examples * Apply suggestion from @Samoed Co-authored-by: Roman Solomatin <[email protected]> * docs: remove custom benchmarks subsection * docs: expand filtering section with content tabs * docs: fix code block indentation in content tabs * build: include docs deps in dev group --------- Co-authored-by: Roman Solomatin <[email protected]>
* update generate_model_card with get_benchmark_result() * add support for list of benchmarks * split parameters * fix type * generate card * add tests * add tests * add tabulate to test dependencies * correct tests --------- Co-authored-by: Roman Solomatin <[email protected]>
* update reference website of Seed1.6-embedding-1215 * update Bytedance/Seed1.6-embedding-1215 model
* fix repo exists check * add test
* feat: add leaderboard CLI command with cache-path option * test: add comprehensive tests for leaderboard CLI command * try to fix install * fix: lazy-load leaderboard to avoid requiring deps for CLI * Update mteb/cli/build_cli.py Co-authored-by: Roman Solomatin <[email protected]> * make lint * remove AGENTS.md * move import to top of file * log the default cache path * Improve leaderboard tests to verify actual cache paths Address PR feedback by modifying leaderboard tests to verify the actual cache paths passed to get_leaderboard_app instead of mocking ResultCache. - Updated test_leaderboard_custom_cache_path to create real ResultCache instances and verify the correct custom cache path is used - Updated test_leaderboard_default_cache to verify the default cache path is used - Removed ResultCache mocking in favor of testing actual cache behavior - Used patch.dict to mock the leaderboard module import while preserving real cache functionality This provides better test coverage by validating that the cache objects passed to the leaderboard app have the correct paths, as suggested in PR comment: #3802 (comment) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <[email protected]> * Combine leaderboard cache tests using pytest parametrize Address PR feedback by combining test_leaderboard_custom_cache_path and test_leaderboard_default_cache into a single parametrized test. - Created test_leaderboard_cache_paths with parametrize decorator - Tests both custom cache path and default cache path scenarios - Each test case covers different host, port, and share configurations - Removed redundant test_leaderboard_args as functionality is now covered by the parametrized test - Improved test maintainability by reducing code duplication This addresses PR comment: #3802 (comment) "Can be combined with the following test using a parametrize argument" 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <[email protected]> * Update make run-leaderboard to use new CLI and remove app.py main block Address PR feedback by updating the project to use the new leaderboard CLI: - Updated Makefile run-leaderboard target to use `python -m mteb leaderboard` instead of `python -m mteb.leaderboard.app` - Removed the `if __name__ == "__main__":` block from mteb/leaderboard/app.py as this functionality is now handled by the CLI command This completes the integration of the new leaderboard CLI command into the project's build system and removes deprecated direct module execution. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <[email protected]> * feat: add theme and head parameters to leaderboard CLI * fix: suppress leaderboard warnings on CLI launch * test: update leaderboard tests for theme and head params * Revert "Update make run-leaderboard to use new CLI and remove app.py main block" This reverts commit d4df501. * Update mteb/cli/build_cli.py Co-authored-by: Kenneth Enevoldsen <[email protected]> * docs: update leaderboard CLI usage * update docs to show defaults * fix: apply ruff formatting --------- Co-authored-by: Roman Solomatin <[email protected]> Co-authored-by: Claude Sonnet 4 <[email protected]> Co-authored-by: Kenneth Enevoldsen <[email protected]>
* Add filter for model type * fix literal issue * fix * remove white space * remove logic in filter_tasks * remove info in leaderboard * add tests * update tests * add default in model types * fix model filter --------- Co-authored-by: Roman Solomatin <[email protected]>
* Optimize leaderboard startup by downloading cached results from cached-data branch - Modify _load_results() to first try downloading __cached_results.json.gz from the cached-data branch - Only fallback to full repository clone if the direct download fails - Add gzip decompression to handle the compressed cache file - This reduces startup time significantly by avoiding full repo cloning when possible - Added comprehensive logging to track download progress and fallback behavior 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <[email protected]> * make lint * Fix leaderboard stability test with enhanced debugging - Remove prevent_thread_lock=True to keep Gradio process alive - Add comprehensive exception handling for HTTP, gzip, and file operations - Optimize test completion with HTTP 200 health checking (300s → ~140s) - Add detailed logging and warning suppressions for better debugging 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <[email protected]> * Update tests/test_leaderboard.py Co-authored-by: Copilot <[email protected]> * Add comprehensive tests for leaderboard caching exception handling - Add 46 unit tests covering HTTP downloads, gzip decompression, file I/O, and JSON validation - Reorganize leaderboard tests into focused modules for better maintainability - Update Makefile with improved leaderboard test commands 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <[email protected]> * Increase cached results download size limit to 500MB The cached results file has grown to ~92.7MB, exceeding the previous 50MB limit. This change increases the limit to 500MB to accommodate current and future file sizes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <[email protected]> * Fix leaderboard tests by adding missing dependency to install-for-tests GitHub Actions were failing because cachetools was not installed during CI test runs. The leaderboard extra was already defined with cachetools>=5.2.0 but wasn't included in the install-for-tests target used by CI. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <[email protected]> * Remove LogFlusher functionality from leaderboard app Addresses PR comment feedback indicating the log flushing optimization was unnecessary at this stage. Removes: - LogFlusher class with batching logic - Global _log_flusher instance - _flush_logs() wrapper function - All calls to _flush_logs() throughout the app - Complete test file test_log_flushing.py Leaderboard functionality remains unchanged and tests pass. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <[email protected]> * remove _validate_benchmark_json * Refactor leaderboard caching to use ResultCache and consolidate tests Move download_cached_results_from_branch to ResultCache class and reduce TestDownloadCachedResultsFromBranch from 23 to 13 test cases while maintaining full coverage. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <[email protected]> * Apply suggestions from code review Co-authored-by: Copilot <[email protected]> * lint and remove unreachable code * Move shared test fixtures to parent conftest.py - Created tests/conftest.py with shared fixtures (mock_benchmark_json, mock_invalid_json, mock_gzipped_content) for use across all tests - Removed duplicate fixtures from tests/test_leaderboard/conftest.py - Kept leaderboard-specific fixtures in test_leaderboard/conftest.py - Fixes TestDownloadCachedResultsFromBranch test failures by making fixtures accessible to test_result_cache.py All 25 tests now passing (23 in test_result_cache.py, 2 in test_integration.py) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]> * make method private * Fix content type validation test to match implementation behavior The test_content_type_handling test was expecting warnings for unexpected content types, but the actual implementation raises exceptions. Updated test to use pytest.raises() for proper exception validation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <[email protected]> * update cache based on review comments * type check * Remove unused leaderboard_test_config fixture * fix: remove unused mock_invalid_json fixture * rm AGENTS/,d * reduce number of excepts in app.py --------- Co-authored-by: Claude Sonnet 4 <[email protected]> Co-authored-by: Copilot <[email protected]>
* use uv to all make commands * read the docs a bit more... * try out system flag * fix: remove redundant pip install uv commands from Makefile Removes duplicate uv installations that were conflicting with the properly configured uv from astral-sh/setup-uv GitHub Action. The GitHub Action already installs and configures uv correctly, so the Makefile pip installs were overwriting this configuration and causing "No system Python installation found" errors. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <[email protected]> * fix: remove --system flag from uv pip install commands The astral-sh/setup-uv GitHub Action configures uv to manage its own Python installations, not to use system Python. The --system flag was causing "No system Python installation found" errors because uv expects to use its managed Python environment. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <[email protected]> * fix: migrate Makefile to use correct uv workflow - Replace 'uv pip install' with 'uv sync' for dependency management - Add proper --extra flags for all optional dependencies - Use 'uv run' for all Python command executions - Follow official uv GitHub Actions best practices This aligns with uv's recommended project workflow and should resolve the CI environment issues we were experiencing. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <[email protected]> * fix: update all GitHub Actions workflows to remove UV_SYSTEM_PYTHON - Remove UV_SYSTEM_PYTHON: 1 from all workflow files - Fix documentation.yml to use 'uv sync --group docs' instead of 'uv pip install' - Fix leaderboard_build.yml to use 'uv sync --extra leaderboard --group dev' - Ensures consistent uv workflow across all CI jobs Updated workflows: - lint.yml - documentation.yml - model_loading.yml - dataset_loading.yml - leaderboard_build.yml 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <[email protected]> * fix: lint workflow to use correct dependency group - Change from 'make install' to 'uv sync --group lint' since pre-commit is in the lint group - Add explicit pre-commit install step - Use 'uv run' for lint commands (ruff, typos) to ensure proper environment - Fixes "pre-commit: No such file or directory" error in lint workflow 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <[email protected]> * remove 3.14 * try out 3.14 again with python_full_version * specify torch version for pylate dep * try to skip colpali * try split torch * Add --no-sync flag and group/extra flags to uv run commands Address review comments from PR #3702: 1. Add --no-sync to all uv run commands in Makefile for: - Faster execution (avoids re-syncing on each command) - pip compatibility (users can remove 'uv run' prefix) 2. Add appropriate group/extra flags to uv run commands: - test commands: --group test - docs commands: --group docs - typecheck: --group typing - leaderboard: --extra leaderboard 3. Update CI workflows to use --no-sync and appropriate groups: - lint.yml: Add --no-sync --group lint to all uv run commands - documentation.yml: Add uv run --no-sync --group docs to mkdocs gh-deploy These changes improve performance while maintaining compatibility for contributors who prefer using pip directly. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]> * try removing install block * add back install block * remove install block in doc CI without --no-sync * add uv lock file * replace install-for-test with just install * install pre-commit with uv * fix doc workflow * address review comments * remove no-sync from run-leaderboard make command * remove --no-sync from selected make commands * update typechecking * fix type checking * sync to install * fix tests * test pre-commit setup * remove test file * fix: separate install and install-for-tests with uv commands * fix: add leaderboard extra to typecheck command for gradio imports * fix: add faiss-cpu extra to test targets * fix: update CI workflows for uv dependency management * docs: update all documentation for uv migration - Add uv installation options alongside pip in README.md - Update installation.md with comprehensive migration guide for contributors - Add uv context to CONTRIBUTING.md for development setup - Update all usage docs to include uv alternatives for extras: - openai, leaderboard, image, xet, faiss-cpu dependencies - Fix incorrect extra name: faiss -> faiss-cpu in retrieval_backend.md - Ensure consistent dual-option approach (pip/uv) throughout documentation This provides users and contributors with modern, fast uv tooling while maintaining backward compatibility with existing pip workflows. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <[email protected]> --------- Co-authored-by: Claude Sonnet 4 <[email protected]> Co-authored-by: Roman Solomatin <[email protected]>
|
|
|
Actually I'm curious whether a main merge would solve this issue. |
|
No, It won't. It will be fixed only if maeb would use datasets v4 |
So, can it? Looks like huggingface/datasets#7707 is solved now. |
|
We're limited by transformers now #3538 or more specific huggingface/transformers#42103 |
|
I see... maybe we don't merge main for now then. Is the fix now to invalidate the dependency cache in CI or what? |
|
You can't delete the cache, or what's the problem? |
|
There is no big problem. I've deleted it, but each run on main will overwrite overwrite it |
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
So what's the better fix? |
- Remove unused LogOnce import from _create_dataloaders.py - Use specific type ignore codes [arg-type] in mock_tasks.py for PGH003 compliance - Fix type annotations in classification.py to use Array type instead of np.ndarray - Remove unused Iterable import from classification.py 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
You can remove HF cache from tests in maeb branch, but we would need to get back it before merge to main mteb/.github/workflows/test.yml Lines 52 to 57 in 1e78793
|
Remove modalities from _common_mock_metadata since each ModelMeta instance specifies its own modalities, which caused "got multiple values for keyword argument 'modalities'" error. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
We can change the cache key to |
|
Yes, this would also work |
|
@Samoed I was hoping to verify my last commit's CI runs. Would you mind holding off pushing commit to this branch please? |
|
Yes, of course. I've just noticed that you tried to fix |
b536656 to
2631fc8
Compare
I think this point is running into the issue you mentioned earlier. We might have to just temporarily remove 3.14 in maeb if we want all tests to pass. |
wdyt @Samoed ? |
|
We can do this, but we need to not forget to get back it before merge |

Update
maebbranch with mteb 2.6.5