Add: File Based Caching for lm_eval tests (#2051)

rahul-tuli · web-flow · commit b77175dd60c2 · 2025-11-20T15:20:40.000Z
# LM Eval Caching System ## Overview Caches base model evaluation results to speed up tests by ~30-50%. When multiple quantization tests share the same base model and evaluation parameters, cached results are reused instead of re-evaluating. ## Quick Start ### Enable (Default) ```bash pytest tests/lmeval/test_lmeval.py ``` First run caches results, subsequent runs use cached results for matching parameters. ### Disable ```bash DISABLE_LMEVAL_CACHE=1 pytest tests/lmeval/test_lmeval.py ``` ### Custom Cache Location ```bash LMEVAL_CACHE_DIR=/tmp/cache pytest tests/lmeval/test_lmeval.py ``` ## How It Works Results are cached based on: - Model ID (e.g., `TinyLlama/TinyLlama-1.1B-Chat-v1.0`) - Task (e.g., `gsm8k`) - Few-shot count - Sample limit - Batch size - Model arguments (hashed) ### Cache Storage Single CSV file: `.lmeval_cache/cache.csv` ```csv model,task,num_fewshot,limit,batch_size,model_args_hash,result TinyLlama/TinyLlama-1.1B-Chat-v1.0,gsm8k,5,1000,100,abc123def456,"{'results': {...}}" ``` ## Usage in Tests ```python from tests.testing_utils import cached_lm_eval_run class TestLMEval: @cached_lm_eval_run def _eval_base_model(self) -> dict: return self._eval_model(self.model) ``` ## Cache Management ### Inspect Cache ```bash cat .lmeval_cache/cache.csv # Or with pandas python -c "import pandas as pd; print(pd.read_csv('.lmeval_cache/cache.csv'))" ``` ### Clear Cache ```bash rm -rf .lmeval_cache/ ``` ### Ignore in Git Already added to `.gitignore`: ```gitignore .lmeval_cache/ ``` ## Environment Variables | Variable | Values | Default | Description | |----------|--------|---------|-------------| | `DISABLE_LMEVAL_CACHE` | `1`, `true`, `yes` | disabled | Disable caching | | `LMEVAL_CACHE_DIR` | path | `.lmeval_cache` | Cache directory | ## Performance Example **Without cache:** 175s total ``` Test 1 (W4A16): Base (60s) + Compressed (30s) = 90s Test 2 (W8A8): Base (60s) + Compressed (25s) = 85s ``` **With cache:** 115s total (34% faster) ``` Test 1 (W4A16): Base (60s) + Compressed (30s) = 90s [MISS] Test 2 (W8A8): Base (0.1s) + Compressed (25s) = 25s [HIT] ``` ## Example Cache Logs When running tests, you'll see these log messages: **First test (cache miss):** ``` LM-Eval cache MISS: meta-llama/Meta-Llama-3-8B-Instruct/gsm8k # ... evaluation runs ... LM-Eval cache WRITE: meta-llama/Meta-Llama-3-8B-Instruct/gsm8k ``` **Second test with same base model (cache hit):** ``` LM-Eval cache HIT: meta-llama/Meta-Llama-3-8B-Instruct/gsm8k # ... evaluation skipped, cached result returned ... ``` ### Testing the Cache Run two tests that share the same base model: ```bash # Clean cache rm -rf .lmeval_cache/ # Test 1: FP8_DYNAMIC scheme - Cache MISS CUDA_VISIBLE_DEVICES=5 CADENCE=weekly \ TEST_DATA_FILE=tests/lmeval/configs/fp8_dynamic_per_token.yaml \ .venv/bin/python -m pytest tests/lmeval/test_lmeval.py::TestLMEval::test_lm_eval -v # Test 2: FP8 static scheme (same base model) - Cache HIT CUDA_VISIBLE_DEVICES=5 CADENCE=weekly \ TEST_DATA_FILE=tests/lmeval/configs/fp8_static_per_tensor.yaml \ .venv/bin/python -m pytest tests/lmeval/test_lmeval.py::TestLMEval::test_lm_eval -v # Inspect cache cat .lmeval_cache/cache.csv ``` Expected output shows the second test completes much faster due to cached base model evaluation. ## Troubleshooting ### Cache Not Working Check environment variable: ```bash echo $DISABLE_LMEVAL_CACHE # Should be empty ``` Verify cache file exists: ```bash ls -lh .lmeval_cache/cache.csv ``` ### Cache Always Misses Ensure exact parameter match: - Model ID (case-sensitive) - Task name - All numeric parameters (fewshot, limit, batch_size) - Model arguments ### Corrupted Cache Simply delete and recreate: ```bash rm .lmeval_cache/cache.csv pytest tests/lmeval/test_lmeval.py ``` ## CI/CD Integration ### GitHub Actions ```yaml - name: Run LM Eval tests with cache env: LMEVAL_CACHE_DIR: ${{ runner.temp }}/lm_cache run: pytest tests/lmeval/test_lmeval.py ``` ## Design Notes ### Why CSV? - Matches existing timing data format (`timings/*.csv`) - Simple pandas integration ### Error Handling Failures are logged but don't break tests - cache simply falls back to uncached execution on any error. Rebased version of #1900 --------- Signed-off-by: Rahul-Tuli <rtuli@redhat.com>
diff --git a/.gitignore b/.gitignore
@@ -805,6 +805,9 @@ timings/
 output_finetune/
 env_log.json
 
+# LM Eval cache
+.lmeval_cache/
+
 # uv artifacts
 uv.lock
 .venv/
diff --git a/tests/lmeval/test_lmeval.py b/tests/lmeval/test_lmeval.py
@@ -15,7 +15,7 @@
 from llmcompressor.core import active_session
 from tests.e2e.e2e_utils import load_model, run_oneshot_for_e2e_testing
 from tests.test_timer.timer_utils import get_singleton_manager, log_time
-from tests.testing_utils import requires_gpu
+from tests.testing_utils import cached_lm_eval_run, requires_gpu
 
 
 class LmEvalConfig(BaseModel):
@@ -100,12 +100,12 @@ def set_up(self, test_data_file: str):
         self.recipe = eval_config.get("recipe")
         self.quant_type = eval_config.get("quant_type")
         self.save_dir = eval_config.get("save_dir")
+        self.seed = eval_config.get("seed", None)
 
-        seed = eval_config.get("seed", None)
-        if seed is not None:
-            random.seed(seed)
-            numpy.random.seed(seed)
-            torch.manual_seed(seed)
+        if self.seed is not None:
+            random.seed(self.seed)
+            numpy.random.seed(self.seed)
+            torch.manual_seed(self.seed)
 
         logger.info("========== RUNNING ==============")
         logger.info(self.scheme)
@@ -161,8 +161,9 @@ def test_lm_eval(self, test_data_file: str):
         self.tear_down()
 
     @log_time
+    @cached_lm_eval_run
     def _eval_base_model(self) -> dict:
-        """Evaluate the base (uncompressed) model."""
+        """Evaluate the base (uncompressed) model with caching."""
         return self._eval_model(self.model)
 
     @log_time
diff --git a/tests/testing_utils.py b/tests/testing_utils.py
@@ -1,20 +1,37 @@
 import dataclasses
 import enum
+import hashlib
+import json
 import logging
 import os
 from dataclasses import dataclass
 from enum import Enum
+from functools import wraps
 from pathlib import Path
 from subprocess import PIPE, STDOUT, run
-from typing import Callable, List, Optional, Union
+from typing import Any, Callable, Dict, List, Optional, Union
 
+import pandas as pd
 import pytest
 import torch
 import yaml
 from datasets import Dataset
+from loguru import logger
 from transformers import ProcessorMixin
 
 TEST_DATA_FILE = os.environ.get("TEST_DATA_FILE", None)
+DISABLE_LMEVAL_CACHE = os.environ.get("DISABLE_LMEVAL_CACHE", "").lower() in (
+    "1",
+    "true",
+    "yes",
+)
+LMEVAL_CACHE_DIR = Path(os.environ.get("LMEVAL_CACHE_DIR", ".lmeval_cache"))
+LMEVAL_CACHE_FILE = LMEVAL_CACHE_DIR / "cache.csv"
+
+
+def _sha256_hash(text: str, length: Optional[int] = None) -> str:
+    hash_result = hashlib.sha256(text.encode()).hexdigest()
+    return hash_result[:length] if length else hash_result
 
 
 # TODO: maybe test type as decorators?
@@ -292,3 +309,143 @@ def requires_cadence(cadence: Union[str, List[str]]) -> Callable:
     return pytest.mark.skipif(
         (current_cadence not in cadence), reason="cadence mismatch"
     )
+
+
+@dataclass(frozen=True)
+class LMEvalCacheKey:
+    """Cache key for LM Eval results based on evaluation parameters."""
+
+    model: str
+    task: str
+    num_fewshot: int
+    limit: int
+    batch_size: int
+    model_args_hash: str
+    lmeval_version: str
+    seed: Optional[int]
+
+    @classmethod
+    def from_test_instance(cls, test_instance: Any) -> "LMEvalCacheKey":
+        """Create cache key from test instance."""
+        try:
+            import lm_eval
+
+            lmeval_version = lm_eval.__version__
+        except (ImportError, AttributeError):
+            lmeval_version = "unknown"
+
+        lmeval = test_instance.lmeval
+        model_args_json = json.dumps(lmeval.model_args, sort_keys=True)
+        seed = getattr(test_instance, "seed", None)
+
+        return cls(
+            model=test_instance.model,
+            task=lmeval.task,
+            num_fewshot=lmeval.num_fewshot,
+            limit=lmeval.limit,
+            batch_size=lmeval.batch_size,
+            model_args_hash=_sha256_hash(model_args_json, 16),
+            lmeval_version=lmeval_version,
+            seed=seed,
+        )
+
+    def _matches(self, row: pd.Series) -> bool:
+        """Check if a DataFrame row matches this cache key."""
+        # Handle NaN for seed comparison (pandas reads None as NaN)
+        seed_matches = (pd.isna(row["seed"]) and self.seed is None) or (
+            row["seed"] == self.seed
+        )
+        return (
+            row["model"] == self.model
+            and row["task"] == self.task
+            and row["num_fewshot"] == self.num_fewshot
+            and row["limit"] == self.limit
+            and row["batch_size"] == self.batch_size
+            and row["model_args_hash"] == self.model_args_hash
+            and row["lmeval_version"] == self.lmeval_version
+            and seed_matches
+        )
+
+    def get_cached_result(self) -> Optional[Dict]:
+        """Load cached result from CSV file."""
+        if not LMEVAL_CACHE_FILE.exists():
+            return None
+
+        try:
+            df = pd.read_csv(LMEVAL_CACHE_FILE)
+            matches = df[df.apply(self._matches, axis=1)]
+
+            if matches.empty:
+                return None
+
+            return json.loads(matches.iloc[0]["result"])
+
+        except Exception as e:
+            logger.debug(f"Cache read failed: {e}")
+            return None
+
+    def store_result(self, result: Dict) -> None:
+        """Store result in CSV file."""
+        try:
+            LMEVAL_CACHE_DIR.mkdir(parents=True, exist_ok=True)
+
+            new_row = {
+                "model": self.model,
+                "task": self.task,
+                "num_fewshot": self.num_fewshot,
+                "limit": self.limit,
+                "batch_size": self.batch_size,
+                "model_args_hash": self.model_args_hash,
+                "lmeval_version": self.lmeval_version,
+                "seed": self.seed,
+                "result": json.dumps(result, default=str),
+            }
+
+            # Load existing cache or create new
+            if LMEVAL_CACHE_FILE.exists():
+                df = pd.read_csv(LMEVAL_CACHE_FILE)
+                # Remove duplicate entries for this key
+                df = df[~df.apply(self._matches, axis=1)]
+                df = pd.concat([df, pd.DataFrame([new_row])], ignore_index=True)
+            else:
+                df = pd.DataFrame([new_row])
+
+            df.to_csv(LMEVAL_CACHE_FILE, index=False)
+            logger.info(f"LM-Eval cache WRITE: {self.model}/{self.task}")
+
+        except Exception as e:
+            logger.debug(f"Cache write failed: {e}")
+
+
+def cached_lm_eval_run(func: Callable) -> Callable:
+    """
+    Decorator to cache lm_eval results in CSV format.
+
+    Caches results based on model, task, num_fewshot, limit, batch_size,
+    and model_args to avoid redundant base model evaluations.
+
+    Environment variables:
+        DISABLE_LMEVAL_CACHE: Set to "1"/"true"/"yes" to disable
+        LMEVAL_CACHE_DIR: Custom cache directory (default: .lmeval_cache)
+    """
+
+    @wraps(func)
+    def wrapper(self, *args, **kwargs):
+        # Skip caching if disabled
+        if DISABLE_LMEVAL_CACHE:
+            return func(self, *args, **kwargs)
+
+        # Try to get cached result
+        cache_key = LMEvalCacheKey.from_test_instance(self)
+        if (cached_result := cache_key.get_cached_result()) is not None:
+            logger.info(f"LM-Eval cache HIT: {cache_key.model}/{cache_key.task}")
+            return cached_result
+
+        # Run evaluation and cache result
+        logger.info(f"LM-Eval cache MISS: {cache_key.model}/{cache_key.task}")
+        result = func(self, *args, **kwargs)
+        cache_key.store_result(result)
+
+        return result
+
+    return wrapper