csuhan's picture
Upload folder using huggingface_hub
b0c0df0 verified

Caching and reloading responses

This guide explains how to enable and use the built‑in JSONL cache api in lmms-eval so repeated runs can reload model responses instead of re‑calling the model. It also notes an optional legacy SQLite cache wrapper.

What gets cached

  • Scope: Per model instance and per task.
  • Unit: One record per document (doc_id) with the final string response.
  • Files: One JSONL file per task and process shard.

The cache is implemented in lmms_eval.api.model.lmms via:

  • load_cache() and load_jsonl_cache() to load cached responses at startup
  • get_response_from_cache() to split incoming requests into “already cached” vs “not cached”
  • add_request_response_to_cache() to append new results as they are produced

Models that call these APIs (for example async_openai_compatible_chat) automatically benefit from caching without any code changes in user scripts. You will need to use this api in your generate_until to cache and reload cache.

Minimal example (inside a model's generate_until)

def generate_until(self, requests):
    self.load_cache()
    cached, pending = self.get_response_from_cache(requests)
    results = [c["response"] for c in cached]
    for req in pending:
        out = call_backend(req)  # your model inference
        self.add_request_response_to_cache(req, out)
        results.append(out)
    return results

Enable the cache

Set an environment variable before running:

export LMMS_EVAL_USE_CACHE=True
# optional: set the base directory for caches (defaults to ~/.cache/lmms-eval)
export LMMS_EVAL_HOME="/path/to/cache_root"

Nothing else is required. When enabled, the model will:

  1. load existing JSONL cache files at startup; 2) serve responses from cache; 3) append newly generated responses back to the JSONL files.

Where cache files live

  • Base directory: $(LMMS_EVAL_HOME:-~/.cache/lmms-eval)/eval_cache/<model_hash>/
  • File name per task and process shard: {task_name}_rank{rank}_world_size{world_size}.jsonl
  • Record format per line:
{"doc_id": <doc_id>, "response": <string>}

Notes:

  • The <model_hash> is derived from a best‑effort human‑readable model identity (e.g., model_version) and the set of task names attached to the model, to avoid collisions.
  • Separate files per rank and world_size make distributed runs safe to cache concurrently.

How it works at runtime

For models wired to the cache API (e.g., async_openai_compatible_chat):

  • At the beginning of generate_until(...) the model calls load_cache() and then get_response_from_cache(requests).
  • Cached items are returned immediately; only the remaining requests are forwarded to the backend.
  • After each response is produced, add_request_response_to_cache(...) appends a JSONL record.

The cache key is the tuple (task_name, doc_id). Ensure your task produces stable doc_ids across runs.

Example: use with async_openai_compatible_chat

export OPENAI_API_BASE="http://localhost:8000/v1"
export OPENAI_API_KEY="EMPTY"          # if your server allows it
export LMMS_EVAL_USE_CACHE=True         # enable JSONL cache
# optional: export LMMS_EVAL_HOME to relocate cache root

python -m lmms_eval \
  --model async_openai \
  --model_args model_version=grok-2-latest,base_url=$OPENAI_API_BASE,api_key=$OPENAI_API_KEY \
  --tasks <your_task> \
  --batch_size 1 \
  --output_path ./logs/

On a second run with the same task/docs, cached responses will be loaded and only missing documents will call the model.

Inspect or clear the cache

  • Inspect: open the task JSONL file(s) under the model’s cache directory and view records.
  • Clear: delete the corresponding JSONL file(s) or the entire <model_hash> directory to force re‑evaluation.

Notes and limitations

  • The JSONL cache is keyed by task_name and doc_id. Changing task names or document IDs invalidates reuse.
  • Responses are cached as final strings. If your model emits intermediate tool calls, the final message (including any inline annotations) is what gets cached.
  • Distributed runs write to per‑rank files to avoid contention; reusing the cache works across single‑ and multi‑GPU as long as task_name/doc_id match.

Optional: legacy SQLite cache wrapper

There is also a separate optional wrapper CachingLMM (see lmms_eval.api.model.CachingLMM) that caches by hashing the entire call arguments to a SQLite DB (via SqliteDict). It is independent from the JSONL cache above and can be useful for broader API‑level caching. For most users, enabling LMMS_EVAL_USE_CACHE=True is sufficient and simpler.