Caching and reloading responses

This guide explains how to enable and use the built‑in JSONL cache api in lmms-eval so repeated runs can reload model responses instead of re‑calling the model. It also notes an optional legacy SQLite cache wrapper.

What gets cached

Scope: Per model instance and per task.
Unit: One record per document (doc_id) with the final string response.
Files: One JSONL file per task and process shard.

The cache is implemented in lmms_eval.api.model.lmms via:

load_cache() and load_jsonl_cache() to load cached responses at startup
get_response_from_cache() to split incoming requests into “already cached” vs “not cached”
add_request_response_to_cache() to append new results as they are produced

Models that call these APIs (for example async_openai_compatible_chat) automatically benefit from caching without any code changes in user scripts. You will need to use this api in your generate_until to cache and reload cache.

Minimal example (inside a model's `generate_until`)

def generate_until(self, requests):
    self.load_cache()
    cached, pending = self.get_response_from_cache(requests)
    results = [c["response"] for c in cached]
    for req in pending:
        out = call_backend(req)  # your model inference
        self.add_request_response_to_cache(req, out)
        results.append(out)
    return results

Enable the cache

Set an environment variable before running:

export LMMS_EVAL_USE_CACHE=True
# optional: set the base directory for caches (defaults to ~/.cache/lmms-eval)
export LMMS_EVAL_HOME="/path/to/cache_root"

Nothing else is required. When enabled, the model will:

load existing JSONL cache files at startup; 2) serve responses from cache; 3) append newly generated responses back to the JSONL files.

Where cache files live

Base directory: $(LMMS_EVAL_HOME:-~/.cache/lmms-eval)/eval_cache/<model_hash>/
File name per task and process shard: {task_name}_rank{rank}_world_size{world_size}.jsonl
Record format per line:

{"doc_id": <doc_id>, "response": <string>}

Notes:

The <model_hash> is derived from a best‑effort human‑readable model identity (e.g., model_version) and the set of task names attached to the model, to avoid collisions.
Separate files per rank and world_size make distributed runs safe to cache concurrently.

How it works at runtime

For models wired to the cache API (e.g., async_openai_compatible_chat):

At the beginning of generate_until(...) the model calls load_cache() and then get_response_from_cache(requests).
Cached items are returned immediately; only the remaining requests are forwarded to the backend.
After each response is produced, add_request_response_to_cache(...) appends a JSONL record.

The cache key is the tuple (task_name, doc_id). Ensure your task produces stable doc_ids across runs.

Example: use with async_openai_compatible_chat

export OPENAI_API_BASE="http://localhost:8000/v1"
export OPENAI_API_KEY="EMPTY"          # if your server allows it
export LMMS_EVAL_USE_CACHE=True         # enable JSONL cache
# optional: export LMMS_EVAL_HOME to relocate cache root

python -m lmms_eval \
  --model async_openai \
  --model_args model_version=grok-2-latest,base_url=$OPENAI_API_BASE,api_key=$OPENAI_API_KEY \
  --tasks <your_task> \
  --batch_size 1 \
  --output_path ./logs/

On a second run with the same task/docs, cached responses will be loaded and only missing documents will call the model.

Inspect or clear the cache

Inspect: open the task JSONL file(s) under the model’s cache directory and view records.
Clear: delete the corresponding JSONL file(s) or the entire <model_hash> directory to force re‑evaluation.

Notes and limitations

The JSONL cache is keyed by task_name and doc_id. Changing task names or document IDs invalidates reuse.
Responses are cached as final strings. If your model emits intermediate tool calls, the final message (including any inline annotations) is what gets cached.
Distributed runs write to per‑rank files to avoid contention; reusing the cache works across single‑ and multi‑GPU as long as task_name/doc_id match.

Optional: legacy SQLite cache wrapper

There is also a separate optional wrapper CachingLMM (see lmms_eval.api.model.CachingLMM) that caches by hashing the entire call arguments to a SQLite DB (via SqliteDict). It is independent from the JSONL cache above and can be useful for broader API‑level caching. For most users, enabling LMMS_EVAL_USE_CACHE=True is sufficient and simpler.