Caching and reloading responses
This guide explains how to enable and use the built‑in JSONL cache api in lmms-eval so repeated runs can reload model responses instead of re‑calling the model. It also notes an optional legacy SQLite cache wrapper.
What gets cached
- Scope: Per model instance and per task.
- Unit: One record per document (
doc_id) with the final string response. - Files: One JSONL file per task and process shard.
The cache is implemented in lmms_eval.api.model.lmms via:
load_cache()andload_jsonl_cache()to load cached responses at startupget_response_from_cache()to split incoming requests into “already cached” vs “not cached”add_request_response_to_cache()to append new results as they are produced
Models that call these APIs (for example async_openai_compatible_chat) automatically benefit from caching without any code changes in user scripts. You will need to use this api in your generate_until to cache and reload cache.
Minimal example (inside a model's generate_until)
def generate_until(self, requests):
self.load_cache()
cached, pending = self.get_response_from_cache(requests)
results = [c["response"] for c in cached]
for req in pending:
out = call_backend(req) # your model inference
self.add_request_response_to_cache(req, out)
results.append(out)
return results
Enable the cache
Set an environment variable before running:
export LMMS_EVAL_USE_CACHE=True
# optional: set the base directory for caches (defaults to ~/.cache/lmms-eval)
export LMMS_EVAL_HOME="/path/to/cache_root"
Nothing else is required. When enabled, the model will:
- load existing JSONL cache files at startup; 2) serve responses from cache; 3) append newly generated responses back to the JSONL files.
Where cache files live
- Base directory:
$(LMMS_EVAL_HOME:-~/.cache/lmms-eval)/eval_cache/<model_hash>/ - File name per task and process shard:
{task_name}_rank{rank}_world_size{world_size}.jsonl - Record format per line:
{"doc_id": <doc_id>, "response": <string>}
Notes:
- The
<model_hash>is derived from a best‑effort human‑readable model identity (e.g.,model_version) and the set of task names attached to the model, to avoid collisions. - Separate files per
rankandworld_sizemake distributed runs safe to cache concurrently.
How it works at runtime
For models wired to the cache API (e.g., async_openai_compatible_chat):
- At the beginning of
generate_until(...)the model callsload_cache()and thenget_response_from_cache(requests). - Cached items are returned immediately; only the remaining requests are forwarded to the backend.
- After each response is produced,
add_request_response_to_cache(...)appends a JSONL record.
The cache key is the tuple (task_name, doc_id). Ensure your task produces stable doc_ids across runs.
Example: use with async_openai_compatible_chat
export OPENAI_API_BASE="http://localhost:8000/v1"
export OPENAI_API_KEY="EMPTY" # if your server allows it
export LMMS_EVAL_USE_CACHE=True # enable JSONL cache
# optional: export LMMS_EVAL_HOME to relocate cache root
python -m lmms_eval \
--model async_openai \
--model_args model_version=grok-2-latest,base_url=$OPENAI_API_BASE,api_key=$OPENAI_API_KEY \
--tasks <your_task> \
--batch_size 1 \
--output_path ./logs/
On a second run with the same task/docs, cached responses will be loaded and only missing documents will call the model.
Inspect or clear the cache
- Inspect: open the task JSONL file(s) under the model’s cache directory and view records.
- Clear: delete the corresponding JSONL file(s) or the entire
<model_hash>directory to force re‑evaluation.
Notes and limitations
- The JSONL cache is keyed by
task_nameanddoc_id. Changing task names or document IDs invalidates reuse. - Responses are cached as final strings. If your model emits intermediate tool calls, the final message (including any inline annotations) is what gets cached.
- Distributed runs write to per‑rank files to avoid contention; reusing the cache works across single‑ and multi‑GPU as long as
task_name/doc_idmatch.
Optional: legacy SQLite cache wrapper
There is also a separate optional wrapper CachingLMM (see lmms_eval.api.model.CachingLMM) that caches by hashing the entire call arguments to a SQLite DB (via SqliteDict). It is independent from the JSONL cache above and can be useful for broader API‑level caching. For most users, enabling LMMS_EVAL_USE_CACHE=True is sufficient and simpler.