llm_cp2 / src /lmms-eval /docs /caching.md

Upload folder using huggingface_hub

b0c0df0 verified about 1 month ago

4.69 kB

	## Caching and reloading responses

	This guide explains how to enable and use the built‑in JSONL cache api in `lmms-eval` so repeated runs can reload model responses instead of re‑calling the model. It also notes an optional legacy SQLite cache wrapper.

	### What gets cached

	- Scope: Per model instance and per task.
	- Unit: One record per document (`doc_id`) with the final string response.
	- Files: One JSONL file per task and process shard.

	The cache is implemented in `lmms_eval.api.model.lmms` via:
	- `load_cache()` and `load_jsonl_cache()` to load cached responses at startup
	- `get_response_from_cache()` to split incoming requests into “already cached” vs “not cached”
	- `add_request_response_to_cache()` to append new results as they are produced

	Models that call these APIs (for example `async_openai_compatible_chat`) automatically benefit from caching without any code changes in user scripts. You will need to use this api in your `generate_until` to cache and reload cache.

	### Minimal example (inside a model's `generate_until`)

	```python
	def generate_until(self, requests):
	self.load_cache()
	cached, pending = self.get_response_from_cache(requests)
	results = [c["response"] for c in cached]
	for req in pending:
	out = call_backend(req) # your model inference
	self.add_request_response_to_cache(req, out)
	results.append(out)
	return results
	```

	### Enable the cache

	Set an environment variable before running:

	```bash
	export LMMS_EVAL_USE_CACHE=True
	# optional: set the base directory for caches (defaults to ~/.cache/lmms-eval)
	export LMMS_EVAL_HOME="/path/to/cache_root"
	```

	Nothing else is required. When enabled, the model will:
	1) load existing JSONL cache files at startup; 2) serve responses from cache; 3) append newly generated responses back to the JSONL files.

	### Where cache files live

	- Base directory: `$(LMMS_EVAL_HOME:-~/.cache/lmms-eval)/eval_cache/<model_hash>/`
	- File name per task and process shard: `{task_name}_rank{rank}_world_size{world_size}.jsonl`
	- Record format per line:

	```json
	{"doc_id": <doc_id>, "response": <string>}
	```

	Notes:
	- The `<model_hash>` is derived from a best‑effort human‑readable model identity (e.g., `model_version`) and the set of task names attached to the model, to avoid collisions.
	- Separate files per `rank` and `world_size` make distributed runs safe to cache concurrently.

	### How it works at runtime

	For models wired to the cache API (e.g., `async_openai_compatible_chat`):
	- At the beginning of `generate_until(...)` the model calls `load_cache()` and then `get_response_from_cache(requests)`.
	- Cached items are returned immediately; only the remaining requests are forwarded to the backend.
	- After each response is produced, `add_request_response_to_cache(...)` appends a JSONL record.

	The cache key is the tuple `(task_name, doc_id)`. Ensure your task produces stable `doc_id`s across runs.

	### Example: use with async_openai_compatible_chat

	```bash
	export OPENAI_API_BASE="http://localhost:8000/v1"
	export OPENAI_API_KEY="EMPTY" # if your server allows it
	export LMMS_EVAL_USE_CACHE=True # enable JSONL cache
	# optional: export LMMS_EVAL_HOME to relocate cache root

	python -m lmms_eval \
	--model async_openai \
	--model_args model_version=grok-2-latest,base_url=$OPENAI_API_BASE,api_key=$OPENAI_API_KEY \
	--tasks <your_task> \
	--batch_size 1 \
	--output_path ./logs/
	```

	On a second run with the same task/docs, cached responses will be loaded and only missing documents will call the model.

	### Inspect or clear the cache

	- Inspect: open the task JSONL file(s) under the model’s cache directory and view records.
	- Clear: delete the corresponding JSONL file(s) or the entire `<model_hash>` directory to force re‑evaluation.

	### Notes and limitations

	- The JSONL cache is keyed by `task_name` and `doc_id`. Changing task names or document IDs invalidates reuse.
	- Responses are cached as final strings. If your model emits intermediate tool calls, the final message (including any inline annotations) is what gets cached.
	- Distributed runs write to per‑rank files to avoid contention; reusing the cache works across single‑ and multi‑GPU as long as `task_name`/`doc_id` match.

	### Optional: legacy SQLite cache wrapper

	There is also a separate optional wrapper `CachingLMM` (see `lmms_eval.api.model.CachingLMM`) that caches by hashing the entire call arguments to a SQLite DB (via `SqliteDict`). It is independent from the JSONL cache above and can be useful for broader API‑level caching. For most users, enabling `LMMS_EVAL_USE_CACHE=True` is sufficient and simpler.