Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / peft /pr_3207 /en /package_reference /cartridges.md

HuggingFaceDocBuilder

23 days ago

preview code

download

raw

6.01 kB

	# Cartridges

	Cartridges are a prompt-learning method that stores a compressed long-context representation as a parameterized KV-cache
	prefix. The core idea comes from the paper
	[Cartridges: Lightweight and general-purpose long context representations via self-study](https://huggingface.co/papers/2506.06266).

	For a high-level overview and motivation, see the blog post
	[Cartridges: Storing long contexts in tiny caches with self-study](https://hazyresearch.stanford.edu/blog/2025-06-08-cartridges).

	## How Cartridges differ from Prefix Tuning

	Both Prefix Tuning and Cartridges are served by injecting `past_key_values` (a prefix KV cache) into the base model.

	- Prefix Tuning learns virtual token embeddings (and optionally an MLP projection) and produces a KV prefix.
	- Cartridges learn the KV prefix itself directly (the per-layer key/value vectors for `p` virtual tokens), and are
	designed to be initialized from real prefill KV (for example, the first `p` tokens of a corpus/system prompt).

	The paper also recommends freezing the first token as an attention sink for stability (`num_frozen_tokens=1` is the
	default).

	## Usage (inference)

	Load a trained CARTRIDGE adapter and run generation:

	```py
	from transformers import AutoModelForCausalLM, AutoTokenizer

	from peft import PeftModel

	model_id = "Qwen/Qwen2.5-0.5B-Instruct"
	adapter_path = "path/to/cartridge_adapter"

	base = AutoModelForCausalLM.from_pretrained(model_id)
	model = PeftModel.from_pretrained(base, adapter_path)

	tok = AutoTokenizer.from_pretrained(model_id)
	if tok.pad_token is None:
	tok.pad_token = tok.eos_token

	out = model.generate(**tok("Question about the corpus:", return_tensors="pt"), max_new_tokens=64)
	print(tok.decode(out[0], skip_special_tokens=True))
	```

	If you need to create and initialize a cartridge before training, see the initialization options below.

	## Initialization options

	The paper discusses a few practical initialization strategies:

	- Random KV (default): create a `CartridgeConfig` and start training. This initializes the KV prefix randomly.
	- KV from the first tokens of a prompt/corpus: use `initialize_kv_prefix_from_text(model, tokenizer, text=...)`. This
	runs a prefill on `text` and copies the resulting KV cache for the first `num_virtual_tokens` into the adapter.
	- KV from an existing cache: use `initialize_kv_prefix_from_past_key_values(model, past_key_values=...)` if you already
	have a `past_key_values` object from a base-model prefill.

	## Training

	The Cartridges paper proposes a SELF-STUDY distillation objective (a frozen base model provides teacher logits; the
	CARTRIDGE adapter is trained so the student matches the teacher’s next-token distribution over the target segment).
	PEFT keeps training logic out of the core library; see
	`https://github.com/huggingface/peft/tree/main/examples/cartridge_self_study` for a reference workflow.
	The example scripts use the frozen base model as the teacher and the adapted model as the student, so both share the
	same underlying checkpoint.

	## Composition

	To concatenate independently trained cartridges into a single adapter, use `compose_cartridge_adapters(...)`.

	## CartridgeConfig[[peft.CartridgeConfig]]

	#### peft.CartridgeConfig[[peft.CartridgeConfig]]

	[Source](https://github.com/huggingface/peft/blob/vr_3207/src/peft/tuners/cartridge/config.py#L22)

	Configuration for CARTRIDGE, a KV-cache-parameterized prefix adapter.

	This is similar to prefix-tuning in how it is served (as `past_key_values`), but it stores the KV cache directly as
	trainable parameters instead of learning it via an MLP projection.

	Initialization:
	The Cartridges paper discusses multiple initialization options. In PEFT, initialization is a separate step
	from constructing the adapter config:

	- Random KV initialization (paper option 2): Create the adapter via `get_peft_model(...)`. The CARTRIDGE
	prompt encoder parameters are randomly initialized by PyTorch.

	- KV derived from the first tokens of a prompt/corpus (paper option 3): Run a no-grad prefill on the *base
	model* and copy the first `num_virtual_tokens` cached KV tokens into the adapter. PEFT provides utilities for
	this (importable from `peft` or from `peft.tuners.cartridge.utils`):

	- `initialize_kv_prefix_from_text(model, tokenizer, text=...)`
	- `initialize_kv_prefix_from_past_key_values(model, past_key_values=...)`

	If you already have a flattened KV-prefix tensor, you can load it directly via the prompt encoder’s
	`load_prompt_embeddings(...)` method.

	Parameters:

	num_frozen_tokens (`int`, defaults to 1) : Number of prefix tokens at the start of the cartridge to keep frozen (no gradients). The Cartridges paper recommends freezing the first token as an attention sink for stability (set this to `1`), as many LLMs use early tokens as attention sinks and changing them can harm training.

	## CartridgeEncoder[[peft.CartridgeEncoder]]

	#### peft.CartridgeEncoder[[peft.CartridgeEncoder]]

	[Source](https://github.com/huggingface/peft/blob/vr_3207/src/peft/tuners/cartridge/model.py#L20)

	A parameterized prefix KV cache.

	The parameters are stored in the same flattened layout as `PrefixEncoder` output: `[num_virtual_tokens, num_layers
	* 2 * token_dim]`, where `token_dim` is per-head hidden size times number of heads (after any GQA adjustment
	performed by `_prepare_prompt_learning_config`).

	If `num_frozen_tokens > 0`, the first `num_frozen_tokens` virtual tokens are stored as a non-trainable parameter,
	and the remaining tokens are trainable.

	load_prompt_embeddingspeft.CartridgeEncoder.load_prompt_embeddingshttps://github.com/huggingface/peft/blob/vr_3207/src/peft/tuners/cartridge/model.py#L89[{"name": "prompt_embeddings", "val": ": torch.Tensor"}]

	Load the flattened prompt embeddings saved by PEFT (`prompt_embeddings`).

	PEFT saves prompt-learning adapters as a single `prompt_embeddings` tensor. For CARTRIDGE, we split that tensor
	into frozen and trainable segments according to `self.num_frozen_tokens`.

Xet Storage Details

Size:: 6.01 kB
Xet hash:: ccca5ff0b927a78e9ffcece953d07138215ca77e9cf35a95eb5093622d9abfef

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.