squat / README.md

Update README.md

b2d9750 verified 7 months ago

5.14 kB

	---
	library_name: transformers
	tags:
	- custom_generate
	---


	## Description
	Implementation of the KV cache quantization method introduced in the [SQuat paper (COLM 2025)](https://arxiv.org/abs/2503.24358). SQuat (Subspace-orthogonal KV cache quantization) reduces the memory and compute cost of storing the KV cache by carefully quantizing the key tensors. It constructs a task-relevant subspace and ensures that quantization errors remain orthogonal to it, thereby minimizing their impact on attention outputs. SQuat is training-free, calibration-free, and operates on-the-fly, with strong theoretical grounding and state-of-the-art empirical results.

	This repo provides a partial implementation of SQuat via a custom `SQuatCache` class. It requires passing an additional `query_states` input to `.update()`. To support this, you can monkey patch the `LlamaAttention.forward` method—see the example usage below.

	For the full implementation, please refer to the [original repository](https://github.com/Red-Hat-AI-Innovation-Team/SQuat).


	## Base model:
	`meta-llama/Llama-3.1-8B-Instruct`

	## Model compatibility
	Most models. More specifically, any `transformer` LLM/VLM trained for causal language modeling.

	## Additional Arguments
	- `backend` (`str`, optional): quantization backend, default is `quanto`
	- `nbits` (`int`, optional): number of bits for quantization, default is `2`
	- `quant_group_size` (`int`, optional): quantization group size, default is `64`
	- `residual_length` (`int`, optional): residual length, default is `32`
	- `squat_lambda` (`float`, optional): squat lambda, default is `0.001`
	- `subspace_dim` (`int`, optional): subspace dimension, default is `10`
	- `shared_svd` (`bool`, optional): if use shared svd, default is `True`

	## Output Type changes
	(none)

	## Example usage

	```py
	import torch
	from typing import Callable, Optional, Tuple
	from transformers.cache_utils import Cache
	from transformers.models.llama.modeling_llama import apply_rotary_pos_emb, eager_attention_forward
	from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS
	from transformers.modeling_flash_attention_utils import FlashAttentionKwargs
	from transformers.processing_utils import Unpack
	import transformers

	from transformers import AutoTokenizer, AutoModelForCausalLM

	def llama_attn_forward(
	self,
	hidden_states: torch.Tensor,
	position_embeddings: Tuple[torch.Tensor, torch.Tensor],
	attention_mask: Optional[torch.Tensor],
	past_key_value: Optional[Cache] = None,
	cache_position: Optional[torch.LongTensor] = None,
	**kwargs: Unpack[FlashAttentionKwargs],
	) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:

	input_shape = hidden_states.shape[:-1]
	hidden_shape = (*input_shape, -1, self.head_dim)

	query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
	key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)
	value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)

	cos, sin = position_embeddings
	query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)

	if past_key_value is not None:
	# sin and cos are specific to RoPE models; cache_position needed for the static cache
	cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position, "query_states": query_states, "attention_mask": attention_mask}
	key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)

	attention_interface: Callable = eager_attention_forward

	if self.config._attn_implementation != "eager":
	if self.config._attn_implementation == "sdpa" and kwargs.get("output_attentions", False):
	logger.warning_once(
	"`torch.nn.functional.scaled_dot_product_attention` does not support `output_attentions=True`. Falling back to "
	'eager attention. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.'
	)
	else:
	attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]

	attn_output, attn_weights = attention_interface(
	self,
	query_states,
	key_states,
	value_states,
	attention_mask,
	dropout=0.0 if not self.training else self.attention_dropout,
	scaling=self.scaling,
	**kwargs,
	)

	attn_output = attn_output.reshape(*input_shape, -1).contiguous()
	attn_output = self.o_proj(attn_output)
	return attn_output, attn_weights

	def replace_llama():
	transformers.models.llama.modeling_llama.LlamaAttention.forward = llama_attn_forward

	replace_llama()

	tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-3.1-8B-Instruct')
	model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-3.1-8B-Instruct', device_map="auto")

	inputs = tokenizer(["I like rock music because"], return_tensors="pt").to(model.device)

	gen_out = model.generate(**inputs, custom_generate="ligongh/squat", trust_remote_code=True)
	print(tokenizer.batch_decode(gen_out))
	```