Link Paper to arXiv

e2ee6f0 verified 1 day ago

4.92 kB

	---
	license: apache-2.0
	tags:
	- oscar
	- int2
	- kv-cache
	- quantization
	- rotation
	- sglang
	---

	<p align="center">
	<img src="https://huggingface.co/Zhongzhu/OSCAR-RotationZoo/resolve/main/oscar_logo_kv_transparent.png" alt="OSCAR INT2 KV-Cache" width="180"/>
	</p>

	# OSCAR RotationZoo

	Precomputed K/V rotation matrices for OSCAR INT2 KV-cache quantization.

	- 📄 Paper — [OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization](https://arxiv.org/pdf/2605.17757)
	- 🌐 Website — https://oscar-quantize.github.io/
	- 💻 Code — https://github.com/FutureMLS-Lab/OSCAR

	OSCAR captures Q/K/V activations on a small calibration set, estimates
	attention-aware K/V covariance offline, and derives per-layer orthogonal
	rotations that align INT2 quantization with the directions attention actually
	consumes. The result is ~7× compression of the KV-cache memory footprint with
	single-digit pp accuracy drop on GPQA for dense reasoning models.

	This repo packages the rotations as drop-in `.pt` files so you don't need to
	re-run the Q/K/V dump and eigendecomposition yourself.

	## Available rotations

	\| Model \| Calibration \| GPQA (BF16) \| GPQA (OSCAR INT2) \|
	\|---\|---\|---:\|---:\|
	\| `Qwen/Qwen3-4B-Thinking-2507` \| `seq20000_prompt83_group128` \| 67.27 \| 67.17 \|
	\| `Qwen/Qwen3-4B-Thinking-2507` \| `seq20000_prompt85_group128` (fresh re-dump) \| 67.27 \| — \|
	\| `Qwen/Qwen3-8B` \| `seq20000_prompt83_group128` \| 56.67 \| 55.56 \|
	\| `Qwen/Qwen3-32B` \| `seq16000_prompt69_group128` \| 58.49 \| 60.40 \|
	\| `zai-org/GLM-4.7-FP8` \| `seq10000_prompt43_group128` \| 73.23 \| 73.57 \|

	`seq<T>_prompt<N>_group<G>` notation: `T` = total calibration tokens,
	`N` = calibration prompt count, `G` = INT2 quant group size along head_dim.

	## File format

	Each rotation directory contains:

	- `k_rotation_qqt_r_h_pbr.pt` — K-side rotation `R_K = R · H · P_br` where
	`R = eigvec(Σ_Q)` is fit on Q's attention-aware covariance, `H` is a
	head-dim Hadamard, and `P_br` is the eigenvalue-sorted bit-reversal
	permutation
	- `v_rotation_sst_r_h_pbr.pt` — V-side rotation built on the score-weighted
	V covariance `Σ_V = V^T diag(K^T (Q^T Q) K) V`

	File layout (PyTorch state-dict):
	```python
	{
	"format_version": 1,
	"objective": "qqt_r_h_pbr" # or "sst_r_h_pbr" for V
	"source_grouping": "layer",
	"layers": {
	0: {"layer_id": 0, "rotation": tensor(head_dim, head_dim)},
	1: {"layer_id": 1, "rotation": tensor(head_dim, head_dim)},
	...
	}
	}
	```

	## How to use

	### 1. Download the rotation for your model

	```bash
	pip install huggingface_hub
	```

	```python
	from huggingface_hub import snapshot_download
	snapshot_download(
	repo_id="Zhongzhu/OSCAR-RotationZoo",
	allow_patterns="Qwen3-8B/**",
	local_dir="./oscar_rotations",
	)
	# rotations now at ./oscar_rotations/Qwen3-8B/seq20000_prompt83_group128/
	```

	### 2. Serve with sglang-research using the rotation

	Clone https://github.com/FutureMLS-Lab/OSCAR and set up the single `oscar`
	conda env, then point the eval driver at your downloaded rotation:

	```bash
	ROT_DIR=./oscar_rotations/Qwen3-8B/seq20000_prompt83_group128 \
	bash rotation/qwen3-8B/eval_gpqa.sh
	```

	The driver internally launches sglang with these flags:

	```bash
	SGLANG_ENABLE_MIXED_KV_WINDOWS=1 \
	SGLANG_OSCAR_ROTATION_MODE=oscar \
	SGLANG_OSCAR_K_ROTATION_PATH=$ROT_DIR/k_rotation_qqt_r_h_pbr.pt \
	SGLANG_OSCAR_V_ROTATION_PATH=$ROT_DIR/v_rotation_sst_r_h_pbr.pt \
	SGLANG_OSCAR_K_CLIP_RATIO=0.96 \
	SGLANG_OSCAR_V_CLIP_RATIO=0.92 \
	SGLANG_OSCAR_ABSORB_V_ROTATION=1 \
	SGLANG_MIXED_KV_PREFIX_TOKENS=64 \
	SGLANG_MIXED_KV_RECENT_TOKENS=256 \
	HADAMARD_ORDER=128 \
	python -m sglang.launch_server \
	--model-path Qwen/Qwen3-8B \
	--tensor-parallel-size 1 \
	--kv-cache-dtype int2 \
	--kv-cache-quant-group-size 128 \
	--prefill-attention-backend fa3 \
	--decode-attention-backend triton \
	--disable-radix-cache \
	--disable-custom-all-reduce \
	--trust-remote-code
	```

	Sink (`PREFIX_TOKENS=64`) and recent window (`RECENT_TOKENS=256`) tokens stay
	in BF16; the bulk of the KV cache is INT2-quantized into 128-element groups
	along head_dim using these rotations.

	## Reproducing from scratch

	If you want to fit your own rotation on a different calibration set, the
	OSCAR pipeline is end-to-end reproducible:

	```bash
	git clone https://github.com/FutureMLS-Lab/OSCAR.git
	cd OSCAR
	bash rotation/qwen3-8B/save_qkv_8b.sh # phase 1 — dump Q/K/V
	bash rotation/qwen3-8B/compute_rotation.sh # phase 2 — fit R = eigvec(Σ_Q)
	```

	## Citation

	```bibtex
	@article{zhou2026oscar,
	title = {OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization},
	author = {Zhou, Zhongzhu and Zhuang, Donglin and Li, Jisen and Chen, Ziyan and Song, Shuaiwen Leon and Athiwaratkun, Ben and Wu, Xiaoxia},
	year = {2026},
	note = {Together AI; University of Sydney; UIUC},
	}
	```