OSCAR INT2 KV-Cache

OSCAR RotationZoo

Precomputed K/V rotation matrices for OSCAR INT2 KV-cache quantization.

This repository contains the artifacts for the paper: OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization Zhongzhu Zhou, Donglin Zhuang, Jisen Li, Ziyan Chen, Shuaiwen Leon Song, Ben Athiwaratkun, Xiaoxia Wu

📄 Paper — arXiv:2605.17757
🌐 Website — https://oscar-quantize.github.io/
💻 Code — https://github.com/FutureMLS-Lab/OSCAR

OSCAR captures Q/K/V activations on a small calibration set, estimates attention-aware K/V covariance offline, and derives per-layer orthogonal rotations that align INT2 quantization with the directions attention actually consumes. The result is ~7× compression of the KV-cache memory footprint with single-digit pp accuracy drop on GPQA for dense reasoning models.

This repo packages the rotations as drop-in .pt files so you don't need to re-run the Q/K/V dump and eigendecomposition yourself.

Available rotations

Model	Calibration	GPQA (BF16)	GPQA (OSCAR INT2)
`Qwen/Qwen3-4B-Thinking-2507`	`seq20000_prompt83_group128`	67.27	67.17
`Qwen/Qwen3-4B-Thinking-2507`	`seq20000_prompt85_group128` (fresh re-dump)	67.27	—
`Qwen/Qwen3-8B`	`seq20000_prompt83_group128`	56.67	55.56
`Qwen/Qwen3-32B`	`seq16000_prompt69_group128`	58.49	60.40
`zai-org/GLM-4.7-FP8`	`seq10000_prompt43_group128`	73.23	73.57

seq<T>_prompt<N>_group<G> notation: T = total calibration tokens, N = calibration prompt count, G = INT2 quant group size along head_dim.

File format

Each rotation directory contains:

k_rotation_qqt_r_h_pbr.pt — K-side rotation R_K = R · H · P_br where R = eigvec(Σ_Q) is fit on Q's attention-aware covariance, H is a head-dim Hadamard, and P_br is the eigenvalue-sorted bit-reversal permutation
v_rotation_sst_r_h_pbr.pt — V-side rotation built on the score-weighted V covariance Σ_V = V^T diag(K^T (Q^T Q) K) V

File layout (PyTorch state-dict):

{
  "format_version": 1,
  "objective":      "qqt_r_h_pbr",      # or "sst_r_h_pbr" for V
  "source_grouping": "layer",
  "layers": {
    0:  {"layer_id": 0,  "rotation": tensor(head_dim, head_dim)},
    1:  {"layer_id": 1,  "rotation": tensor(head_dim, head_dim)},
    ...
  }
}

How to use

1. Download the rotation for your model

pip install huggingface_hub

from huggingface_hub import snapshot_download
snapshot_download(
    repo_id="Zhongzhu/OSCAR-RotationZoo",
    allow_patterns="Qwen3-8B/**",
    local_dir="./oscar_rotations",
)
# rotations now at ./oscar_rotations/Qwen3-8B/seq20000_prompt83_group128/

2. Serve with sglang-research using the rotation

Clone https://github.com/FutureMLS-Lab/OSCAR and set up the single oscar conda env, then point the eval driver at your downloaded rotation:

ROT_DIR=./oscar_rotations/Qwen3-8B/seq20000_prompt83_group128 \
  bash rotation/qwen3-8B/eval_gpqa.sh

The driver internally launches sglang with these flags:

SGLANG_ENABLE_MIXED_KV_WINDOWS=1 \
SGLANG_OSCAR_ROTATION_MODE=oscar \
SGLANG_OSCAR_K_ROTATION_PATH=$ROT_DIR/k_rotation_qqt_r_h_pbr.pt \
SGLANG_OSCAR_V_ROTATION_PATH=$ROT_DIR/v_rotation_sst_r_h_pbr.pt \
SGLANG_OSCAR_K_CLIP_RATIO=0.96 \
SGLANG_OSCAR_V_CLIP_RATIO=0.92 \
SGLANG_OSCAR_ABSORB_V_ROTATION=1 \
SGLANG_MIXED_KV_PREFIX_TOKENS=64 \
SGLANG_MIXED_KV_RECENT_TOKENS=256 \
HADAMARD_ORDER=128 \
python -m sglang.launch_server \
  --model-path Qwen/Qwen3-8B \
  --tensor-parallel-size 1 \
  --kv-cache-dtype int2 \
  --kv-cache-quant-group-size 128 \
  --prefill-attention-backend fa3 \
  --decode-attention-backend triton \
  --disable-radix-cache \
  --disable-custom-all-reduce \
  --trust-remote-code

Reproducing from scratch

If you want to fit your own rotation on a different calibration set, the OSCAR pipeline is end-to-end reproducible:

git clone https://github.com/FutureMLS-Lab/OSCAR.git
cd OSCAR
bash rotation/qwen3-8B/save_qkv_8b.sh        # phase 1 — dump Q/K/V
bash rotation/qwen3-8B/compute_rotation.sh   # phase 2 — fit R = eigvec(Σ_Q)

Citation

@article{zhou2026oscar,
  title  = {OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization},
  author = {Zhou, Zhongzhu and Zhuang, Donglin and Li, Jisen and Chen, Ziyan and Song, Shuaiwen Leon and Athiwaratkun, Ben and Wu, Xiaoxia},
  year   = {2026},
  note   = {Together AI; University of Sydney; UIUC},
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Paper for Zhongzhu/OSCAR-RotationZoo

OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

Paper • 2605.17757 • Published May 18 • 66