| --- |
| license: apache-2.0 |
| tags: |
| - oscar |
| - int2 |
| - kv-cache |
| - quantization |
| - rotation |
| - sglang |
| --- |
| |
| <p align="center"> |
| <img src="https://huggingface.co/Zhongzhu/OSCAR-RotationZoo/resolve/main/oscar_logo_kv_transparent.png" alt="OSCAR INT2 KV-Cache" width="180"/> |
| </p> |
|
|
| # OSCAR RotationZoo |
|
|
| Precomputed K/V rotation matrices for **OSCAR INT2 KV-cache quantization**. |
|
|
| - 📄 **Paper** — [*OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization*](https://arxiv.org/pdf/2605.17757) |
| - 🌐 **Website** — https://oscar-quantize.github.io/ |
| - 💻 **Code** — https://github.com/FutureMLS-Lab/OSCAR |
|
|
| OSCAR captures Q/K/V activations on a small calibration set, estimates |
| attention-aware K/V covariance offline, and derives per-layer orthogonal |
| rotations that align INT2 quantization with the directions attention actually |
| consumes. The result is ~7× compression of the KV-cache memory footprint with |
| single-digit pp accuracy drop on GPQA for dense reasoning models. |
|
|
| This repo packages the rotations as drop-in `.pt` files so you don't need to |
| re-run the Q/K/V dump and eigendecomposition yourself. |
|
|
| ## Available rotations |
|
|
| | Model | Calibration | GPQA (BF16) | GPQA (OSCAR INT2) | |
| |---|---|---:|---:| |
| | `Qwen/Qwen3-4B-Thinking-2507` | `seq20000_prompt83_group128` | 67.27 | 67.17 | |
| | `Qwen/Qwen3-4B-Thinking-2507` | `seq20000_prompt85_group128` (fresh re-dump) | 67.27 | — | |
| | `Qwen/Qwen3-8B` | `seq20000_prompt83_group128` | 56.67 | 55.56 | |
| | `Qwen/Qwen3-32B` | `seq16000_prompt69_group128` | 58.49 | 60.40 | |
| | `zai-org/GLM-4.7-FP8` | `seq10000_prompt43_group128` | 73.23 | 73.57 | |
|
|
| `seq<T>_prompt<N>_group<G>` notation: `T` = total calibration tokens, |
| `N` = calibration prompt count, `G` = INT2 quant group size along head_dim. |
| |
| ## File format |
| |
| Each rotation directory contains: |
| |
| - `k_rotation_qqt_r_h_pbr.pt` — K-side rotation `R_K = R · H · P_br` where |
| `R = eigvec(Σ_Q)` is fit on Q's attention-aware covariance, `H` is a |
| head-dim Hadamard, and `P_br` is the eigenvalue-sorted bit-reversal |
| permutation |
| - `v_rotation_sst_r_h_pbr.pt` — V-side rotation built on the score-weighted |
| V covariance `Σ_V = V^T diag(K^T (Q^T Q) K) V` |
|
|
| File layout (PyTorch state-dict): |
| ```python |
| { |
| "format_version": 1, |
| "objective": "qqt_r_h_pbr" # or "sst_r_h_pbr" for V |
| "source_grouping": "layer", |
| "layers": { |
| 0: {"layer_id": 0, "rotation": tensor(head_dim, head_dim)}, |
| 1: {"layer_id": 1, "rotation": tensor(head_dim, head_dim)}, |
| ... |
| } |
| } |
| ``` |
|
|
| ## How to use |
|
|
| ### 1. Download the rotation for your model |
|
|
| ```bash |
| pip install huggingface_hub |
| ``` |
|
|
| ```python |
| from huggingface_hub import snapshot_download |
| snapshot_download( |
| repo_id="Zhongzhu/OSCAR-RotationZoo", |
| allow_patterns="Qwen3-8B/**", |
| local_dir="./oscar_rotations", |
| ) |
| # rotations now at ./oscar_rotations/Qwen3-8B/seq20000_prompt83_group128/ |
| ``` |
|
|
| ### 2. Serve with sglang-research using the rotation |
|
|
| Clone https://github.com/FutureMLS-Lab/OSCAR and set up the single `oscar` |
| conda env, then point the eval driver at your downloaded rotation: |
|
|
| ```bash |
| ROT_DIR=./oscar_rotations/Qwen3-8B/seq20000_prompt83_group128 \ |
| bash rotation/qwen3-8B/eval_gpqa.sh |
| ``` |
|
|
| The driver internally launches sglang with these flags: |
|
|
| ```bash |
| SGLANG_ENABLE_MIXED_KV_WINDOWS=1 \ |
| SGLANG_OSCAR_ROTATION_MODE=oscar \ |
| SGLANG_OSCAR_K_ROTATION_PATH=$ROT_DIR/k_rotation_qqt_r_h_pbr.pt \ |
| SGLANG_OSCAR_V_ROTATION_PATH=$ROT_DIR/v_rotation_sst_r_h_pbr.pt \ |
| SGLANG_OSCAR_K_CLIP_RATIO=0.96 \ |
| SGLANG_OSCAR_V_CLIP_RATIO=0.92 \ |
| SGLANG_OSCAR_ABSORB_V_ROTATION=1 \ |
| SGLANG_MIXED_KV_PREFIX_TOKENS=64 \ |
| SGLANG_MIXED_KV_RECENT_TOKENS=256 \ |
| HADAMARD_ORDER=128 \ |
| python -m sglang.launch_server \ |
| --model-path Qwen/Qwen3-8B \ |
| --tensor-parallel-size 1 \ |
| --kv-cache-dtype int2 \ |
| --kv-cache-quant-group-size 128 \ |
| --prefill-attention-backend fa3 \ |
| --decode-attention-backend triton \ |
| --disable-radix-cache \ |
| --disable-custom-all-reduce \ |
| --trust-remote-code |
| ``` |
|
|
| Sink (`PREFIX_TOKENS=64`) and recent window (`RECENT_TOKENS=256`) tokens stay |
| in BF16; the bulk of the KV cache is INT2-quantized into 128-element groups |
| along head_dim using these rotations. |
| |
| ## Reproducing from scratch |
| |
| If you want to fit your own rotation on a different calibration set, the |
| OSCAR pipeline is end-to-end reproducible: |
| |
| ```bash |
| git clone https://github.com/FutureMLS-Lab/OSCAR.git |
| cd OSCAR |
| bash rotation/qwen3-8B/save_qkv_8b.sh # phase 1 — dump Q/K/V |
| bash rotation/qwen3-8B/compute_rotation.sh # phase 2 — fit R = eigvec(Σ_Q) |
| ``` |
| |
| ## Citation |
| |
| ```bibtex |
| @article{zhou2026oscar, |
| title = {OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization}, |
| author = {Zhou, Zhongzhu and Zhuang, Donglin and Li, Jisen and Chen, Ziyan and Song, Shuaiwen Leon and Athiwaratkun, Ben and Wu, Xiaoxia}, |
| year = {2026}, |
| note = {Together AI; University of Sydney; UIUC}, |
| } |
| ``` |
| |