--- license: apache-2.0 tags: - oscar - int2 - kv-cache - quantization - rotation - sglang ---

OSCAR INT2 KV-Cache

# OSCAR RotationZoo Precomputed K/V rotation matrices for **OSCAR INT2 KV-cache quantization**. - ๐Ÿ“„ **Paper** โ€” [*OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization*](https://arxiv.org/pdf/2605.17757) - ๐ŸŒ **Website** โ€” https://oscar-quantize.github.io/ - ๐Ÿ’ป **Code** โ€” https://github.com/FutureMLS-Lab/OSCAR OSCAR captures Q/K/V activations on a small calibration set, estimates attention-aware K/V covariance offline, and derives per-layer orthogonal rotations that align INT2 quantization with the directions attention actually consumes. The result is ~7ร— compression of the KV-cache memory footprint with single-digit pp accuracy drop on GPQA for dense reasoning models. This repo packages the rotations as drop-in `.pt` files so you don't need to re-run the Q/K/V dump and eigendecomposition yourself. ## Available rotations | Model | Calibration | GPQA (BF16) | GPQA (OSCAR INT2) | |---|---|---:|---:| | `Qwen/Qwen3-4B-Thinking-2507` | `seq20000_prompt83_group128` | 67.27 | 67.17 | | `Qwen/Qwen3-4B-Thinking-2507` | `seq20000_prompt85_group128` (fresh re-dump) | 67.27 | โ€” | | `Qwen/Qwen3-8B` | `seq20000_prompt83_group128` | 56.67 | 55.56 | | `Qwen/Qwen3-32B` | `seq16000_prompt69_group128` | 58.49 | 60.40 | | `zai-org/GLM-4.7-FP8` | `seq10000_prompt43_group128` | 73.23 | 73.57 | `seq_prompt_group` notation: `T` = total calibration tokens, `N` = calibration prompt count, `G` = INT2 quant group size along head_dim. ## File format Each rotation directory contains: - `k_rotation_qqt_r_h_pbr.pt` โ€” K-side rotation `R_K = R ยท H ยท P_br` where `R = eigvec(ฮฃ_Q)` is fit on Q's attention-aware covariance, `H` is a head-dim Hadamard, and `P_br` is the eigenvalue-sorted bit-reversal permutation - `v_rotation_sst_r_h_pbr.pt` โ€” V-side rotation built on the score-weighted V covariance `ฮฃ_V = V^T diag(K^T (Q^T Q) K) V` File layout (PyTorch state-dict): ```python { "format_version": 1, "objective": "qqt_r_h_pbr" # or "sst_r_h_pbr" for V "source_grouping": "layer", "layers": { 0: {"layer_id": 0, "rotation": tensor(head_dim, head_dim)}, 1: {"layer_id": 1, "rotation": tensor(head_dim, head_dim)}, ... } } ``` ## How to use ### 1. Download the rotation for your model ```bash pip install huggingface_hub ``` ```python from huggingface_hub import snapshot_download snapshot_download( repo_id="Zhongzhu/OSCAR-RotationZoo", allow_patterns="Qwen3-8B/**", local_dir="./oscar_rotations", ) # rotations now at ./oscar_rotations/Qwen3-8B/seq20000_prompt83_group128/ ``` ### 2. Serve with sglang-research using the rotation Clone https://github.com/FutureMLS-Lab/OSCAR and set up the single `oscar` conda env, then point the eval driver at your downloaded rotation: ```bash ROT_DIR=./oscar_rotations/Qwen3-8B/seq20000_prompt83_group128 \ bash rotation/qwen3-8B/eval_gpqa.sh ``` The driver internally launches sglang with these flags: ```bash SGLANG_ENABLE_MIXED_KV_WINDOWS=1 \ SGLANG_OSCAR_ROTATION_MODE=oscar \ SGLANG_OSCAR_K_ROTATION_PATH=$ROT_DIR/k_rotation_qqt_r_h_pbr.pt \ SGLANG_OSCAR_V_ROTATION_PATH=$ROT_DIR/v_rotation_sst_r_h_pbr.pt \ SGLANG_OSCAR_K_CLIP_RATIO=0.96 \ SGLANG_OSCAR_V_CLIP_RATIO=0.92 \ SGLANG_OSCAR_ABSORB_V_ROTATION=1 \ SGLANG_MIXED_KV_PREFIX_TOKENS=64 \ SGLANG_MIXED_KV_RECENT_TOKENS=256 \ HADAMARD_ORDER=128 \ python -m sglang.launch_server \ --model-path Qwen/Qwen3-8B \ --tensor-parallel-size 1 \ --kv-cache-dtype int2 \ --kv-cache-quant-group-size 128 \ --prefill-attention-backend fa3 \ --decode-attention-backend triton \ --disable-radix-cache \ --disable-custom-all-reduce \ --trust-remote-code ``` Sink (`PREFIX_TOKENS=64`) and recent window (`RECENT_TOKENS=256`) tokens stay in BF16; the bulk of the KV cache is INT2-quantized into 128-element groups along head_dim using these rotations. ## Reproducing from scratch If you want to fit your own rotation on a different calibration set, the OSCAR pipeline is end-to-end reproducible: ```bash git clone https://github.com/FutureMLS-Lab/OSCAR.git cd OSCAR bash rotation/qwen3-8B/save_qkv_8b.sh # phase 1 โ€” dump Q/K/V bash rotation/qwen3-8B/compute_rotation.sh # phase 2 โ€” fit R = eigvec(ฮฃ_Q) ``` ## Citation ```bibtex @article{zhou2026oscar, title = {OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization}, author = {Zhou, Zhongzhu and Zhuang, Donglin and Li, Jisen and Chen, Ziyan and Song, Shuaiwen Leon and Athiwaratkun, Ben and Wu, Xiaoxia}, year = {2026}, note = {Together AI; University of Sydney; UIUC}, } ```