majentik
/

gpt-oss-20b-RotorQuant

@@ -1,6 +1,6 @@
 ---
 base_model: openai/gpt-oss-20b
-library_name: transformers
 tags:
   - rotorquant
   - kv-cache-quantization
@@ -8,80 +8,130 @@ tags:
   - openai
   - moe
   - quantized
-license: apache-2.0
 pipeline_tag: text-generation
 ---
-# GPT-OSS-20B - RotorQuant KV Cache
-**RotorQuant KV-cache quantization** applied to [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b). RotorQuant uses block-diagonal rotations (Clifford algebra) to compress the KV cache, delivering 5.3x faster prefill and 28% faster decode compared to TurboQuant with equivalent memory savings.
-This repository provides the RotorQuant KV-cache configuration for GPT-OSS-20B, OpenAI's first open-weights release in years (Apache 2.0). The model weights remain at their original precision; only the key-value cache is quantized at runtime. GPT-OSS-20B is a Mixture-of-Experts model that rivals o3-mini on reasoning benchmarks and is ideal for local and edge deployment.
-## Model Specifications
-| Property | Value |
-|---|---|
-| **Base Model** | [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b) |
-| **Parameters** | 20 billion (MoE) |
-| **Architecture** | Mixture-of-Experts (MoE) Transformer |
-| **License** | Apache 2.0 (commercial use OK) |
-| **Quantization** | RotorQuant KV-cache only (weights unchanged) |
-| **Downloads** | 6M+ on HuggingFace |
 ## Quickstart
 ```python
-from rotorquant import IsoQuantCache
 from transformers import AutoModelForCausalLM, AutoTokenizer
-model_id = "openai/gpt-oss-20b"
-tokenizer = AutoTokenizer.from_pretrained(model_id)
-model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
-# Apply RotorQuant KV-cache quantization
-cache = IsoQuantCache(model)
-inputs = tokenizer("Explain the theory of relativity.", return_tensors="pt").to(model.device)
-outputs = model.generate(**inputs, past_key_values=cache)
-print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
 ## What is RotorQuant?
-[RotorQuant](https://github.com/scrya-com/rotorquant) applies block-diagonal rotations (Clifford algebra) for KV cache compression. It provides equivalent memory savings to TurboQuant while dramatically improving throughput.
-Key advantages over TurboQuant:
-- **5.3x faster prefill**
-- **28% faster decode**
-- Equivalent memory savings
-- Slightly better perplexity
-## KV-Cache Quantization Comparison
-| Method | Prefill Speed | Decode Speed | Memory Savings | Reference |
-|---|---|---|---|---|
-| **TurboQuant** | 1x (baseline) | 1x (baseline) | High | [arXiv: 2504.19874](https://arxiv.org/abs/2504.19874) |
-| **RotorQuant** | **5.3x faster** | **28% faster** | High | [GitHub](https://github.com/scrya-com/rotorquant) |
-## Memory Estimates (GPT-OSS-20B)
-| Precision | Approximate Size |
-|---|---|
-| BF16 (original) | ~40 GB |
-| 8-bit quantized | ~20 GB |
-| 4-bit quantized | ~12 GB |
-| 2-bit quantized | ~6 GB |
-Note: These estimates are for weight quantization. This repository applies KV-cache quantization only, so model weight memory remains at the precision you load the model in. The KV-cache memory savings are realized during generation.
 ## See Also
-- [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b) -- Base model
-- [majentik/gpt-oss-20b-TurboQuant](https://huggingface.co/majentik/gpt-oss-20b-TurboQuant) -- TurboQuant KV-cache variant
-- [majentik/gpt-oss-20b-RotorQuant-MLX-8bit](https://huggingface.co/majentik/gpt-oss-20b-RotorQuant-MLX-8bit) -- MLX 8-bit variant
-- [majentik/gpt-oss-20b-RotorQuant-MLX-4bit](https://huggingface.co/majentik/gpt-oss-20b-RotorQuant-MLX-4bit) -- MLX 4-bit variant
-- [majentik/gpt-oss-20b-RotorQuant-MLX-2bit](https://huggingface.co/majentik/gpt-oss-20b-RotorQuant-MLX-2bit) -- MLX 2-bit variant
-- [majentik/gpt-oss-20b-RotorQuant-GGUF-Q4_K_M](https://huggingface.co/majentik/gpt-oss-20b-RotorQuant-GGUF-Q4_K_M) -- GGUF Q4_K_M variant
 - [RotorQuant GitHub](https://github.com/scrya-com/rotorquant)

 ---
+license: apache-2.0
 base_model: openai/gpt-oss-20b
 tags:
   - rotorquant
   - kv-cache-quantization
   - openai
   - moe
   - quantized
+library_name: transformers
 pipeline_tag: text-generation
 ---
+# gpt-oss-20b-RotorQuant
+**RotorQuant KV cache compression** for [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b).
+This is a **documentation repository** that explains how to combine gpt-oss-20b's weights with RotorQuant inference-time KV cache compression. No weights are stored here — use the base model directly and apply RotorQuant via the Python package or llama.cpp fork.
+## What is this?
+KV cache compression reduces the memory used by the attention cache during inference. Unlike weight quantization (which is baked into the GGUF/MLX file), KV cache compression is applied at runtime — so the same base weights can be used with or without compression.
+| Technique | Where it's applied | Savings |
+|-----------|-------------------|---------|
+| Weight quantization (GGUF/MLX/AWQ) | Baked into model file | Reduces disk + weight memory |
+| **RotorQuant KV cache** | At inference time | Reduces attention memory (critical for long context) |
+Both can be combined for maximum efficiency.
 ## Quickstart
+### Option A — Python / transformers
+Install the `rotorquant` package:
+```bash
+pip install rotorquant
+```
+Then use it with the base model:
 ```python
+import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer
+from rotorquant import IsoQuantCache
+tokenizer = AutoTokenizer.from_pretrained("openai/gpt-oss-20b", trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    "openai/gpt-oss-20b",
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+    trust_remote_code=True,
+)
+# Apply RotorQuant to the KV cache
+cache = IsoQuantCache(bits=4)  # or bits=2 for more aggressive compression
+inputs = tokenizer("Hello, how are you?", return_tensors="pt").to(model.device)
+outputs = model.generate(
+    **inputs,
+    max_new_tokens=128,
+    past_key_values=cache,
+    use_cache=True,
+)
+print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
+```
+### Option B — llama.cpp / LM Studio / Ollama (with fork)
+RotorQuant KV cache types (`iso3`) are **not** in upstream llama.cpp. They require:
+- [llama-cpp-turboquant fork](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache)
+Once built:
+```bash
+llama-cli -m gpt-oss-20b.gguf \
+  --cache-type-k iso3 --cache-type-v iso3 \
+  -ngl 99 -fa \
+  -p "Hello"
 ```
+For standard runtimes (LM Studio, Ollama, upstream llama.cpp), use conventional KV cache types (`q8_0`, `q4_0`). You lose the RotorQuant-specific benefits but keep GGUF weight quantization.
+## Model Specifications
+| Property | Value |
+|----------|-------|
+| Base Model | [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b) |
+| Architecture | Sparse MoE |
+| Parameters | 20B total (MoE) |
+| Context Length | 128K |
+| BF16 Size | ~40 GB |
+| Modalities | Text |
+| License | apache-2.0 |
 ## What is RotorQuant?
+[RotorQuant](https://github.com/scrya-com/rotorquant) is a KV cache compression method based on Clifford algebra (Cl(3,0)) rotors — a faster, more parameter-efficient alternative to Google's TurboQuant. Uses lightweight block-diagonal rotations (independent 2D/4D rotations per pair/quartet) achieving O(d) complexity instead of O(d log d), fully parallelisable with no inter-element dependencies.
+**Benchmarks** (from the RotorQuant repository, Llama 3.1 8B on RTX 5090 — results vary by model and hardware):
+- Prefill: 3,822 tok/s (vs TurboQuant 722 tok/s)
+- Decode: 119 tok/s (vs TurboQuant 93 tok/s)
+- Perplexity: 6.91 (vs TurboQuant 7.07)
+- Parameters: 4 per rotor (vs TurboQuant 16,384)
+> Benchmarks are from the RotorQuant repository using Llama 3.1 8B. Performance on gpt-oss-20b will differ. Please open a discussion if you have independent results.
+## Current Ecosystem Support
+| Runtime | RotorQuant Support | Notes |
+|---------|----------------------|-------|
+| Python transformers + `rotorquant` | ✅ Full | Drop-in cache class |
+| llama.cpp upstream | ❌ Not merged | Use fork below |
+| llama-cpp-turboquant fork | ✅ `planar3`, `iso3` | [GitHub](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache) |
+| LM Studio | ❌ [Requested](https://github.com/lmstudio-ai/lmstudio-bug-tracker/issues/1719) | Use `q8_0` as alternative |
+| Ollama | ❌ Not supported | Use `OLLAMA_KV_CACHE_TYPE=q8_0` |
+| vLLM | ❌ Not supported | — |
+| koboldcpp | ❌ Not supported | — |
+## Pre-quantized weight variants
+If you want combined weight + KV cache compression, majentik hosts pre-quantized versions:
+- [MLX (Apple Silicon)](https://huggingface.co/majentik?search=gpt-oss-20b+MLX)
+- [GGUF (llama.cpp / Ollama / LM Studio)](https://huggingface.co/majentik?search=gpt-oss-20b+GGUF)
 ## See Also
 - [RotorQuant GitHub](https://github.com/scrya-com/rotorquant)
+- [TurboQuant paper (arXiv 2504.19874)](https://arxiv.org/abs/2504.19874)
+- [llama-cpp-turboquant fork](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache)
+- [Base model: openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b)
+- [gpt-oss-20b announcement](https://openai.com/blog/gpt-oss)