| --- |
| license: apache-2.0 |
| base_model: google/gemma-4-31B |
| tags: |
| - rotorquant |
| - kv-cache-quantization |
| - gemma |
| - gemma4 |
| - quantized |
| library_name: transformers |
| pipeline_tag: image-text-to-text |
| --- |
| |
| # gemma-4-31B-RotorQuant |
|
|
| **RotorQuant KV cache compression** for [google/gemma-4-31B](https://huggingface.co/google/gemma-4-31B). |
|
|
| This is a **documentation repository** that explains how to combine gemma-4-31B's weights with RotorQuant inference-time KV cache compression. No weights are stored here β use the base model directly and apply RotorQuant via the Python package or llama.cpp fork. |
|
|
| ## What is this? |
|
|
| KV cache compression reduces the memory used by the attention cache during inference. Unlike weight quantization (which is baked into the GGUF/MLX file), KV cache compression is applied at runtime β so the same base weights can be used with or without compression. |
|
|
| | Technique | Where it's applied | Savings | |
| |-----------|-------------------|---------| |
| | Weight quantization (GGUF/MLX/AWQ) | Baked into model file | Reduces disk + weight memory | |
| | **RotorQuant KV cache** | At inference time | Reduces attention memory (critical for long context) | |
|
|
| Both can be combined for maximum efficiency. |
|
|
| ## Quickstart |
|
|
| ### Option A β Python / transformers |
|
|
| Install the `rotorquant` package: |
|
|
| ```bash |
| pip install rotorquant |
| ``` |
|
|
| Then use it with the base model: |
|
|
| ```python |
| import torch |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| from rotorquant import IsoQuantCache |
| |
| tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-31B", trust_remote_code=True) |
| model = AutoModelForCausalLM.from_pretrained( |
| "google/gemma-4-31B", |
| torch_dtype=torch.bfloat16, |
| device_map="auto", |
| trust_remote_code=True, |
| ) |
| |
| # Apply RotorQuant to the KV cache |
| cache = IsoQuantCache(bits=4) # or bits=2 for more aggressive compression |
| |
| inputs = tokenizer("Hello, how are you?", return_tensors="pt").to(model.device) |
| outputs = model.generate( |
| **inputs, |
| max_new_tokens=128, |
| past_key_values=cache, |
| use_cache=True, |
| ) |
| print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)) |
| ``` |
|
|
|
|
| ### Option B β llama.cpp / LM Studio / Ollama (with fork) |
|
|
| RotorQuant KV cache types (`iso3`) are **not** in upstream llama.cpp. They require: |
| - [llama-cpp-turboquant fork](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache) |
|
|
| Once built: |
|
|
| ```bash |
| llama-cli -m gemma-4-31B.gguf \ |
| --cache-type-k iso3 --cache-type-v iso3 \ |
| -ngl 99 -fa \ |
| -p "Hello" |
| ``` |
|
|
| For standard runtimes (LM Studio, Ollama, upstream llama.cpp), use conventional KV cache types (`q8_0`, `q4_0`). You lose the RotorQuant-specific benefits but keep GGUF weight quantization. |
|
|
| ## Model Specifications |
|
|
| | Property | Value | |
| |----------|-------| |
| | Base Model | [google/gemma-4-31B](https://huggingface.co/google/gemma-4-31B) | |
| | Architecture | Dense transformer | |
| | Parameters | 31B (dense) | |
| | Context Length | 128K | |
| | BF16 Size | ~62 GB | |
| | Modalities | Text + Image | |
| | License | apache-2.0 | |
|
|
| ## What is RotorQuant? |
|
|
| [RotorQuant](https://github.com/scrya-com/rotorquant) is a KV cache compression method based on Clifford algebra (Cl(3,0)) rotors β a faster, more parameter-efficient alternative to Google's TurboQuant. Uses lightweight block-diagonal rotations (independent 2D/4D rotations per pair/quartet) achieving O(d) complexity instead of O(d log d), fully parallelisable with no inter-element dependencies. |
|
|
| **Benchmarks** (from the RotorQuant repository, Llama 3.1 8B on RTX 5090 β results vary by model and hardware): |
|
|
| - Prefill: 3,822 tok/s (vs TurboQuant 722 tok/s) |
| - Decode: 119 tok/s (vs TurboQuant 93 tok/s) |
| - Perplexity: 6.91 (vs TurboQuant 7.07) |
| - Parameters: 4 per rotor (vs TurboQuant 16,384) |
|
|
| > Benchmarks are from the RotorQuant repository using Llama 3.1 8B. Performance on gemma-4-31B will differ. Please open a discussion if you have independent results. |
|
|
| ## Current Ecosystem Support |
|
|
| | Runtime | RotorQuant Support | Notes | |
| |---------|----------------------|-------| |
| | Python transformers + `rotorquant` | β
Full | Drop-in cache class | |
| | llama.cpp upstream | β Not merged | Use fork below | |
| | llama-cpp-turboquant fork | β
`planar3`, `iso3` | [GitHub](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache) | |
| | LM Studio | β [Requested](https://github.com/lmstudio-ai/lmstudio-bug-tracker/issues/1719) | Use `q8_0` as alternative | |
| | Ollama | β Not supported | Use `OLLAMA_KV_CACHE_TYPE=q8_0` | |
| | vLLM | β Not supported | β | |
| | koboldcpp | β Not supported | β | |
|
|
| ## Pre-quantized weight variants |
|
|
| If you want combined weight + KV cache compression, majentik hosts pre-quantized versions: |
|
|
| - [MLX (Apple Silicon)](https://huggingface.co/majentik?search=gemma-4-31B+MLX) |
| - [GGUF (llama.cpp / Ollama / LM Studio)](https://huggingface.co/majentik?search=gemma-4-31B+GGUF) |
|
|
| ## See Also |
|
|
| - [RotorQuant GitHub](https://github.com/scrya-com/rotorquant) |
| - [TurboQuant paper (arXiv 2504.19874)](https://arxiv.org/abs/2504.19874) |
| - [llama-cpp-turboquant fork](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache) |
| - [Base model: google/gemma-4-31B](https://huggingface.co/google/gemma-4-31B) |
| - [gemma-4-31B announcement](https://blog.google/technology/developers/gemma-4/) |
|
|