gemma-4-31B-RotorQuant-GGUF-Q8_0

GGUF Q8_0 weight-quantized variant of google/gemma-4-31B optimised for use with RotorQuant KV cache compression via a dedicated llama.cpp fork.

Important: RotorQuant KV cache types (planar3, iso3) are not available in upstream llama.cpp, standard Ollama, or LM Studio. They require a specific llama.cpp fork. The GGUF file itself is a standard GGUF and works with any llama.cpp-compatible runtime using normal KV cache types (f16, q8_0, q4_0, etc.).

Overview

This model combines two independent compression techniques:

Technique What it does Requirement
GGUF Q8_0 weight quantization Reduces model size from ~62 GB (BF16) to ~31.0 GB Any llama.cpp-compatible runtime
RotorQuant KV cache compression β€” block-diagonal Clifford-algebra rotors for 3-bit KV cache (--cache-type-k iso3 --cache-type-v iso3) Block-diagonal rotations / random rotation for compressed KV cache llama-cpp-turboquant fork only

Quickstart

Option A β€” With RotorQuant KV cache (fork required)

You must build from the RotorQuant-enabled llama.cpp fork:

# Clone and build the fork
git clone https://github.com/johndpope/llama-cpp-turboquant.git
cd llama-cpp-turboquant && git checkout feature/planarquant-kv-cache

# CUDA (Windows/Linux)
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j

# Metal (Apple Silicon)
cmake -B build -DGGML_METAL=ON -DGGML_METAL_EMBED_LIBRARY=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j

# Run with RotorQuant KV cache
./build/bin/llama-cli -m gemma-4-31B-RotorQuant-GGUF-Q8_0.gguf \
  --cache-type-k iso3 --cache-type-v iso3 \
  -ngl 99 -fa \
  -p "Explain quantum computing"

# Or run as a server
./build/bin/llama-server -m gemma-4-31B-RotorQuant-GGUF-Q8_0.gguf \
  --cache-type-k iso3 --cache-type-v iso3 \
  -ngl 99 -fa --jinja

Option B β€” With standard llama.cpp / LM Studio / Ollama

The GGUF works as a normal quantised model. You won't get RotorQuant-specific KV cache benefits, but standard KV cache quantization (q8_0, q4_0) still reduces VRAM significantly.

llama.cpp (upstream)

llama-cli -m gemma-4-31B-RotorQuant-GGUF-Q8_0.gguf \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  -ngl 99 -fa \
  -p "Explain quantum computing"

LM Studio

  1. Download the GGUF file and load in LM Studio.
  2. Enable Developer Mode (Settings β†’ Developer).
  3. In the model loader's advanced settings, set Flash Attention to ON.
  4. Set K Cache Quantization and V Cache Quantization to q8_0 (or q4_0 for more aggressive VRAM savings).
  5. Note: LM Studio does not currently support RotorQuant's iso3 cache types. Track this feature request for updates.

Ollama

# Standard Ollama does not support RotorQuant cache types.
# Use with default or q8_0 KV cache via OLLAMA_KV_CACHE_TYPE=q8_0
OLLAMA_KV_CACHE_TYPE=q8_0 OLLAMA_FLASH_ATTENTION=1 ollama run majentik/gemma-4-31B-RotorQuant-GGUF-Q8_0

Specifications

Property Value
Base Model google/gemma-4-31B
Architecture Dense transformer
Parameters 31B (all active, dense)
Context Length 128K
Weight Quantization GGUF Q8_0 (near-lossless 8-bit, reference quality)
Original Size (BF16) ~62 GB
Quantized File Size ~31.0 GB
KV Cache (RotorQuant) 3-bit via --cache-type-k iso3 --cache-type-v iso3 (fork only)
KV Cache (standard) q8_0, q4_0, f16, etc. (any llama.cpp runtime)
License apache-2.0
Modalities Text + Image (image-text-to-text)
Compatible Runtimes llama.cpp, LM Studio, Ollama, koboldcpp

What is RotorQuant?

RotorQuant is a KV cache compression method based on Clifford algebra (Cl(3,0)) rotors. It was developed as a faster, more parameter-efficient alternative to Google's TurboQuant (ICLR 2026).

Instead of applying a dense dΓ—d random orthogonal rotation matrix (as TurboQuant does), RotorQuant uses lightweight block-diagonal rotations β€” independent 2D/4D rotations per pair/quartet β€” achieving O(d) complexity instead of O(d log d), fully parallelisable with no inter-element dependencies.

Benchmarks from the RotorQuant repository (Llama 3.1 8B, RTX 5090 β€” results will vary by model and hardware):

Metric RotorQuant (iso3) TurboQuant Standard q4_0
Prefill Speed 3,822 tok/s 722 tok/s β€”
Decode Speed 119 tok/s 93 tok/s β€”
Perplexity (PPL) 6.91 7.07 β€”
KV Compression ~5Γ— vs FP16 ~5Γ— vs FP16 ~4Γ— vs FP16
Rotation Parameters 4 per rotor 16,384 per matrix N/A

Note: These benchmarks are from the RotorQuant repository using Llama 3.1 8B on an RTX 5090. Performance on gemma-4-31B will differ. Independent benchmarks for this specific model are welcome β€” please open a discussion if you have results to share.

Current Status of RotorQuant in the Ecosystem

Runtime RotorQuant Support Standard KV Quant
llama.cpp (upstream) ❌ Not merged βœ… q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
llama-cpp-turboquant fork βœ… planar3, iso3 βœ… All standard types
LM Studio ❌ Requested βœ… Via advanced settings
Ollama ❌ Not supported βœ… Via OLLAMA_KV_CACHE_TYPE
koboldcpp ❌ Not supported βœ… Standard types

Recommended Settings

For VRAM-constrained setups, standard q8_0 KV cache quantization already halves KV cache memory with negligible quality impact. Flash Attention should always be enabled β€” it is required for V cache quantization and improves memory efficiency regardless.

VRAM Suggested Configuration
24 GB (RTX 4090) Q8_0 + q8_0 KV cache + Flash Attention, 8K–16K context
16 GB Q8_0 + q4_0 KV cache + Flash Attention, 4K–8K context
48+ GB Q8_0 + f16 KV cache, full 32K+ context

See Also

Downloads last month
99
GGUF
Model size
31B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for majentik/gemma-4-31B-RotorQuant-GGUF-Q8_0

Quantized
(28)
this model

Paper for majentik/gemma-4-31B-RotorQuant-GGUF-Q8_0