Gemma 4 31B - RotorQuant AWQ 4-bit
4-bit AWQ-quantized version of google/gemma-4-31B (31B dense) with RotorQuant KV-cache quantization. AWQ (Activation-aware Weight Quantization) is an activation-aware method optimal for GPU inference. RotorQuant delivers 5.3x faster prefill and 28% faster decode compared to TurboQuant -- especially valuable for long-context serving on a single 24GB GPU.
Approximate model size: ~17 GB
Note: RotorQuant KV cache modes (
planar3,iso3) require the RotorQuant fork or the llama-cpp-turboquant fork. The AWQ weights themselves load cleanly in stock AutoAWQ / vLLM; RotorQuant KV-cache kernels are opt-in.
Model Specifications
| Property | Value |
|---|---|
| Base Model | google/gemma-4-31B |
| Parameters | ~31 billion |
| Architecture | Dense transformer |
| Modality | Multimodal: image + text input, text output |
| License | Apache 2.0 |
| Weight Quantization | AWQ 4-bit (~17 GB) |
| Group Size | 128 |
| KV-Cache Quantization | RotorQuant (planar3 / iso3) |
| Framework | transformers + AutoAWQ / vLLM |
Quickstart
AutoAWQ
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model = AutoAWQForCausalLM.from_quantized(
"majentik/gemma-4-31B-RotorQuant-AWQ-4bit",
device_map="auto",
fuse_layers=True,
)
tokenizer = AutoTokenizer.from_pretrained("majentik/gemma-4-31B-RotorQuant-AWQ-4bit")
prompt = "The history of artificial intelligence began"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(out[0], skip_special_tokens=True))
vLLM
vllm serve majentik/gemma-4-31B-RotorQuant-AWQ-4bit \
--quantization awq_marlin \
--max-model-len 8192
With RotorQuant KV cache (fork)
from rotorquant import RotorQuantCache
cache = RotorQuantCache(model, mode="iso3") # or "planar3"
What is RotorQuant?
RotorQuant is a high-performance KV-cache quantization method using block-diagonal Clifford-algebra rotors. Combined with AWQ 4-bit weights, this delivers a dual compression strategy with superior KV-cache performance for GPU inference.
Key advantages over TurboQuant:
- 5.3x faster prefill
- 28% faster decode
- Equivalent memory savings
planar3/iso33-bit KV cache modes
KV-Cache Quantization Comparison
| Method | Prefill Speed | Decode Speed | Memory Savings | Reference |
|---|---|---|---|---|
| TurboQuant | 1x (baseline) | 1x (baseline) | High | arXiv: 2504.19874 |
| RotorQuant | 5.3x faster | 28% faster | High | GitHub |
AWQ vs GGUF vs MLX
| Format | Target Hardware | Runtime | Best For |
|---|---|---|---|
| AWQ | NVIDIA / AMD GPU (CUDA/ROCm) | AutoAWQ, vLLM, TGI | GPU-native inference, production serving |
| GGUF | CPU + GPU (cross-platform) | llama.cpp, Ollama, LM Studio | Laptops, CPU-only boxes, mixed offload |
| MLX | Apple Silicon | MLX, mlx-lm, mlx-vlm | Macs with unified memory |
This repo ships AWQ. See the "See Also" section for GGUF and MLX siblings.
Memory Estimates (Gemma 4 31B)
| Precision | Approximate Size | VRAM Tier |
|---|---|---|
| FP16 (original) | ~62 GB | 80 GB+ (A100/H100) |
| AWQ 8-bit | ~31 GB | 48 GB+ (L40S, A100 40GB tight) |
| AWQ 4-bit | ~17 GB | 24 GB+ (RTX 4090, A5000, A6000) |
Fits on a single RTX 4090 (24GB) with room for moderate contexts; comfortable on A6000 / L40 (48GB).
Hardware Requirements
- NVIDIA GPU with >=24 GB VRAM (RTX 4090, A5000, A6000, L40, A100, H100)
- CUDA 12.x recommended
- For vLLM: compute capability >= 7.5 (Turing or newer) for Marlin kernels
- For RotorQuant KV cache: scrya-com/rotorquant fork
See Also
- google/gemma-4-31B -- Base model
- majentik/gemma-4-31B-RotorQuant -- RotorQuant KV-cache only (transformers)
- majentik/gemma-4-31B-RotorQuant-AWQ-8bit -- AWQ 8-bit variant
- majentik/gemma-4-31B-TurboQuant-AWQ-4bit -- TurboQuant AWQ 4-bit variant
- majentik/gemma-4-31B-RotorQuant-MLX-4bit -- MLX variant (Apple Silicon)
- RotorQuant GitHub
- llama-cpp-turboquant fork
- AutoAWQ
- vLLM
Quant trade-off (AWQ lane)
| Bits | Approx size | Use case | Recommendation |
|---|---|---|---|
| 4-bit | ~13 GB | Activation-aware 4-bit weight quant | GPU inference (vLLM, transformers, AutoAWQ) |
| 8-bit | ~24 GB | Activation-aware 8-bit weight quant | Quality-sensitive GPU inference |
(Current variant — 4bit — is bolded.)
Variants in this family
(Showing 18 sibling variants under majentik/gemma4-31b-*. The current variant — RotorQuant-AWQ-4bit — is bolded.)
| Variant | Runtime | Approx size | Use case |
|---|---|---|---|
| RotorQuant | runtime modifier | n/a | KV-cache root (weight-agnostic) |
| RotorQuant-AWQ-4bit | transformers | ~19 GB | GPU 4-bit (AutoAWQ) |
| RotorQuant-AWQ-8bit | transformers | ~34 GB | GPU 8-bit (AutoAWQ) |
| RotorQuant-GGUF-IQ4_XS | llama.cpp | ~27 GB | Lossy 4-bit, low-RAM CPU/edge |
| RotorQuant-GGUF-Q2_K | llama.cpp | ~19 GB | Lossy, low-RAM CPU/edge |
| RotorQuant-GGUF-Q3_K_M | llama.cpp | ~24 GB | Smaller 3-bit, CPU-friendly |
| RotorQuant-GGUF-Q4_K_M | llama.cpp | ~34 GB | Balanced default |
| RotorQuant-GGUF-Q5_K_M | llama.cpp | ~41 GB | Higher fidelity, more RAM |
| RotorQuant-GGUF-Q8_0 | llama.cpp | ~65 GB | Near-lossless reference |
| RotorQuant-MLX-2bit | mlx-lm | ~9.9 GB | Apple Silicon, smallest |
| RotorQuant-MLX-4bit | mlx-lm | ~19 GB | Apple Silicon balanced |
| RotorQuant-MLX-8bit | mlx-lm | ~37 GB | Apple Silicon reference |
| TurboQuant | runtime modifier | n/a | KV-cache root (weight-agnostic) |
| TurboQuant-AWQ-4bit | transformers | ~19 GB | GPU 4-bit (AutoAWQ) |
| TurboQuant-AWQ-8bit | transformers | ~34 GB | GPU 8-bit (AutoAWQ) |
| TurboQuant-MLX-2bit | mlx-lm | ~9.9 GB | Apple Silicon, smallest |
| TurboQuant-MLX-4bit | mlx-lm | ~19 GB | Apple Silicon balanced |
| TurboQuant-MLX-8bit | mlx-lm | ~37 GB | Apple Silicon reference |
Model tree for majentik/gemma-4-31B-RotorQuant-AWQ-4bit
Base model
google/gemma-4-31B