Gemma 4 31B-it - RotorQuant AWQ 8-bit
8-bit AWQ-quantized version of google/gemma-4-31B-it (31B dense, instruction-tuned) with RotorQuant KV-cache quantization. AWQ (Activation-aware Weight Quantization) is an activation-aware method optimal for GPU inference. The 8-bit variant keeps quality very close to FP16 while halving VRAM usage, and RotorQuant delivers 5.3x faster prefill and 28% faster decode vs TurboQuant.
Approximate model size: ~31 GB
Note: RotorQuant KV cache modes (
planar3,iso3) require the RotorQuant fork or the llama-cpp-turboquant fork. The AWQ weights themselves load cleanly in stock AutoAWQ / vLLM; RotorQuant KV-cache kernels are opt-in.
Model Specifications
| Property | Value |
|---|---|
| Base Model | google/gemma-4-31B-it |
| Parameters | ~31 billion |
| Architecture | Dense transformer, instruction-tuned |
| Modality | Multimodal: image + text input, text output |
| License | Apache 2.0 |
| Weight Quantization | AWQ 8-bit (~31 GB) |
| Group Size | 128 |
| KV-Cache Quantization | RotorQuant (planar3 / iso3) |
| Framework | transformers + AutoAWQ / vLLM |
Quickstart
AutoAWQ
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model = AutoAWQForCausalLM.from_quantized(
"majentik/gemma-4-31B-it-RotorQuant-AWQ-8bit",
device_map="auto",
fuse_layers=True,
)
tokenizer = AutoTokenizer.from_pretrained("majentik/gemma-4-31B-it-RotorQuant-AWQ-8bit")
messages = [{"role": "user", "content": "Draft a launch announcement for a new API product."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
out = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(out[0], skip_special_tokens=True))
vLLM
vllm serve majentik/gemma-4-31B-it-RotorQuant-AWQ-8bit \
--quantization awq_marlin \
--tensor-parallel-size 1 \
--max-model-len 8192
With RotorQuant KV cache (fork)
from rotorquant import RotorQuantCache
cache = RotorQuantCache(model, mode="iso3") # or "planar3"
What is RotorQuant?
RotorQuant is a high-performance KV-cache quantization method using block-diagonal Clifford-algebra rotors. Combined with AWQ 8-bit weights, it delivers near-FP16 quality at roughly half the VRAM cost, with RotorQuant's compressed KV cache further reducing long-context memory.
Key advantages over TurboQuant:
- 5.3x faster prefill
- 28% faster decode
- Equivalent memory savings
planar3/iso33-bit KV cache modes
KV-Cache Quantization Comparison
| Method | Prefill Speed | Decode Speed | Memory Savings | Reference |
|---|---|---|---|---|
| TurboQuant | 1x (baseline) | 1x (baseline) | High | arXiv: 2504.19874 |
| RotorQuant | 5.3x faster | 28% faster | High | GitHub |
AWQ vs GGUF vs MLX
| Format | Target Hardware | Runtime | Best For |
|---|---|---|---|
| AWQ | NVIDIA / AMD GPU (CUDA/ROCm) | AutoAWQ, vLLM, TGI | GPU-native inference, production serving |
| GGUF | CPU + GPU (cross-platform) | llama.cpp, Ollama, LM Studio | Laptops, CPU-only boxes, mixed offload |
| MLX | Apple Silicon | MLX, mlx-lm, mlx-vlm | Macs with unified memory |
This repo ships AWQ. See the "See Also" section for GGUF and MLX siblings.
Memory Estimates (Gemma 4 31B-it)
| Precision | Approximate Size | VRAM Tier |
|---|---|---|
| FP16 (original) | ~62 GB | 80 GB+ (A100/H100) |
| AWQ 8-bit | ~31 GB | 40 GB+ (A100 40/80GB, L40S, 2x RTX 4090) |
| AWQ 4-bit | ~17 GB | 24 GB+ |
Best deployed on server-class GPUs (A100 40/80GB, L40S, H100) or dual RTX 4090 with tensor parallelism.
Hardware Requirements
- NVIDIA GPU with >=40 GB VRAM single-card, or 2x 24 GB cards with TP=2
- Recommended: A100 40GB, A100 80GB, L40S 48GB, H100 80GB
- CUDA 12.x recommended
- For vLLM: compute capability >= 7.5 (Turing or newer) for Marlin kernels
- For RotorQuant KV cache: scrya-com/rotorquant fork
See Also
- google/gemma-4-31B-it -- Base model
- majentik/gemma-4-31B-it-RotorQuant -- RotorQuant KV-cache only (transformers)
- majentik/gemma-4-31B-it-RotorQuant-AWQ-4bit -- AWQ 4-bit variant
- majentik/gemma-4-31B-it-TurboQuant-AWQ-8bit -- TurboQuant AWQ 8-bit variant
- majentik/gemma-4-31B-it-RotorQuant-MLX-8bit -- MLX variant (Apple Silicon)
- RotorQuant GitHub
- llama-cpp-turboquant fork
- AutoAWQ
- vLLM
Quant trade-off (AWQ lane)
| Bits | Approx size | Use case | Recommendation |
|---|---|---|---|
| 4-bit | ~13 GB | Activation-aware 4-bit weight quant | GPU inference (vLLM, transformers, AutoAWQ) |
| 8-bit | ~24 GB | Activation-aware 8-bit weight quant | Quality-sensitive GPU inference |
(Current variant — 8bit — is bolded.)
Variants in this family
(Showing 18 sibling variants under majentik/gemma4-31b-it-*. The current variant — RotorQuant-AWQ-8bit — is bolded.)
| Variant | Runtime | Approx size | Use case |
|---|---|---|---|
| RotorQuant | runtime modifier | n/a | KV-cache root (weight-agnostic) |
| RotorQuant-AWQ-4bit | transformers | ~19 GB | GPU 4-bit (AutoAWQ) |
| RotorQuant-AWQ-8bit | transformers | ~34 GB | GPU 8-bit (AutoAWQ) |
| RotorQuant-GGUF-IQ4_XS | llama.cpp | ~27 GB | Lossy 4-bit, low-RAM CPU/edge |
| RotorQuant-GGUF-Q2_K | llama.cpp | ~19 GB | Lossy, low-RAM CPU/edge |
| RotorQuant-GGUF-Q3_K_M | llama.cpp | ~24 GB | Smaller 3-bit, CPU-friendly |
| RotorQuant-GGUF-Q4_K_M | llama.cpp | ~34 GB | Balanced default |
| RotorQuant-GGUF-Q5_K_M | llama.cpp | ~41 GB | Higher fidelity, more RAM |
| RotorQuant-GGUF-Q8_0 | llama.cpp | ~65 GB | Near-lossless reference |
| RotorQuant-MLX-2bit | mlx-lm | ~9.9 GB | Apple Silicon, smallest |
| RotorQuant-MLX-4bit | mlx-lm | ~19 GB | Apple Silicon balanced |
| RotorQuant-MLX-8bit | mlx-lm | ~37 GB | Apple Silicon reference |
| TurboQuant | runtime modifier | n/a | KV-cache root (weight-agnostic) |
| TurboQuant-AWQ-4bit | transformers | ~19 GB | GPU 4-bit (AutoAWQ) |
| TurboQuant-AWQ-8bit | transformers | ~34 GB | GPU 8-bit (AutoAWQ) |
| TurboQuant-MLX-2bit | mlx-lm | ~9.9 GB | Apple Silicon, smallest |
| TurboQuant-MLX-4bit | mlx-lm | ~19 GB | Apple Silicon balanced |
| TurboQuant-MLX-8bit | mlx-lm | ~37 GB | Apple Silicon reference |