majentik's picture
docs: Tier 2 polish — variant matrix + quant trade-off
e0b2503 verified
---
license: apache-2.0
base_model: google/gemma-4-E2B
tags:
- awq
- rotorquant
- kv-cache-quantization
- gemma
- gemma4
- quantized
- 4bit
library_name: transformers
pipeline_tag: image-text-to-text
---
# Gemma 4 E2B - RotorQuant AWQ 4-bit
**4-bit AWQ-quantized version** of [google/gemma-4-E2B](https://huggingface.co/google/gemma-4-E2B) with RotorQuant KV-cache quantization. AWQ (Activation-aware Weight Quantization) is an activation-aware method optimal for GPU inference, preserving the salient weights most important to model outputs. RotorQuant delivers 5.3x faster prefill and 28% faster decode compared to TurboQuant.
Approximate model size: **~1.5 GB**
> **Note:** RotorQuant KV cache modes (`planar3`, `iso3`) require the [RotorQuant fork](https://github.com/scrya-com/rotorquant) or the [llama-cpp-turboquant fork](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache) for llama.cpp workflows. The AWQ weights themselves load cleanly in stock AutoAWQ / vLLM; RotorQuant KV-cache kernels are opt-in.
## Model Specifications
| Property | Value |
|---|---|
| **Base Model** | [google/gemma-4-E2B](https://huggingface.co/google/gemma-4-E2B) |
| **Parameters** | ~2 billion |
| **Architecture** | Dense transformer |
| **Modality** | Multimodal: image + text input, text output |
| **License** | Apache 2.0 |
| **Weight Quantization** | AWQ 4-bit (~1.5 GB) |
| **Group Size** | 128 |
| **KV-Cache Quantization** | RotorQuant (`planar3` / `iso3`) |
| **Framework** | transformers + AutoAWQ / vLLM |
## Quickstart
### AutoAWQ
```python
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model = AutoAWQForCausalLM.from_quantized(
"majentik/gemma-4-E2B-RotorQuant-AWQ-4bit",
device_map="auto",
fuse_layers=True,
)
tokenizer = AutoTokenizer.from_pretrained("majentik/gemma-4-E2B-RotorQuant-AWQ-4bit")
prompt = "The history of artificial intelligence began"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(out[0], skip_special_tokens=True))
```
### vLLM
```bash
vllm serve majentik/gemma-4-E2B-RotorQuant-AWQ-4bit \
--quantization awq_marlin \
--max-model-len 8192
```
### With RotorQuant KV cache (fork)
```python
from rotorquant import RotorQuantCache
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model = AutoAWQForCausalLM.from_quantized(
"majentik/gemma-4-E2B-RotorQuant-AWQ-4bit", device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("majentik/gemma-4-E2B-RotorQuant-AWQ-4bit")
cache = RotorQuantCache(model, mode="iso3") # or "planar3"
inputs = tokenizer("Long-context prompt...", return_tensors="pt").to(model.device)
out = model.generate(**inputs, past_key_values=cache, max_new_tokens=1024)
```
## What is RotorQuant?
[RotorQuant](https://github.com/scrya-com/rotorquant) is a high-performance KV-cache quantization method that achieves significantly better throughput than TurboQuant using block-diagonal Clifford-algebra rotors. Combined with AWQ 4-bit weights, this delivers a dual compression strategy with superior KV-cache performance for GPU inference.
Key advantages over TurboQuant:
- **5.3x faster prefill**
- **28% faster decode**
- Equivalent memory savings
- `planar3` / `iso3` 3-bit KV cache modes
## KV-Cache Quantization Comparison
| Method | Prefill Speed | Decode Speed | Memory Savings | Reference |
|---|---|---|---|---|
| **TurboQuant** | 1x (baseline) | 1x (baseline) | High | [arXiv: 2504.19874](https://arxiv.org/abs/2504.19874) |
| **RotorQuant** | **5.3x faster** | **28% faster** | High | [GitHub](https://github.com/scrya-com/rotorquant) |
## AWQ vs GGUF vs MLX
| Format | Target Hardware | Runtime | Best For |
|---|---|---|---|
| **AWQ** | NVIDIA / AMD GPU (CUDA/ROCm) | AutoAWQ, vLLM, TGI | GPU-native inference, production serving |
| **GGUF** | CPU + GPU (cross-platform) | llama.cpp, Ollama, LM Studio | Laptops, CPU-only boxes, mixed offload |
| **MLX** | Apple Silicon | MLX, mlx-lm, mlx-vlm | Macs with unified memory |
This repo ships **AWQ**. See the "See Also" section for GGUF and MLX siblings.
## Memory Estimates (Gemma 4 E2B)
| Precision | Approximate Size | VRAM Tier |
|---|---|---|
| FP16 (original) | ~4 GB | 8 GB+ |
| AWQ 8-bit | ~2 GB | 4 GB+ |
| **AWQ 4-bit** | **~1.5 GB** | **4 GB+** |
Fits comfortably on entry-level GPUs (RTX 3050 / 4060 / A2000 and up).
## Hardware Requirements
- NVIDIA GPU with >=4 GB VRAM (RTX 3050, 3060, 4060, A2000, T4)
- CUDA 12.x recommended
- For vLLM: compute capability >= 7.5 (Turing or newer) for Marlin kernels
- For RotorQuant KV cache: [scrya-com/rotorquant](https://github.com/scrya-com/rotorquant) fork
## See Also
- [google/gemma-4-E2B](https://huggingface.co/google/gemma-4-E2B) -- Base model
- [majentik/gemma-4-E2B-RotorQuant](https://huggingface.co/majentik/gemma-4-E2B-RotorQuant) -- RotorQuant KV-cache only (transformers)
- [majentik/gemma-4-E2B-RotorQuant-AWQ-8bit](https://huggingface.co/majentik/gemma-4-E2B-RotorQuant-AWQ-8bit) -- AWQ 8-bit variant
- [majentik/gemma-4-E2B-TurboQuant-AWQ-4bit](https://huggingface.co/majentik/gemma-4-E2B-TurboQuant-AWQ-4bit) -- TurboQuant AWQ 4-bit variant
- [majentik/gemma-4-E2B-RotorQuant-MLX-4bit](https://huggingface.co/majentik/gemma-4-E2B-RotorQuant-MLX-4bit) -- MLX variant (Apple Silicon)
- [RotorQuant GitHub](https://github.com/scrya-com/rotorquant)
- [llama-cpp-turboquant fork](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache)
- [AutoAWQ](https://github.com/casper-hansen/AutoAWQ)
- [vLLM](https://github.com/vllm-project/vllm)
## Quant trade-off (AWQ lane)
| Bits | Approx size | Use case | Recommendation |
|---|---|---|---|
| **4-bit** | ~860 MB | Activation-aware 4-bit weight quant | **GPU inference (vLLM, transformers, AutoAWQ)** |
| 8-bit | ~1.5 GB | Activation-aware 8-bit weight quant | Quality-sensitive GPU inference |
(Current variant — **4bit** — is bolded.)
## Variants in this family
(Showing 18 sibling variants under `majentik/gemma4-e2b-*`. The current variant — `RotorQuant-AWQ-4bit` — is **bolded**.)
| Variant | Runtime | Approx size | Use case |
|---|---|---|---|
| [RotorQuant](https://huggingface.co/majentik/gemma4-e2b-rotorquant) | runtime modifier | n/a | KV-cache root (weight-agnostic) |
| **RotorQuant-AWQ-4bit** | transformers | ~1.2 GB | GPU 4-bit (AutoAWQ) |
| [RotorQuant-AWQ-8bit](https://huggingface.co/majentik/gemma4-e2b-rotorquant-awq-8bit) | transformers | ~2.2 GB | GPU 8-bit (AutoAWQ) |
| [RotorQuant-GGUF-IQ4_XS](https://huggingface.co/majentik/gemma4-e2b-rotorquant-gguf-IQ4_XS) | llama.cpp | ~1.7 GB | Lossy 4-bit, low-RAM CPU/edge |
| [RotorQuant-GGUF-Q2_K](https://huggingface.co/majentik/gemma4-e2b-rotorquant-gguf-Q2_K) | llama.cpp | ~1.2 GB | Lossy, low-RAM CPU/edge |
| [RotorQuant-GGUF-Q3_K_M](https://huggingface.co/majentik/gemma4-e2b-rotorquant-gguf-Q3_K_M) | llama.cpp | ~1.6 GB | Smaller 3-bit, CPU-friendly |
| [RotorQuant-GGUF-Q4_K_M](https://huggingface.co/majentik/gemma4-e2b-rotorquant-gguf-Q4_K_M) | llama.cpp | ~2.2 GB | Balanced default |
| [RotorQuant-GGUF-Q5_K_M](https://huggingface.co/majentik/gemma4-e2b-rotorquant-gguf-Q5_K_M) | llama.cpp | ~2.6 GB | Higher fidelity, more RAM |
| [RotorQuant-GGUF-Q8_0](https://huggingface.co/majentik/gemma4-e2b-rotorquant-gguf-Q8_0) | llama.cpp | ~4.2 GB | Near-lossless reference |
| [RotorQuant-MLX-2bit](https://huggingface.co/majentik/gemma4-e2b-rotorquant-mlx-2bit) | mlx-lm | ~655 MB | Apple Silicon, smallest |
| [RotorQuant-MLX-4bit](https://huggingface.co/majentik/gemma4-e2b-rotorquant-mlx-4bit) | mlx-lm | ~1.2 GB | Apple Silicon balanced |
| [RotorQuant-MLX-8bit](https://huggingface.co/majentik/gemma4-e2b-rotorquant-mlx-8bit) | mlx-lm | ~2.4 GB | Apple Silicon reference |
| [TurboQuant](https://huggingface.co/majentik/gemma4-e2b-turboquant) | runtime modifier | n/a | KV-cache root (weight-agnostic) |
| [TurboQuant-AWQ-4bit](https://huggingface.co/majentik/gemma4-e2b-turboquant-awq-4bit) | transformers | ~1.2 GB | GPU 4-bit (AutoAWQ) |
| [TurboQuant-AWQ-8bit](https://huggingface.co/majentik/gemma4-e2b-turboquant-awq-8bit) | transformers | ~2.2 GB | GPU 8-bit (AutoAWQ) |
| [TurboQuant-MLX-2bit](https://huggingface.co/majentik/gemma4-e2b-turboquant-mlx-2bit) | mlx-lm | ~655 MB | Apple Silicon, smallest |
| [TurboQuant-MLX-4bit](https://huggingface.co/majentik/gemma4-e2b-turboquant-mlx-4bit) | mlx-lm | ~1.2 GB | Apple Silicon balanced |
| [TurboQuant-MLX-8bit](https://huggingface.co/majentik/gemma4-e2b-turboquant-mlx-8bit) | mlx-lm | ~2.4 GB | Apple Silicon reference |