---
license: apache-2.0
base_model: google/gemma-4-E4B
tags:
  - awq
  - rotorquant
  - kv-cache-quantization
  - gemma
  - gemma4
  - quantized
  - 4bit
library_name: transformers
pipeline_tag: image-text-to-text
---

# Gemma 4 E4B - RotorQuant AWQ 4-bit

**4-bit AWQ-quantized version** of [google/gemma-4-E4B](https://huggingface.co/google/gemma-4-E4B) with RotorQuant KV-cache quantization. AWQ (Activation-aware Weight Quantization) is an activation-aware method optimal for GPU inference. RotorQuant delivers 5.3x faster prefill and 28% faster decode compared to TurboQuant.

Approximate model size: **~2.5 GB**

> **Note:** RotorQuant KV cache modes (`planar3`, `iso3`) require the [RotorQuant fork](https://github.com/scrya-com/rotorquant) or the [llama-cpp-turboquant fork](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache). The AWQ weights themselves load cleanly in stock AutoAWQ / vLLM; RotorQuant KV-cache kernels are opt-in.

## Model Specifications

| Property | Value |
|---|---|
| **Base Model** | [google/gemma-4-E4B](https://huggingface.co/google/gemma-4-E4B) |
| **Parameters** | ~4 billion |
| **Architecture** | Dense transformer |
| **Modality** | Multimodal: image + text input, text output |
| **License** | Apache 2.0 |
| **Weight Quantization** | AWQ 4-bit (~2.5 GB) |
| **Group Size** | 128 |
| **KV-Cache Quantization** | RotorQuant (`planar3` / `iso3`) |
| **Framework** | transformers + AutoAWQ / vLLM |

## Quickstart

### AutoAWQ

```python
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model = AutoAWQForCausalLM.from_quantized(
    "majentik/gemma-4-E4B-RotorQuant-AWQ-4bit",
    device_map="auto",
    fuse_layers=True,
)
tokenizer = AutoTokenizer.from_pretrained("majentik/gemma-4-E4B-RotorQuant-AWQ-4bit")

prompt = "The history of artificial intelligence began"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(out[0], skip_special_tokens=True))
```

### vLLM

```bash
vllm serve majentik/gemma-4-E4B-RotorQuant-AWQ-4bit \
  --quantization awq_marlin \
  --max-model-len 8192
```

### With RotorQuant KV cache (fork)

```python
from rotorquant import RotorQuantCache
cache = RotorQuantCache(model, mode="iso3")  # or "planar3"
```

## What is RotorQuant?

[RotorQuant](https://github.com/scrya-com/rotorquant) is a high-performance KV-cache quantization method using block-diagonal Clifford-algebra rotors. Combined with AWQ 4-bit weights, this delivers a dual compression strategy with superior KV-cache performance for GPU inference.

Key advantages over TurboQuant:
- **5.3x faster prefill**
- **28% faster decode**
- Equivalent memory savings
- `planar3` / `iso3` 3-bit KV cache modes

## KV-Cache Quantization Comparison

| Method | Prefill Speed | Decode Speed | Memory Savings | Reference |
|---|---|---|---|---|
| **TurboQuant** | 1x (baseline) | 1x (baseline) | High | [arXiv: 2504.19874](https://arxiv.org/abs/2504.19874) |
| **RotorQuant** | **5.3x faster** | **28% faster** | High | [GitHub](https://github.com/scrya-com/rotorquant) |

## AWQ vs GGUF vs MLX

| Format | Target Hardware | Runtime | Best For |
|---|---|---|---|
| **AWQ** | NVIDIA / AMD GPU (CUDA/ROCm) | AutoAWQ, vLLM, TGI | GPU-native inference, production serving |
| **GGUF** | CPU + GPU (cross-platform) | llama.cpp, Ollama, LM Studio | Laptops, CPU-only boxes, mixed offload |
| **MLX** | Apple Silicon | MLX, mlx-lm, mlx-vlm | Macs with unified memory |

This repo ships **AWQ**. See the "See Also" section for GGUF and MLX siblings.

## Memory Estimates (Gemma 4 E4B)

| Precision | Approximate Size | VRAM Tier |
|---|---|---|
| FP16 (original) | ~8 GB | 12 GB+ |
| AWQ 8-bit | ~4 GB | 8 GB+ |
| **AWQ 4-bit** | **~2.5 GB** | **6 GB+** |

Fits on mid-range consumer GPUs (RTX 3060 12GB, 4060 Ti, 4070 and up).

## Hardware Requirements

- NVIDIA GPU with >=6 GB VRAM (RTX 3060, 4060, 4070, A4000, L4)
- CUDA 12.x recommended
- For vLLM: compute capability >= 7.5 (Turing or newer) for Marlin kernels
- For RotorQuant KV cache: [scrya-com/rotorquant](https://github.com/scrya-com/rotorquant) fork

## See Also

- [google/gemma-4-E4B](https://huggingface.co/google/gemma-4-E4B) -- Base model
- [majentik/gemma-4-E4B-RotorQuant](https://huggingface.co/majentik/gemma-4-E4B-RotorQuant) -- RotorQuant KV-cache only (transformers)
- [majentik/gemma-4-E4B-RotorQuant-AWQ-8bit](https://huggingface.co/majentik/gemma-4-E4B-RotorQuant-AWQ-8bit) -- AWQ 8-bit variant
- [majentik/gemma-4-E4B-TurboQuant-AWQ-4bit](https://huggingface.co/majentik/gemma-4-E4B-TurboQuant-AWQ-4bit) -- TurboQuant AWQ 4-bit variant
- [majentik/gemma-4-E4B-RotorQuant-MLX-4bit](https://huggingface.co/majentik/gemma-4-E4B-RotorQuant-MLX-4bit) -- MLX variant (Apple Silicon)
- [RotorQuant GitHub](https://github.com/scrya-com/rotorquant)
- [llama-cpp-turboquant fork](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache)
- [AutoAWQ](https://github.com/casper-hansen/AutoAWQ)
- [vLLM](https://github.com/vllm-project/vllm)

## Quant trade-off (AWQ lane)

| Bits | Approx size | Use case | Recommendation |
|---|---|---|---|
| **4-bit** | ~1.7 GB | Activation-aware 4-bit weight quant | **GPU inference (vLLM, transformers, AutoAWQ)** |
| 8-bit | ~3.0 GB | Activation-aware 8-bit weight quant | Quality-sensitive GPU inference |

(Current variant — **4bit** — is bolded.)

## Variants in this family

(Showing 18 sibling variants under `majentik/gemma4-e4b-*`. The current variant — `RotorQuant-AWQ-4bit` — is **bolded**.)

| Variant | Runtime | Approx size | Use case |
|---|---|---|---|
| [RotorQuant](https://huggingface.co/majentik/gemma4-e4b-rotorquant) | runtime modifier | n/a | KV-cache root (weight-agnostic) |
| **RotorQuant-AWQ-4bit** | transformers | ~2.5 GB | GPU 4-bit (AutoAWQ) |
| [RotorQuant-AWQ-8bit](https://huggingface.co/majentik/gemma4-e4b-rotorquant-awq-8bit) | transformers | ~4.4 GB | GPU 8-bit (AutoAWQ) |
| [RotorQuant-GGUF-IQ4_XS](https://huggingface.co/majentik/gemma4-e4b-rotorquant-gguf-IQ4_XS) | llama.cpp | ~3.4 GB | Lossy 4-bit, low-RAM CPU/edge |
| [RotorQuant-GGUF-Q2_K](https://huggingface.co/majentik/gemma4-e4b-rotorquant-gguf-Q2_K) | llama.cpp | ~2.4 GB | Lossy, low-RAM CPU/edge |
| [RotorQuant-GGUF-Q3_K_M](https://huggingface.co/majentik/gemma4-e4b-rotorquant-gguf-Q3_K_M) | llama.cpp | ~3.1 GB | Smaller 3-bit, CPU-friendly |
| [RotorQuant-GGUF-Q4_K_M](https://huggingface.co/majentik/gemma4-e4b-rotorquant-gguf-Q4_K_M) | llama.cpp | ~4.4 GB | Balanced default |
| [RotorQuant-GGUF-Q5_K_M](https://huggingface.co/majentik/gemma4-e4b-rotorquant-gguf-Q5_K_M) | llama.cpp | ~5.3 GB | Higher fidelity, more RAM |
| [RotorQuant-GGUF-Q8_0](https://huggingface.co/majentik/gemma4-e4b-rotorquant-gguf-Q8_0) | llama.cpp | ~8.4 GB | Near-lossless reference |
| [RotorQuant-MLX-2bit](https://huggingface.co/majentik/gemma4-e4b-rotorquant-mlx-2bit) | mlx-lm | ~1.3 GB | Apple Silicon, smallest |
| [RotorQuant-MLX-4bit](https://huggingface.co/majentik/gemma4-e4b-rotorquant-mlx-4bit) | mlx-lm | ~2.5 GB | Apple Silicon balanced |
| [RotorQuant-MLX-8bit](https://huggingface.co/majentik/gemma4-e4b-rotorquant-mlx-8bit) | mlx-lm | ~4.7 GB | Apple Silicon reference |
| [TurboQuant](https://huggingface.co/majentik/gemma4-e4b-turboquant) | runtime modifier | n/a | KV-cache root (weight-agnostic) |
| [TurboQuant-AWQ-4bit](https://huggingface.co/majentik/gemma4-e4b-turboquant-awq-4bit) | transformers | ~2.5 GB | GPU 4-bit (AutoAWQ) |
| [TurboQuant-AWQ-8bit](https://huggingface.co/majentik/gemma4-e4b-turboquant-awq-8bit) | transformers | ~4.4 GB | GPU 8-bit (AutoAWQ) |
| [TurboQuant-MLX-2bit](https://huggingface.co/majentik/gemma4-e4b-turboquant-mlx-2bit) | mlx-lm | ~1.3 GB | Apple Silicon, smallest |
| [TurboQuant-MLX-4bit](https://huggingface.co/majentik/gemma4-e4b-turboquant-mlx-4bit) | mlx-lm | ~2.5 GB | Apple Silicon balanced |
| [TurboQuant-MLX-8bit](https://huggingface.co/majentik/gemma4-e4b-turboquant-mlx-8bit) | mlx-lm | ~4.7 GB | Apple Silicon reference |