--- license: apache-2.0 base_model: google/gemma-4-E4B tags: - awq - rotorquant - kv-cache-quantization - gemma - gemma4 - quantized - 4bit library_name: transformers pipeline_tag: image-text-to-text --- # Gemma 4 E4B - RotorQuant AWQ 4-bit **4-bit AWQ-quantized version** of [google/gemma-4-E4B](https://huggingface.co/google/gemma-4-E4B) with RotorQuant KV-cache quantization. AWQ (Activation-aware Weight Quantization) is an activation-aware method optimal for GPU inference. RotorQuant delivers 5.3x faster prefill and 28% faster decode compared to TurboQuant. Approximate model size: **~2.5 GB** > **Note:** RotorQuant KV cache modes (`planar3`, `iso3`) require the [RotorQuant fork](https://github.com/scrya-com/rotorquant) or the [llama-cpp-turboquant fork](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache). The AWQ weights themselves load cleanly in stock AutoAWQ / vLLM; RotorQuant KV-cache kernels are opt-in. ## Model Specifications | Property | Value | |---|---| | **Base Model** | [google/gemma-4-E4B](https://huggingface.co/google/gemma-4-E4B) | | **Parameters** | ~4 billion | | **Architecture** | Dense transformer | | **Modality** | Multimodal: image + text input, text output | | **License** | Apache 2.0 | | **Weight Quantization** | AWQ 4-bit (~2.5 GB) | | **Group Size** | 128 | | **KV-Cache Quantization** | RotorQuant (`planar3` / `iso3`) | | **Framework** | transformers + AutoAWQ / vLLM | ## Quickstart ### AutoAWQ ```python from awq import AutoAWQForCausalLM from transformers import AutoTokenizer model = AutoAWQForCausalLM.from_quantized( "majentik/gemma-4-E4B-RotorQuant-AWQ-4bit", device_map="auto", fuse_layers=True, ) tokenizer = AutoTokenizer.from_pretrained("majentik/gemma-4-E4B-RotorQuant-AWQ-4bit") prompt = "The history of artificial intelligence began" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) out = model.generate(**inputs, max_new_tokens=512) print(tokenizer.decode(out[0], skip_special_tokens=True)) ``` ### vLLM ```bash vllm serve majentik/gemma-4-E4B-RotorQuant-AWQ-4bit \ --quantization awq_marlin \ --max-model-len 8192 ``` ### With RotorQuant KV cache (fork) ```python from rotorquant import RotorQuantCache cache = RotorQuantCache(model, mode="iso3") # or "planar3" ``` ## What is RotorQuant? [RotorQuant](https://github.com/scrya-com/rotorquant) is a high-performance KV-cache quantization method using block-diagonal Clifford-algebra rotors. Combined with AWQ 4-bit weights, this delivers a dual compression strategy with superior KV-cache performance for GPU inference. Key advantages over TurboQuant: - **5.3x faster prefill** - **28% faster decode** - Equivalent memory savings - `planar3` / `iso3` 3-bit KV cache modes ## KV-Cache Quantization Comparison | Method | Prefill Speed | Decode Speed | Memory Savings | Reference | |---|---|---|---|---| | **TurboQuant** | 1x (baseline) | 1x (baseline) | High | [arXiv: 2504.19874](https://arxiv.org/abs/2504.19874) | | **RotorQuant** | **5.3x faster** | **28% faster** | High | [GitHub](https://github.com/scrya-com/rotorquant) | ## AWQ vs GGUF vs MLX | Format | Target Hardware | Runtime | Best For | |---|---|---|---| | **AWQ** | NVIDIA / AMD GPU (CUDA/ROCm) | AutoAWQ, vLLM, TGI | GPU-native inference, production serving | | **GGUF** | CPU + GPU (cross-platform) | llama.cpp, Ollama, LM Studio | Laptops, CPU-only boxes, mixed offload | | **MLX** | Apple Silicon | MLX, mlx-lm, mlx-vlm | Macs with unified memory | This repo ships **AWQ**. See the "See Also" section for GGUF and MLX siblings. ## Memory Estimates (Gemma 4 E4B) | Precision | Approximate Size | VRAM Tier | |---|---|---| | FP16 (original) | ~8 GB | 12 GB+ | | AWQ 8-bit | ~4 GB | 8 GB+ | | **AWQ 4-bit** | **~2.5 GB** | **6 GB+** | Fits on mid-range consumer GPUs (RTX 3060 12GB, 4060 Ti, 4070 and up). ## Hardware Requirements - NVIDIA GPU with >=6 GB VRAM (RTX 3060, 4060, 4070, A4000, L4) - CUDA 12.x recommended - For vLLM: compute capability >= 7.5 (Turing or newer) for Marlin kernels - For RotorQuant KV cache: [scrya-com/rotorquant](https://github.com/scrya-com/rotorquant) fork ## See Also - [google/gemma-4-E4B](https://huggingface.co/google/gemma-4-E4B) -- Base model - [majentik/gemma-4-E4B-RotorQuant](https://huggingface.co/majentik/gemma-4-E4B-RotorQuant) -- RotorQuant KV-cache only (transformers) - [majentik/gemma-4-E4B-RotorQuant-AWQ-8bit](https://huggingface.co/majentik/gemma-4-E4B-RotorQuant-AWQ-8bit) -- AWQ 8-bit variant - [majentik/gemma-4-E4B-TurboQuant-AWQ-4bit](https://huggingface.co/majentik/gemma-4-E4B-TurboQuant-AWQ-4bit) -- TurboQuant AWQ 4-bit variant - [majentik/gemma-4-E4B-RotorQuant-MLX-4bit](https://huggingface.co/majentik/gemma-4-E4B-RotorQuant-MLX-4bit) -- MLX variant (Apple Silicon) - [RotorQuant GitHub](https://github.com/scrya-com/rotorquant) - [llama-cpp-turboquant fork](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache) - [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) - [vLLM](https://github.com/vllm-project/vllm) ## Quant trade-off (AWQ lane) | Bits | Approx size | Use case | Recommendation | |---|---|---|---| | **4-bit** | ~1.7 GB | Activation-aware 4-bit weight quant | **GPU inference (vLLM, transformers, AutoAWQ)** | | 8-bit | ~3.0 GB | Activation-aware 8-bit weight quant | Quality-sensitive GPU inference | (Current variant — **4bit** — is bolded.) ## Variants in this family (Showing 18 sibling variants under `majentik/gemma4-e4b-*`. The current variant — `RotorQuant-AWQ-4bit` — is **bolded**.) | Variant | Runtime | Approx size | Use case | |---|---|---|---| | [RotorQuant](https://huggingface.co/majentik/gemma4-e4b-rotorquant) | runtime modifier | n/a | KV-cache root (weight-agnostic) | | **RotorQuant-AWQ-4bit** | transformers | ~2.5 GB | GPU 4-bit (AutoAWQ) | | [RotorQuant-AWQ-8bit](https://huggingface.co/majentik/gemma4-e4b-rotorquant-awq-8bit) | transformers | ~4.4 GB | GPU 8-bit (AutoAWQ) | | [RotorQuant-GGUF-IQ4_XS](https://huggingface.co/majentik/gemma4-e4b-rotorquant-gguf-IQ4_XS) | llama.cpp | ~3.4 GB | Lossy 4-bit, low-RAM CPU/edge | | [RotorQuant-GGUF-Q2_K](https://huggingface.co/majentik/gemma4-e4b-rotorquant-gguf-Q2_K) | llama.cpp | ~2.4 GB | Lossy, low-RAM CPU/edge | | [RotorQuant-GGUF-Q3_K_M](https://huggingface.co/majentik/gemma4-e4b-rotorquant-gguf-Q3_K_M) | llama.cpp | ~3.1 GB | Smaller 3-bit, CPU-friendly | | [RotorQuant-GGUF-Q4_K_M](https://huggingface.co/majentik/gemma4-e4b-rotorquant-gguf-Q4_K_M) | llama.cpp | ~4.4 GB | Balanced default | | [RotorQuant-GGUF-Q5_K_M](https://huggingface.co/majentik/gemma4-e4b-rotorquant-gguf-Q5_K_M) | llama.cpp | ~5.3 GB | Higher fidelity, more RAM | | [RotorQuant-GGUF-Q8_0](https://huggingface.co/majentik/gemma4-e4b-rotorquant-gguf-Q8_0) | llama.cpp | ~8.4 GB | Near-lossless reference | | [RotorQuant-MLX-2bit](https://huggingface.co/majentik/gemma4-e4b-rotorquant-mlx-2bit) | mlx-lm | ~1.3 GB | Apple Silicon, smallest | | [RotorQuant-MLX-4bit](https://huggingface.co/majentik/gemma4-e4b-rotorquant-mlx-4bit) | mlx-lm | ~2.5 GB | Apple Silicon balanced | | [RotorQuant-MLX-8bit](https://huggingface.co/majentik/gemma4-e4b-rotorquant-mlx-8bit) | mlx-lm | ~4.7 GB | Apple Silicon reference | | [TurboQuant](https://huggingface.co/majentik/gemma4-e4b-turboquant) | runtime modifier | n/a | KV-cache root (weight-agnostic) | | [TurboQuant-AWQ-4bit](https://huggingface.co/majentik/gemma4-e4b-turboquant-awq-4bit) | transformers | ~2.5 GB | GPU 4-bit (AutoAWQ) | | [TurboQuant-AWQ-8bit](https://huggingface.co/majentik/gemma4-e4b-turboquant-awq-8bit) | transformers | ~4.4 GB | GPU 8-bit (AutoAWQ) | | [TurboQuant-MLX-2bit](https://huggingface.co/majentik/gemma4-e4b-turboquant-mlx-2bit) | mlx-lm | ~1.3 GB | Apple Silicon, smallest | | [TurboQuant-MLX-4bit](https://huggingface.co/majentik/gemma4-e4b-turboquant-mlx-4bit) | mlx-lm | ~2.5 GB | Apple Silicon balanced | | [TurboQuant-MLX-8bit](https://huggingface.co/majentik/gemma4-e4b-turboquant-mlx-8bit) | mlx-lm | ~4.7 GB | Apple Silicon reference |