Instructions to use majentik/gemma-4-E4B-RotorQuant-AWQ-8bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use majentik/gemma-4-E4B-RotorQuant-AWQ-8bit with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="majentik/gemma-4-E4B-RotorQuant-AWQ-8bit")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("majentik/gemma-4-E4B-RotorQuant-AWQ-8bit", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use majentik/gemma-4-E4B-RotorQuant-AWQ-8bit with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "majentik/gemma-4-E4B-RotorQuant-AWQ-8bit" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "majentik/gemma-4-E4B-RotorQuant-AWQ-8bit", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/majentik/gemma-4-E4B-RotorQuant-AWQ-8bit
- SGLang
How to use majentik/gemma-4-E4B-RotorQuant-AWQ-8bit with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "majentik/gemma-4-E4B-RotorQuant-AWQ-8bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "majentik/gemma-4-E4B-RotorQuant-AWQ-8bit", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "majentik/gemma-4-E4B-RotorQuant-AWQ-8bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "majentik/gemma-4-E4B-RotorQuant-AWQ-8bit", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use majentik/gemma-4-E4B-RotorQuant-AWQ-8bit with Docker Model Runner:
docker model run hf.co/majentik/gemma-4-E4B-RotorQuant-AWQ-8bit
Gemma 4 E4B - RotorQuant AWQ 8-bit
8-bit AWQ-quantized version of google/gemma-4-E4B with RotorQuant KV-cache quantization. AWQ (Activation-aware Weight Quantization) is an activation-aware method optimal for GPU inference. The 8-bit variant keeps quality very close to FP16 while halving VRAM usage, and RotorQuant delivers 5.3x faster prefill and 28% faster decode vs TurboQuant.
Approximate model size: ~4 GB
Note: RotorQuant KV cache modes (
planar3,iso3) require the RotorQuant fork or the llama-cpp-turboquant fork. The AWQ weights themselves load cleanly in stock AutoAWQ / vLLM; RotorQuant KV-cache kernels are opt-in.
Model Specifications
| Property | Value |
|---|---|
| Base Model | google/gemma-4-E4B |
| Parameters | ~4 billion |
| Architecture | Dense transformer |
| Modality | Multimodal: image + text input, text output |
| License | Apache 2.0 |
| Weight Quantization | AWQ 8-bit (~4 GB) |
| Group Size | 128 |
| KV-Cache Quantization | RotorQuant (planar3 / iso3) |
| Framework | transformers + AutoAWQ / vLLM |
Quickstart
AutoAWQ
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model = AutoAWQForCausalLM.from_quantized(
"majentik/gemma-4-E4B-RotorQuant-AWQ-8bit",
device_map="auto",
fuse_layers=True,
)
tokenizer = AutoTokenizer.from_pretrained("majentik/gemma-4-E4B-RotorQuant-AWQ-8bit")
prompt = "The history of artificial intelligence began"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(out[0], skip_special_tokens=True))
vLLM
vllm serve majentik/gemma-4-E4B-RotorQuant-AWQ-8bit \
--quantization awq_marlin \
--max-model-len 8192
With RotorQuant KV cache (fork)
from rotorquant import RotorQuantCache
cache = RotorQuantCache(model, mode="iso3") # or "planar3"
What is RotorQuant?
RotorQuant is a high-performance KV-cache quantization method using block-diagonal Clifford-algebra rotors. Combined with AWQ 8-bit weights, it delivers near-FP16 quality at roughly half the VRAM cost, with RotorQuant's compressed KV cache further reducing long-context memory.
Key advantages over TurboQuant:
- 5.3x faster prefill
- 28% faster decode
- Equivalent memory savings
planar3/iso33-bit KV cache modes
KV-Cache Quantization Comparison
| Method | Prefill Speed | Decode Speed | Memory Savings | Reference |
|---|---|---|---|---|
| TurboQuant | 1x (baseline) | 1x (baseline) | High | arXiv: 2504.19874 |
| RotorQuant | 5.3x faster | 28% faster | High | GitHub |
AWQ vs GGUF vs MLX
| Format | Target Hardware | Runtime | Best For |
|---|---|---|---|
| AWQ | NVIDIA / AMD GPU (CUDA/ROCm) | AutoAWQ, vLLM, TGI | GPU-native inference, production serving |
| GGUF | CPU + GPU (cross-platform) | llama.cpp, Ollama, LM Studio | Laptops, CPU-only boxes, mixed offload |
| MLX | Apple Silicon | MLX, mlx-lm, mlx-vlm | Macs with unified memory |
This repo ships AWQ. See the "See Also" section for GGUF and MLX siblings.
Memory Estimates (Gemma 4 E4B)
| Precision | Approximate Size | VRAM Tier |
|---|---|---|
| FP16 (original) | ~8 GB | 12 GB+ |
| AWQ 8-bit | ~4 GB | 8 GB+ |
| AWQ 4-bit | ~2.5 GB | 6 GB+ |
Fits on mid-range consumer GPUs (RTX 3060 12GB, 4060 Ti 8/16GB, 4070 and up).
Hardware Requirements
- NVIDIA GPU with >=8 GB VRAM (RTX 3060 Ti, 3070, 4060 Ti, 4070, A4000, L4)
- CUDA 12.x recommended
- For vLLM: compute capability >= 7.5 (Turing or newer) for Marlin kernels
- For RotorQuant KV cache: scrya-com/rotorquant fork
See Also
- google/gemma-4-E4B -- Base model
- majentik/gemma-4-E4B-RotorQuant -- RotorQuant KV-cache only (transformers)
- majentik/gemma-4-E4B-RotorQuant-AWQ-4bit -- AWQ 4-bit variant
- majentik/gemma-4-E4B-TurboQuant-AWQ-8bit -- TurboQuant AWQ 8-bit variant
- majentik/gemma-4-E4B-RotorQuant-MLX-8bit -- MLX variant (Apple Silicon)
- RotorQuant GitHub
- llama-cpp-turboquant fork
- AutoAWQ
- vLLM
Quant trade-off (AWQ lane)
| Bits | Approx size | Use case | Recommendation |
|---|---|---|---|
| 4-bit | ~1.7 GB | Activation-aware 4-bit weight quant | GPU inference (vLLM, transformers, AutoAWQ) |
| 8-bit | ~3.0 GB | Activation-aware 8-bit weight quant | Quality-sensitive GPU inference |
(Current variant — 8bit — is bolded.)
Variants in this family
(Showing 18 sibling variants under majentik/gemma4-e4b-*. The current variant — RotorQuant-AWQ-8bit — is bolded.)
| Variant | Runtime | Approx size | Use case |
|---|---|---|---|
| RotorQuant | runtime modifier | n/a | KV-cache root (weight-agnostic) |
| RotorQuant-AWQ-4bit | transformers | ~2.5 GB | GPU 4-bit (AutoAWQ) |
| RotorQuant-AWQ-8bit | transformers | ~4.4 GB | GPU 8-bit (AutoAWQ) |
| RotorQuant-GGUF-IQ4_XS | llama.cpp | ~3.4 GB | Lossy 4-bit, low-RAM CPU/edge |
| RotorQuant-GGUF-Q2_K | llama.cpp | ~2.4 GB | Lossy, low-RAM CPU/edge |
| RotorQuant-GGUF-Q3_K_M | llama.cpp | ~3.1 GB | Smaller 3-bit, CPU-friendly |
| RotorQuant-GGUF-Q4_K_M | llama.cpp | ~4.4 GB | Balanced default |
| RotorQuant-GGUF-Q5_K_M | llama.cpp | ~5.3 GB | Higher fidelity, more RAM |
| RotorQuant-GGUF-Q8_0 | llama.cpp | ~8.4 GB | Near-lossless reference |
| RotorQuant-MLX-2bit | mlx-lm | ~1.3 GB | Apple Silicon, smallest |
| RotorQuant-MLX-4bit | mlx-lm | ~2.5 GB | Apple Silicon balanced |
| RotorQuant-MLX-8bit | mlx-lm | ~4.7 GB | Apple Silicon reference |
| TurboQuant | runtime modifier | n/a | KV-cache root (weight-agnostic) |
| TurboQuant-AWQ-4bit | transformers | ~2.5 GB | GPU 4-bit (AutoAWQ) |
| TurboQuant-AWQ-8bit | transformers | ~4.4 GB | GPU 8-bit (AutoAWQ) |
| TurboQuant-MLX-2bit | mlx-lm | ~1.3 GB | Apple Silicon, smallest |
| TurboQuant-MLX-4bit | mlx-lm | ~2.5 GB | Apple Silicon balanced |
| TurboQuant-MLX-8bit | mlx-lm | ~4.7 GB | Apple Silicon reference |
Model tree for majentik/gemma-4-E4B-RotorQuant-AWQ-8bit
Base model
google/gemma-4-E4B