Instructions to use majentik/gemma-4-E2B-RotorQuant-AWQ-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use majentik/gemma-4-E2B-RotorQuant-AWQ-4bit with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="majentik/gemma-4-E2B-RotorQuant-AWQ-4bit")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("majentik/gemma-4-E2B-RotorQuant-AWQ-4bit", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use majentik/gemma-4-E2B-RotorQuant-AWQ-4bit with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "majentik/gemma-4-E2B-RotorQuant-AWQ-4bit"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "majentik/gemma-4-E2B-RotorQuant-AWQ-4bit",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/majentik/gemma-4-E2B-RotorQuant-AWQ-4bit

SGLang

How to use majentik/gemma-4-E2B-RotorQuant-AWQ-4bit with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "majentik/gemma-4-E2B-RotorQuant-AWQ-4bit" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "majentik/gemma-4-E2B-RotorQuant-AWQ-4bit",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "majentik/gemma-4-E2B-RotorQuant-AWQ-4bit" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "majentik/gemma-4-E2B-RotorQuant-AWQ-4bit",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use majentik/gemma-4-E2B-RotorQuant-AWQ-4bit with Docker Model Runner:
```
docker model run hf.co/majentik/gemma-4-E2B-RotorQuant-AWQ-4bit
```

gemma-4-E2B-RotorQuant-AWQ-4bit / README.md

majentik

docs: Tier 2 polish — variant matrix + quant trade-off

e0b2503 verified about 1 month ago

preview code

raw

history blame contribute delete

8.67 kB

	---
	license: apache-2.0
	base_model: google/gemma-4-E2B
	tags:
	- awq
	- rotorquant
	- kv-cache-quantization
	- gemma
	- gemma4
	- quantized
	- 4bit
	library_name: transformers
	pipeline_tag: image-text-to-text
	---

	# Gemma 4 E2B - RotorQuant AWQ 4-bit

	4-bit AWQ-quantized version of [google/gemma-4-E2B](https://huggingface.co/google/gemma-4-E2B) with RotorQuant KV-cache quantization. AWQ (Activation-aware Weight Quantization) is an activation-aware method optimal for GPU inference, preserving the salient weights most important to model outputs. RotorQuant delivers 5.3x faster prefill and 28% faster decode compared to TurboQuant.

	Approximate model size: ~1.5 GB

	> Note: RotorQuant KV cache modes (`planar3`, `iso3`) require the [RotorQuant fork](https://github.com/scrya-com/rotorquant) or the [llama-cpp-turboquant fork](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache) for llama.cpp workflows. The AWQ weights themselves load cleanly in stock AutoAWQ / vLLM; RotorQuant KV-cache kernels are opt-in.

	## Model Specifications

	\| Property \| Value \|
	\|---\|---\|
	\| Base Model \| [google/gemma-4-E2B](https://huggingface.co/google/gemma-4-E2B) \|
	\| Parameters \| ~2 billion \|
	\| Architecture \| Dense transformer \|
	\| Modality \| Multimodal: image + text input, text output \|
	\| License \| Apache 2.0 \|
	\| Weight Quantization \| AWQ 4-bit (~1.5 GB) \|
	\| Group Size \| 128 \|
	\| KV-Cache Quantization \| RotorQuant (`planar3` / `iso3`) \|
	\| Framework \| transformers + AutoAWQ / vLLM \|

	## Quickstart

	### AutoAWQ

	```python
	from awq import AutoAWQForCausalLM
	from transformers import AutoTokenizer

	model = AutoAWQForCausalLM.from_quantized(
	"majentik/gemma-4-E2B-RotorQuant-AWQ-4bit",
	device_map="auto",
	fuse_layers=True,
	)
	tokenizer = AutoTokenizer.from_pretrained("majentik/gemma-4-E2B-RotorQuant-AWQ-4bit")

	prompt = "The history of artificial intelligence began"
	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
	out = model.generate(**inputs, max_new_tokens=512)
	print(tokenizer.decode(out[0], skip_special_tokens=True))
	```

	### vLLM

	```bash
	vllm serve majentik/gemma-4-E2B-RotorQuant-AWQ-4bit \
	--quantization awq_marlin \
	--max-model-len 8192
	```

	### With RotorQuant KV cache (fork)

	```python
	from rotorquant import RotorQuantCache
	from awq import AutoAWQForCausalLM
	from transformers import AutoTokenizer

	model = AutoAWQForCausalLM.from_quantized(
	"majentik/gemma-4-E2B-RotorQuant-AWQ-4bit", device_map="auto"
	)
	tokenizer = AutoTokenizer.from_pretrained("majentik/gemma-4-E2B-RotorQuant-AWQ-4bit")

	cache = RotorQuantCache(model, mode="iso3") # or "planar3"
	inputs = tokenizer("Long-context prompt...", return_tensors="pt").to(model.device)
	out = model.generate(**inputs, past_key_values=cache, max_new_tokens=1024)
	```

	## What is RotorQuant?

	[RotorQuant](https://github.com/scrya-com/rotorquant) is a high-performance KV-cache quantization method that achieves significantly better throughput than TurboQuant using block-diagonal Clifford-algebra rotors. Combined with AWQ 4-bit weights, this delivers a dual compression strategy with superior KV-cache performance for GPU inference.

	Key advantages over TurboQuant:
	- 5.3x faster prefill
	- 28% faster decode
	- Equivalent memory savings
	- `planar3` / `iso3` 3-bit KV cache modes

	## KV-Cache Quantization Comparison

	\| Method \| Prefill Speed \| Decode Speed \| Memory Savings \| Reference \|
	\|---\|---\|---\|---\|---\|
	\| TurboQuant \| 1x (baseline) \| 1x (baseline) \| High \| [arXiv: 2504.19874](https://arxiv.org/abs/2504.19874) \|
	\| RotorQuant \| 5.3x faster \| 28% faster \| High \| [GitHub](https://github.com/scrya-com/rotorquant) \|

	## AWQ vs GGUF vs MLX

	\| Format \| Target Hardware \| Runtime \| Best For \|
	\|---\|---\|---\|---\|
	\| AWQ \| NVIDIA / AMD GPU (CUDA/ROCm) \| AutoAWQ, vLLM, TGI \| GPU-native inference, production serving \|
	\| GGUF \| CPU + GPU (cross-platform) \| llama.cpp, Ollama, LM Studio \| Laptops, CPU-only boxes, mixed offload \|
	\| MLX \| Apple Silicon \| MLX, mlx-lm, mlx-vlm \| Macs with unified memory \|

	This repo ships AWQ. See the "See Also" section for GGUF and MLX siblings.

	## Memory Estimates (Gemma 4 E2B)

	\| Precision \| Approximate Size \| VRAM Tier \|
	\|---\|---\|---\|
	\| FP16 (original) \| ~4 GB \| 8 GB+ \|
	\| AWQ 8-bit \| ~2 GB \| 4 GB+ \|
	\| AWQ 4-bit \| ~1.5 GB \| 4 GB+ \|

	Fits comfortably on entry-level GPUs (RTX 3050 / 4060 / A2000 and up).

	## Hardware Requirements

	- NVIDIA GPU with >=4 GB VRAM (RTX 3050, 3060, 4060, A2000, T4)
	- CUDA 12.x recommended
	- For vLLM: compute capability >= 7.5 (Turing or newer) for Marlin kernels
	- For RotorQuant KV cache: [scrya-com/rotorquant](https://github.com/scrya-com/rotorquant) fork

	## See Also

	- [google/gemma-4-E2B](https://huggingface.co/google/gemma-4-E2B) -- Base model
	- [majentik/gemma-4-E2B-RotorQuant](https://huggingface.co/majentik/gemma-4-E2B-RotorQuant) -- RotorQuant KV-cache only (transformers)
	- [majentik/gemma-4-E2B-RotorQuant-AWQ-8bit](https://huggingface.co/majentik/gemma-4-E2B-RotorQuant-AWQ-8bit) -- AWQ 8-bit variant
	- [majentik/gemma-4-E2B-TurboQuant-AWQ-4bit](https://huggingface.co/majentik/gemma-4-E2B-TurboQuant-AWQ-4bit) -- TurboQuant AWQ 4-bit variant
	- [majentik/gemma-4-E2B-RotorQuant-MLX-4bit](https://huggingface.co/majentik/gemma-4-E2B-RotorQuant-MLX-4bit) -- MLX variant (Apple Silicon)
	- [RotorQuant GitHub](https://github.com/scrya-com/rotorquant)
	- [llama-cpp-turboquant fork](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache)
	- [AutoAWQ](https://github.com/casper-hansen/AutoAWQ)
	- [vLLM](https://github.com/vllm-project/vllm)

	## Quant trade-off (AWQ lane)

	\| Bits \| Approx size \| Use case \| Recommendation \|
	\|---\|---\|---\|---\|
	\| 4-bit \| ~860 MB \| Activation-aware 4-bit weight quant \| GPU inference (vLLM, transformers, AutoAWQ) \|
	\| 8-bit \| ~1.5 GB \| Activation-aware 8-bit weight quant \| Quality-sensitive GPU inference \|

	(Current variant — 4bit — is bolded.)

	## Variants in this family

	(Showing 18 sibling variants under `majentik/gemma4-e2b-`. The current variant — `RotorQuant-AWQ-4bit` — is bolded*.)

	\| Variant \| Runtime \| Approx size \| Use case \|
	\|---\|---\|---\|---\|
	\| [RotorQuant](https://huggingface.co/majentik/gemma4-e2b-rotorquant) \| runtime modifier \| n/a \| KV-cache root (weight-agnostic) \|
	\| RotorQuant-AWQ-4bit \| transformers \| ~1.2 GB \| GPU 4-bit (AutoAWQ) \|
	\| [RotorQuant-AWQ-8bit](https://huggingface.co/majentik/gemma4-e2b-rotorquant-awq-8bit) \| transformers \| ~2.2 GB \| GPU 8-bit (AutoAWQ) \|
	\| [RotorQuant-GGUF-IQ4_XS](https://huggingface.co/majentik/gemma4-e2b-rotorquant-gguf-IQ4_XS) \| llama.cpp \| ~1.7 GB \| Lossy 4-bit, low-RAM CPU/edge \|
	\| [RotorQuant-GGUF-Q2_K](https://huggingface.co/majentik/gemma4-e2b-rotorquant-gguf-Q2_K) \| llama.cpp \| ~1.2 GB \| Lossy, low-RAM CPU/edge \|
	\| [RotorQuant-GGUF-Q3_K_M](https://huggingface.co/majentik/gemma4-e2b-rotorquant-gguf-Q3_K_M) \| llama.cpp \| ~1.6 GB \| Smaller 3-bit, CPU-friendly \|
	\| [RotorQuant-GGUF-Q4_K_M](https://huggingface.co/majentik/gemma4-e2b-rotorquant-gguf-Q4_K_M) \| llama.cpp \| ~2.2 GB \| Balanced default \|
	\| [RotorQuant-GGUF-Q5_K_M](https://huggingface.co/majentik/gemma4-e2b-rotorquant-gguf-Q5_K_M) \| llama.cpp \| ~2.6 GB \| Higher fidelity, more RAM \|
	\| [RotorQuant-GGUF-Q8_0](https://huggingface.co/majentik/gemma4-e2b-rotorquant-gguf-Q8_0) \| llama.cpp \| ~4.2 GB \| Near-lossless reference \|
	\| [RotorQuant-MLX-2bit](https://huggingface.co/majentik/gemma4-e2b-rotorquant-mlx-2bit) \| mlx-lm \| ~655 MB \| Apple Silicon, smallest \|
	\| [RotorQuant-MLX-4bit](https://huggingface.co/majentik/gemma4-e2b-rotorquant-mlx-4bit) \| mlx-lm \| ~1.2 GB \| Apple Silicon balanced \|
	\| [RotorQuant-MLX-8bit](https://huggingface.co/majentik/gemma4-e2b-rotorquant-mlx-8bit) \| mlx-lm \| ~2.4 GB \| Apple Silicon reference \|
	\| [TurboQuant](https://huggingface.co/majentik/gemma4-e2b-turboquant) \| runtime modifier \| n/a \| KV-cache root (weight-agnostic) \|
	\| [TurboQuant-AWQ-4bit](https://huggingface.co/majentik/gemma4-e2b-turboquant-awq-4bit) \| transformers \| ~1.2 GB \| GPU 4-bit (AutoAWQ) \|
	\| [TurboQuant-AWQ-8bit](https://huggingface.co/majentik/gemma4-e2b-turboquant-awq-8bit) \| transformers \| ~2.2 GB \| GPU 8-bit (AutoAWQ) \|
	\| [TurboQuant-MLX-2bit](https://huggingface.co/majentik/gemma4-e2b-turboquant-mlx-2bit) \| mlx-lm \| ~655 MB \| Apple Silicon, smallest \|
	\| [TurboQuant-MLX-4bit](https://huggingface.co/majentik/gemma4-e2b-turboquant-mlx-4bit) \| mlx-lm \| ~1.2 GB \| Apple Silicon balanced \|
	\| [TurboQuant-MLX-8bit](https://huggingface.co/majentik/gemma4-e2b-turboquant-mlx-8bit) \| mlx-lm \| ~2.4 GB \| Apple Silicon reference \|