gemma-4-E4B-RotorQuant-GGUF-Q2_K
GGUF Q2_K weight-quantized variant of google/gemma-4-E4B with RotorQuant KV cache compression for efficient inference with llama.cpp, Ollama, and LM Studio.
Overview
This model combines two compression techniques:
- GGUF Q2_K weight quantization โ reduces model size from ~8GB to ~2 GB
- RotorQuant KV cache compression โ block-diagonal rotations (Clifford algebra) for 3-bit KV cache, 5.3x faster prefill
Quickstart
llama.cpp
llama-cli -m gemma-4-E4B-RotorQuant-GGUF-Q2_K.gguf \
--cache-type-k planar3 --cache-type-v iso3 \
-p "Explain quantum computing"
Ollama
ollama run majentik/gemma-4-E4B-RotorQuant-GGUF-Q2_K
LM Studio
Download the GGUF file and load in LM Studio. Enable RotorQuant KV cache in advanced settings.
Specifications
| Property | Value |
|---|---|
| Base Model | google/gemma-4-E4B |
| Parameters | ~4B dense |
| Weight Quantization | GGUF Q2_K |
| KV Cache | RotorQuant 3-bit (planar/iso) |
| File Size | ~2 GB |
| License | Apache 2.0 |
| Compatible | llama.cpp, Ollama, LM Studio, koboldcpp |
What is RotorQuant?
RotorQuant applies block-diagonal rotations (Clifford algebra) for KV cache compression. When used with llama.cpp's --cache-type-k planar3 --cache-type-v iso3 flags:
| Metric | RotorQuant | TurboQuant |
|---|---|---|
| Prefill Speed | 3,822 tok/s | 722 tok/s |
| Decode Speed | 119 tok/s | 93 tok/s |
| Perplexity | 6.91 | 7.07 |
See Also
- Downloads last month
- 146
Hardware compatibility
Log In to add your hardware
2-bit
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support
Model tree for majentik/gemma-4-E4B-RotorQuant-GGUF-Q2_K
Base model
google/gemma-4-E4B