Instructions to use majentik/gemma-4-26B-A4B-it-RotorQuant-MLX-2bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use majentik/gemma-4-26B-A4B-it-RotorQuant-MLX-2bit with MLX:

# Make sure mlx-vlm is installed
# pip install --upgrade mlx-vlm

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

# Load the model
model, processor = load("majentik/gemma-4-26B-A4B-it-RotorQuant-MLX-2bit")
config = load_config("majentik/gemma-4-26B-A4B-it-RotorQuant-MLX-2bit")

# Prepare input
image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
prompt = "Describe this image."

# Apply chat template
formatted_prompt = apply_chat_template(
    processor, config, prompt, num_images=1
)

# Generate output
output = generate(model, processor, formatted_prompt, image)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps Settings
LM Studio

How to use majentik/gemma-4-26B-A4B-it-RotorQuant-MLX-2bit with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "majentik/gemma-4-26B-A4B-it-RotorQuant-MLX-2bit"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "majentik/gemma-4-26B-A4B-it-RotorQuant-MLX-2bit"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use majentik/gemma-4-26B-A4B-it-RotorQuant-MLX-2bit with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "majentik/gemma-4-26B-A4B-it-RotorQuant-MLX-2bit"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default majentik/gemma-4-26B-A4B-it-RotorQuant-MLX-2bit

Run Hermes

hermes

OpenClaw new

How to use majentik/gemma-4-26B-A4B-it-RotorQuant-MLX-2bit with OpenClaw:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "majentik/gemma-4-26B-A4B-it-RotorQuant-MLX-2bit"

Configure OpenClaw

# Install OpenClaw:
npm install -g openclaw@latest
# Register the local server and set it as the default model:
openclaw onboard --non-interactive --mode local \
  --auth-choice custom-api-key \
  --custom-base-url http://127.0.0.1:8080/v1 \
  --custom-model-id "majentik/gemma-4-26B-A4B-it-RotorQuant-MLX-2bit" \
  --custom-provider-id mlx-lm \
  --custom-compatibility openai \
  --custom-text-input \
  --accept-risk \
  --skip-health

Run OpenClaw

openclaw agent --local --agent main --message "Hello from Hugging Face"

KV-cache quantization without any fork (recommended, 2026): upstream llama.cpp/Ollama now cover this natively — use -ctk q8_0 -ctv q8_0 (~~half KV memory, negligible quality loss: perplexity +0.002–0.05) or -ctk q4_0 -ctv q4_0 (~~quarter memory, ≈7.6% perplexity increase). In Ollama: OLLAMA_KV_CACHE_TYPE=q8_0 with OLLAMA_FLASH_ATTENTION=1. Keep K and V types symmetric to stay on the fast fused Flash-Attention path. Since April 2026, mainline llama.cpp also applies Hadamard rotation to KV activations (PR #21038), which greatly improves low-bit KV quality (opt-out: LLAMA_ATTN_ROT_DISABLE=1).

The RotorQuant/TurboQuant fork flow below is experimental/legacy: the TurboQuant llama.cpp PR was closed without merging (June 2026) and the fork is unmaintained relative to mainline. It is NOT required to use this model.

Gemma 4 26B-A4B-it - RotorQuant MLX 2-bit

2-bit weight-quantized MLX version of google/gemma-4-26B-A4B-it with RotorQuant KV-cache quantization. Optimized for Apple Silicon inference via the MLX framework. RotorQuant delivers 5.3x faster prefill and 28% faster decode compared to TurboQuant. The most aggressive quantization, fitting the full model in the smallest possible footprint. Only 4B parameters are active per token despite 26B total, making this model significantly more efficient at inference time than its parameter count suggests.

Approximate model size: ~7 GB

Model Specifications

Property	Value
Base Model	google/gemma-4-26B-A4B-it
Parameters	26 billion total (4 billion active per token)
Architecture	Mixture-of-Experts (MoE) (4B active per token)
Modality	Multimodal: image + text input, text output
License	Apache 2.0
Weight Quantization	2-bit (~7 GB)
KV-Cache Quantization	RotorQuant
Framework	MLX (Apple Silicon)

Quickstart

import mlx.core as mx
from mlx_lm import load, generate

model, tokenizer = load("majentik/gemma-4-26B-A4B-it-RotorQuant-MLX-2bit")

prompt = "Describe this image in detail."
response = generate(model, tokenizer, prompt=prompt, max_tokens=512)
print(response)

For multimodal usage with images:

from mlx_vlm import load, generate

model, processor = load("majentik/gemma-4-26B-A4B-it-RotorQuant-MLX-2bit")

prompt = "What do you see in this image?"
output = generate(model, processor, prompt=prompt, image="path/to/image.jpg", max_tokens=512)
print(output)

What is RotorQuant?

RotorQuant is a high-performance KV-cache quantization method that achieves significantly better throughput than TurboQuant. Combined with 2-bit weight quantization in MLX, this provides maximum compression with the best available KV-cache performance: the smallest possible model footprint plus the fastest compressed KV cache for efficient long-context generation.

Key advantages over TurboQuant:

5.3x faster prefill
28% faster decode
Equivalent memory savings

Note: 2-bit quantization is the most aggressive option and may result in some quality degradation compared to higher-precision variants. It is best suited for experimentation, rapid prototyping, or hardware-constrained environments.

KV-Cache Quantization Comparison

Method	Prefill Speed	Decode Speed	Memory Savings	Reference
TurboQuant	1x (baseline)	1x (baseline)	High	arXiv: 2504.19874
RotorQuant	5.3x faster	28% faster	High	GitHub

Memory Estimates (Gemma 4 26B-A4B-it)

Precision	Approximate Size	MLX Variant
FP16 (original)	~52 GB	--
8-bit quantized	~26 GB	RotorQuant-MLX-8bit
4-bit quantized	~14 GB	RotorQuant-MLX-4bit
2-bit quantized	~7 GB	This model

Hardware Requirements

This model requires approximately 7 GB of unified memory. Recommended hardware:

Apple M1 (16 GB+)
Apple M2 (16 GB+)
Apple M3 (16 GB+)
Apple M4 (16 GB+)
Any Apple Silicon Mac with 16 GB+ unified memory

Quant trade-off (MLX lane)

Bits	Approx size	Use case	Recommendation
2-bit	~6.8 GB	Aggressive quantization	Very low-RAM Macs
3-bit	~9.4 GB	Lossy but small	Low-RAM Macs
4-bit	~11 GB	Balanced default	Recommended for most Macs
5-bit	~13 GB	Higher fidelity	Quality-sensitive
6-bit	~16 GB	Approaching FP16 quality	High-fidelity
8-bit	~20 GB	Near-lossless reference	Fidelity-critical work

(Current variant — 2bit — is bolded.)

Variants in this family

(Showing 14 sibling variants under majentik/gemma-4-26b-a4b-it-*. The current variant — RotorQuant-MLX-2bit — is bolded.)

Variant	Runtime	Approx size	Use case
RotorQuant	runtime modifier	n/a	KV-cache root (weight-agnostic)
RotorQuant-GGUF-IQ4_XS	llama.cpp	~22 GB	Lossy 4-bit, low-RAM CPU/edge
RotorQuant-GGUF-Q2_K	llama.cpp	~16 GB	Lossy, low-RAM CPU/edge
RotorQuant-GGUF-Q3_K_M	llama.cpp	~20 GB	Smaller 3-bit, CPU-friendly
RotorQuant-GGUF-Q4_K_M	llama.cpp	~29 GB	Balanced default
RotorQuant-GGUF-Q5_K_M	llama.cpp	~34 GB	Higher fidelity, more RAM
RotorQuant-GGUF-Q8_0	llama.cpp	~55 GB	Near-lossless reference
RotorQuant-MLX-2bit	mlx-lm	~8.3 GB	Apple Silicon, smallest
RotorQuant-MLX-4bit	mlx-lm	~16 GB	Apple Silicon balanced
RotorQuant-MLX-8bit	mlx-lm	~31 GB	Apple Silicon reference
TurboQuant	runtime modifier	n/a	KV-cache root (weight-agnostic)
TurboQuant-MLX-2bit	mlx-lm	~8.3 GB	Apple Silicon, smallest
TurboQuant-MLX-4bit	mlx-lm	~16 GB	Apple Silicon balanced
TurboQuant-MLX-8bit	mlx-lm	~31 GB	Apple Silicon reference

Downloads last month: 188

Safetensors

Model size

3B params

Tensor type

BF16

U32

MLX

Hardware compatibility

2-bit

Model tree for majentik/gemma-4-26B-A4B-it-RotorQuant-MLX-2bit

Base model

google/gemma-4-26B-A4B

Finetuned

google/gemma-4-26B-A4B-it

Quantized

(301)

this model

Collection including majentik/gemma-4-26B-A4B-it-RotorQuant-MLX-2bit

Gemma 4 — quantized (GGUF + MLX)

Collection

Quantized GGUF and MLX packs of the Gemma 4 family (dense and MoE variants) for local inference. • 60 items • Updated about 16 hours ago

Paper for majentik/gemma-4-26B-A4B-it-RotorQuant-MLX-2bit

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Paper • 2504.19874 • Published Apr 28, 2025 • 34

majentik
/

gemma-4-26B-A4B-it-RotorQuant-MLX-2bit

Gemma 4 26B-A4B-it - RotorQuant MLX 2-bit

Model Specifications

Quickstart

What is RotorQuant?

KV-Cache Quantization Comparison

Memory Estimates (Gemma 4 26B-A4B-it)

Hardware Requirements

See Also

Quant trade-off (MLX lane)

Variants in this family

Model tree for majentik/gemma-4-26B-A4B-it-RotorQuant-MLX-2bit

Collection including majentik/gemma-4-26B-A4B-it-RotorQuant-MLX-2bit

Gemma 4 — quantized (GGUF + MLX)

Paper for majentik/gemma-4-26B-A4B-it-RotorQuant-MLX-2bit

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate