Instructions to use majentik/Qwen3.6-35B-A3B-RotorQuant-MLX-3bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use majentik/Qwen3.6-35B-A3B-RotorQuant-MLX-3bit with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("majentik/Qwen3.6-35B-A3B-RotorQuant-MLX-3bit")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps Settings
LM Studio

How to use majentik/Qwen3.6-35B-A3B-RotorQuant-MLX-3bit with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "majentik/Qwen3.6-35B-A3B-RotorQuant-MLX-3bit"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "majentik/Qwen3.6-35B-A3B-RotorQuant-MLX-3bit"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use majentik/Qwen3.6-35B-A3B-RotorQuant-MLX-3bit with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "majentik/Qwen3.6-35B-A3B-RotorQuant-MLX-3bit"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default majentik/Qwen3.6-35B-A3B-RotorQuant-MLX-3bit

Run Hermes

hermes

OpenClaw new

How to use majentik/Qwen3.6-35B-A3B-RotorQuant-MLX-3bit with OpenClaw:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "majentik/Qwen3.6-35B-A3B-RotorQuant-MLX-3bit"

Configure OpenClaw

# Install OpenClaw:
npm install -g openclaw@latest
# Register the local server and set it as the default model:
openclaw onboard --non-interactive --mode local \
  --auth-choice custom-api-key \
  --custom-base-url http://127.0.0.1:8080/v1 \
  --custom-model-id "majentik/Qwen3.6-35B-A3B-RotorQuant-MLX-3bit" \
  --custom-provider-id mlx-lm \
  --custom-compatibility openai \
  --custom-text-input \
  --accept-risk \
  --skip-health

Run OpenClaw

openclaw agent --local --agent main --message "Hello from Hugging Face"

MLX LM

How to use majentik/Qwen3.6-35B-A3B-RotorQuant-MLX-3bit with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "majentik/Qwen3.6-35B-A3B-RotorQuant-MLX-3bit"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "majentik/Qwen3.6-35B-A3B-RotorQuant-MLX-3bit"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "majentik/Qwen3.6-35B-A3B-RotorQuant-MLX-3bit",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

Qwen3.6 35B-A3B - RotorQuant MLX 3-bit

3-bit weight-quantized MLX version of Qwen/Qwen3.6-35B-A3B with RotorQuant KV-cache quantization. Optimized for Apple Silicon inference via the MLX framework. RotorQuant delivers 5.3x faster prefill and 28% faster decode compared to TurboQuant. A good balance between model quality and memory efficiency. Only 3B parameters are active per token despite 35B total, making this model significantly more efficient at inference time than its parameter count suggests.

Approximate model size: ~18 GB

Model Specifications

Property	Value
Base Model	Qwen/Qwen3.6-35B-A3B
Parameters	35 billion total (3 billion active per token)
Architecture	Mixture-of-Experts (MoE) (3B active per token)
Modality	Text-only (language tower extracted from a multimodal base; vision tower not included)
License	Apache 2.0
Weight Quantization	3-bit (~18 GB)
KV-Cache Quantization	RotorQuant
Framework	MLX (Apple Silicon)

Quickstart

import mlx.core as mx
from mlx_lm import load, generate

model, tokenizer = load("majentik/Qwen3.6-35B-A3B-RotorQuant-MLX-3bit")

prompt = "Give me a short introduction to Mixture-of-Experts models."
response = generate(model, tokenizer, prompt=prompt, max_tokens=512)
print(response)

Text-only extraction. This repo contains only the quantized language tower of Qwen3.6-35B-A3B. The upstream vision tower (333 tensors) and MTP head are not included, so image/video input does not work and mlx_vlm.load(...) fails with a Missing ... parameters error (the vision tower it expects is absent from the checkpoint). Load it with mlx_lm (recent version with qwen3_5_moe support) as shown above. For image/video input, use the upstream BF16 model Qwen/Qwen3.6-35B-A3B on a runtime that supports it.

What is RotorQuant?

RotorQuant is a high-performance KV-cache quantization method that achieves significantly better throughput than TurboQuant. Combined with 3-bit weight quantization in MLX, this provides a dual compression strategy with superior KV-cache performance: smaller model weights plus faster compressed KV cache for efficient long-context generation.

Key advantages over TurboQuant:

5.3x faster prefill
28% faster decode
Equivalent memory savings

KV-Cache Quantization Comparison

Method	Prefill Speed	Decode Speed	Memory Savings	Reference
TurboQuant	1x (baseline)	1x (baseline)	High	arXiv: 2504.19874
RotorQuant	5.3x faster	28% faster	High	GitHub

Memory Estimates (Qwen3.6 35B-A3B)

Precision	Approximate Size	MLX Variant
FP16 (original)	~70 GB (approx.)	--
8-bit quantized	~35 GB	RotorQuant-MLX-8bit
3-bit quantized	~18 GB	This model
2-bit quantized	~9 GB	RotorQuant-MLX-2bit

Hardware Requirements

This model requires approximately 18 GB of unified memory. Recommended hardware:

Apple M2 Pro (24 GB+)
Apple M3 Pro (24 GB+)
Apple M4 Pro (24 GB+)
Any Apple Silicon Mac with 24 GB+ unified memory

Quant trade-off (MLX lane)

Bits	Approx size	Use case	Recommendation
2-bit	~9.1 GB	Aggressive quantization	Very low-RAM Macs
3-bit	~13 GB	Lossy but small	Low-RAM Macs
4-bit	~15 GB	Balanced default	Recommended for most Macs
5-bit	~18 GB	Higher fidelity	Quality-sensitive
6-bit	~21 GB	Approaching FP16 quality	High-fidelity
8-bit	~27 GB	Near-lossless reference	Fidelity-critical work

(Current variant — 3bit — is bolded.)

Variants in this family

(Showing 24 sibling variants under majentik/qwen3.6-35b-a3b-*. The current variant — RotorQuant-MLX-3bit — is bolded.)

Variant	Runtime	Approx size	Use case
RotorQuant	runtime modifier	n/a	KV-cache root (weight-agnostic)
RotorQuant-GGUF-IQ4_XS	llama.cpp	~30 GB	Lossy 4-bit, low-RAM CPU/edge
RotorQuant-GGUF-Q2_K	llama.cpp	~21 GB	Lossy, low-RAM CPU/edge
RotorQuant-GGUF-Q3_K_M	llama.cpp	~27 GB	Smaller 3-bit, CPU-friendly
RotorQuant-GGUF-Q4_K_M	llama.cpp	~38 GB	Balanced default
RotorQuant-GGUF-Q5_K_M	llama.cpp	~46 GB	Higher fidelity, more RAM
RotorQuant-GGUF-Q8_0	llama.cpp	~74 GB	Near-lossless reference
RotorQuant-MLX-2bit	mlx-lm	~11 GB	Apple Silicon, smallest
RotorQuant-MLX-3bit	mlx-lm	~16 GB	Apple Silicon, small
RotorQuant-MLX-4bit	mlx-lm	~22 GB	Apple Silicon balanced
RotorQuant-MLX-5bit	mlx-lm	~27 GB	Apple Silicon, higher fidelity
RotorQuant-MLX-6bit	mlx-lm	~32 GB	Apple Silicon, near-lossless
RotorQuant-MLX-8bit	mlx-lm	~41 GB	Apple Silicon reference
TurboQuant	runtime modifier	n/a	KV-cache root (weight-agnostic)
TurboQuant-MLX-2bit	mlx-lm	~11 GB	Apple Silicon, smallest
TurboQuant-MLX-3bit	mlx-lm	~16 GB	Apple Silicon, small
TurboQuant-MLX-4bit	mlx-lm	~22 GB	Apple Silicon balanced
TurboQuant-MLX-5bit	mlx-lm	~27 GB	Apple Silicon, higher fidelity
TurboQuant-MLX-6bit	mlx-lm	~32 GB	Apple Silicon, near-lossless
TurboQuant-MLX-8bit	mlx-lm	~41 GB	Apple Silicon reference

Downloads last month: 702

Safetensors

Model size

5B params

Tensor type

BF16

U32

MLX

Hardware compatibility

3-bit

Model tree for majentik/Qwen3.6-35B-A3B-RotorQuant-MLX-3bit

Base model

Qwen/Qwen3.6-35B-A3B

Quantized

(602)

this model

Paper for majentik/Qwen3.6-35B-A3B-RotorQuant-MLX-3bit

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Paper • 2504.19874 • Published Apr 28, 2025 • 34

majentik
/

Qwen3.6-35B-A3B-RotorQuant-MLX-3bit

Qwen3.6 35B-A3B - RotorQuant MLX 3-bit

Model Specifications

Quickstart

What is RotorQuant?

KV-Cache Quantization Comparison

Memory Estimates (Qwen3.6 35B-A3B)

Hardware Requirements

See Also

Quant trade-off (MLX lane)

Variants in this family

Model tree for majentik/Qwen3.6-35B-A3B-RotorQuant-MLX-3bit

Paper for majentik/Qwen3.6-35B-A3B-RotorQuant-MLX-3bit

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate