Instructions to use majentik/Qwen3.6-35B-A3B-RotorQuant-MLX-3bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use majentik/Qwen3.6-35B-A3B-RotorQuant-MLX-3bit with MLX:
# Make sure mlx-vlm is installed # pip install --upgrade mlx-vlm from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template from mlx_vlm.utils import load_config # Load the model model, processor = load("majentik/Qwen3.6-35B-A3B-RotorQuant-MLX-3bit") config = load_config("majentik/Qwen3.6-35B-A3B-RotorQuant-MLX-3bit") # Prepare input image = ["http://images.cocodataset.org/val2017/000000039769.jpg"] prompt = "Describe this image." # Apply chat template formatted_prompt = apply_chat_template( processor, config, prompt, num_images=1 ) # Generate output output = generate(model, processor, formatted_prompt, image) print(output) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi new
How to use majentik/Qwen3.6-35B-A3B-RotorQuant-MLX-3bit with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "majentik/Qwen3.6-35B-A3B-RotorQuant-MLX-3bit"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "majentik/Qwen3.6-35B-A3B-RotorQuant-MLX-3bit" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use majentik/Qwen3.6-35B-A3B-RotorQuant-MLX-3bit with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "majentik/Qwen3.6-35B-A3B-RotorQuant-MLX-3bit"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default majentik/Qwen3.6-35B-A3B-RotorQuant-MLX-3bit
Run Hermes
hermes
Qwen3.6 35B-A3B - RotorQuant MLX 3-bit
3-bit weight-quantized MLX version of Qwen/Qwen3.6-35B-A3B with RotorQuant KV-cache quantization. Optimized for Apple Silicon inference via the MLX framework. RotorQuant delivers 5.3x faster prefill and 28% faster decode compared to TurboQuant. A good balance between model quality and memory efficiency. Only 3B parameters are active per token despite 26B total, making this model significantly more efficient at inference time than its parameter count suggests.
Approximate model size: ~18 GB
Model Specifications
| Property | Value |
|---|---|
| Base Model | Qwen/Qwen3.6-35B-A3B |
| Parameters | 35 billion total (3 billion active per token) |
| Architecture | Mixture-of-Experts (MoE) (3B active per token) |
| Modality | Multimodal: image + video + text input, text output |
| License | Apache 2.0 |
| Weight Quantization | 3-bit (~18 GB) |
| KV-Cache Quantization | RotorQuant |
| Framework | MLX (Apple Silicon) |
Quickstart
import mlx.core as mx
from mlx_lm import load, generate
model, tokenizer = load("majentik/Qwen3.6-35B-A3B-RotorQuant-MLX-3bit")
prompt = "Describe this image in detail."
response = generate(model, tokenizer, prompt=prompt, max_tokens=512)
print(response)
For multimodal usage with images:
from mlx_vlm import load, generate
model, processor = load("majentik/Qwen3.6-35B-A3B-RotorQuant-MLX-3bit")
prompt = "What do you see in this image?"
output = generate(model, processor, prompt=prompt, image="path/to/image.jpg", max_tokens=512)
print(output)
What is RotorQuant?
RotorQuant is a high-performance KV-cache quantization method that achieves significantly better throughput than TurboQuant. Combined with 3-bit weight quantization in MLX, this provides a dual compression strategy with superior KV-cache performance: smaller model weights plus faster compressed KV cache for efficient long-context generation.
Key advantages over TurboQuant:
- 5.3x faster prefill
- 28% faster decode
- Equivalent memory savings
KV-Cache Quantization Comparison
| Method | Prefill Speed | Decode Speed | Memory Savings | Reference |
|---|---|---|---|---|
| TurboQuant | 1x (baseline) | 1x (baseline) | High | arXiv: 2504.19874 |
| RotorQuant | 5.3x faster | 28% faster | High | GitHub |
Memory Estimates (Qwen3.6 35B-A3B)
| Precision | Approximate Size | MLX Variant |
|---|---|---|
| FP16 (original) | ~70 GB (approx.) | -- |
| 8-bit quantized | ~35 GB | RotorQuant-MLX-8bit |
| 3-bit quantized | ~18 GB | This model |
| 2-bit quantized | ~9 GB | RotorQuant-MLX-2bit |
Hardware Requirements
This model requires approximately 18 GB of unified memory. Recommended hardware:
- Apple M2 Pro (24 GB+)
- Apple M3 Pro (24 GB+)
- Apple M4 Pro (24 GB+)
- Any Apple Silicon Mac with 24 GB+ unified memory
See Also
- Qwen/Qwen3.6-35B-A3B -- Base model
- majentik/Qwen3.6-35B-A3B-RotorQuant -- RotorQuant KV-cache only (transformers)
- majentik/Qwen3.6-35B-A3B-RotorQuant-MLX-8bit -- MLX 8-bit variant
- majentik/Qwen3.6-35B-A3B-RotorQuant-MLX-2bit -- MLX 2-bit variant
- majentik/Qwen3.6-35B-A3B-TurboQuant-MLX-3bit -- TurboQuant MLX 3-bit variant
- RotorQuant GitHub
- MLX Framework
Quant trade-off (MLX lane)
| Bits | Approx size | Use case | Recommendation |
|---|---|---|---|
| 2-bit | ~9.1 GB | Aggressive quantization | Very low-RAM Macs |
| 3-bit | ~13 GB | Lossy but small | Low-RAM Macs |
| 4-bit | ~15 GB | Balanced default | Recommended for most Macs |
| 5-bit | ~18 GB | Higher fidelity | Quality-sensitive |
| 6-bit | ~21 GB | Approaching FP16 quality | High-fidelity |
| 8-bit | ~27 GB | Near-lossless reference | Fidelity-critical work |
(Current variant — 3bit — is bolded.)
Variants in this family
(Showing 24 sibling variants under majentik/qwen3.6-35b-a3b-*. The current variant — RotorQuant-MLX-3bit — is bolded.)
| Variant | Runtime | Approx size | Use case |
|---|---|---|---|
| RotorQuant | runtime modifier | n/a | KV-cache root (weight-agnostic) |
| RotorQuant-AWQ-4bit | transformers | ~22 GB | GPU 4-bit (AutoAWQ) |
| RotorQuant-AWQ-8bit | transformers | ~38 GB | GPU 8-bit (AutoAWQ) |
| RotorQuant-GGUF-IQ4_XS | llama.cpp | ~30 GB | Lossy 4-bit, low-RAM CPU/edge |
| RotorQuant-GGUF-Q2_K | llama.cpp | ~21 GB | Lossy, low-RAM CPU/edge |
| RotorQuant-GGUF-Q3_K_M | llama.cpp | ~27 GB | Smaller 3-bit, CPU-friendly |
| RotorQuant-GGUF-Q4_K_M | llama.cpp | ~38 GB | Balanced default |
| RotorQuant-GGUF-Q5_K_M | llama.cpp | ~46 GB | Higher fidelity, more RAM |
| RotorQuant-GGUF-Q8_0 | llama.cpp | ~74 GB | Near-lossless reference |
| RotorQuant-MLX-2bit | mlx-lm | ~11 GB | Apple Silicon, smallest |
| RotorQuant-MLX-3bit | mlx-lm | ~16 GB | Apple Silicon, small |
| RotorQuant-MLX-4bit | mlx-lm | ~22 GB | Apple Silicon balanced |
| RotorQuant-MLX-5bit | mlx-lm | ~27 GB | Apple Silicon, higher fidelity |
| RotorQuant-MLX-6bit | mlx-lm | ~32 GB | Apple Silicon, near-lossless |
| RotorQuant-MLX-8bit | mlx-lm | ~41 GB | Apple Silicon reference |
| TurboQuant | runtime modifier | n/a | KV-cache root (weight-agnostic) |
| TurboQuant-AWQ-4bit | transformers | ~22 GB | GPU 4-bit (AutoAWQ) |
| TurboQuant-AWQ-8bit | transformers | ~38 GB | GPU 8-bit (AutoAWQ) |
| TurboQuant-MLX-2bit | mlx-lm | ~11 GB | Apple Silicon, smallest |
| TurboQuant-MLX-3bit | mlx-lm | ~16 GB | Apple Silicon, small |
| TurboQuant-MLX-4bit | mlx-lm | ~22 GB | Apple Silicon balanced |
| TurboQuant-MLX-5bit | mlx-lm | ~27 GB | Apple Silicon, higher fidelity |
| TurboQuant-MLX-6bit | mlx-lm | ~32 GB | Apple Silicon, near-lossless |
| TurboQuant-MLX-8bit | mlx-lm | ~41 GB | Apple Silicon reference |
- Downloads last month
- 1,462
3-bit
Model tree for majentik/Qwen3.6-35B-A3B-RotorQuant-MLX-3bit
Base model
Qwen/Qwen3.6-35B-A3B