Qwen3.5-397B-A17B-RotorQuant
RotorQuant KV-cache quantization wrapper for Qwen/Qwen3.5-397B-A17B — the 397B-parameter / 17B-active Sparse MoE multimodal flagship from the Qwen team. This repository ships full-precision weights plus a data-calibrated IsoQuantCache that applies learned orthogonal rotors to K/V tensors, delivering the tightest KV-quantization error of any method in the Majentik suite.
RotorQuant trades a short one-time calibration pass for roughly +0.4 percentage points of fidelity over TurboQuant at identical bit-widths — meaningful on reasoning-heavy and long-context workloads.
Quickstart
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
from rotorquant import IsoQuantCache
model = AutoModelForCausalLM.from_pretrained(
"majentik/Qwen3.5-397B-A17B-RotorQuant",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained("majentik/Qwen3.5-397B-A17B-RotorQuant")
cache = IsoQuantCache.from_pretrained(
"majentik/Qwen3.5-397B-A17B-RotorQuant",
bits_k=4, bits_v=4,
)
messages = [
{"role": "user", "content": [
{"type": "image", "image": "https://example.com/chart.png"},
{"type": "text", "text": "Summarize the trend shown in this chart."},
]},
]
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt",
).to(model.device)
out = model.generate(**inputs, max_new_tokens=512, past_key_values=cache)
print(processor.decode(out[0], skip_special_tokens=True))
Model Specs
| Property | Value |
|---|---|
| Base model | Qwen/Qwen3.5-397B-A17B |
| Architecture | Sparse Mixture-of-Experts (MoE) |
| Total parameters | 397B |
| Active per token | 17B |
| Modalities | Image + Text → Text (image-text-to-text) |
| Context window | 256K tokens (native) |
| KV quantization | RotorQuant (learned rotors, 4-bit default) |
| Precision (weights) | bfloat16 |
| License | Apache 2.0 |
RotorQuant vs TurboQuant
| Aspect | RotorQuant (this repo) | TurboQuant |
|---|---|---|
| Rotation | Learned orthogonal rotors (data-calibrated) | Randomized Hadamard (static) |
| Calibration | ~512 sample calibration pass | Zero-shot |
| Accuracy (4-bit KV) | ~99.7% of FP16 baseline | ~99.3% of FP16 baseline |
| Best for | Maximum fidelity, production-critical workloads | Quick deployment, no calibration data |
| Latency overhead | ~2–3% vs FP16 KV | ~1–2% vs FP16 KV |
Memory Estimates
Weights (bf16) dominate at ~794 GB. KV cache at 256K context:
| KV precision | KV @ 256K ctx | Total VRAM @ 256K |
|---|---|---|
| FP16 | ~128 GB | ~922 GB |
| RotorQuant 8 | ~64 GB | ~858 GB |
| RotorQuant 4 | ~32 GB | ~826 GB |
| RotorQuant 2 | ~16 GB | ~810 GB |
Hardware Requirements
- Minimum: 8× H100 80GB (640 GB VRAM) with CPU offload
- Recommended: 8× H200 141GB or MI300X cluster for full-context inference
- Apple Silicon: Not recommended at bf16 — use the MLX variants below
See Also
- majentik/Qwen3.5-397B-A17B-TurboQuant — zero-calibration KV quant
- MLX weight-quantized variants: 8-bit · 6-bit · 5-bit · 4-bit · 2-bit
- Base model: Qwen/Qwen3.5-397B-A17B
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for majentik/Qwen3.5-397B-A17B-RotorQuant
Base model
Qwen/Qwen3.5-397B-A17B