Qwen3.5-397B-A17B-RotorQuant

RotorQuant KV-cache quantization wrapper for Qwen/Qwen3.5-397B-A17B — the 397B-parameter / 17B-active Sparse MoE multimodal flagship from the Qwen team. This repository ships full-precision weights plus a data-calibrated IsoQuantCache that applies learned orthogonal rotors to K/V tensors, delivering the tightest KV-quantization error of any method in the Majentik suite.

RotorQuant trades a short one-time calibration pass for roughly +0.4 percentage points of fidelity over TurboQuant at identical bit-widths — meaningful on reasoning-heavy and long-context workloads.

Quickstart

import torch
from transformers import AutoModelForCausalLM, AutoProcessor
from rotorquant import IsoQuantCache

model = AutoModelForCausalLM.from_pretrained(
    "majentik/Qwen3.5-397B-A17B-RotorQuant",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained("majentik/Qwen3.5-397B-A17B-RotorQuant")

cache = IsoQuantCache.from_pretrained(
    "majentik/Qwen3.5-397B-A17B-RotorQuant",
    bits_k=4, bits_v=4,
)

messages = [
    {"role": "user", "content": [
        {"type": "image", "image": "https://example.com/chart.png"},
        {"type": "text",  "text": "Summarize the trend shown in this chart."},
    ]},
]
inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt",
).to(model.device)

out = model.generate(**inputs, max_new_tokens=512, past_key_values=cache)
print(processor.decode(out[0], skip_special_tokens=True))

Model Specs

Property Value
Base model Qwen/Qwen3.5-397B-A17B
Architecture Sparse Mixture-of-Experts (MoE)
Total parameters 397B
Active per token 17B
Modalities Image + Text → Text (image-text-to-text)
Context window 256K tokens (native)
KV quantization RotorQuant (learned rotors, 4-bit default)
Precision (weights) bfloat16
License Apache 2.0

RotorQuant vs TurboQuant

Aspect RotorQuant (this repo) TurboQuant
Rotation Learned orthogonal rotors (data-calibrated) Randomized Hadamard (static)
Calibration ~512 sample calibration pass Zero-shot
Accuracy (4-bit KV) ~99.7% of FP16 baseline ~99.3% of FP16 baseline
Best for Maximum fidelity, production-critical workloads Quick deployment, no calibration data
Latency overhead ~2–3% vs FP16 KV ~1–2% vs FP16 KV

Memory Estimates

Weights (bf16) dominate at ~794 GB. KV cache at 256K context:

KV precision KV @ 256K ctx Total VRAM @ 256K
FP16 ~128 GB ~922 GB
RotorQuant 8 ~64 GB ~858 GB
RotorQuant 4 ~32 GB ~826 GB
RotorQuant 2 ~16 GB ~810 GB

Hardware Requirements

  • Minimum: 8× H100 80GB (640 GB VRAM) with CPU offload
  • Recommended: 8× H200 141GB or MI300X cluster for full-context inference
  • Apple Silicon: Not recommended at bf16 — use the MLX variants below

See Also

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for majentik/Qwen3.5-397B-A17B-RotorQuant

Finetuned
(28)
this model