Kimi-K2.6-MXFP4
MXFP4 quantized version of moonshotai/Kimi-K2.6, quantized using AMD Quark via the quanto toolkit.
Benchmark Results
MMLU 5-shot log-likelihood evaluation. All runs use the same prompt format (no chat template) for direct comparability.
| Precision | Model | MMLU 5-shot (acc) | Δ vs W4A16 |
|---|---|---|---|
| W4A16 (compressed-tensors) | moonshotai/Kimi-K2.6 (official) | 89.62% | — |
| MXFP4 (OCP MX, Quark RTN) | This model | 89.05% | -0.57% |
The official
moonshotai/Kimi-K2.6release uses W4A16 compressed-tensors quantization (WNA16 MoE method).
Quantization Details
| Property | Value |
|---|---|
| Method | MXFP4 (MX Floating Point 4-bit) |
| Algorithm | RTN (Round-to-Nearest) |
| Weight dtype | FP4 (E2M1), OCP MX format |
| Activation dtype | FP4 (E2M1), dynamic per-group |
| Scale format | E8M0 (per-group of 32) |
| Group size | 32 |
| Tool | AMD Quark 0.11.1 + quanto |
Model Architecture
Kimi-K2.6 is a 1-trillion parameter Mixture-of-Experts language model with:
- Total parameters: ~1T
- Active parameters per token: ~32B
- Architecture: MoE with latent attention (MLA), 61 transformer layers
- Experts: 384 routed + 1 shared expert per MoE layer, top-8 routing
- Context length: 128K tokens
Usage
from vllm import LLM, SamplingParams
llm = LLM(
model="haanjack/Kimi-K2.6-MXFP4",
tensor_parallel_size=4,
trust_remote_code=True,
max_model_len=32768,
enforce_eager=True, # required: avoids HIP kernel crash during graph capture
gpu_memory_utilization=0.85,
)
Required environment variables (AMD ROCm):
export QUARK_MXFP4_IMPL=triton # use Triton kernel (avoids HIP C++ kernel crash on gfx950)
export PYTORCH_ROCM_ARCH=gfx950 # set to your GPU architecture for fast kernel compilation
Note: This model requires AMD Quark and a recent vLLM build with Quark support (
quantization=quark). Tested withrocm/vllm-dev:nightly(vLLM 0.20.1rc1, ROCm 7.2, AMD MI355).
Serving with vLLM
QUARK_MXFP4_IMPL=triton PYTORCH_ROCM_ARCH=gfx950 \
python -m vllm.entrypoints.openai.api_server \
--model haanjack/Kimi-K2.6-MXFP4 \
--tensor-parallel-size 4 \
--trust-remote-code \
--max-model-len 32768 \
--gpu-memory-utilization 0.85 \
--enforce-eager
Quantization Recipe
Quantized using the quanto CLI:
python -m quanto \
--model_path moonshotai/Kimi-K2.6 \
--output_dir ./kimi-k2.6-mxfp4 \
--precision mxfp4 \
--exclude_layers lm_head "*self_attn*" "*.gate" "*shared_experts*" "*embed*" "*norm*"
The mxfp4 precision triggers Quark's quantize_model_per_safetensor (file-to-file) path, which processes each safetensors shard independently without loading the full model into GPU memory.
Known Limitations
- Requires
--enforce-eagerflag in vLLM (CUDA graph capture triggers a kernel crash with the Quark MXFP4 emulation backend on ROCm) QUARK_MXFP4_IMPL=tritonis required on gfx950 (MI355) hardware; the default HIP C++ kernel has a memory access bug on this architecture- Native MXFP4 compute kernels (AITER) are not yet available for
w_mxfp4_a_mxfp4scheme — weights are dequantized to BF16 on-the-fly during inference (emulation mode)
- Downloads last month
- 452
Model tree for haanjack/Kimi-K2.6-MXFP4
Base model
moonshotai/Kimi-K2.6