MiniMax-M2.5-CPU-NUMA4-AMXINT8

MiniMaxAI/MiniMax-M2.5 quantized to the AMXINT8 format for inference with sglang + ktransformers, packed specifically for inference on 4 NUMA nodes.

To run, please ensure that your CPU supports the AMX instruction set (Intel Xeon processor, Sapphire Rapids or newer), and make note of your NUMA node count. Install kt-kernal and sglang-kt following the official documentation.

Then, download the official weights of MiniMaxAI/MiniMax-M2.5 in FP8, as well as this CPU-optimized quantized model, and prepare your launch command:

PYTORCH_ALLOC_CONF=expandable_segments:True \
SGLANG_ENABLE_JIT_DEEPGEMM=0 \
python -m sglang.launch_server \
  --model /path/to/MiniMax-M2.5 \
  --kt-method AMXINT8 \
  --kt-weight-path /path/to/MiniMax-M2.5-CPU-NUMA4-AMXINT8 \
  --kt-cpuinfer 128 \
  --kt-threadpool-count 4 \
  --kt-num-gpu-experts 64 \
  --kt-max-deferred-experts-per-token 0 \
  --kt-expert-placement-strategy uniform \
  --trust-remote-code \
  --mem-fraction-static 0.98 \
  --served-model-name MiniMaxAI/MiniMax-M2.5 \
  --enable-mixed-chunk \
  --tensor-parallel-size 1 \
  --enable-p2p-check \
  --disable-shared-experts-fusion \
  --chunked-prefill-size 4096 \
  --context-length 131072 \
  --max-total-tokens 131072 \
  --max-running-requests 1 \
  --attention-backend flashinfer \
  --fp8-gemm-backend cutlass \
  --reasoning-parser minimax \
  --tool-call-parser minimax-m2

Notes:

At the time of writing, MiniMaxM2ForCausalLM produces significantly degraded output with transformers 5.0.0+
--kt-cpuinfer should be set to the total number of physical CPU cores across all NUMA nodes
--tensor-parallel-size 1 should be set to the number of GPUs
The optimal choices for --attention-backend and --fp8-gemm-backend depend on the CUDA architecture of your GPUs - please check the sglang documentation
--kt-num-gpu-experts, --mem-fraction-static, --chunked-prefill-size, --context-length, --max-total-tokens, and --max-running-requests should be adjusted depending on constraints of your hardware
Please review the official kt-kernel documentation for details

Downloads last month: 5

Model tree for CPU-Hybrid-MoE/MiniMax-M2.5-CPU-NUMA4-AMXINT8

Base model

MiniMaxAI/MiniMax-M2.5

Quantized

(67)

this model