0% acceptance rate when using MTP

#5
by ydemartino - opened

I'm using mtp with vllm 0.20.1:

export VLLM_ROCM_USE_AITER=1
export SAFETENSORS_FAST_GPU=1

vllm serve \
    --model amd/GLM-4.7-MXFP4 \
    --port 8000 \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice \
    --tensor-parallel-size 2 \
    --enable-prefix-caching \
    --performance-mode throughput \
    --default-chat-template-kwargs '{"enable_thinking": true, "clear_thinking": false}' \
    --gpu-memory-utilization 0.97

Here is the result:

---------------Speculative Decoding---------------
Acceptance rate (%):                     0.00
Acceptance length:                       1.00
Drafts:                                  2794
Draft tokens:                            2794
Accepted tokens:                         0
Per-position acceptance (%):
  Position 0:                            0.00

Do I need to do something special to have MTP work with this model?

Sign up or log in to comment