0% acceptance rate when using MTP

by ydemartino - opened May 6

May 6

I'm using mtp with vllm 0.20.1:

export VLLM_ROCM_USE_AITER=1
export SAFETENSORS_FAST_GPU=1

vllm serve \
    --model amd/GLM-4.7-MXFP4 \
    --port 8000 \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice \
    --tensor-parallel-size 2 \
    --enable-prefix-caching \
    --performance-mode throughput \
    --default-chat-template-kwargs '{"enable_thinking": true, "clear_thinking": false}' \
    --gpu-memory-utilization 0.97

Here is the result:

---------------Speculative Decoding---------------
Acceptance rate (%):                     0.00
Acceptance length:                       1.00
Drafts:                                  2794
Draft tokens:                            2794
Accepted tokens:                         0
Per-position acceptance (%):
  Position 0:                            0.00

Do I need to do something special to have MTP work with this model?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment