0% acceptance rate when using MTP
#5
by ydemartino - opened
I'm using mtp with vllm 0.20.1:
export VLLM_ROCM_USE_AITER=1
export SAFETENSORS_FAST_GPU=1
vllm serve \
--model amd/GLM-4.7-MXFP4 \
--port 8000 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--tensor-parallel-size 2 \
--enable-prefix-caching \
--performance-mode throughput \
--default-chat-template-kwargs '{"enable_thinking": true, "clear_thinking": false}' \
--gpu-memory-utilization 0.97
Here is the result:
---------------Speculative Decoding---------------
Acceptance rate (%): 0.00
Acceptance length: 1.00
Drafts: 2794
Draft tokens: 2794
Accepted tokens: 0
Per-position acceptance (%):
Position 0: 0.00
Do I need to do something special to have MTP work with this model?