tool hallucinations and bad tool calls
when I run this model (or the bf16 version), I get hallucinated tools and the tool calls show up in the output stream
This is an example of what it returns when I ask what tools it has:
File operations: Read, write, and search through files in the project
Terminal/bash: Execute shell commands
Web browsing/searching: Only when explicitly requested
It has none of these.
This is what output looks like when it attempts to call one of these non-existing tools:
minimax:tool_call
<invoke name="cli-mcp-server_run_command">
<parameter name="command">ls -la</parameter>
</invoke>
</minimax:tool_call>
Care to also show your vllm command line to get a proper understanding of what's going on? I have used the model extensively and aside from the famous infamous "plurals" bug, so far this quant of mratsim's has been flawless!
This is my fp8 config, the kv cache dtype is removed for bf16:
SAMPLER_OVERRIDE='{"temperature": 1, "top_p": 0.95, "top_k": 40, "repetition_penalty": 1.1, "frequency_penalty": 0.40}'
vllm serve /models/MiniMax-M2.5-FP8-INT4-AWQ \
--gpu-memory-utilization 0.92 \
--enable-expert-parallel \
--kv-cache-dtype fp8_e4m3 \
--override-generation-config "${SAMPLER_OVERRIDE}" \
--max-model-len 195000 \
--async-scheduling \
--no-enable-prefix-caching \
--tensor-parallel-size 4 \
--enable-auto-tool-choice --tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2 \
--served-model-name minimax-m2.5 \
--compilation-config '{"pass_config":{"fuse_allreduce_rms":false}}' \
--port 9000
Running on 4x h100
8x RTX3090 (p2p enabled - patched vllm) --> vllm 0.15.1 goes all fine with:
vllm serve mratsim/MiniMax-M2.5-FP8-INT4-AWQ--served-model-name "MiniMax-M2.5-FP8-INT4-AWQ"
--tensor-parallel-size 8
--max-model-len 196608
--gpu-memory-utilization 0.90
--max-num-seqs 4
--kv-cache-dtype fp8
--reasoning-parser minimax_m2
--tool-call-parser minimax_m2
--enable-auto-tool-choice
--host 0.0.0.0
--port 5005
--disable-log-requests
--disable-uvicorn-access-log
--swap-space 8
--override-generation-config '{"temperature": 1, "top_p": 0.95, "top_k": 40, "repetition_penalty": 1.1, "frequency_penalty": 0.40}'
--trust-remote-code
Not sure which vllm version you're running (probably a newer) that you can skip "--trust-remote-code"
I was running vllm nightly from a few days before my post
when you say patched vllm, what are you referring to?