Work great on 3090 except for weird (...) generation
Running it on 6x3090 with VLLM, works great, but the output for some reason its full of (...) like if the model don't want to write long tracts of texts and it abbreviates everything. Anybody knows what could be the reason? I also get random chinese characters that the model itself gets surprised to write and it apologizes. I believe might als be a problem in vllm implementation of step-3.5.
Command to run it on vllm on Ampere (3090), vllm 0.17.1:
python -m vllm.entrypoints.openai.api_server
--model Intel_Step-3.5-Flash-int4-mixed-AutoRound
--host 0.0.0.0
--port 8001
--gpu-memory-utilization 0.85
--pipeline-parallel-size 6
--tensor-parallel-size 1
--swap-space 4
--reasoning-parser step3p5
--chat-template Intel_Step-3.5-Flash-int4-mixed-AutoRound/chat_template.jinja
--enable-auto-tool-choice
--tool-call-parser step3p5
--trust_remote_code \
This model sometimes thinks in Chinese but should answer in English - similar behavior seen in the GGUFs.
I think that to get to parity with the better llama.cpp quants would need a standard tuned model rather than RTN. Maybe that's in the pipeline.