My recipe for deployment
Using vllm, u need to manually install transformers>=5.3 to fit new RoPE embedding.
My hardware: RTX3090 + RTX4090
deployment script:
# Enable memory profiler to estimate CUDA graphs v0.19 functionality
export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1
export MODEL_NAME="mconcat/Qwopus3.5-27B-v3-FP8-Dynamic"
# Start vLLM with reduced swap space
vllm serve $MODEL_NAME \
--served-model-name vllm/Qwen3.5-27B \
--trust-remote-code \
--tensor-parallel-size 2 \
--max-model-len 219520 \
--gpu-memory-utilization 0.92 \
--enable-auto-tool-choice \
--enable-chunked-prefill \
--enable-prefix-caching \
--max-num-batched-tokens 4096 \
--max-num-seqs 4 \
--kv-cache-dtype fp8 \
--tool-call-parser hermes \
--reasoning-parser qwen3 \
--no-use-tqdm-on-load \
--host 0.0.0.0 \
--port 8000 \
--language-model-only
I feel the stability is a bit improved, the original 27B fail and melfunctioned on tool calling, but this one is fine so far. However i havent try the long context conversation / agentic coding yet. Any one have data on it?
Thanks for sharing!
Quick question β why use --language-model-only? This model supports vision (image-text-to-text).
I use the Claude code. After running a few rounds, it would suddenly stop and then say "Continue" before resuming the run.
Thanks for sharing!
Quick question β why use --language-model-only? This model supports vision (image-text-to-text).
Simply ofcaz i dont need the image part, and it can save some RAM by stopping it
I use the Claude code. After running a few rounds, it would suddenly stop and then say "Continue" before resuming the run.
Do you means Claude code will automatically type "Continue" in the text box and let it run? Or you have to manually type "Continue" to let this model run ?
