QuantTrio/Qwen3.5-397B-A17B-AWQ reponse is !

#5
by duyuting - opened

After the model is launched, all outputs returned by accessing it are exclamation marks (!). What could be the cause of this issue? The a10-awq model works normally in this environment.

device a800

ONTEXT_LENGTH=32768
export CUDA_VISIBLE_DEVICES="0,1,2,3"
vllm serve
//models/Qwen3.5/Qwen3.5-397B-A17B-AWQ
--served-model-name Qwen3.5-397B-A17B-AWQ
--enable-expert-parallel
--swap-space 16
--max-num-seqs 32
--max-model-len $CONTEXT_LENGTH
--gpu-memory-utilization 0.9
--tensor-parallel-size 4
--reasoning-parser qwen3
--mm-processor-cache-type shm
--mm-encoder-tp-mode data
--enable-prefix-caching
--host 0.0.0.0
--port 8086 \

QuantTrio org

https://huggingface.co/QuantTrio/Qwen3.5-397B-A17B-AWQ/discussions/3

double check if you're not using cuda 12.8/13.0

I'm using Cuda 13.0 and am getting "!!!!!" , I cleared the cache and still get nothing but !!!!!!!!>

QuantTrio org

I'm using Cuda 13.0 and am getting "!!!!!" , I cleared the cache and still get nothing but !!!!!!!!>

did you use the docker image same as others?

I was using vllm/vllm-openai:cu130-nightly

https://huggingface.co/QuantTrio/Qwen3.5-397B-A17B-AWQ/discussions/3

double check if you're not using cuda 12.8/13.0

my torch cuda version is 12.8;

torch.version.cuda
'12.8'

QuantTrio org
edited Feb 28

Please download / replace with the new config.json file from this repo, and have a try one more time. Let me know if this can help resolve the issue.

Please download / replace with the new config.json file from this repo, and have a try one more time. Let me know if this can help resolve the issue.

this config.json is ok!

Please download / replace with the new config.json file from this repo, and have a try one more time. Let me know if this can help resolve the issue.

The model works as is without updating anything when running on my 8xA6000 server, but when trying to run it on my Turing server 8xQuadro RTX 8000 which doesn't support BF16 it still gives me "!!!!!!" even with the new config.json.

My command:

VLLM_USE_ATOMIC_ADD=1  VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve Qwen3.5-397B-A17B-AWQ/ --reasoning-parser qwen3 --enable-prefix-caching --enable-auto-tool-choice --tool-call-parser qwen3_coder --attention-backend FLASHINFER   -tp 8  --enable-expert-parallel --gpu-memory-utilization 0.9 --trust-remote-code   --limit-mm-per-prompt '{"video":0}' --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":1}' --mm-processor-cache-type shm  

NVCC returns Cuda 13.2.
Torch returns: 2.10.0+cu128

Honestly out of ideas.

Sign up or log in to comment