QuantTrio/Qwen3.5-397B-A17B-AWQ reponse is !

by duyuting - opened Feb 27

Feb 27

After the model is launched, all outputs returned by accessing it are exclamation marks (!). What could be the cause of this issue? The a10-awq model works normally in this environment.

device a800

ONTEXT_LENGTH=32768
export CUDA_VISIBLE_DEVICES="0,1,2,3"
vllm serve
//models/Qwen3.5/Qwen3.5-397B-A17B-AWQ
--served-model-name Qwen3.5-397B-A17B-AWQ
--enable-expert-parallel
--swap-space 16
--max-num-seqs 32
--max-model-len $CONTEXT_LENGTH
--gpu-memory-utilization 0.9
--tensor-parallel-size 4
--reasoning-parser qwen3
--mm-processor-cache-type shm
--mm-encoder-tp-mode data
--enable-prefix-caching
--host 0.0.0.0
--port 8086 \

tclf90

QuantTrio org Feb 27

https://huggingface.co/QuantTrio/Qwen3.5-397B-A17B-AWQ/discussions/3

double check if you're not using cuda 12.8/13.0

JNBailey

Feb 27

I'm using Cuda 13.0 and am getting "!!!!!" , I cleared the cache and still get nothing but !!!!!!!!>

tclf90

QuantTrio org Feb 27

I'm using Cuda 13.0 and am getting "!!!!!" , I cleared the cache and still get nothing but !!!!!!!!>

did you use the docker image same as others?

JNBailey

Feb 27

I was using vllm/vllm-openai:cu130-nightly

duyuting

Feb 28

https://huggingface.co/QuantTrio/Qwen3.5-397B-A17B-AWQ/discussions/3

double check if you're not using cuda 12.8/13.0

my torch cuda version is 12.8;

torch.version.cuda
'12.8'

tclf90

QuantTrio org Feb 28

•

edited Feb 28

Please download / replace with the new config.json file from this repo, and have a try one more time. Let me know if this can help resolve the issue.

duyuting

Mar 4

Please download / replace with the new config.json file from this repo, and have a try one more time. Let me know if this can help resolve the issue.

this config.json is ok！

yuchenxie

about 1 month ago

•

edited about 1 month ago

Please download / replace with the new config.json file from this repo, and have a try one more time. Let me know if this can help resolve the issue.

The model works as is without updating anything when running on my 8xA6000 server, but when trying to run it on my Turing server 8xQuadro RTX 8000 which doesn't support BF16 it still gives me "!!!!!!" even with the new config.json.

My command:

VLLM_USE_ATOMIC_ADD=1  VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve Qwen3.5-397B-A17B-AWQ/ --reasoning-parser qwen3 --enable-prefix-caching --enable-auto-tool-choice --tool-call-parser qwen3_coder --attention-backend FLASHINFER   -tp 8  --enable-expert-parallel --gpu-memory-utilization 0.9 --trust-remote-code   --limit-mm-per-prompt '{"video":0}' --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":1}' --mm-processor-cache-type shm

NVCC returns Cuda 13.2.
Torch returns: 2.10.0+cu128

Honestly out of ideas.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment