bullpoint/GLM-4.6-AWQ · endless response

endless response

by ramidahbash - opened Dec 15, 2025

Dec 15, 2025

I have tried this AWQ version.
I deployed it using vllm 0.10.2 and 4 H100 GPUs and the response never ends, it looks like he in a conversation with itself so the response is the a question to himself and he answer it in a never ending loop.
Setting the temperature to 1.0 doesn't help.

bullpoint

Owner Dec 15, 2025

Does every prompt result in endless looping?

ramidahbash

Dec 15, 2025

yes

bullpoint

Owner Dec 15, 2025

Could you give me a sample prompt I could try to see what happens on my machine? It does not happen for any prompt I give.

ramidahbash

Dec 15, 2025

for every prompt I give, even "Hello how are you?", it happens.
is it deployed locally on your machine with vLLM?

bullpoint

Owner Dec 15, 2025

Yes -- with 4xRTX PRO 6000 -- so blackwell instead of hopper.

Here's my docker-compose.yaml using vllm's nightly. I just pulled it today, so maybe you could give the same a try and see if any errors when it starts up?

services:
  inference:
    image: vllm/vllm-openai:nightly
    container_name: inference
    privileged: true
    userns_mode: host
    ipc: host
    shm_size: "32gb"
    ulimits:
      memlock: -1
      stack: 67108864
    ports:
      - "0.0.0.0:8000:8000"
    deploy:
      resources:
        limits:
          memory: 32g
          cpus: '32'
        reservations:
          memory: 32g
          cpus: '32'
          devices:
            - driver: nvidia
              count: -1
              capabilities: [gpu]
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - NVIDIA_DRIVER_CAPABILITIES=compute,utility
      - CUDA_LAUNCH_BLOCKING=0
      - NCCL_IB_DISABLE=1
      - NCCL_NVLS_ENABLE=0
      - NCCL_P2P_DISABLE=0
      - NCCL_SHM_DISABLE=0
      - VLLM_USE_V1=0
      - OMP_NUM_THREADS=8
      - TORCH_FLOAT32_MATMUL_PRECISION=high
    volumes:
      - /path/to/GLM-4.6-AWQ:/models/GLM-4.6-AWQ:ro
    entrypoint: ["/bin/bash", "-c"]
    command:
      - |
        # Run vLLM serve - TP=4, NO expert parallelism (for single-request speed)
        exec vllm serve /models/GLM-4.6-AWQ \
          --tensor-parallel-size 4 \
          --attention-backend FLASHINFER \
          --max-num-batched-tokens 16384 \
          --max-num-seqs 1 \
          --served-model-name GLM-4.6-AWQ \
          --enable-auto-tool-choice \
          --tool-call-parser glm45 \
          --reasoning-parser glm45 \
          --host 0.0.0.0 \
          --port 8000

ramidahbash

Dec 16, 2025

I configured my vllm the same as you, and it still doing this.

what version of vllm are you using? i tried 0.11.2 as well.

bullpoint

Owner Dec 16, 2025

I'm using vllm nightly, but v0.12 also works. Have you tried some other GLM 4.6 quants? https://huggingface.co/cyankiwi/GLM-4.6-AWQ-4bit is a good one.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment