TensorRTLLM 1.2.0rc6 Endless Stream

by mbatuhanunverdi - opened Dec 24, 2025

Dec 24, 2025

docker run
--gpus '"device=4,5,6,7"'
--rm
--ipc=host
--ulimit memlock=-1:-1
--ulimit stack=67108864
--shm-size=64G
-p 8051:8000
-v /raid/huggingface:/root/.cache/huggingface
-w /app/tensorrt_llm
-e HF_HUB_OFFLINE=1
-e TRANSFORMERS_OFFLINE=1
-e HF_DATASETS_OFFLINE=1
nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc6
trtllm-serve
Salyut1/GLM-4.7-NVFP4
--host 0.0.0.0
--port 8000
--backend pytorch
--max_batch_size 32
--max_num_tokens 8192
--max_seq_len 128000
--tp_size 4
--kv_cache_free_gpu_memory_fraction 0.9

I run the model by using TRT-LLM 1.2.0RC6. After that I tried to use it but the response didn't stop it generates endless response.
Could anyone help me ?

oliverjohnwilson

Jan 25

docker run
--gpus '"device=4,5,6,7"'
--rm
--ipc=host
--ulimit memlock=-1:-1
--ulimit stack=67108864
--shm-size=64G
-p 8051:8000
-v /raid/huggingface:/root/.cache/huggingface
-w /app/tensorrt_llm
-e HF_HUB_OFFLINE=1
-e TRANSFORMERS_OFFLINE=1
-e HF_DATASETS_OFFLINE=1
nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc6
trtllm-serve
Salyut1/GLM-4.7-NVFP4
--host 0.0.0.0
--port 8000
--backend pytorch
--max_batch_size 32
--max_num_tokens 8192
--max_seq_len 128000
--tp_size 4
--kv_cache_free_gpu_memory_fraction 0.9

I run the model by using TRT-LLM 1.2.0RC6. After that I tried to use it but the response didn't stop it generates endless response.
Could anyone help me ?

Hey @mbatuhanunverdi , did you ever get Salyut1/GLM-4.7-NVFP4 working as intended on TRT-LLM?

mbatuhanunverdi

Jan 27

Hey @oliverjohnwilson ,unfortunately I couldn't run the model

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment