TensorRTLLM 1.2.0rc6 Endless Stream

#5
by mbatuhanunverdi - opened

docker run
--gpus '"device=4,5,6,7"'
--rm
--ipc=host
--ulimit memlock=-1:-1
--ulimit stack=67108864
--shm-size=64G
-p 8051:8000
-v /raid/huggingface:/root/.cache/huggingface
-w /app/tensorrt_llm
-e HF_HUB_OFFLINE=1
-e TRANSFORMERS_OFFLINE=1
-e HF_DATASETS_OFFLINE=1
nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc6
trtllm-serve
Salyut1/GLM-4.7-NVFP4
--host 0.0.0.0
--port 8000
--backend pytorch
--max_batch_size 32
--max_num_tokens 8192
--max_seq_len 128000
--tp_size 4
--kv_cache_free_gpu_memory_fraction 0.9

I run the model by using TRT-LLM 1.2.0RC6. After that I tried to use it but the response didn't stop it generates endless response.
Could anyone help me ?

Sign up or log in to comment