Crash on first request on RTX Pro 6000 x8

#3
by koushd - opened

@lukealonso
Can you share your environment details? Which sglang are you using? A docker image or git checkout? I'm getting this crash immediately with either. CUDA 12.9

[2026-02-21 06:36:32] INFO: 192.168.2.124:50021 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2026-02-21 06:36:32 TP0] Prefill batch, #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, input throughput (token/s): 4.86, cuda graph: False
/pytorch/aten/src/ATen/native/cuda/TensorCompare.cu:112: _assert_async_cuda_kernel: block: [0,0,0], thread: [0,0,0] Assertion probability tensor contains either inf, nan or element < 0 failed.
/pytorch/aten/src/ATen/native/cuda/TensorCompare.cu:112: _assert_async_cuda_kernel: block: [0,0,0], thread: [0,0,0] Assertion probability tensor contains either inf, nan or element < 0 failed.
/pytorch/aten/src/ATen/native/cuda/TensorCompare.cu:112: _assert_async_cuda_kernel: block: [0,0,0], thread: [0,0,0] Assertion probability tensor contains either inf, nan or element < 0 failed.
/pytorch/aten/src/ATen/native/cuda/TensorCompare.cu/pytorch/aten/src/ATen/native/cuda/TensorCompare.cu:112:112: _assert_async_cuda_kernel: _assert_async_cuda_kernel: block: [0: block: [0,0,0,0,0], thread: [0/pytorch/aten/src/ATen/native/cuda/TensorCompare.cu], thread: [0,0:112,0,0,0: _assert_async_cuda_kernel] Assertion probability tensor contains either inf, nanor element < 0] Assertionprobability tensor contains either inf, nan or element < 0: block: [0failed. failed.
/pytorch/aten/src/ATen/native/cuda/TensorCompare.cu,0:112,0: _assert_async_cuda_kernel], thread: [0,0: block: [0,0,0] Assertion probability tensor contains either inf, nan or element < 0,0 failed.
], thread: [0,0,0] Assertion probability tensor contains either inf, nan or element < 0 failed.
[2026-02-21 06:36:32 TP4] Scheduler hit an exception: Traceback (most recent call last):
File "/mnt/storage2/venv/sglang/python/sglang/srt/managers/scheduler.py", line 3169, in run_scheduler_process
scheduler.event_loop_overlap()
File "/mnt/storage2/venv/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/mnt/storage2/venv/sglang/python/sglang/srt/managers/scheduler.py", line 1173, in event_loop_overlap
pop_and_process()
File "/mnt/storage2/venv/sglang/python/sglang/srt/managers/scheduler.py", line 1144, in pop_and_process
self.process_batch_result(tmp_batch, tmp_result)
File "/mnt/storage2/venv/sglang/python/sglang/srt/managers/scheduler.py", line 2453, in process_batch_result
self.process_batch_result_decode(batch, result)
File "/mnt/storage2/venv/sglang/python/sglang/srt/managers/scheduler_output_processor_mixin.py", line 423, in process_batch_result_decode
result.copy_done.synchronize()
File "/mnt/storage2/venv/.venv/lib/python3.12/site-packages/torch/cuda/streams.py", line 231, in synchronize
super().synchronize()
torch.AcceleratorError: CUDA error: device-side assert triggered
Search for cudaErrorAssert' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA` to enable device-side assertions.

Also happening here the same thing.

I think this is an sglang bug. I have a workaround patch I use locally. You can also set temperature=0 as another (suboptimal) workaround.

Hey thanks for looking into this @lukealonso ! Would love to know what your local patch actually touches if you're able to share, even if it is a rough description, that would also help.

If it is the same problem I had with 6 RTX Pro 6000, disable KV cache quantization helps (--kv-cache-dtype bf16).

i opened this issue recently regarding this same issue https://github.com/sgl-project/sglang/issues/18954

Sign up or log in to comment