Deployment of Step-3.5-Flash (fp16) via sglang failed.
Image: lmsysorg/sglang:dev-pr-18084
GPU: NVIDIA A800(80GB) * 8
start command:
sglang serve --model-path ./Step-3.5-Flash \
--served-model-name step3p5-flash \
--tp-size 8 \
--tool-call-parser step3p5 \
--reasoning-parser step3p5 \
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--enable-multi-layer-eagle \
--host 0.0.0.0 \
--port 8000
Error log:
Capturing batches (bs=48 avail_mem=15.05 GB): 0%| | 0/23 [00:00<?, ?it/s]
[2026-02-06 06:00:07 TP0] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3026, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 350, in init
self.init_model_worker()
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 539, in init_model_worker
self.init_tp_model_worker()
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 501, in init_tp_model_worker
self.tp_worker = TpModelWorker(
^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 242, in init
self._init_model_runner()
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 325, in _init_model_runner
self._model_runner = ModelRunner(
^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 391, in init
self.initialize(min_per_gpu_memory)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 588, in initialize
self.init_device_graphs()
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 2042, in init_device_graphs
self.graph_runner = graph_runnersself.device
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 366, in init
self.capture()
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 522, in capture
_capture_one_stream()
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 509, in _capture_one_stream
) = self.capture_one_batch_size(bs, forward, stream_idx)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 685, in capture_one_batch_size
attn_backend.init_forward_metadata_capture_cuda_graph(
File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 602, in init_forward_metadata_capture_cuda_graph
self.indices_updater_prefill.update(
File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 1273, in update_sliding_window
torch.tensor(self.sliding_window_size) + seq_lens - prefix_lens,
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~
TypeError: unsupported operand type(s) for -: 'Tensor' and 'NoneType'
Did I miss anything? Or do you have any suggestions?
Look like, prefix_lens is None. Because of /sgl-workspace/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py:line 604
self.indices_updater_prefill.update(
req_pool_indices,
seq_lens,
seq_lens.cpu(), # may add a little overhead in capture stage
seq_lens_sum,
prefix_lens=None, # here set to None
prefill_wrappers=prefill_wrappers,
use_ragged=False,
encoder_lens=encoder_lens,
spec_info=spec_info,
)
I have the same problem. Have you found a solution?
I have the same problem. Have you found a solution?
What GPU are you using?
I have the same problem. Have you found a solution?
What GPU are you using?
A800