Deployment of Step-3.5-Flash (fp16) via sglang failed.

#17
by zqzq71 - opened

Image: lmsysorg/sglang:dev-pr-18084

GPU: NVIDIA A800(80GB) * 8

start command:

sglang serve --model-path ./Step-3.5-Flash \
  --served-model-name step3p5-flash \
  --tp-size 8 \
  --tool-call-parser step3p5 \
  --reasoning-parser step3p5 \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \
  --enable-multi-layer-eagle \
  --host 0.0.0.0 \
  --port 8000

Error log:
Capturing batches (bs=48 avail_mem=15.05 GB): 0%| | 0/23 [00:00<?, ?it/s]
[2026-02-06 06:00:07 TP0] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3026, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 350, in init
self.init_model_worker()
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 539, in init_model_worker
self.init_tp_model_worker()
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 501, in init_tp_model_worker
self.tp_worker = TpModelWorker(
^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 242, in init
self._init_model_runner()
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 325, in _init_model_runner
self._model_runner = ModelRunner(
^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 391, in init
self.initialize(min_per_gpu_memory)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 588, in initialize
self.init_device_graphs()
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 2042, in init_device_graphs
self.graph_runner = graph_runnersself.device
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 366, in init
self.capture()
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 522, in capture
_capture_one_stream()
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 509, in _capture_one_stream
) = self.capture_one_batch_size(bs, forward, stream_idx)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 685, in capture_one_batch_size
attn_backend.init_forward_metadata_capture_cuda_graph(
File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 602, in init_forward_metadata_capture_cuda_graph
self.indices_updater_prefill.update(
File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 1273, in update_sliding_window
torch.tensor(self.sliding_window_size) + seq_lens - prefix_lens,
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~
TypeError: unsupported operand type(s) for -: 'Tensor' and 'NoneType'

Did I miss anything? Or do you have any suggestions?

Look like, prefix_lens is None. Because of /sgl-workspace/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py:line 604

self.indices_updater_prefill.update(
                req_pool_indices,
                seq_lens,
                seq_lens.cpu(),  # may add a little overhead in capture stage
                seq_lens_sum,
                prefix_lens=None, # here set to None
                prefill_wrappers=prefill_wrappers,
                use_ragged=False,
                encoder_lens=encoder_lens,
                spec_info=spec_info,
            )

I have the same problem. Have you found a solution?

StepFun org
This comment has been hidden (marked as Resolved)
StepFun org

I have the same problem. Have you found a solution?

What GPU are you using?

I have the same problem. Have you found a solution?

What GPU are you using?

A800

Sign up or log in to comment