Instructions to use stepfun-ai/Step-3.5-Flash with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use stepfun-ai/Step-3.5-Flash with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="stepfun-ai/Step-3.5-Flash", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("stepfun-ai/Step-3.5-Flash", trust_remote_code=True, dtype="auto")

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use stepfun-ai/Step-3.5-Flash with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "stepfun-ai/Step-3.5-Flash"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "stepfun-ai/Step-3.5-Flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/stepfun-ai/Step-3.5-Flash

SGLang

How to use stepfun-ai/Step-3.5-Flash with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "stepfun-ai/Step-3.5-Flash" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "stepfun-ai/Step-3.5-Flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "stepfun-ai/Step-3.5-Flash" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "stepfun-ai/Step-3.5-Flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use stepfun-ai/Step-3.5-Flash with Docker Model Runner:
```
docker model run hf.co/stepfun-ai/Step-3.5-Flash
```

Deployment of Step-3.5-Flash (fp16) via sglang failed.

#17

by zqzq71 - opened Feb 6

Discussion

zqzq71

Feb 6

Image: lmsysorg/sglang:dev-pr-18084

GPU: NVIDIA A800(80GB) * 8

start command:

sglang serve --model-path ./Step-3.5-Flash \
  --served-model-name step3p5-flash \
  --tp-size 8 \
  --tool-call-parser step3p5 \
  --reasoning-parser step3p5 \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \
  --enable-multi-layer-eagle \
  --host 0.0.0.0 \
  --port 8000

Error log:
Capturing batches (bs=48 avail_mem=15.05 GB): 0%| | 0/23 [00:00<?, ?it/s]
[2026-02-06 06:00:07 TP0] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3026, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 350, in init
self.init_model_worker()
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 539, in init_model_worker
self.init_tp_model_worker()
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 501, in init_tp_model_worker
self.tp_worker = TpModelWorker(
^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 242, in init
self._init_model_runner()
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 325, in _init_model_runner
self._model_runner = ModelRunner(
^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 391, in init
self.initialize(min_per_gpu_memory)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 588, in initialize
self.init_device_graphs()
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 2042, in init_device_graphs
self.graph_runner = graph_runnersself.device
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 366, in init
self.capture()
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 522, in capture
_capture_one_stream()
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 509, in _capture_one_stream
) = self.capture_one_batch_size(bs, forward, stream_idx)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 685, in capture_one_batch_size
attn_backend.init_forward_metadata_capture_cuda_graph(
File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 602, in init_forward_metadata_capture_cuda_graph
self.indices_updater_prefill.update(
File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 1273, in update_sliding_window
torch.tensor(self.sliding_window_size) + seq_lens - prefix_lens,
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~
TypeError: unsupported operand type(s) for -: 'Tensor' and 'NoneType'

Did I miss anything? Or do you have any suggestions?

zqzq71

Feb 6

Look like, prefix_lens is None. Because of /sgl-workspace/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py:line 604

self.indices_updater_prefill.update(
                req_pool_indices,
                seq_lens,
                seq_lens.cpu(),  # may add a little overhead in capture stage
                seq_lens_sum,
                prefix_lens=None, # here set to None
                prefill_wrappers=prefill_wrappers,
                use_ragged=False,
                encoder_lens=encoder_lens,
                spec_info=spec_info,
            )

1245244103

Feb 6

I have the same problem. Have you found a solution?

MickJ

StepFun org Feb 6

This comment has been hidden (marked as Resolved)

MickJ

StepFun org Feb 6

I have the same problem. Have you found a solution?

What GPU are you using?

1245244103

Feb 7

I have the same problem. Have you found a solution?

What GPU are you using?

A800

yozice

Feb 10

I managed to run it in sglang (lmsysorg/sglang:dev-pr-18084) on multiple A100s by specifying the --attention-backend triton or fa3 flag. However, there were performance issues with speculative decoding (30 t/s), so I had to remove it, and it went up to 100 t/s.

1245244103

Mar 3

I managed to run it in sglang (lmsysorg/sglang:dev-pr-18084) on multiple A100s by specifying the --attention-backend triton or fa3 flag. However, there were performance issues with speculative decoding (30 t/s), so I had to remove it, and it went up to 100 t/s.

What is your deployment command without MTP? I tried using attention-backend triton without MTP, but the model output seems buggy and keeps falling into repetitive loops.

yozice

Mar 13

What is your deployment command without MTP? I tried using attention-backend triton without MTP, but the model output seems buggy and keeps falling into repetitive loops.

server:
    image: lmsysorg/sglang:dev-pr-18084
    restart: always
    shm_size: '80gb'
    ipc: host
    entrypoint: python3 -m sglang.launch_server
    command: >
      --model-path stepfun-ai/Step-3.5-Flash
      --tp-size 8
      --api-key ${SGLANG_API_KEY}
      --mem-fraction-static 0.85
      --tool-call-parser step3p5
      --reasoning-parser step3p5
      --attention-backend fa3
      --host 0.0.0.0
      --port 8000
    ports:
      - 13904:8000
    volumes:
      - /mnt/CACHE/huggingface:/root/.cache/huggingface # persist cache
    env_file: stack.env
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0','1','2','3','4','5','6','7']
              capabilities: [gpu]

In stack.env only HF_TOKEN and SGLANG_API_KEY are defined

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment