Instructions to use stepfun-ai/Step-3.5-Flash with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use stepfun-ai/Step-3.5-Flash with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="stepfun-ai/Step-3.5-Flash", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("stepfun-ai/Step-3.5-Flash", trust_remote_code=True, dtype="auto")

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use stepfun-ai/Step-3.5-Flash with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "stepfun-ai/Step-3.5-Flash"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "stepfun-ai/Step-3.5-Flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/stepfun-ai/Step-3.5-Flash

SGLang

How to use stepfun-ai/Step-3.5-Flash with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "stepfun-ai/Step-3.5-Flash" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "stepfun-ai/Step-3.5-Flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "stepfun-ai/Step-3.5-Flash" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "stepfun-ai/Step-3.5-Flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use stepfun-ai/Step-3.5-Flash with Docker Model Runner:
```
docker model run hf.co/stepfun-ai/Step-3.5-Flash
```

Context Management Reproducibility | 可复现性 ?

#27

by pandemo - opened Feb 16

Discussion

pandemo

Feb 16

Hi StepFun team, thank you so much for open-sourcing such impressive models and sharing your research!

Just a quick question on the discard-all strategy used for BrowseComp:

When the context length exceeds the threshold and the agent “discards its entire context,”(ref from the Step 3.5 Flash paper), does that mean everything accumulated (tool calls, reasoning, observations, etc.) is removed except the system prompt and initial user message/question?

Also, is the agent framework used for BrowseComp/HLE evaluation the same as (or similar to) the one in Step-DeepResearch?

Thanks again for your amazing work! 🙏

Hi 阶跃星辰团队，感谢你们开源如此出色的模型并分享你们的研究成果！

有一个关于 BrowseComp 中使用的 discard-all 策略的小问题：

当 context length 超过阈值，agent “discards its entire context”（引用自 Step 3.5 Flash 论文）时，是否意味着此前累计的所有内容（tool calls、reasoning、observations 等）都会被移除，仅保留 system prompt 和 initial user message/question？

另外，用于 BrowseComp/HLE 评测的 agent framework，是否与 Step-DeepResearch 中使用的框架相同或类似？

再次感谢你们出色的工作！🙏

ccchen1006

StepFun org Feb 27

To answer your questions:

Discard-all Strategy: You are correct. When the context length hits the threshold, the agent clears all accumulated tool calls, reasoning steps, and observations. It effectively resets the memory, retaining only the system prompt and the initial user message to keep the core objective in focus while freeing up space for new exploration.
Agent Framework: The framework used for BrowseComp/HLE is indeed very similar to the Step-DeepResearch architecture. Both rely on a core ReAct loop managed by a dedicated Context Manager. The primary difference lies in the toolsets: BrowseComp utilizes a specialized suite of internal optimization tools tailored specifically for complex web browsing and high-level reasoning tasks.

pandemo

Mar 5

Thank you for the clarification @ccchen1006 , this is really helpful for community reproducibility🙏.

Out of curiosity, are you able to share any more details about the agent framework used, in particular concerning the "specialized suite of internal optimization tools" you mentioned? Or are there any plans to open-source the agent framework in the future?

谢谢澄清 @ccchen1006 ，这对社区的可复现性真的很有帮助🙏。

出于好奇，您是否可以分享更多关于所使用的 agent framework 的细节，尤其是您提到的 “specialized suite of internal optimization tools”？另外，未来是否有计划将该 agent framework 开源？

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment