mh3467 commited on
Commit
34c69a7
·
verified ·
1 Parent(s): a9c0171

add readme

Browse files
Files changed (1) hide show
  1. README.md +348 -0
README.md CHANGED
@@ -1,3 +1,351 @@
1
  ---
2
  license: apache-2.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
  ---
4
+
5
+ ## Introduction
6
+ Step 3.7 Flash is a 198B-parameter Mixture-of-Experts (MoE) vision-language model that combines a 196B-parameter language backbone with a 1.8B-parameter vision encoder for native image understanding. Engineered for high-frequency production workloads, it activates approximately 11B parameters per token and delivers a throughput of up to 400 tokens per second. Step 3.7 Flash supports a 256k context window and offers three selectable reasoning levels (low, medium, and high) so developers can easily balance speed, cost, and cognitive depth.
7
+ We built Step 3.7 Flash for developers who need to scale agentic workflows that combine perception, search, and reasoning. It is designed to handle intensive tasks such as parsing massive financial reports in one pass, running multi-step search loops with cross-source verification, or operating concurrent coding agents in high-throughput pipelines.
8
+
9
+ ## Capabilities & Performance
10
+
11
+ ### Multimodal Perception and Verification
12
+
13
+ The model delivers top-tier visual intelligence, securing first place on SimpleVQA (Search) with a 79.2 and achieving frontier parity on V* (Python) at 95.3. These metrics reflect strong visual grounding and retrieval-augmented reasoning beyond basic image description. The model accurately processes dense visual interfaces, such as UI wireframes, application GUIs, and data charts, to map them into structured code. When it encounters an incomplete visual asset, it can independently identify missing data and execute lookups to verify context before returning a factually verified conclusion.
14
+
15
+ ### Workflow Integrity and Tool Orchestration
16
+
17
+ Execution reliability is critical for autonomous agents. Step 3.7 Flash leads the ClawEval-1.1 benchmark with a score of 67.1, which significantly outperforms the next closest competitor at 59.8. This performance demonstrates high resistance to adversarial traps and strict adherence to system policies during multi-turn orchestration. Backed by scores of 49.5 on Toolathlon and 48.1 on HLE w. Tool, this profile ensures high trajectory integrity. Step 3.7 Flash reliably interacts with external APIs and executes long-horizon workflows without drifting from instructions or violating system constraints.
18
+
19
+ ### Code Engineering and Professional Baselines
20
+
21
+ Step 3.7 Flash is built for live engineering tasks and secured a definitive second-place finish on SWE-Bench PRO with a score of 56.3. It can independently trace multi-file repositories, isolate bugs from raw issue reports, and generate functional patches that pass automated unit tests. While evaluations like Terminal-Bench 2.1 (59.5) and GPDVal (45.8) show clear areas for future optimization compared to the absolute peak of the cohort, they establish a dependable baseline for system interactions and structured professional deliverables.
22
+
23
+ ## 3. Pricing
24
+
25
+ ## 4. Availability, Deployment, and Ecosystem
26
+ - Availability: Step 3.7 Flash is available through StepFun Open Platform at platform.stepfun.ai and platform.stepfun.com, as well as partner platforms including OpenRouter and NVIDIA NIM.
27
+ - Deployment: Step 3.7 Flash supports flexible deployment across cloud, data center, and local environments. For large-scale production and enterprise use cases, Step 3.7 Flash can be deployed on modern data center infrastructure. For local and workstation scenarios, it can also run on high-memory devices such as NVIDIA DGX Station, AMD Ryzen AI Max+ 395-based systems, and Mac Studio / Macbook Pro devices with at least 128GB unified memory.
28
+ - Ecosystem: Step 3.7 Flash is supported across popular open-source infrastructure for both inference and model development. For inference and serving, developers can use vLLM, SGLang, Hugging Face Transformers, and llama.cpp. For model development workflows, StepFun model support has landed in the NVIDIA Megatron ecosystem, including Megatron Core and Megatron Bridge.
29
+
30
+ ## 5. Examples
31
+
32
+ You can get started with Step 3.7 Flash in minutes using StepFun's API or via other inference providers.
33
+
34
+ 5.1 Chat Example
35
+
36
+ ```python
37
+ from openai import OpenAI
38
+
39
+ client = OpenAI(api_key="STEP_API_KEY", base_url="https://api.stepfun.com/v1")
40
+
41
+ completion = client.chat.completions.create(
42
+ model="step-3.7-flash",
43
+ messages=[
44
+ {
45
+ "role": "system",
46
+ "content":"You are an AI assistant provided by StepFun. You are good at Chinese, English, and many other languages, and you can see, think, and act to
47
+ help users get things done.",
48
+ },
49
+ {
50
+ "role": "user",
51
+ "content": "Introduce StepFun's artificial intelligence capabilities."
52
+ },
53
+ ],
54
+ )
55
+
56
+ print(completion)
57
+ ```
58
+
59
+ ### 5.2 Text and Image Input Example
60
+
61
+ ```python
62
+ {
63
+ "model": "step-3.7-flash",
64
+ "messages": [
65
+ {
66
+ "role": "user",
67
+ "content": [
68
+ {
69
+ "type": "text",
70
+ "text": "what is in this picture?"
71
+ },
72
+ {
73
+ "type": "image_url",
74
+ "image_url": {
75
+ "url": "https://example.com/photo.jpg"
76
+ }
77
+ }
78
+ ]
79
+ }
80
+ ]
81
+ }
82
+ ```
83
+
84
+ ## 6. Local Deployment
85
+
86
+ Step 3.7 Flash is optimized for local inference and supports industry-standard backends including vLLM, SGLang, Hugging Face Transformers and llama.cpp.
87
+
88
+ ### 6.1 vLLM
89
+
90
+ We recommend using the latest nightly build of vLLM.
91
+
92
+ 1. Install vLLM.
93
+
94
+ ```bash
95
+ # via Docker
96
+ docker pull vllm/vllm-openai:nightly
97
+
98
+ # or via pip (nightly wheels)
99
+ pip install -U vllm --pre \
100
+ --index-url https://pypi.org/simple \
101
+ --extra-index-url https://wheels.vllm.ai/nightly
102
+ ```
103
+
104
+ 2. Launch the server.
105
+
106
+ - For fp8 model
107
+ ```bash
108
+ vllm serve <MODEL_PATH_OR_HF_ID> \
109
+ --served-model-name step3p7-flash \
110
+ --tensor-parallel-size 8 \
111
+ --enable-expert-parallel \
112
+ --disable-cascade-attn \
113
+ --reasoning-parser step3p5 \
114
+ --enable-auto-tool-choice \
115
+ --tool-call-parser step3p5 \
116
+ --speculative_config '{"method": "mtp", "num_speculative_tokens": 3}' \
117
+ --trust-remote-code
118
+ ```
119
+ - For bf16 model
120
+ ```bash
121
+ vllm serve <MODEL_PATH_OR_HF_ID> \
122
+ --served-model-name step3p7-flash-fp8 \
123
+ --tensor-parallel-size 8 \
124
+ --enable-expert-parallel \
125
+ --disable-cascade-attn \
126
+ --reasoning-parser step3p5 \
127
+ --enab - -auto-tool-choice \
128
+ --tool-call-parser step3p5 \
129
+ --speculative_config '{"method": "mtp", "num_speculative_tokens": 3}' \
130
+ --trust-remote-code
131
+ ```
132
+
133
+ - For nvfp4 model
134
+ Compared to standard precisions, running the FP4 quantized version requires modelopt activation and FP8 KV Cache alignment.
135
+ ```bash
136
+ python3 -m vllm.entrypoints.openai.api_server \
137
+ --host 0.0.0.0 \
138
+ --port ${PORT} \
139
+ --model stepfun-ai/Step-3.7-Flash-NVFP4 \
140
+ --served-model-name step3p7 \
141
+ --tensor-parallel-size 4 \
142
+ --gpu-memory-utilization 0.9 \
143
+ --enable-expert-parallel \
144
+ --trust-remote-code \
145
+ --quantization modelopt \
146
+ --kv-cache-dtype fp8 \
147
+ --max-model-len 8192 \
148
+ --reasoning-parser step3p5 \
149
+ --enable-auto-tool-choice \
150
+ --tool-call-parser step3p5 \
151
+ --async-scheduling
152
+ ```
153
+
154
+ ### 6.2 SGLang
155
+
156
+
157
+ ### 6.2 SGLang
158
+
159
+ 1. Install SGLang.
160
+
161
+ ```bash
162
+ # via Docker
163
+ docker pull lmsysorg/sglang:latest
164
+
165
+ # or from source (pip)
166
+ pip install "sglang[all] @ git+https://github.com/sgl-project/sglang.git"
167
+ ```
168
+
169
+ 2. Launch the server.
170
+
171
+ > **Note:** For Blackwell GPUs, `--mm-attention-backend fa4` may be used.
172
+
173
+ - For bf16 model
174
+
175
+ ```bash
176
+ sglang serve --model-path stepfun-ai/Step-3.7-Flash \
177
+ --tp 8 \
178
+ --reasoning-parser step3p5 \
179
+ --tool-call-parser step3p5 \
180
+ --enable-multimodal \
181
+ --speculative-algorithm EAGLE \
182
+ --speculative-num-steps 3 \
183
+ --speculative-eagle-topk 1 \
184
+ --speculative-num-draft-tokens 4 \
185
+ --enable-multi-layer-eagle \
186
+ --trust-remote-code \
187
+ --host 0.0.0.0 \
188
+ --port 8000
189
+ ```
190
+
191
+ - For fp8 model
192
+
193
+ ```bash
194
+ sglang serve --model-path stepfun-ai/Step-3.7-Flash-fp8 \
195
+ --tp 8 \
196
+ --ep 4 \
197
+ --reasoning-parser step3p5 \
198
+ --tool-call-parser step3p5 \
199
+ --enable-multimodal \
200
+ --speculative-algorithm EAGLE \
201
+ --speculative-num-steps 3 \
202
+ --speculative-eagle-topk 1 \
203
+ --speculative-num-draft-tokens 4 \
204
+ --enable-multi-layer-eagle \
205
+ --trust-remote-code \
206
+ --host 0.0.0.0 \
207
+ --port 8000
208
+ ```
209
+
210
+ ### 6.3 Transformers (Debug / Verification)
211
+
212
+ Use this snippet for quick functional verification. For high-throughput serving, use vLLM or SGLang.
213
+
214
+ > **Note:** Deployment of this model requires `transformers` 5.0 or later.
215
+
216
+ ```python
217
+ from transformers import AutoProcessor, AutoModelForCausalLM
218
+
219
+ MODEL_PATH = "<MODEL_PATH_OR_HF_ID>"
220
+
221
+ # 1. Setup
222
+ processor = AutoProcessor.from_pretrained(MODEL_PATH, trust_remote_code=True)
223
+ model = AutoModelForCausalLM.from_pretrained(
224
+ MODEL_PATH,
225
+ device_map="auto",
226
+ dtype="auto",
227
+ trust_remote_code=True
228
+ )
229
+
230
+ # 2. Prepare Input
231
+ messages = [
232
+ {
233
+ "role": "user",
234
+ "content": [
235
+ {"type": "image", "url": "https://example.com/photo.jpg"},
236
+ {"type": "text", "text": "What is in this picture?"}
237
+ ]
238
+ },
239
+ ]
240
+ inputs = processor.apply_chat_template(
241
+ messages,
242
+ tokenize=True,
243
+ add_generation_prompt=True,
244
+ return_dict=True,
245
+ return_tensors="pt",
246
+ ).to(model.device)
247
+
248
+ # 3. Generate
249
+ generated_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
250
+ output_text = tokenizer.decode(generated_ids[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
251
+
252
+ print(output_text)
253
+ ```
254
+
255
+ ### 6.4 llama.cpp
256
+
257
+ **System Requirements**
258
+
259
+ GGUF Model Weights:
260
+
261
+ | Component | Quantization | File Size |
262
+ |---|---|---|
263
+ | Language Model | Q4_K_S | 111.5 GB |
264
+ | Language Model | IQ4_XS | 104.99 GB |
265
+ | Language Model | Q3_K_L | 102.5 GB |
266
+ | Multimodal Projector | FP16 | 3.97 GB |
267
+
268
+ - **Runtime Overhead:** ~7 GB
269
+ - **Minimum VRAM:** 120 GB (e.g., Mac Studio, NVIDIA DGX Station, AMD Ryzen AI Max+ 395)
270
+ - **Recommended:** 128 GB unified memory
271
+
272
+ **Steps**
273
+
274
+ 1. Use llama.cpp:
275
+
276
+ ```bash
277
+ git clone https://github.com/stepfun-ai/llama.cpp.git
278
+ cd llama.cpp
279
+ git checkout -b step3.7 origin/step3.7
280
+ ```
281
+
282
+ 2. Build llama.cpp on Mac:
283
+
284
+ ```bash
285
+ cmake -B build-macos -S . \
286
+ -DCMAKE_BUILD_TYPE=Release \
287
+ -DBUILD_SHARED_LIBS=ON \
288
+ -DLLAMA_BUILD_SERVER=ON \
289
+ -DLLAMA_BUILD_TESTS=ON \
290
+ -DGGML_METAL=ON \
291
+ -DGGML_METAL_EMBED_LIBRARY=ON \
292
+ -DGGML_BLAS=ON \
293
+ -DGGML_BLAS_VENDOR=Apple \
294
+ -DGGML_ACCELERATE=ON \
295
+ -DGGML_NATIVE=ON
296
+ cmake --build build-macos -j8
297
+ ```
298
+
299
+ 3. Build llama.cpp on DGX-Spark:
300
+
301
+ ```bash
302
+ cmake -S . -B build-cuda \
303
+ -DCMAKE_BUILD_TYPE=Release \
304
+ -DGGML_CUDA=ON \
305
+ -DGGML_CUDA_GRAPHS=ON \
306
+ -DGGML_CUDA_FORCE_MMQ=ON \
307
+ -DLLAMA_OPENSSL=OFF \
308
+ -DLLAMA_BUILD_COMMON=ON \
309
+ -DLLAMA_BUILD_TOOLS=ON \
310
+ -DLLAMA_BUILD_SERVER=ON \
311
+ -DLLAMA_BUILD_EXAMPLES=OFF \
312
+ -DLLAMA_BUILD_TESTS=OFF
313
+ cmake --build build-cuda -j8
314
+ ```
315
+
316
+ 4. Build llama.cpp on AMD Windows:
317
+
318
+ ```bash
319
+ cmake -S . -B build-vulkan \
320
+ -DCMAKE_BUILD_TYPE=Release \
321
+ -DGGML_VULKAN=ON \
322
+ -DGGML_NATIVE=ON \
323
+ -DLLAMA_BUILD_SERVER=ON \
324
+ -DLLAMA_BUILD_UI=OFF \
325
+ -DLLAMA_BUILD_TOOLS=ON
326
+ cmake --build build-vulkan -j8
327
+ ```
328
+
329
+ 5. Run with `llama-cli`:
330
+
331
+ ```bash
332
+ ./llama-cli -m Step3.7_Q4_K_S.gguf -b 2048 -ub 2048 -fa on --temp 1.0 -p "What's your name?"
333
+ ```
334
+
335
+ 6. Test performance with `llama-batched-bench`:
336
+
337
+ ```bash
338
+ ./llama-batched-bench -m step3.7_Q4_K_S.gguf -c 32768 -b 2048 -ub 2048 -npp 0,2048,8192,16384,32768 -ntg 128 -npl 1
339
+ ```
340
+
341
+ ## 7. Using Step 3.7 Flash on Agent Platforms
342
+
343
+ You can use Step 3.7 Flash on Agent platforms such as Hermes Agent, Lemonade, OpenClaw, Kilo Code, and more.
344
+
345
+ ## 8. Getting in Touch
346
+
347
+ As we work to shape the future of AGI by expanding broad model capabilities, we want to ensure we are solving the right problems. We invite you to be part of this continuous feedback loop — your insights directly influence our priorities.
348
+
349
+ - **Join the Conversation:** Our [Discord](https://discord.gg/RcMJhNVAQc) community is the primary hub for brainstorming future architectures, proposing capabilities, and getting early access updates 🚀
350
+ - **Report Friction:** Encountering limitations? You can open an issue or start a discussion on GitHub / HuggingFace, or flag it directly in our Discord support channels.
351
+