hengm3467 commited on
Commit
47ec689
·
1 Parent(s): 480473d

clean up model card: fix numbering, typos, and code examples

Browse files

- add section numbers 1-8 and remove duplicate 6.2 SGLang heading
- add Pricing section with token pricing table
- fix garbled --enable-auto-tool-choice flag in vLLM bf16 block
- swap served-model-name between fp8 and bf16 examples
- fix Python syntax error (newline inside string) in chat example
- rewrite 5.2 example in Python OpenAI client style to match 5.1
- fix undefined tokenizer reference in Transformers example
- correct JSON code fence and clarify unified memory wording
- expand YAML frontmatter with library_name, pipeline_tag, language, tags

Files changed (1) hide show
  1. README.md +49 -36
README.md CHANGED
@@ -1,12 +1,22 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
3
  ---
4
 
5
- ## Introduction
 
6
  Step 3.7 Flash is a 198B-parameter Mixture-of-Experts (MoE) vision-language model that combines a 196B-parameter language backbone with a 1.8B-parameter vision encoder for native image understanding. Engineered for high-frequency production workloads, it activates approximately 11B parameters per token and delivers a throughput of up to 400 tokens per second. Step 3.7 Flash supports a 256k context window and offers three selectable reasoning levels (low, medium, and high) so developers can easily balance speed, cost, and cognitive depth.
 
7
  We built Step 3.7 Flash for developers who need to scale agentic workflows that combine perception, search, and reasoning. It is designed to handle intensive tasks such as parsing massive financial reports in one pass, running multi-step search loops with cross-source verification, or operating concurrent coding agents in high-throughput pipelines.
8
 
9
- ## Capabilities & Performance
10
 
11
  ### Multimodal Perception and Verification
12
 
@@ -22,16 +32,22 @@ Step 3.7 Flash is built for live engineering tasks and secured a definitive seco
22
 
23
  ## 3. Pricing
24
 
 
 
 
 
 
 
25
  ## 4. Availability, Deployment, and Ecosystem
26
  - Availability: Step 3.7 Flash is available through StepFun Open Platform at platform.stepfun.ai and platform.stepfun.com, as well as partner platforms including OpenRouter and NVIDIA NIM.
27
  - Deployment: Step 3.7 Flash supports flexible deployment across cloud, data center, and local environments. For large-scale production and enterprise use cases, Step 3.7 Flash can be deployed on modern data center infrastructure. For local and workstation scenarios, it can also run on high-memory devices such as NVIDIA DGX Station, AMD Ryzen AI Max+ 395-based systems, and Mac Studio / Macbook Pro devices with at least 128GB unified memory.
28
  - Ecosystem: Step 3.7 Flash is supported across popular open-source infrastructure for both inference and model development. For inference and serving, developers can use vLLM, SGLang, Hugging Face Transformers, and llama.cpp. For model development workflows, StepFun model support has landed in the NVIDIA Megatron ecosystem, including Megatron Core and Megatron Bridge.
29
 
30
- ## 5. Examples
31
 
32
  You can get started with Step 3.7 Flash in minutes using StepFun's API or via other inference providers.
33
 
34
- 5.1 Chat Example
35
 
36
  ```python
37
  from openai import OpenAI
@@ -43,8 +59,7 @@ completion = client.chat.completions.create(
43
  messages=[
44
  {
45
  "role": "system",
46
- "content":"You are an AI assistant provided by StepFun. You are good at Chinese, English, and many other languages, and you can see, think, and act to
47
- help users get things done.",
48
  },
49
  {
50
  "role": "user",
@@ -59,33 +74,34 @@ print(completion)
59
  ### 5.2 Text and Image Input Example
60
 
61
  ```python
62
- {
63
- "model": "step-3.7-flash",
64
- "messages": [
65
- {
66
- "role": "user",
67
- "content": [
68
- {
69
- "type": "text",
70
- "text": "what is in this picture?"
71
- },
72
- {
73
- "type": "image_url",
74
- "image_url": {
75
- "url": "https://example.com/photo.jpg"
76
- }
77
- }
78
- ]
79
- }
80
- ]
81
- }
 
82
  ```
83
 
84
  ## 6. Local Deployment
85
 
86
  Step 3.7 Flash is optimized for local inference and supports industry-standard backends including vLLM, SGLang, Hugging Face Transformers and llama.cpp.
87
 
88
- ### 6.1 vLLM
89
 
90
  We recommend using the latest nightly build of vLLM.
91
 
@@ -106,7 +122,7 @@ pip install -U vllm --pre \
106
  - For fp8 model
107
  ```bash
108
  vllm serve <MODEL_PATH_OR_HF_ID> \
109
- --served-model-name step3p7-flash \
110
  --tensor-parallel-size 8 \
111
  --enable-expert-parallel \
112
  --disable-cascade-attn \
@@ -114,17 +130,17 @@ pip install -U vllm --pre \
114
  --enable-auto-tool-choice \
115
  --tool-call-parser step3p5 \
116
  --speculative_config '{"method": "mtp", "num_speculative_tokens": 3}' \
117
- --trust-remote-code
118
  ```
119
  - For bf16 model
120
  ```bash
121
  vllm serve <MODEL_PATH_OR_HF_ID> \
122
- --served-model-name step3p7-flash-fp8 \
123
  --tensor-parallel-size 8 \
124
  --enable-expert-parallel \
125
  --disable-cascade-attn \
126
  --reasoning-parser step3p5 \
127
- --enab - -auto-tool-choice \
128
  --tool-call-parser step3p5 \
129
  --speculative_config '{"method": "mtp", "num_speculative_tokens": 3}' \
130
  --trust-remote-code
@@ -153,9 +169,6 @@ pip install -U vllm --pre \
153
 
154
  ### 6.2 SGLang
155
 
156
-
157
- ### 6.2 SGLang
158
-
159
  1. Install SGLang.
160
 
161
  ```bash
@@ -247,7 +260,7 @@ inputs = processor.apply_chat_template(
247
 
248
  # 3. Generate
249
  generated_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
250
- output_text = tokenizer.decode(generated_ids[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
251
 
252
  print(output_text)
253
  ```
@@ -266,7 +279,7 @@ GGUF Model Weights:
266
  | Multimodal Projector | FP16 | 3.97 GB |
267
 
268
  - **Runtime Overhead:** ~7 GB
269
- - **Minimum VRAM:** 120 GB (e.g., Mac Studio, NVIDIA DGX Station, AMD Ryzen AI Max+ 395)
270
  - **Recommended:** 128 GB unified memory
271
 
272
  **Steps**
 
1
  ---
2
  license: apache-2.0
3
+ library_name: transformers
4
+ pipeline_tag: image-text-to-text
5
+ language:
6
+ - en
7
+ tags:
8
+ - vision-language
9
+ - multimodal
10
+ - moe
11
  ---
12
 
13
+ ## 1. Introduction
14
+
15
  Step 3.7 Flash is a 198B-parameter Mixture-of-Experts (MoE) vision-language model that combines a 196B-parameter language backbone with a 1.8B-parameter vision encoder for native image understanding. Engineered for high-frequency production workloads, it activates approximately 11B parameters per token and delivers a throughput of up to 400 tokens per second. Step 3.7 Flash supports a 256k context window and offers three selectable reasoning levels (low, medium, and high) so developers can easily balance speed, cost, and cognitive depth.
16
+
17
  We built Step 3.7 Flash for developers who need to scale agentic workflows that combine perception, search, and reasoning. It is designed to handle intensive tasks such as parsing massive financial reports in one pass, running multi-step search loops with cross-source verification, or operating concurrent coding agents in high-throughput pipelines.
18
 
19
+ ## 2. Capabilities & Performance
20
 
21
  ### Multimodal Perception and Verification
22
 
 
32
 
33
  ## 3. Pricing
34
 
35
+ | Token Type | Price |
36
+ |---|---|
37
+ | Input (cache miss) | $0.20 / M tokens |
38
+ | Input (cache hit) | $0.04 / M tokens |
39
+ | Output | $1.15 / M tokens |
40
+
41
  ## 4. Availability, Deployment, and Ecosystem
42
  - Availability: Step 3.7 Flash is available through StepFun Open Platform at platform.stepfun.ai and platform.stepfun.com, as well as partner platforms including OpenRouter and NVIDIA NIM.
43
  - Deployment: Step 3.7 Flash supports flexible deployment across cloud, data center, and local environments. For large-scale production and enterprise use cases, Step 3.7 Flash can be deployed on modern data center infrastructure. For local and workstation scenarios, it can also run on high-memory devices such as NVIDIA DGX Station, AMD Ryzen AI Max+ 395-based systems, and Mac Studio / Macbook Pro devices with at least 128GB unified memory.
44
  - Ecosystem: Step 3.7 Flash is supported across popular open-source infrastructure for both inference and model development. For inference and serving, developers can use vLLM, SGLang, Hugging Face Transformers, and llama.cpp. For model development workflows, StepFun model support has landed in the NVIDIA Megatron ecosystem, including Megatron Core and Megatron Bridge.
45
 
46
+ ## 5. Examples
47
 
48
  You can get started with Step 3.7 Flash in minutes using StepFun's API or via other inference providers.
49
 
50
+ ### 5.1 Chat Example
51
 
52
  ```python
53
  from openai import OpenAI
 
59
  messages=[
60
  {
61
  "role": "system",
62
+ "content": "You are an AI assistant provided by StepFun. You are good at Chinese, English, and many other languages, and you can see, think, and act to help users get things done.",
 
63
  },
64
  {
65
  "role": "user",
 
74
  ### 5.2 Text and Image Input Example
75
 
76
  ```python
77
+ from openai import OpenAI
78
+
79
+ client = OpenAI(api_key="STEP_API_KEY", base_url="https://api.stepfun.com/v1")
80
+
81
+ completion = client.chat.completions.create(
82
+ model="step-3.7-flash",
83
+ messages=[
84
+ {
85
+ "role": "user",
86
+ "content": [
87
+ {"type": "text", "text": "What is in this picture?"},
88
+ {
89
+ "type": "image_url",
90
+ "image_url": {"url": "https://example.com/photo.jpg"},
91
+ },
92
+ ],
93
+ },
94
+ ],
95
+ )
96
+
97
+ print(completion)
98
  ```
99
 
100
  ## 6. Local Deployment
101
 
102
  Step 3.7 Flash is optimized for local inference and supports industry-standard backends including vLLM, SGLang, Hugging Face Transformers and llama.cpp.
103
 
104
+ ### 6.1 vLLM
105
 
106
  We recommend using the latest nightly build of vLLM.
107
 
 
122
  - For fp8 model
123
  ```bash
124
  vllm serve <MODEL_PATH_OR_HF_ID> \
125
+ --served-model-name step3p7-flash-fp8 \
126
  --tensor-parallel-size 8 \
127
  --enable-expert-parallel \
128
  --disable-cascade-attn \
 
130
  --enable-auto-tool-choice \
131
  --tool-call-parser step3p5 \
132
  --speculative_config '{"method": "mtp", "num_speculative_tokens": 3}' \
133
+ --trust-remote-code
134
  ```
135
  - For bf16 model
136
  ```bash
137
  vllm serve <MODEL_PATH_OR_HF_ID> \
138
+ --served-model-name step3p7-flash \
139
  --tensor-parallel-size 8 \
140
  --enable-expert-parallel \
141
  --disable-cascade-attn \
142
  --reasoning-parser step3p5 \
143
+ --enable-auto-tool-choice \
144
  --tool-call-parser step3p5 \
145
  --speculative_config '{"method": "mtp", "num_speculative_tokens": 3}' \
146
  --trust-remote-code
 
169
 
170
  ### 6.2 SGLang
171
 
 
 
 
172
  1. Install SGLang.
173
 
174
  ```bash
 
260
 
261
  # 3. Generate
262
  generated_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
263
+ output_text = processor.tokenizer.decode(generated_ids[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
264
 
265
  print(output_text)
266
  ```
 
279
  | Multimodal Projector | FP16 | 3.97 GB |
280
 
281
  - **Runtime Overhead:** ~7 GB
282
+ - **Minimum unified memory / VRAM:** 120 GB (e.g., Mac Studio, NVIDIA DGX Station, AMD Ryzen AI Max+ 395)
283
  - **Recommended:** 128 GB unified memory
284
 
285
  **Steps**