update-bmk-numbers

#18
by mh3467 - opened
Files changed (2) hide show
  1. README.md +24 -25
  2. step-bar-chart.png +2 -2
README.md CHANGED
@@ -52,29 +52,29 @@ Performance of Step 3.5 Flash measured across **Reasoning**, **Coding**, and **A
52
  ### Detailed Benchmarks
53
 
54
  | Benchmark | Step 3.5 Flash | DeepSeek V3.2 | Kimi K2 Thinking / K2.5 | GLM-4.7 | MiniMax M2.1 | MiMo-V2 Flash |
55
- | --- | --- | --- | --- | --- | --- | --- |
56
  | # Activated Params | 11B | 37B | 32B | 32B | 10B | 15B |
57
  | # Total Params (MoE) | 196B | 671B | 1T | 355B | 230B | 309B |
58
- | Est. decoding cost @ 128K context, Hopper GPU** | **1.0x**<br>100 tok/s, MTP-3, EP8 | **6.0x**<br>33 tok/s, MTP-1, EP32 | **18.9x**<br>33 tok/s, no MTP, EP32 | **18.9x**<br>100 tok/s, MTP-3, EP8 | **3.9x**<br>100 tok/s, MTP-3, EP8 | **1.2x**<br>100 tok/s, MTP-3, EP8 |
59
- | | | | **Agent** | | | |
60
- | τ²-Bench | 88.2 | 80.3 (85.2*) | 74.3*/85.4* | 87.4 | 86.6* | 80.3 (84.1*) |
61
- | BrowseComp | 51.6 | 51.4 | 41.5* / 60.6 | 52.0 | 47.4 | 45.4 |
62
- | BrowseComp (w/ Context Manager) | 69.0 | 67.6 | 60.2/74.9 | 67.5 | 62.0 | 58.3 |
63
- | BrowseComp-ZH | 66.9 | 65.0 | 62.3 / 62.3* | 66.6 | 47.8* | 51.2* |
64
- | BrowseComp-ZH (w/ Context Manager) | 73.7 | — | —/— | — | — | — |
65
- | GAIA (no file) | 84.5 | 75.1* | 75.6*/75.9* | 61.9* | 64.3* | 78.2* |
66
- | xbench-DeepSearch (2025.05) | 83.7 | 78.0* | 76.0*/76.7* | 72.0* | 68.7* | 69.3* |
67
- | xbench-DeepSearch (2025.10) | 56.3 | 55.7* | —/40+ | 52.3* | 43.0* | 44.0* |
68
- | ResearchRubrics | 65.3 | 55.8* | 56.2*/59.5* | 62.0* | 60.2* | 54.3* |
69
- | | | | **Reasoning** | | | |
70
- | AIME 2025 | 97.3 | 93.1 | 94.5/96.1 | 95.7 | 83.0 | 94.1 (95.1*) |
71
- | HMMT 2025 (Feb.) | 98.4 | 92.5 | 89.4/95.4 | 97.1 | 71.0* | 84.4 (95.4*) |
72
- | HMMT 2025 (Nov.) | 94.0 | 90.2 | 89.2*/— | 93.5 | 74.3* | 91.0* |
73
- | IMOAnswerBench | 85.4 | 78.3 | 78.6/81.8 | 82.0 | 60.4* | 80.9* |
74
- | | | | **Coding** | | | |
75
- | LiveCodeBench-V6 | 86.4 | 83.3 | 83.1/85.0 | 84.9 | — | 80.6 (81.6*) |
76
- | SWE-bench Verified | 74.4 | 73.1 | 71.3/76.8 | 73.8 | 74.0 | 73.4 |
77
- | Terminal-Bench 2.0 | 51.0 | 46.4 | 35.7*/50.8 | 41.0 | 47.9 | 38.5 |
78
 
79
  **Notes**:
80
  1. "—" indicates the score is not publicly available or not tested.
@@ -305,11 +305,10 @@ print(output_text)
305
  - Minimum VRAM: 120 GB (e.g., Mac studio, DGX-Spark, AMD Ryzen AI Max+ 395)
306
  - Recommended: 128GB unified memory
307
  #### Steps
308
- 1. Use official llama.cpp:
309
- > the folder `Step-3.5-Flash/tree/main/llama.cpp` is **obsolete**
310
  ```bash
311
- git clone https://github.com/ggml-org/llama.cpp
312
- cd llama.cpp
313
  ```
314
  2. Build llama.cpp on Mac:
315
  ```bash
 
52
  ### Detailed Benchmarks
53
 
54
  | Benchmark | Step 3.5 Flash | DeepSeek V3.2 | Kimi K2 Thinking / K2.5 | GLM-4.7 | MiniMax M2.1 | MiMo-V2 Flash |
55
+ |---|---|---|---|---|---|---|
56
  | # Activated Params | 11B | 37B | 32B | 32B | 10B | 15B |
57
  | # Total Params (MoE) | 196B | 671B | 1T | 355B | 230B | 309B |
58
+ | Est. decoding cost (@ 128K context, Hopper GPU**) | **1.0x** (100 tok/s, MTP-3, EP8) | 6.0x (33 tok/s, MTP-1, EP32) | 18.9x (33 tok/s, no MTP, EP32) | 18.9x (100 tok/s, MTP-3, EP8) | 3.9x (100 tok/s, MTP-3, EP8) | 1.2x (100 tok/s, MTP-3, EP8) |
59
+ | **Agency** | | | | | | |
60
+ | τ²-Bench | **88.2** | 80.3 | 74.3* / — | 87.4 | 80.2* | 80.3 |
61
+ | BrowseComp | 51.6 | 51.4 | 41.5* / **60.6** | 52.0 | 47.4 | 45.4 |
62
+ | BrowseComp (w/ Context Manager) | 69.0 | 67.6 | 60.2 / **74.9** | 67.5 | 62.0 | 58.3 |
63
+ | BrowseComp-ZH | **66.9** | 65.0 | 62.3 / 62.3* | 66.6 | 47.8* | 51.2* |
64
+ | BrowseComp-ZH (w/ Context Manager) | **73.7** | — | / — | — | — | — |
65
+ | GAIA (no file) | **84.5** | 75.1* | 75.6* / 75.9* | 61.9* | 64.3* | 78.2* |
66
+ | xbench-DeepSearch (2025.05) | **83.7** | 78.0* | 76.0* / 76.7* | 72.0* | 68.7* | 69.3* |
67
+ | xbench-DeepSearch (2025.10) | **56.3** | 55.7* | — / 40+ | 52.3* | 43.0* | 44.0* |
68
+ | ResearchRubrics | **65.3** | 55.8* | 56.2* / 59.5* | 62.0* | 60.2* | 54.3* |
69
+ | **Reasoning** | | | | | | |
70
+ | AIME 2025 | **97.3** | 93.1 | 94.5 / 96.1 | 95.7 | 83.0 | 94.1 (95.1*) |
71
+ | HMMT 2025 (Feb.) | **98.4** | 92.5 | 89.4 / 95.4 | 97.1 | 71.0* | 84.4 (95.4*) |
72
+ | HMMT 2025 (Nov.) | **94.0** | 90.2 | 89.2* / — | 93.5 | 74.3* | 91.0* |
73
+ | IMOAnswerBench | **85.4** | 78.3 | 78.6 / 81.8 | 82.0 | 60.4* | 80.9* |
74
+ | **Coding** | | | | | | |
75
+ | LiveCodeBench-V6 | **86.4** | 83.3 | 83.1 / 85.0 | 84.9 | — | 80.6 (81.6*) |
76
+ | SWE-bench Verified | 74.4 | 73.1 | 71.3 / **76.8** | 73.8 | 74.0 | 73.4 |
77
+ | Terminal-Bench 2.0 | **51.0** | 46.4 | 35.7* / 50.8 | 41.0 | 47.9 | 38.5 |
78
 
79
  **Notes**:
80
  1. "—" indicates the score is not publicly available or not tested.
 
305
  - Minimum VRAM: 120 GB (e.g., Mac studio, DGX-Spark, AMD Ryzen AI Max+ 395)
306
  - Recommended: 128GB unified memory
307
  #### Steps
308
+ 1. Use llama.cpp:
 
309
  ```bash
310
+ git clone git@github.com:stepfun-ai/Step-3.5-Flash.git
311
+ cd Step-3.5-Flash/llama.cpp
312
  ```
313
  2. Build llama.cpp on Mac:
314
  ```bash
step-bar-chart.png CHANGED

Git LFS Details

  • SHA256: 3fa283dc9c139edc3331aaafa21d69de212a241f03262f09acf96fbc0123a93d
  • Pointer size: 131 Bytes
  • Size of remote file: 647 kB

Git LFS Details

  • SHA256: b353d54e27baaac2539402d9dacdccf8230ff909c098c31dc905fbc5a442165e
  • Pointer size: 131 Bytes
  • Size of remote file: 575 kB