Files changed (1) hide show
  1. README.md +33 -31
README.md CHANGED
@@ -51,37 +51,39 @@ Performance of Step 3.5 Flash measured across **Reasoning**, **Coding**, and **A
51
 
52
  ### Detailed Benchmarks
53
 
54
- | Benchmark | Step 3.5 Flash | DeepSeek V3.2 | Kimi K2 Thinking / K2.5 | GLM-4.7 | MiniMax M2.1 | MiMo-V2 Flash |
55
- | --- | --- | --- | --- | --- | --- | --- |
56
- | # Activated Params | 11B | 37B | 32B | 32B | 10B | 15B |
57
- | # Total Params (MoE) | 196B | 671B | 1T | 355B | 230B | 309B |
58
- | Est. decoding cost @ 128K context, Hopper GPU** | **1.0x**<br>100 tok/s, MTP-3, EP8 | **6.0x**<br>33 tok/s, MTP-1, EP32 | **18.9x**<br>33 tok/s, no MTP, EP32 | **18.9x**<br>100 tok/s, MTP-3, EP8 | **3.9x**<br>100 tok/s, MTP-3, EP8 | **1.2x**<br>100 tok/s, MTP-3, EP8 |
59
- | | | | **Agent** | | | |
60
- | τ²-Bench | 88.2 | 80.3 (85.2*) | 74.3*/85.4* | 87.4 | 86.6* | 80.3 (84.1*) |
61
- | BrowseComp | 51.6 | 51.4 | 41.5* / 60.6 | 52.0 | 47.4 | 45.4 |
62
- | BrowseComp (w/ Context Manager) | 69.0 | 67.6 | 60.2/74.9 | 67.5 | 62.0 | 58.3 |
63
- | BrowseComp-ZH | 66.9 | 65.0 | 62.3 / 62.3* | 66.6 | 47.8* | 51.2* |
64
- | BrowseComp-ZH (w/ Context Manager) | 73.7 | | —/— | | | |
65
- | GAIA (no file) | 84.5 | 75.1* | 75.6*/75.9* | 61.9* | 64.3* | 78.2* |
66
- | xbench-DeepSearch (2025.05) | 83.7 | 78.0* | 76.0*/76.7* | 72.0* | 68.7* | 69.3* |
67
- | xbench-DeepSearch (2025.10) | 56.3 | 55.7* | —/40+ | 52.3* | 43.0* | 44.0* |
68
- | ResearchRubrics | 65.3 | 55.8* | 56.2*/59.5* | 62.0* | 60.2* | 54.3* |
69
- | | | | **Reasoning** | | | |
70
- | AIME 2025 | 97.3 | 93.1 | 94.5/96.1 | 95.7 | 83.0 | 94.1 (95.1*) |
71
- | HMMT 2025 (Feb.) | 98.4 | 92.5 | 89.4/95.4 | 97.1 | 71.0* | 84.4 (95.4*) |
72
- | HMMT 2025 (Nov.) | 94.0 | 90.2 | 89.2*/— | 93.5 | 74.3* | 91.0* |
73
- | IMOAnswerBench | 85.4 | 78.3 | 78.6/81.8 | 82.0 | 60.4* | 80.9* |
74
- | | | | **Coding** | | | |
75
- | LiveCodeBench-V6 | 86.4 | 83.3 | 83.1/85.0 | 84.9 | | 80.6 (81.6*) |
76
- | SWE-bench Verified | 74.4 | 73.1 | 71.3/76.8 | 73.8 | 74.0 | 73.4 |
77
- | Terminal-Bench 2.0 | 51.0 | 46.4 | 35.7*/50.8 | 41.0 | 47.9 | 38.5 |
78
-
79
- **Notes**:
80
- 1. "—" indicates the score is not publicly available or not tested.
81
- 2. "*" indicates the original score was inaccessible or lower than our reproduced, so we report the evaluation under the same test conditions as Step 3.5 Flash to ensure fair comparability.
82
- 3. **BrowseComp (with Context Manager)**: When the effective context length exceeds a predefined threshold, the agent resets the context and restarts the agent loop. By contrast, Kimi K2.5 and DeepSeek-V3.2 used a "discard-all" strategy.
83
- 4. **Decoding Cost**: Estimates are based on a methodology similar to, but more accurate than, the approach described arxiv.org/abs/2507.19427
84
-
 
 
85
  ### Recommended Inference Parameters
86
  1. For general chat domain, we suggest: `temperature=0.6, top_p=0.95`
87
  2. For reasoning / agent scenario, we recommend: `temperature=1.0, top_p=0.95`.
 
51
 
52
  ### Detailed Benchmarks
53
 
54
+ | Benchmark | # Shots | Step3.5 Flash (Base) | MiMo‑V2 Flash (Base) | GLM‑4.5 (Base) | DeepSeek V3.1 (Base) | DeepSeek V3.2 (Exp Base) | Kimi‑K2 (Base) |
55
+ | --- | --- | --- | --- | --- | --- | --- | --- |
56
+ | # Activated Params | - | 11B | 15B | 32B | 37B | 37B | 32B |
57
+ | # Total Params | - | 196B | 309B | 355B | 671B | 671B | 1043B |
58
+ | General | | | | | | | |
59
+ | BBH | 3-shot | 88.2 | 88.5 | 86.2 | 88.2† | 88.7† | 88.7 |
60
+ | MMLU | 5-shot | 85.8 | 86.7 | 86.1 | 87.4 | 87.8† | 87.8 |
61
+ | MMLU‑Redux | 5-shot | 89.2 | 90.6 | - | 90.0 | 90.4 | 90.2 |
62
+ | MMLU‑Pro | 5-shot | 62.3 | 73.2 | - | 58.8† | 62.1† | 69.2 |
63
+ | HellaSwag | 10-shot | 90.2 | 88.5 | 87.1 | 89.2† | 89.4† | 94.6 |
64
+ | WinoGrande | 5-shot | 79.1 | 83.8 | - | 85.9† | 85.6† | 85.3 |
65
+ | GPQA | 5-shot | 41.7 | 43.5* | 33.5* | 43.1* | 37.3* | 43.1* |
66
+ | SuperGPQA | 5-shot | 41.0 | 41.1 | - | 42.3† | 43.6† | 44.7 |
67
+ | SimpleQA | 5-shot | 31.6 | 20.6 | 30.0 | 26.3 | 27.0 | 35.3 |
68
+ | Mathematics | | | | | | | |
69
+ | GSM8K | 8-shot | 88.2 | 92.3 | 87.6 | 91.4† | 91.1† | 92.1 |
70
+ | MATH | 4-shot | 66.8 | 71.0 | 62.6 | 62.6† | 62.5† | 70.2 |
71
+ | Code | | | | | | | |
72
+ | HumanEval | 3-shot | 81.1 | 77.4* | 79.8* | 72.5* | 67.7* | 84.8* |
73
+ | MBPP | 3-shot | 79.4 | 81.0* | 81.6* | 74.6* | 75.6* | 89.0* |
74
+ | HumanEval+ | 0-shot | 72.0 | 70.7 | - | 64.6† | 67.7† | - |
75
+ | MBPP+ | 0-shot | 70.6 | 71.4 | - | 72.2† | 69.8† | - |
76
+ | MultiPL‑E HumanEval | 0-shot | 67.7 | 59.5 | - | 45.9† | 45.7† | 60.5 |
77
+ | MultiPL‑E MBPP | 0-shot | 58.0 | 56.7 | - | 52.5† | 50.6† | 58.8 |
78
+ | Chinese | | | | | | | |
79
+ | C‑EVAL | 5-shot | 89.6 | 87.9 | 86.9 | 90.0† | 91.0† | 92.5 |
80
+ | CMMLU | 5-shot | 88.9 | 87.4 | - | 88.8† | 88.9† | 90.9 |
81
+ | C‑SimpleQA | 5-shot | 63.2 | 61.5 | 70.1 | 70.9† | 68.0† | 77.6 |
82
+
83
+ 1. * denotes cases where the original score was unavailable; we report results evaluated under the same test conditions as Step3.5 Flash for fair
84
+ comparison.
85
+ 2. “†” indicates DeepSeek scores quoted from the MiMo‑V2‑Flash report.
86
+
87
  ### Recommended Inference Parameters
88
  1. For general chat domain, we suggest: `temperature=0.6, top_p=0.95`
89
  2. For reasoning / agent scenario, we recommend: `temperature=1.0, top_p=0.95`.