Text Generation
Transformers
Safetensors
step3p5
conversational
custom_code
Eval Results
SaylorTwift HF Staff commited on
Commit
31cd3f0
·
verified ·
1 Parent(s): ef9cf3c

Add evaluation results from Step 3.5 Flash paper

Browse files

- HLE (text only): 23.1
- GPQA Diamond: 83.5
- MMLU-Pro: 84.4
- SWE-Bench Verified: 74.4%
- Terminal-Bench 2.0: 51.0%

Source: https://arxiv.org/abs/2602.10604 (Table 5, Vanilla inference)

.eval_results/gpqa_diamond.yaml ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ - dataset:
2
+ id: Idavidrein/gpqa
3
+ task_id: diamond
4
+ value: 83.5
5
+ date: '2026-02-11'
6
+ source:
7
+ url: https://arxiv.org/abs/2602.10604
8
+ name: Step 3.5 Flash Paper
9
+ user: SaylorTwift
.eval_results/hle.yaml ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ - dataset:
2
+ id: cais/hle
3
+ task_id: hle
4
+ value: 23.1
5
+ date: '2026-02-11'
6
+ source:
7
+ url: https://arxiv.org/abs/2602.10604
8
+ name: Step 3.5 Flash Paper
9
+ user: SaylorTwift
10
+ notes: "Text Only"
.eval_results/mmlu_pro.yaml ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ - dataset:
2
+ id: TIGER-Lab/MMLU-Pro
3
+ task_id: mmlu_pro
4
+ value: 84.4
5
+ date: '2026-02-11'
6
+ source:
7
+ url: https://arxiv.org/abs/2602.10604
8
+ name: Step 3.5 Flash Paper
9
+ user: SaylorTwift
.eval_results/swe_bench_verified.yaml ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ - dataset:
2
+ id: SWE-bench/SWE-bench_Verified
3
+ task_id: swe_bench_%_resolved
4
+ value: 74.4
5
+ date: '2026-02-11'
6
+ source:
7
+ url: https://arxiv.org/abs/2602.10604
8
+ name: Step 3.5 Flash Paper
9
+ user: SaylorTwift
.eval_results/terminal_bench_2.yaml ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ - dataset:
2
+ id: harborframework/terminal-bench-2.0
3
+ task_id: terminalbench_2
4
+ value: 51.0
5
+ date: '2026-02-11'
6
+ source:
7
+ url: https://arxiv.org/abs/2602.10604
8
+ name: Step 3.5 Flash Paper
9
+ user: SaylorTwift