Text Generation
Transformers
Safetensors
step3p5
conversational
custom_code
Eval Results

Add evaluation results from Step 3.5 Flash paper - HLE (text only): 23.1 - GPQA Diamond: 83.5 - MMLU-Pro: 84.4 - SWE-Bench Verified: 74.4% - Terminal-Bench 2.0: 51.0% Source: https://arxiv.org/abs/2602.10604 (Table 5, Vanilla inference)

#34
by SaylorTwift HF Staff - opened
No description provided.
hzwer changed pull request status to merged

Sign up or log in to comment