Add evaluation results from Step 3.5 Flash paper - HLE (text only): 23.1 - GPQA Diamond: 83.5 - MMLU-Pro: 84.4 - SWE-Bench Verified: 74.4% - Terminal-Bench 2.0: 51.0% Source: https://arxiv.org/abs/2602.10604 (Table 5, Vanilla inference)
#34
by
SaylorTwift HF Staff - opened
No description provided.
hzwer changed pull request status to
merged