Add evaluation results

#6
by SaylorTwift HF Staff - opened

Add Evaluation Results for InternScience/Agents-A1

Summary

This PR adds evaluation results extracted from the Agents-A1 model card benchmark table to the .eval_results/ directory, following the Hugging Face Hub evaluation-results specification.

Benchmarks Added

Benchmark Score Hub Dataset Task ID
HLE w/ tools 47.6 cais/hle hle

Benchmarks Skipped (Not Registered on Hub)

Benchmark Score
BrowseComp 75.51
XBench-DS-2510 86.0
Seal0 56.36
GAIA 96.04
SciCode 44.33
MLE-Lite 43.94
HiPhO 46.4
FrontierScience-Olympiad 79.0
FrontierScience-Research 40.0
IFBench 80.61
LongBench-v2 60.2
IFEval 94.82
τ²-Bench 79.81
VitaBench 38.75
MatTools 47.1
MolBench-bind 56.8

These can be added once the benchmark authors register their eval.yaml on the Hub.

Source

Files Added

  • .eval_results/Agents-A1.yaml

Verification

These results were extracted from the official benchmark table published in the Agents-A1 model card. No verified token is provided as these were not run via HF Jobs with inspect-ai.

BoZhang changed pull request status to merged

Sign up or log in to comment