Spaces:
Sleeping
Sleeping
| title: TRAIL Leaderboard | |
| emoji: 🥇 | |
| colorFrom: green | |
| colorTo: indigo | |
| sdk: gradio | |
| app_file: app.py | |
| pinned: true | |
| license: mit | |
| short_description: Trace Reasoning and Agentic Issue Localization Leaderboard | |
| sdk_version: 5.19.0 | |
| tags: | |
| - leaderboard | |
| # Model Performance Leaderboard | |
| This is a Hugging Face Space that hosts a leaderboard for comparing model performances across various metrics of TRAIL dataset. | |
| ## Features | |
| - **Submit Your Answers**: Run your model on TRAIL dataset. Submit your results. | |
| - **Leaderboard**: View how your submissions are ranked. | |
| ## Instructions | |
| * Please refer to our GitHub repository at https://github.com/patronus-ai/trail-benchmark for step‑by‑step instructions on how to run your model with the TRAIL dataset. | |
| * Please upload a zip file containing your model outputs. The zip file should contain: | |
| - One or more directories with model outputs | |
| - Each directory should contain JSON files with the model's predictions | |
| - Directory names should indicate the split (GAIA_ or SWE_) | |
| * Once the evaluation is complete, we’ll upload the scores (this process will soon be automated). | |
| ## Benchmarking on TRAIL | |
| [TRAIL(Trace Reasoning and Agentic Issue Localization)](https://arxiv.org/abs/2505.08638) is a benchmark dataset of 148 annotated AI agent execution traces containing 841 errors across reasoning, execution, and planning categories. Created from real-world software engineering and information retrieval tasks, it challenges even state-of-the-art LLMs, with the best model achieving only 11% accuracy, highlighting the difficulty of trace debugging for complex agent workflows. | |
| ## License | |
| This project is open source and available under the MIT license. |