Spaces:
Sleeping
Sleeping
Update README.md
Browse files
README.md
CHANGED
|
@@ -23,13 +23,13 @@ This is a Hugging Face Space that hosts a leaderboard for comparing model perfor
|
|
| 23 |
|
| 24 |
## Instructions
|
| 25 |
|
| 26 |
-
1. Please refer to our GitHub repository at
|
| 27 |
2. Compress the resulting JSON outputs into a ZIP archive whose filename begins with SWE_/GAIA_, and submit it.
|
| 28 |
3. Once the evaluation is complete, we’ll upload the scores (this process will soon be automated).
|
| 29 |
|
| 30 |
## Benchmarking on TRAIL
|
| 31 |
|
| 32 |
-
TRAIL(Trace Reasoning and Agentic Issue Localization) is a benchmark dataset of 148 annotated AI agent execution traces containing 841 errors across reasoning, execution, and planning categories. Created from real-world software engineering and information retrieval tasks, it challenges even state-of-the-art LLMs, with the best model achieving only 11% accuracy, highlighting the difficulty of trace debugging for complex agent workflows.
|
| 33 |
|
| 34 |
## License
|
| 35 |
|
|
|
|
| 23 |
|
| 24 |
## Instructions
|
| 25 |
|
| 26 |
+
1. Please refer to our GitHub repository at https://github.com/patronus-ai/trail-benchmark for step‑by‑step instructions on how to run your model with the TRAIL dataset.
|
| 27 |
2. Compress the resulting JSON outputs into a ZIP archive whose filename begins with SWE_/GAIA_, and submit it.
|
| 28 |
3. Once the evaluation is complete, we’ll upload the scores (this process will soon be automated).
|
| 29 |
|
| 30 |
## Benchmarking on TRAIL
|
| 31 |
|
| 32 |
+
[TRAIL(Trace Reasoning and Agentic Issue Localization)](https://arxiv.org/abs/2505.08638) is a benchmark dataset of 148 annotated AI agent execution traces containing 841 errors across reasoning, execution, and planning categories. Created from real-world software engineering and information retrieval tasks, it challenges even state-of-the-art LLMs, with the best model achieving only 11% accuracy, highlighting the difficulty of trace debugging for complex agent workflows.
|
| 33 |
|
| 34 |
## License
|
| 35 |
|