copy
Browse files- src/about.py +2 -2
src/about.py
CHANGED
|
@@ -56,8 +56,8 @@ These benchmarks go beyond basic reasoning and evaluate more advanced, autonomou
|
|
| 56 |
| Benchmark | Description |
|
| 57 |
|-----------------------|----------------------------------------------------------------------------|
|
| 58 |
| **GAIA** | Evaluates autonomous reasoning, planning, problem-solving for question answering. |
|
| 59 |
-
|
|
| 60 |
-
| **In-House-CTF** | Capture-the-flag challenge testing security skills. |
|
| 61 |
| **AgentHarm** / **AgentHarm-Benign** | Measures harmfulness of LLM agents (and benign behavior baseline). |
|
| 62 |
| **SWE-Bench-Verified** | Tests AI agent ability to solve software engineering tasks. |
|
| 63 |
</div>
|
|
|
|
| 56 |
| Benchmark | Description |
|
| 57 |
|-----------------------|----------------------------------------------------------------------------|
|
| 58 |
| **GAIA** | Evaluates autonomous reasoning, planning, problem-solving for question answering. |
|
| 59 |
+
| **InterCode-CTF** | Capture-the-flag challenge testing cyber-security skills. |
|
| 60 |
+
| **In-House-CTF** | Capture-the-flag challenge testing cyber-security skills. |
|
| 61 |
| **AgentHarm** / **AgentHarm-Benign** | Measures harmfulness of LLM agents (and benign behavior baseline). |
|
| 62 |
| **SWE-Bench-Verified** | Tests AI agent ability to solve software engineering tasks. |
|
| 63 |
</div>
|