Spaces:
Running
Running
bump: consumability imporvements
Browse filesSigned-off-by: Rohan R. Arora <rohan.arora@ibm.com>
- evaluation.ipynb +24 -24
evaluation.ipynb
CHANGED
|
@@ -5,8 +5,7 @@
|
|
| 5 |
"id": "84c36161",
|
| 6 |
"metadata": {},
|
| 7 |
"source": [
|
| 8 |
-
"#
|
| 9 |
-
"## Evaluating Agents"
|
| 10 |
]
|
| 11 |
},
|
| 12 |
{
|
|
@@ -22,8 +21,20 @@
|
|
| 22 |
{
|
| 23 |
"cell_type": "markdown",
|
| 24 |
"id": "ki5vh4cb61j",
|
| 25 |
-
"
|
| 26 |
-
"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
},
|
| 28 |
{
|
| 29 |
"cell_type": "markdown",
|
|
@@ -41,24 +52,6 @@
|
|
| 41 |
"## Installation and Imports"
|
| 42 |
]
|
| 43 |
},
|
| 44 |
-
{
|
| 45 |
-
"cell_type": "code",
|
| 46 |
-
"execution_count": 41,
|
| 47 |
-
"id": "uhbkrofupp",
|
| 48 |
-
"metadata": {},
|
| 49 |
-
"outputs": [
|
| 50 |
-
{
|
| 51 |
-
"name": "stdout",
|
| 52 |
-
"output_type": "stream",
|
| 53 |
-
"text": [
|
| 54 |
-
"\u001b[2mAudited \u001b[1m7 packages\u001b[0m \u001b[2min 226ms\u001b[0m\u001b[0m\n"
|
| 55 |
-
]
|
| 56 |
-
}
|
| 57 |
-
],
|
| 58 |
-
"source": [
|
| 59 |
-
"!cd /data/ITBench-SRE-Agent && /home/user/.local/bin/uv pip install matplotlib seaborn numpy pandas plotly tqdm pyyaml"
|
| 60 |
-
]
|
| 61 |
-
},
|
| 62 |
{
|
| 63 |
"cell_type": "code",
|
| 64 |
"execution_count": 42,
|
|
@@ -283,7 +276,14 @@
|
|
| 283 |
"id": "63a13ce6-3d7b-478c-a4cc-1bf6d9924ab2",
|
| 284 |
"metadata": {},
|
| 285 |
"outputs": [],
|
| 286 |
-
"source":
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 287 |
},
|
| 288 |
{
|
| 289 |
"cell_type": "markdown",
|
|
@@ -3457,4 +3457,4 @@
|
|
| 3457 |
},
|
| 3458 |
"nbformat": 4,
|
| 3459 |
"nbformat_minor": 5
|
| 3460 |
-
}
|
|
|
|
| 5 |
"id": "84c36161",
|
| 6 |
"metadata": {},
|
| 7 |
"source": [
|
| 8 |
+
"# Developing and Evaluating AI Agents for IT Automation Tasks with ITBench"
|
|
|
|
| 9 |
]
|
| 10 |
},
|
| 11 |
{
|
|
|
|
| 21 |
{
|
| 22 |
"cell_type": "markdown",
|
| 23 |
"id": "ki5vh4cb61j",
|
| 24 |
+
"metadata": {},
|
| 25 |
+
"source": [
|
| 26 |
+
"## ⚠️ Important: Duplicate This Space First (If You Have Not!)\n",
|
| 27 |
+
"\n",
|
| 28 |
+
"**You need your own copy to run evaluations:**\n",
|
| 29 |
+
"\n",
|
| 30 |
+
"1. Click the **⋮ menu** at the top → **Duplicate this Space**\n",
|
| 31 |
+
"2. In **your Space**, go to **Settings → Repository secrets**\n",
|
| 32 |
+
"3. Add secret: `OPENROUTER_API_KEY` = your key from https://openrouter.ai\n",
|
| 33 |
+
"\n",
|
| 34 |
+
"The API key is required for LLM-as-a-Judge evaluations.\n",
|
| 35 |
+
"\n",
|
| 36 |
+
"---"
|
| 37 |
+
]
|
| 38 |
},
|
| 39 |
{
|
| 40 |
"cell_type": "markdown",
|
|
|
|
| 52 |
"## Installation and Imports"
|
| 53 |
]
|
| 54 |
},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
{
|
| 56 |
"cell_type": "code",
|
| 57 |
"execution_count": 42,
|
|
|
|
| 276 |
"id": "63a13ce6-3d7b-478c-a4cc-1bf6d9924ab2",
|
| 277 |
"metadata": {},
|
| 278 |
"outputs": [],
|
| 279 |
+
"source": [
|
| 280 |
+
"!cd ITBench-SRE-Agent/ && export JUDGE_BASE_URL=\"https://openrouter.ai/api/v1\" && export JUDGE_MODEL=\"google/gemini-2.5-pro\" && export JUDGE_API_KEY=\"$OPENROUTER_API_KEY\" && /home/user/.local/bin/uv run itbench-eval \\\n",
|
| 281 |
+
" --ground-truth ./ITBench-Lite/snapshots/sre/v0.2-B96DF826-4BB2-4B62-97AB-6D84254C53D7 \\\n",
|
| 282 |
+
" --outputs ./outputs/agent_outputs \\\n",
|
| 283 |
+
" --result-file ./outputs/evaluation_results.json\n",
|
| 284 |
+
"\n",
|
| 285 |
+
"# Note: OPENROUTER_API_KEY should be set as a Space secret (see instructions at top of notebook)"
|
| 286 |
+
]
|
| 287 |
},
|
| 288 |
{
|
| 289 |
"cell_type": "markdown",
|
|
|
|
| 3457 |
},
|
| 3458 |
"nbformat": 4,
|
| 3459 |
"nbformat_minor": 5
|
| 3460 |
+
}
|