rohan-arora-ibm commited on
Commit
47fbb47
·
unverified ·
1 Parent(s): 4b6fc62

bump: consumability imporvements

Browse files

Signed-off-by: Rohan R. Arora <rohan.arora@ibm.com>

Files changed (1) hide show
  1. evaluation.ipynb +24 -24
evaluation.ipynb CHANGED
@@ -5,8 +5,7 @@
5
  "id": "84c36161",
6
  "metadata": {},
7
  "source": [
8
- "# AAAI Lab: Developing AI Agents for IT Automation Tasks with ITBench\n",
9
- "## Evaluating Agents"
10
  ]
11
  },
12
  {
@@ -22,8 +21,20 @@
22
  {
23
  "cell_type": "markdown",
24
  "id": "ki5vh4cb61j",
25
- "source": "## ⚠️ Important: Duplicate This Space First (If You Have Not!)\n\n**You need your own copy to run evaluations:**\n\n1. Click the **⋮ menu** at the top → **Duplicate this Space**\n2. In **your Space**, go to **Settings → Repository secrets**\n3. Add secret: `OPENROUTER_API_KEY` = your key from https://openrouter.ai\n\nThe API key is required for LLM-as-a-Judge evaluations.\n\n---",
26
- "metadata": {}
 
 
 
 
 
 
 
 
 
 
 
 
27
  },
28
  {
29
  "cell_type": "markdown",
@@ -41,24 +52,6 @@
41
  "## Installation and Imports"
42
  ]
43
  },
44
- {
45
- "cell_type": "code",
46
- "execution_count": 41,
47
- "id": "uhbkrofupp",
48
- "metadata": {},
49
- "outputs": [
50
- {
51
- "name": "stdout",
52
- "output_type": "stream",
53
- "text": [
54
- "\u001b[2mAudited \u001b[1m7 packages\u001b[0m \u001b[2min 226ms\u001b[0m\u001b[0m\n"
55
- ]
56
- }
57
- ],
58
- "source": [
59
- "!cd /data/ITBench-SRE-Agent && /home/user/.local/bin/uv pip install matplotlib seaborn numpy pandas plotly tqdm pyyaml"
60
- ]
61
- },
62
  {
63
  "cell_type": "code",
64
  "execution_count": 42,
@@ -283,7 +276,14 @@
283
  "id": "63a13ce6-3d7b-478c-a4cc-1bf6d9924ab2",
284
  "metadata": {},
285
  "outputs": [],
286
- "source": "!cd ITBench-SRE-Agent/ && export JUDGE_BASE_URL=\"https://openrouter.ai/api/v1\" && export JUDGE_MODEL=\"google/gemini-2.5-pro\" && export JUDGE_API_KEY=\"$OPENROUTER_API_KEY\" && /home/user/.local/bin/uv run itbench-eval \\\n --ground-truth ./ITBench-Lite/snapshots/sre/v0.2-B96DF826-4BB2-4B62-97AB-6D84254C53D7 \\\n --outputs ./outputs/agent_outputs \\\n --result-file ./outputs/evaluation_results.json\n\n# Note: OPENROUTER_API_KEY should be set as a Space secret (see instructions at top of notebook)"
 
 
 
 
 
 
 
287
  },
288
  {
289
  "cell_type": "markdown",
@@ -3457,4 +3457,4 @@
3457
  },
3458
  "nbformat": 4,
3459
  "nbformat_minor": 5
3460
- }
 
5
  "id": "84c36161",
6
  "metadata": {},
7
  "source": [
8
+ "# Developing and Evaluating AI Agents for IT Automation Tasks with ITBench"
 
9
  ]
10
  },
11
  {
 
21
  {
22
  "cell_type": "markdown",
23
  "id": "ki5vh4cb61j",
24
+ "metadata": {},
25
+ "source": [
26
+ "## ⚠️ Important: Duplicate This Space First (If You Have Not!)\n",
27
+ "\n",
28
+ "**You need your own copy to run evaluations:**\n",
29
+ "\n",
30
+ "1. Click the **⋮ menu** at the top → **Duplicate this Space**\n",
31
+ "2. In **your Space**, go to **Settings → Repository secrets**\n",
32
+ "3. Add secret: `OPENROUTER_API_KEY` = your key from https://openrouter.ai\n",
33
+ "\n",
34
+ "The API key is required for LLM-as-a-Judge evaluations.\n",
35
+ "\n",
36
+ "---"
37
+ ]
38
  },
39
  {
40
  "cell_type": "markdown",
 
52
  "## Installation and Imports"
53
  ]
54
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
  {
56
  "cell_type": "code",
57
  "execution_count": 42,
 
276
  "id": "63a13ce6-3d7b-478c-a4cc-1bf6d9924ab2",
277
  "metadata": {},
278
  "outputs": [],
279
+ "source": [
280
+ "!cd ITBench-SRE-Agent/ && export JUDGE_BASE_URL=\"https://openrouter.ai/api/v1\" && export JUDGE_MODEL=\"google/gemini-2.5-pro\" && export JUDGE_API_KEY=\"$OPENROUTER_API_KEY\" && /home/user/.local/bin/uv run itbench-eval \\\n",
281
+ " --ground-truth ./ITBench-Lite/snapshots/sre/v0.2-B96DF826-4BB2-4B62-97AB-6D84254C53D7 \\\n",
282
+ " --outputs ./outputs/agent_outputs \\\n",
283
+ " --result-file ./outputs/evaluation_results.json\n",
284
+ "\n",
285
+ "# Note: OPENROUTER_API_KEY should be set as a Space secret (see instructions at top of notebook)"
286
+ ]
287
  },
288
  {
289
  "cell_type": "markdown",
 
3457
  },
3458
  "nbformat": 4,
3459
  "nbformat_minor": 5
3460
+ }