ITBench-Lite

Running

App Files Files Community

rohan-arora-ibm commited on Jan 29

Commit

47fbb47

unverified ·

1 Parent(s): 4b6fc62

bump: consumability imporvements

Browse files

Signed-off-by: Rohan R. Arora <rohan.arora@ibm.com>

Files changed (1) hide show

evaluation.ipynb +24 -24

evaluation.ipynb CHANGED Viewed

@@ -5,8 +5,7 @@
    "id": "84c36161",
    "metadata": {},
    "source": [
-    "# AAAI Lab: Developing AI Agents for IT Automation Tasks with ITBench\n",
-    "## Evaluating Agents"
    ]
   },
   {
@@ -22,8 +21,20 @@
   {
    "cell_type": "markdown",
    "id": "ki5vh4cb61j",
-   "source": "## ⚠️ Important: Duplicate This Space First (If You Have Not!)\n\n**You need your own copy to run evaluations:**\n\n1. Click the **⋮ menu** at the top → **Duplicate this Space**\n2. In **your Space**, go to **Settings → Repository secrets**\n3. Add secret: `OPENROUTER_API_KEY` = your key from https://openrouter.ai\n\nThe API key is required for LLM-as-a-Judge evaluations.\n\n---",
-   "metadata": {}
   },
   {
    "cell_type": "markdown",
@@ -41,24 +52,6 @@
     "## Installation and Imports"
    ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": 41,
-   "id": "uhbkrofupp",
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "\u001b[2mAudited \u001b[1m7 packages\u001b[0m \u001b[2min 226ms\u001b[0m\u001b[0m\n"
-     ]
-    }
-   ],
-   "source": [
-    "!cd /data/ITBench-SRE-Agent && /home/user/.local/bin/uv pip install matplotlib seaborn numpy pandas plotly tqdm pyyaml"
-   ]
-  },
   {
    "cell_type": "code",
    "execution_count": 42,
@@ -283,7 +276,14 @@
    "id": "63a13ce6-3d7b-478c-a4cc-1bf6d9924ab2",
    "metadata": {},
    "outputs": [],
-   "source": "!cd ITBench-SRE-Agent/ && export JUDGE_BASE_URL=\"https://openrouter.ai/api/v1\" && export JUDGE_MODEL=\"google/gemini-2.5-pro\" && export JUDGE_API_KEY=\"$OPENROUTER_API_KEY\" && /home/user/.local/bin/uv run itbench-eval \\\n  --ground-truth ./ITBench-Lite/snapshots/sre/v0.2-B96DF826-4BB2-4B62-97AB-6D84254C53D7 \\\n  --outputs ./outputs/agent_outputs \\\n  --result-file ./outputs/evaluation_results.json\n\n# Note: OPENROUTER_API_KEY should be set as a Space secret (see instructions at top of notebook)"
   },
   {
    "cell_type": "markdown",
@@ -3457,4 +3457,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 5
-}

    "id": "84c36161",
    "metadata": {},
    "source": [
+    "# Developing and Evaluating AI Agents for IT Automation Tasks with ITBench"
    ]
   },
   {
   {
    "cell_type": "markdown",
    "id": "ki5vh4cb61j",
+   "metadata": {},
+   "source": [
+    "## ⚠️ Important: Duplicate This Space First (If You Have Not!)\n",
+    "\n",
+    "**You need your own copy to run evaluations:**\n",
+    "\n",
+    "1. Click the **⋮ menu** at the top → **Duplicate this Space**\n",
+    "2. In **your Space**, go to **Settings → Repository secrets**\n",
+    "3. Add secret: `OPENROUTER_API_KEY` = your key from https://openrouter.ai\n",
+    "\n",
+    "The API key is required for LLM-as-a-Judge evaluations.\n",
+    "\n",
+    "---"
+   ]
   },
   {
    "cell_type": "markdown",
     "## Installation and Imports"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 42,
    "id": "63a13ce6-3d7b-478c-a4cc-1bf6d9924ab2",
    "metadata": {},
    "outputs": [],
+   "source": [
+    "!cd ITBench-SRE-Agent/ && export JUDGE_BASE_URL=\"https://openrouter.ai/api/v1\" && export JUDGE_MODEL=\"google/gemini-2.5-pro\" && export JUDGE_API_KEY=\"$OPENROUTER_API_KEY\" && /home/user/.local/bin/uv run itbench-eval \\\n",
+    "  --ground-truth ./ITBench-Lite/snapshots/sre/v0.2-B96DF826-4BB2-4B62-97AB-6D84254C53D7 \\\n",
+    "  --outputs ./outputs/agent_outputs \\\n",
+    "  --result-file ./outputs/evaluation_results.json\n",
+    "\n",
+    "# Note: OPENROUTER_API_KEY should be set as a Space secret (see instructions at top of notebook)"
+   ]
   },
   {
    "cell_type": "markdown",
  },
  "nbformat": 4,
  "nbformat_minor": 5
+}