ITBench-Lite

Sleeping

App Files Files Community

rohan-arora-ibm commited on Jan 19

Commit

05cd483

unverified ·

1 Parent(s): 7f74217

bump: downloading ground truth files

Browse files

Signed-off-by: Rohan R. Arora <rohan.arora@ibm.com>

Files changed (1) hide show

evaluation.ipynb +36 -0

evaluation.ipynb CHANGED Viewed

@@ -82,6 +82,14 @@
    "outputs": [],
    "source": "# Paths\nLEADERBOARD_DIR = PROJECT_ROOT / \"ITBench-Trajectories\" / \"ReAct-Agent-Trajectories\"\nOUTPUT_BASE_DIR = PROJECT_ROOT / \"ITBench-Trajectories\" / \"output\"\n\n# Minimum runs per scenario required for inclusion\nMIN_RUNS_PER_SCENARIO = 2\n\n# Minimum scenarios needed after filtering\nMIN_QUALIFYING_SCENARIOS = 20\n\n# Success threshold for binary classification\nSUCCESS_THRESHOLD = 0.5"
   },
   {
    "cell_type": "markdown",
    "id": "1134e25a",
@@ -129,6 +137,34 @@
    "execution_count": null,
    "outputs": []
   },
   {
    "cell_type": "markdown",
    "id": "8b47a303",

    "outputs": [],
    "source": "# Paths\nLEADERBOARD_DIR = PROJECT_ROOT / \"ITBench-Trajectories\" / \"ReAct-Agent-Trajectories\"\nOUTPUT_BASE_DIR = PROJECT_ROOT / \"ITBench-Trajectories\" / \"output\"\n\n# Minimum runs per scenario required for inclusion\nMIN_RUNS_PER_SCENARIO = 2\n\n# Minimum scenarios needed after filtering\nMIN_QUALIFYING_SCENARIOS = 20\n\n# Success threshold for binary classification\nSUCCESS_THRESHOLD = 0.5"
   },
+  {
+   "cell_type": "code",
+   "id": "pz42i6nppa9",
+   "source": "# Create all output directories upfront\nOUTPUT_BASE_DIR.mkdir(parents=True, exist_ok=True)\n(OUTPUT_BASE_DIR / \"consistency\").mkdir(parents=True, exist_ok=True)\n(OUTPUT_BASE_DIR / \"inferences\").mkdir(parents=True, exist_ok=True)\n(OUTPUT_BASE_DIR / \"tool_failures\").mkdir(parents=True, exist_ok=True)\n(OUTPUT_BASE_DIR / \"discovery\").mkdir(parents=True, exist_ok=True)\n\nprint(f\"✓ Created output directories at: {OUTPUT_BASE_DIR}\")",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": []
+  },
   {
    "cell_type": "markdown",
    "id": "1134e25a",
    "execution_count": null,
    "outputs": []
   },
+  {
+   "cell_type": "markdown",
+   "id": "7gpq7ct50cg",
+   "source": "## Download Ground Truth Data\n\nThe ground truth files contain the root cause entity information and aliases for each scenario.",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "id": "y3ffif24x",
+   "source": "!source /data/ITBench-SRE-Agent/.venv/bin/activate && hf download \\\n  ibm-research/ITBench-Lite \\\n  --repo-type dataset \\\n  --include \"snapshots/sre/v0.2-*/Scenario-*/ground_truth.yaml\" \\\n  --local-dir /data/ITBench-SRE-Agent/ITBench-Lite",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "nxza58xw7v",
+   "source": "### Check Downloaded Ground Truth Data",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "id": "lg601pti47f",
+   "source": "!ls -lh /data/ITBench-SRE-Agent/ITBench-Lite/snapshots/sre/ | head -5",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": []
+  },
   {
    "cell_type": "markdown",
    "id": "8b47a303",