Spaces:
Sleeping
Sleeping
bump: downloading ground truth files
Browse filesSigned-off-by: Rohan R. Arora <rohan.arora@ibm.com>
- evaluation.ipynb +36 -0
evaluation.ipynb
CHANGED
|
@@ -82,6 +82,14 @@
|
|
| 82 |
"outputs": [],
|
| 83 |
"source": "# Paths\nLEADERBOARD_DIR = PROJECT_ROOT / \"ITBench-Trajectories\" / \"ReAct-Agent-Trajectories\"\nOUTPUT_BASE_DIR = PROJECT_ROOT / \"ITBench-Trajectories\" / \"output\"\n\n# Minimum runs per scenario required for inclusion\nMIN_RUNS_PER_SCENARIO = 2\n\n# Minimum scenarios needed after filtering\nMIN_QUALIFYING_SCENARIOS = 20\n\n# Success threshold for binary classification\nSUCCESS_THRESHOLD = 0.5"
|
| 84 |
},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 85 |
{
|
| 86 |
"cell_type": "markdown",
|
| 87 |
"id": "1134e25a",
|
|
@@ -129,6 +137,34 @@
|
|
| 129 |
"execution_count": null,
|
| 130 |
"outputs": []
|
| 131 |
},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 132 |
{
|
| 133 |
"cell_type": "markdown",
|
| 134 |
"id": "8b47a303",
|
|
|
|
| 82 |
"outputs": [],
|
| 83 |
"source": "# Paths\nLEADERBOARD_DIR = PROJECT_ROOT / \"ITBench-Trajectories\" / \"ReAct-Agent-Trajectories\"\nOUTPUT_BASE_DIR = PROJECT_ROOT / \"ITBench-Trajectories\" / \"output\"\n\n# Minimum runs per scenario required for inclusion\nMIN_RUNS_PER_SCENARIO = 2\n\n# Minimum scenarios needed after filtering\nMIN_QUALIFYING_SCENARIOS = 20\n\n# Success threshold for binary classification\nSUCCESS_THRESHOLD = 0.5"
|
| 84 |
},
|
| 85 |
+
{
|
| 86 |
+
"cell_type": "code",
|
| 87 |
+
"id": "pz42i6nppa9",
|
| 88 |
+
"source": "# Create all output directories upfront\nOUTPUT_BASE_DIR.mkdir(parents=True, exist_ok=True)\n(OUTPUT_BASE_DIR / \"consistency\").mkdir(parents=True, exist_ok=True)\n(OUTPUT_BASE_DIR / \"inferences\").mkdir(parents=True, exist_ok=True)\n(OUTPUT_BASE_DIR / \"tool_failures\").mkdir(parents=True, exist_ok=True)\n(OUTPUT_BASE_DIR / \"discovery\").mkdir(parents=True, exist_ok=True)\n\nprint(f\"✓ Created output directories at: {OUTPUT_BASE_DIR}\")",
|
| 89 |
+
"metadata": {},
|
| 90 |
+
"execution_count": null,
|
| 91 |
+
"outputs": []
|
| 92 |
+
},
|
| 93 |
{
|
| 94 |
"cell_type": "markdown",
|
| 95 |
"id": "1134e25a",
|
|
|
|
| 137 |
"execution_count": null,
|
| 138 |
"outputs": []
|
| 139 |
},
|
| 140 |
+
{
|
| 141 |
+
"cell_type": "markdown",
|
| 142 |
+
"id": "7gpq7ct50cg",
|
| 143 |
+
"source": "## Download Ground Truth Data\n\nThe ground truth files contain the root cause entity information and aliases for each scenario.",
|
| 144 |
+
"metadata": {}
|
| 145 |
+
},
|
| 146 |
+
{
|
| 147 |
+
"cell_type": "code",
|
| 148 |
+
"id": "y3ffif24x",
|
| 149 |
+
"source": "!source /data/ITBench-SRE-Agent/.venv/bin/activate && hf download \\\n ibm-research/ITBench-Lite \\\n --repo-type dataset \\\n --include \"snapshots/sre/v0.2-*/Scenario-*/ground_truth.yaml\" \\\n --local-dir /data/ITBench-SRE-Agent/ITBench-Lite",
|
| 150 |
+
"metadata": {},
|
| 151 |
+
"execution_count": null,
|
| 152 |
+
"outputs": []
|
| 153 |
+
},
|
| 154 |
+
{
|
| 155 |
+
"cell_type": "markdown",
|
| 156 |
+
"id": "nxza58xw7v",
|
| 157 |
+
"source": "### Check Downloaded Ground Truth Data",
|
| 158 |
+
"metadata": {}
|
| 159 |
+
},
|
| 160 |
+
{
|
| 161 |
+
"cell_type": "code",
|
| 162 |
+
"id": "lg601pti47f",
|
| 163 |
+
"source": "!ls -lh /data/ITBench-SRE-Agent/ITBench-Lite/snapshots/sre/ | head -5",
|
| 164 |
+
"metadata": {},
|
| 165 |
+
"execution_count": null,
|
| 166 |
+
"outputs": []
|
| 167 |
+
},
|
| 168 |
{
|
| 169 |
"cell_type": "markdown",
|
| 170 |
"id": "8b47a303",
|