rohan-arora-ibm commited on
Commit
05cd483
·
unverified ·
1 Parent(s): 7f74217

bump: downloading ground truth files

Browse files

Signed-off-by: Rohan R. Arora <rohan.arora@ibm.com>

Files changed (1) hide show
  1. evaluation.ipynb +36 -0
evaluation.ipynb CHANGED
@@ -82,6 +82,14 @@
82
  "outputs": [],
83
  "source": "# Paths\nLEADERBOARD_DIR = PROJECT_ROOT / \"ITBench-Trajectories\" / \"ReAct-Agent-Trajectories\"\nOUTPUT_BASE_DIR = PROJECT_ROOT / \"ITBench-Trajectories\" / \"output\"\n\n# Minimum runs per scenario required for inclusion\nMIN_RUNS_PER_SCENARIO = 2\n\n# Minimum scenarios needed after filtering\nMIN_QUALIFYING_SCENARIOS = 20\n\n# Success threshold for binary classification\nSUCCESS_THRESHOLD = 0.5"
84
  },
 
 
 
 
 
 
 
 
85
  {
86
  "cell_type": "markdown",
87
  "id": "1134e25a",
@@ -129,6 +137,34 @@
129
  "execution_count": null,
130
  "outputs": []
131
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
132
  {
133
  "cell_type": "markdown",
134
  "id": "8b47a303",
 
82
  "outputs": [],
83
  "source": "# Paths\nLEADERBOARD_DIR = PROJECT_ROOT / \"ITBench-Trajectories\" / \"ReAct-Agent-Trajectories\"\nOUTPUT_BASE_DIR = PROJECT_ROOT / \"ITBench-Trajectories\" / \"output\"\n\n# Minimum runs per scenario required for inclusion\nMIN_RUNS_PER_SCENARIO = 2\n\n# Minimum scenarios needed after filtering\nMIN_QUALIFYING_SCENARIOS = 20\n\n# Success threshold for binary classification\nSUCCESS_THRESHOLD = 0.5"
84
  },
85
+ {
86
+ "cell_type": "code",
87
+ "id": "pz42i6nppa9",
88
+ "source": "# Create all output directories upfront\nOUTPUT_BASE_DIR.mkdir(parents=True, exist_ok=True)\n(OUTPUT_BASE_DIR / \"consistency\").mkdir(parents=True, exist_ok=True)\n(OUTPUT_BASE_DIR / \"inferences\").mkdir(parents=True, exist_ok=True)\n(OUTPUT_BASE_DIR / \"tool_failures\").mkdir(parents=True, exist_ok=True)\n(OUTPUT_BASE_DIR / \"discovery\").mkdir(parents=True, exist_ok=True)\n\nprint(f\"✓ Created output directories at: {OUTPUT_BASE_DIR}\")",
89
+ "metadata": {},
90
+ "execution_count": null,
91
+ "outputs": []
92
+ },
93
  {
94
  "cell_type": "markdown",
95
  "id": "1134e25a",
 
137
  "execution_count": null,
138
  "outputs": []
139
  },
140
+ {
141
+ "cell_type": "markdown",
142
+ "id": "7gpq7ct50cg",
143
+ "source": "## Download Ground Truth Data\n\nThe ground truth files contain the root cause entity information and aliases for each scenario.",
144
+ "metadata": {}
145
+ },
146
+ {
147
+ "cell_type": "code",
148
+ "id": "y3ffif24x",
149
+ "source": "!source /data/ITBench-SRE-Agent/.venv/bin/activate && hf download \\\n ibm-research/ITBench-Lite \\\n --repo-type dataset \\\n --include \"snapshots/sre/v0.2-*/Scenario-*/ground_truth.yaml\" \\\n --local-dir /data/ITBench-SRE-Agent/ITBench-Lite",
150
+ "metadata": {},
151
+ "execution_count": null,
152
+ "outputs": []
153
+ },
154
+ {
155
+ "cell_type": "markdown",
156
+ "id": "nxza58xw7v",
157
+ "source": "### Check Downloaded Ground Truth Data",
158
+ "metadata": {}
159
+ },
160
+ {
161
+ "cell_type": "code",
162
+ "id": "lg601pti47f",
163
+ "source": "!ls -lh /data/ITBench-SRE-Agent/ITBench-Lite/snapshots/sre/ | head -5",
164
+ "metadata": {},
165
+ "execution_count": null,
166
+ "outputs": []
167
+ },
168
  {
169
  "cell_type": "markdown",
170
  "id": "8b47a303",