samrat-rm commited on
Commit
bece8d8
Β·
1 Parent(s): 8e889bd

feat: updating the readme

Browse files
Files changed (1) hide show
  1. README.md +200 -51
README.md CHANGED
@@ -1,6 +1,6 @@
1
  ---
2
  title: WhyDidItFail Environment Server
3
- emoji: πŸ”
4
  colorFrom: red
5
  colorTo: indigo
6
  sdk: docker
@@ -11,27 +11,185 @@ tags:
11
  - openenv
12
  ---
13
 
14
- # WhyDidItFail β€” ML Training Failure Diagnosis Environment
15
 
16
- An OpenEnv environment where an AI agent must diagnose why a machine learning training run failed. The agent inspects logs, configs, and gradient statistics to identify the root cause and suggest a fix.
17
 
18
- ## Overview
19
 
20
- Real ML engineers spend significant time debugging failed training runs. This environment simulates that workflow: the agent receives partial observability (it must decide what to inspect) and must reason sequentially from evidence to diagnosis.
21
 
22
- **12 realistic failure modes** across 3 difficulty tiers:
23
- - **Easy**: identify failure from training logs only (loss/accuracy curves)
24
- - **Medium**: identify failure from logs + hyperparameter config
25
- - **Hard**: identify failure from logs + config + gradient norm data, and provide a concrete fix
26
 
27
- ## Failure Modes
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
 
29
- | Category | Failure Mode |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
  |---|---|
31
- | Optimization | exploding gradients, vanishing gradients, learning rate too high/low |
32
- | Regularization | overfitting, missing regularization |
33
- | Architecture | dying relu, bad weight initialization |
34
- | Configuration | optimizer misconfiguration, batch size too small, lr scheduler misconfiguration |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
  ## Action Space
37
 
@@ -52,46 +210,17 @@ Each step returns a `WhyDidItFailObservation` with:
52
  - `reward` β€” step-level reward
53
  - `done` β€” episode termination flag
54
 
55
- ## Reward Function
56
-
57
- Rewards are provided throughout the episode, not just at completion:
58
-
59
- | Component | Weight | Signal |
60
- |---|---|---|
61
- | Diagnosis score | 0.70 | Correct failure mode label (exact match = 0.40 base, fuzzy = 0.10 per category keyword) |
62
- | Evidence score | 0.15 | Inspected required sources; penalizes missing or irrelevant inspections |
63
- | Efficiency score | 0.15 | Minimal steps to diagnosis; decays for wasted actions |
64
- | Fix bonus | +0.15 | Keyword match on suggested fix (capped at 1.0 total) |
65
-
66
- Step-level rewards during inspection: +0.10 / +0.07 / +0.05 for each required source discovered (decaying). Re-inspection: βˆ’0.05. Irrelevant inspection: βˆ’0.03.
67
-
68
- ## Tasks
69
-
70
- ### Task 1 β€” Easy (`task_easy`)
71
- - **Objective**: Identify the failure mode from training logs only
72
- - **Required sources**: `logs`
73
- - **Max steps**: 10
74
- - **Failure modes**: exploding gradients, learning rate too high, overfitting, underfitting
75
-
76
- ### Task 2 β€” Medium (`task_medium`)
77
- - **Objective**: Identify the failure mode from logs + hyperparameter config
78
- - **Required sources**: `logs`, `config`
79
- - **Max steps**: 15
80
- - **Failure modes**: learning rate too low, missing regularization, batch size too small, optimizer misconfiguration
81
-
82
- ### Task 3 β€” Hard (`task_hard`)
83
- - **Objective**: Identify failure mode from logs + config + gradients, and provide a concrete fix
84
- - **Required sources**: `logs`, `config`, `gradients`
85
- - **Max steps**: 20
86
- - **Failure modes**: vanishing gradients, dying relu, bad weight initialization, lr scheduler misconfiguration
87
 
88
  ## Baseline Performance (Qwen/Qwen2.5-72B-Instruct)
89
 
90
  | Task | Avg Score | Pass Rate |
91
  |---|---|---|
92
- | Easy | ~0.85 | ~80% |
93
- | Medium | ~0.92 | ~100% |
94
- | Hard | ~0.93 | ~100% |
 
 
95
 
96
  ## Setup
97
 
@@ -99,7 +228,7 @@ Step-level rewards during inspection: +0.10 / +0.07 / +0.05 for each required so
99
 
100
  | Variable | Default | Required |
101
  |---|---|---|
102
- | `HF_TOKEN` | β€” | Yes (mandatory) |
103
  | `API_BASE_URL` | `https://router.huggingface.co/v1` | No |
104
  | `MODEL_NAME` | `Qwen/Qwen2.5-72B-Instruct` | No |
105
  | `SERVER_URL` | `http://localhost:8000` | No |
@@ -124,6 +253,24 @@ docker build -t whydiditfail-env:latest .
124
  docker run -p 8000:8000 whydiditfail-env:latest
125
  ```
126
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
127
  ## Project Structure
128
 
129
  ```
@@ -141,6 +288,8 @@ WhyDidItFail/
141
  └── llm_judge.py # LLM-based reasoning quality judge
142
  ```
143
 
 
 
144
  ## OpenEnv Spec Compliance
145
 
146
  - Typed `Action`, `Observation` Pydantic models βœ“
 
1
  ---
2
  title: WhyDidItFail Environment Server
3
+ emoji: πŸ”¬
4
  colorFrom: red
5
  colorTo: indigo
6
  sdk: docker
 
11
  - openenv
12
  ---
13
 
14
+ ![WhyDidITFail_meme](Wdif.png)
15
 
16
+ # πŸ”¬ WhyDidItFail β€” ML Training Failure Diagnosis Environment
17
 
18
+ > Every dev has been there. It's 2am. Your training run just died. The loss curve looks like a seismograph during an earthquake. You have no idea why. **WhyDidItFail** puts an AI agent in that exact seat β€” and makes it figure out what went wrong.
19
 
20
+ This is a real-world OpenEnv environment where an AI agent must **diagnose failed ML training runs** by inspecting logs, configs, and gradient statistics β€” then commit to a root cause and a fix. No handholding, no free answers. Just evidence, reasoning, and a score.
21
 
22
+ **Built as a training target for small models (7B–13B).** The action space is constrained, the reward signal is dense, and the failure modes are realistic enough to stress-test any agent that thinks it knows ML.
 
 
 
23
 
24
+ ---
25
+
26
+ ## How It Works β€” Episode Lifecycle
27
+
28
+ ```
29
+ reset()
30
+ └─► Agent receives task description + hint
31
+ β”‚
32
+ β–Ό
33
+ [inspect_logs] ──► Observation: training curves (loss, acc per epoch)
34
+ β”‚ Reward: +0.10 (first required source)
35
+ β–Ό
36
+ [inspect_config] ──► Observation: hyperparams (lr, optimizer, dropout...)
37
+ β”‚ Reward: +0.07 (second required source)
38
+ β–Ό
39
+ [inspect_gradients] β–Ί Observation: gradient norms by layer
40
+ β”‚ Reward: +0.05 (third required source)
41
+ β–Ό
42
+ [submit_diagnosis] ──► diagnosis + suggested_fix + reasoning
43
+ Reward: 0.0–1.0 (graded on correctness + evidence + efficiency)
44
+ done = True
45
+ ```
46
+
47
+ > - Each action reveals a different slice of evidence
48
+ > - Agent must decide what to inspect and when to stop
49
+ > - Submitting too early or too late both cost points
50
+ > - Wrong inspection sources penalize the score
51
+
52
+
53
+ ---
54
 
55
+ ## Scenarios
56
+
57
+ ### Easy β€” Logs only
58
+
59
+ | Scenario | Problem | Description |
60
+ |---|---|---|
61
+ | Exploding Gradients | Loss β†’ NaN | Training loss goes NaN after epoch 2. Gradient norms spike to infinity. The model diverges catastrophically β€” a classic sign of learning rate being too high or missing gradient clipping. The agent must catch NaN in the logs and label it correctly. |
62
+ | Learning Rate Too High | Oscillating loss | Loss bounces wildly every epoch β€” goes down, shoots up, never converges. No NaN, just chaos. The optimizer is taking steps so large it overshoots the minimum repeatedly. Batch size is fine; the culprit is the LR. |
63
+ | Overfitting | Val loss climbing | Train loss hits near-zero by epoch 15. Val loss is diverging upward. The config already has regularization (dropout, weight decay) present β€” so this is true overfitting, not a missing regularization bug. The agent must distinguish these two. |
64
+ | Underfitting | Both losses stuck high | Train accuracy and val accuracy hover near random baseline (~10%) throughout training. No gap between them. The model isn't learning at all β€” too simple for the task, wrong architecture, or training stopped too early. |
65
+
66
+ ### Medium β€” Logs + Config
67
+
68
+ | Scenario | Problem | Description |
69
+ |---|---|---|
70
+ | Learning Rate Too Low | Glacial convergence | Loss is decreasing, but imperceptibly β€” 0.001 per epoch. The config reveals `lr=1e-6`. The model is technically learning, but so slowly it would take thousands of epochs to converge. The agent needs both the log trend and the config LR to make this call. |
71
+ | Missing Regularization | Overfit without defense | Train loss low, val loss rising β€” looks like overfitting. But the config shows `weight_decay=0.0` and `dropout=0.0`. This isn't overfitting the model fighting the regularizer β€” it's the model memorizing because there's no regularizer at all. Label matters here. |
72
+ | Batch Size Too Small | Noisy gradient trajectory | Loss goes down on average but is extremely noisy β€” spikes and dips every epoch. Config shows `batch_size=2`. With tiny batches, gradient estimates are high-variance: each update is basically random. The agent must connect the noise pattern to the batch size config. |
73
+ | Optimizer Misconfiguration | SGD with no momentum | Loss curves look stuck or very slow. Config shows `optimizer=SGD, momentum=0.0`. SGD without momentum has no gradient averaging β€” it stalls on saddle points and flat regions. Modern SGD needs momentum to navigate loss landscapes effectively. |
74
+
75
+ ### Hard β€” Logs + Config + Gradients
76
+
77
+ | Scenario | Problem | Description |
78
+ |---|---|---|
79
+ | Vanishing Gradients | Gradient decay toward inputs | Gradient norms decay exponentially from output to input layers (e.g. 1e-1 β†’ 1e-8). Config shows sigmoid or tanh activation. These saturating activations crush gradients during backprop. The input layers learn nothing. Agent must read gradient norms by layer and connect to activation choice. |
80
+ | Dying ReLU | Zero gradients in hidden layers | Gradient norms in hidden layers are exactly 0.0 β€” not small, exactly zero. Config shows ReLU activation and a high learning rate. Neurons have permanently entered the "dead zone" where their pre-activation is always negative, so they never fire or update again. |
81
+ | Bad Weight Initialization | NaN from epoch 1 | Loss is NaN from the very first epoch β€” before training even begins meaningfully. Gradient norms are astronomically large (>10,000). Config shows an extreme weight initialization std (e.g. 100). Weights so large that the forward pass immediately overflows. |
82
+ | LR Scheduler Misconfiguration | Periodic loss spikes | Training goes fine, then suddenly loss spikes at a predictable interval β€” every N epochs. Config shows `lr_scheduler=StepLR, gamma=10.0`. Gamma > 1.0 means the scheduler is **increasing** the learning rate at each step, not decreasing it. A subtle config error with dramatic consequences. |
83
+
84
+ ---
85
+
86
+ ## Features
87
+
88
+ - **12 realistic failure modes** across 3 difficulty tiers β€” exploding gradients, overfitting, dying ReLU, bad weight initialization, and more
89
+ - **Partial observability** β€” the agent chooses what to inspect (logs, config, gradients) and must reason from incomplete evidence
90
+ - **Dense reward signal** β€” step-level rewards during inspection, not just at the end
91
+ - **Dual grading** β€” programmatic keyword scorer (85%) + LLM reasoning judge (15%)
92
+ - **Multi-component score** β€” diagnosis correctness, evidence coverage, efficiency, fix quality, and inspection order all contribute
93
+ - **WebSocket environment** β€” real-time interaction via FastAPI; supports concurrent sessions
94
+ - **Docker-ready** β€” one command to run the full environment server
95
+ - **Local agent** β€” smoke test the pipeline without any API key
96
+
97
+ ---
98
+
99
+ ## Grading β€” The Heart of the Environment ❀️
100
+
101
+ ### Scoring Flow
102
+
103
+ ```
104
+ submit_diagnosis received
105
+ β”‚
106
+ β”œβ”€β–Ί Diagnosis Score (was the label correct?)
107
+ β”‚ exact keyword match β†’ +0.40
108
+ β”‚ category/fuzzy match β†’ +0.10 per keyword
109
+ β”‚ vague answer (<3 words) β†’ βˆ’0.10
110
+ β”‚
111
+ β”œβ”€β–Ί Evidence Score (did the agent inspect the right sources?)
112
+ β”‚ +0.08 per required source inspected
113
+ β”‚ βˆ’0.10 per required source NOT inspected
114
+ β”‚ βˆ’0.02 per irrelevant source inspected
115
+ β”‚
116
+ β”œβ”€β–Ί Evidence-Diagnosis Penalty (had the clues, drew wrong conclusion?)
117
+ β”‚ all required sources + wrong diagnosis β†’ βˆ’0.10
118
+ β”‚ some required sources + wrong diagnosis β†’ βˆ’0.05
119
+ β”‚
120
+ β”œβ”€β–Ί Efficiency Score (did the agent act without waste?)
121
+ β”‚ minimum steps β†’ +0.15
122
+ β”‚ extra steps β†’ βˆ’0.02 Γ— (extra_steps^1.2)
123
+ β”‚ early submit β†’ βˆ’0.05 per missing step
124
+ β”‚
125
+ β”œβ”€β–Ί Fix Score (was the suggested fix actionable?)
126
+ β”‚ all fix keywords match β†’ +0.15
127
+ β”‚ β‰₯60% match β†’ +0.10
128
+ β”‚ β‰₯30% match β†’ +0.05
129
+ β”‚ no fix provided β†’ βˆ’0.05
130
+ β”‚
131
+ └─► Ordering Bonus (+0.05 if sources inspected in canonical order)
132
+ logs β†’ config β†’ gradients
133
+
134
+ Total = clamp(sum, 0.0, 1.0)
135
+ ```
136
+
137
+
138
+ ### Score Breakdown Table
139
+
140
+ | Score Type | Logic | Max Reward | Min Reward |
141
+ |---|---|---|---|
142
+ | Diagnosis | Keyword match on failure mode label | +0.70 | 0.00 |
143
+ | Evidence | Required sources inspected vs missing | +0.25 | βˆ’0.15 |
144
+ | Evidence-Diagnosis Penalty | Had evidence but wrong conclusion | 0.00 | βˆ’0.10 |
145
+ | Efficiency | Steps taken vs minimum needed | +0.15 | 0.00 |
146
+ | Fix | Keyword match on suggested fix | +0.15 | βˆ’0.05 |
147
+ | Ordering Bonus | Canonical inspection order | +0.05 | 0.00 |
148
+ | **Total** | Clamped to [0.0, 1.0] | **1.00** | **0.00** |
149
+
150
+ ### Step-Level Rewards (during inspection)
151
+
152
+ | Action | Reward |
153
  |---|---|
154
+ | First required source discovered | +0.10 |
155
+ | Second required source discovered | +0.07 |
156
+ | Third required source discovered | +0.05 |
157
+ | Irrelevant source inspected | βˆ’0.03 |
158
+ | Re-inspecting a source | βˆ’0.05 |
159
+
160
+ ---
161
+
162
+ ## LLM Judge
163
+
164
+ ### What it does
165
+
166
+ The programmatic grader handles keyword matching β€” fast and deterministic. The **LLM Judge** runs after the episode ends and evaluates the quality of the agent's *reasoning*: did it actually cite evidence? Was the logic coherent? Did the fix make sense given the diagnosis?
167
+
168
+ ### Judge Flow
169
+
170
+ ```
171
+ submit_diagnosis
172
+ β”‚
173
+ β”œβ”€β–Ί Programmatic grader (keyword match β†’ score) 85% weight
174
+ β”‚
175
+ └─► LLM Judge (reasoning quality β†’ score) 15% weight
176
+ β”‚
177
+ β”œβ”€β”€ diagnosis + suggested_fix + reasoning + scenario data
178
+ β”‚
179
+ └── LLM evaluates:
180
+ - Did the agent cite specific numbers from the data?
181
+ - Is the reasoning internally consistent?
182
+ - Does the fix address the actual root cause?
183
+ β”‚
184
+ └── Returns float 0.0–1.0
185
+
186
+ Final Score = 0.85 Γ— keyword_score + 0.15 Γ— judge_score
187
+ ```
188
+
189
+
190
+ The judge uses the same model running inference (configurable via `MODEL_NAME`). It's deliberately lightweight β€” a single-turn evaluation with a structured prompt β€” so it doesn't dominate runtime.
191
+
192
+ ---
193
 
194
  ## Action Space
195
 
 
210
  - `reward` β€” step-level reward
211
  - `done` β€” episode termination flag
212
 
213
+ ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
214
 
215
  ## Baseline Performance (Qwen/Qwen2.5-72B-Instruct)
216
 
217
  | Task | Avg Score | Pass Rate |
218
  |---|---|---|
219
+ | Easy | 0.964 | 100% |
220
+ | Medium | 0.952 | 100% |
221
+ | Hard | 0.979 | 100% |
222
+
223
+ ---
224
 
225
  ## Setup
226
 
 
228
 
229
  | Variable | Default | Required |
230
  |---|---|---|
231
+ | `HF_TOKEN` | β€” | Yes |
232
  | `API_BASE_URL` | `https://router.huggingface.co/v1` | No |
233
  | `MODEL_NAME` | `Qwen/Qwen2.5-72B-Instruct` | No |
234
  | `SERVER_URL` | `http://localhost:8000` | No |
 
253
  docker run -p 8000:8000 whydiditfail-env:latest
254
  ```
255
 
256
+ ### Local Agent (No API Key Needed)
257
+
258
+ Want to test without calling an external LLM? The local agent uses a rule-based heuristic that mimics the expected agent behavior β€” useful for smoke testing the environment and grader logic.
259
+
260
+ ```bash
261
+ # Run with local agent (no HF_TOKEN needed)
262
+ USE_LOCAL=true uv run python inference.py
263
+ ```
264
+
265
+ The local agent (`local_agent.py`) follows a fixed inspection strategy:
266
+ 1. Always inspect logs first
267
+ 2. Inspects config and gradients if the task requires them
268
+ 3. Submits a deterministic diagnosis based on simple pattern matching
269
+
270
+ It won't score like a frontier model, but it will complete episodes cleanly and let you verify the full pipeline β€” server, grader, judge, stdout format β€” without any API calls.
271
+
272
+ ---
273
+
274
  ## Project Structure
275
 
276
  ```
 
288
  └── llm_judge.py # LLM-based reasoning quality judge
289
  ```
290
 
291
+ ---
292
+
293
  ## OpenEnv Spec Compliance
294
 
295
  - Typed `Action`, `Observation` Pydantic models βœ“