File size: 16,000 Bytes
b37875f
8f1e681
bece8d8
b37875f
 
 
 
 
 
 
 
 
 
15f091b
 
b37875f
bece8d8
b37875f
bece8d8
b37875f
bece8d8
b37875f
bece8d8
b37875f
bece8d8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b37875f
bece8d8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8f1e681
bece8d8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b37875f
8f1e681
b37875f
8f1e681
 
 
 
 
 
b37875f
8f1e681
b37875f
8f1e681
 
 
 
 
 
 
b37875f
bece8d8
b37875f
8f1e681
b37875f
8f1e681
 
bece8d8
 
 
 
 
b37875f
8f1e681
b37875f
e6bf1cd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b37875f
8f1e681
 
bece8d8
8f1e681
 
 
b37875f
e6bf1cd
 
 
 
 
b37875f
 
e6bf1cd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b37875f
e6bf1cd
 
 
 
 
 
 
 
f0681d9
 
 
 
 
 
 
 
 
 
 
 
e6bf1cd
 
 
b37875f
e6bf1cd
 
 
 
8f1e681
b37875f
e6bf1cd
b37875f
 
e6bf1cd
 
b37875f
 
e6bf1cd
 
 
 
 
bece8d8
e6bf1cd
bece8d8
 
 
 
 
e6bf1cd
bece8d8
 
 
b37875f
 
 
 
8f1e681
f0681d9
8f1e681
f0681d9
8f1e681
 
b37875f
8f1e681
 
 
 
 
b37875f
8f1e681
bece8d8
 
8f1e681
 
 
 
 
 
 
608b10a
 
 
 
 
 
 
 
 
15f091b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
---
title: WhyDidItFail Environment Server
emoji: πŸ”¬
colorFrom: red
colorTo: indigo
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv
---

<img width="1536" height="1024" alt="image" src="https://github.com/user-attachments/assets/5ea68b2e-0005-4aab-93db-6206ef29ad91" />


# πŸ”¬ WhyDidItFail β€” ML Training Failure Diagnosis Environment

> Every dev has been there. It's 2am. Your training run just died. The loss curve looks like a seismograph during an earthquake. You have no idea why. **WhyDidItFail** puts an AI agent in that exact seat β€” and makes it figure out what went wrong.

This is a real-world OpenEnv environment where an AI agent must **diagnose failed ML training runs** by inspecting logs, configs, and gradient statistics β€” then commit to a root cause and a fix. No handholding, no free answers. Just evidence, reasoning, and a score.

**Built as a training target for small models (7B–13B).** The action space is constrained, the reward signal is dense, and the failure modes are realistic enough to stress-test any agent that thinks it knows ML.

---

## How It Works β€” Episode Lifecycle

```
reset()
  └─► Agent receives task description + hint
        β”‚
        β–Ό
   [inspect_logs]  ──► Observation: training curves (loss, acc per epoch)
        β”‚                Reward: +0.10 (first required source)
        β–Ό
   [inspect_config] ──► Observation: hyperparams (lr, optimizer, dropout...)
        β”‚                Reward: +0.07 (second required source)
        β–Ό
   [inspect_gradients] β–Ί Observation: gradient norms by layer
        β”‚                Reward: +0.05 (third required source)
        β–Ό
   [submit_diagnosis] ──► diagnosis + suggested_fix + reasoning
                          Reward: 0.0–1.0 (graded on correctness + evidence + efficiency)
                          done = True
```

> - Each action reveals a different slice of evidence
> - Agent must decide what to inspect and when to stop
> - Submitting too early or too late both cost points
> - Wrong inspection sources penalize the score


---

## Scenarios

### Easy β€” Logs only

| Scenario | Problem | Description |
|---|---|---|
| Exploding Gradients | Loss β†’ NaN | Training loss goes NaN after epoch 2. Gradient norms spike to infinity. The model diverges catastrophically β€” a classic sign of learning rate being too high or missing gradient clipping. The agent must catch NaN in the logs and label it correctly. |
| Learning Rate Too High | Oscillating loss | Loss bounces wildly every epoch β€” goes down, shoots up, never converges. No NaN, just chaos. The optimizer is taking steps so large it overshoots the minimum repeatedly. Batch size is fine; the culprit is the LR. |
| Overfitting | Val loss climbing | Train loss hits near-zero by epoch 15. Val loss is diverging upward. The config already has regularization (dropout, weight decay) present β€” so this is true overfitting, not a missing regularization bug. The agent must distinguish these two. |
| Underfitting | Both losses stuck high | Train accuracy and val accuracy hover near random baseline (~10%) throughout training. No gap between them. The model isn't learning at all β€” too simple for the task, wrong architecture, or training stopped too early. |

### Medium β€” Logs + Config

| Scenario | Problem | Description |
|---|---|---|
| Learning Rate Too Low | Glacial convergence | Loss is decreasing, but imperceptibly β€” 0.001 per epoch. The config reveals `lr=1e-6`. The model is technically learning, but so slowly it would take thousands of epochs to converge. The agent needs both the log trend and the config LR to make this call. |
| Missing Regularization | Overfit without defense | Train loss low, val loss rising β€” looks like overfitting. But the config shows `weight_decay=0.0` and `dropout=0.0`. This isn't overfitting the model fighting the regularizer β€” it's the model memorizing because there's no regularizer at all. Label matters here. |
| Batch Size Too Small | Noisy gradient trajectory | Loss goes down on average but is extremely noisy β€” spikes and dips every epoch. Config shows `batch_size=2`. With tiny batches, gradient estimates are high-variance: each update is basically random. The agent must connect the noise pattern to the batch size config. |
| Optimizer Misconfiguration | SGD with no momentum | Loss curves look stuck or very slow. Config shows `optimizer=SGD, momentum=0.0`. SGD without momentum has no gradient averaging β€” it stalls on saddle points and flat regions. Modern SGD needs momentum to navigate loss landscapes effectively. |

### Hard β€” Logs + Config + Gradients

| Scenario | Problem | Description |
|---|---|---|
| Vanishing Gradients | Gradient decay toward inputs | Gradient norms decay exponentially from output to input layers (e.g. 1e-1 β†’ 1e-8). Config shows sigmoid or tanh activation. These saturating activations crush gradients during backprop. The input layers learn nothing. Agent must read gradient norms by layer and connect to activation choice. |
| Dying ReLU | Zero gradients in hidden layers | Gradient norms in hidden layers are exactly 0.0 β€” not small, exactly zero. Config shows ReLU activation and a high learning rate. Neurons have permanently entered the "dead zone" where their pre-activation is always negative, so they never fire or update again. |
| Bad Weight Initialization | NaN from epoch 1 | Loss is NaN from the very first epoch β€” before training even begins meaningfully. Gradient norms are astronomically large (>10,000). Config shows an extreme weight initialization std (e.g. 100). Weights so large that the forward pass immediately overflows. |
| LR Scheduler Misconfiguration | Periodic loss spikes | Training goes fine, then suddenly loss spikes at a predictable interval β€” every N epochs. Config shows `lr_scheduler=StepLR, gamma=10.0`. Gamma > 1.0 means the scheduler is **increasing** the learning rate at each step, not decreasing it. A subtle config error with dramatic consequences. |

---

## Features

- **12 realistic failure modes** across 3 difficulty tiers β€” exploding gradients, overfitting, dying ReLU, bad weight initialization, and more
- **Partial observability** β€” the agent chooses what to inspect (logs, config, gradients) and must reason from incomplete evidence
- **Dense reward signal** β€” step-level rewards during inspection, not just at the end
- **Dual grading** β€” programmatic keyword scorer (85%) + LLM reasoning judge (15%)
- **Multi-component score** β€” diagnosis correctness, evidence coverage, efficiency, fix quality, and inspection order all contribute
- **WebSocket environment** β€” real-time interaction via FastAPI; supports concurrent sessions
- **Docker-ready** β€” one command to run the full environment server
- **Local agent** β€” smoke test the pipeline without any API key

---

## Grading β€” The Heart of the Environment ❀️

### Scoring Flow

```
submit_diagnosis received
        β”‚
        β”œβ”€β–Ί Diagnosis Score     (was the label correct?)
        β”‚       exact keyword match  β†’ +0.40
        β”‚       category/fuzzy match β†’ +0.10 per keyword
        β”‚       vague answer (<3 words) β†’ βˆ’0.10
        β”‚
        β”œβ”€β–Ί Evidence Score      (did the agent inspect the right sources?)
        β”‚       +0.08 per required source inspected
        β”‚       βˆ’0.10 per required source NOT inspected
        β”‚       βˆ’0.02 per irrelevant source inspected
        β”‚
        β”œβ”€β–Ί Evidence-Diagnosis Penalty  (had the clues, drew wrong conclusion?)
        β”‚       all required sources + wrong diagnosis β†’ βˆ’0.10
        β”‚       some required sources + wrong diagnosis β†’ βˆ’0.05
        β”‚
        β”œβ”€β–Ί Efficiency Score    (did the agent act without waste?)
        β”‚       minimum steps β†’ +0.15
        β”‚       extra steps   β†’ βˆ’0.02 Γ— (extra_steps^1.2)
        β”‚       early submit  β†’ βˆ’0.05 per missing step
        β”‚
        β”œβ”€β–Ί Fix Score           (was the suggested fix actionable?)
        β”‚       all fix keywords match β†’ +0.15
        β”‚       β‰₯60% match β†’ +0.10
        β”‚       β‰₯30% match β†’ +0.05
        β”‚       no fix provided β†’ βˆ’0.05
        β”‚
        └─► Ordering Bonus      (+0.05 if sources inspected in canonical order)
                                 logs β†’ config β†’ gradients
        
        Total = clamp(sum, 0.0, 1.0)
```


### Score Breakdown Table

| Score Type | Logic | Max Reward | Min Reward |
|---|---|---|---|
| Diagnosis | Keyword match on failure mode label | +0.70 | 0.00 |
| Evidence | Required sources inspected vs missing | +0.25 | βˆ’0.15 |
| Evidence-Diagnosis Penalty | Had evidence but wrong conclusion | 0.00 | βˆ’0.10 |
| Efficiency | Steps taken vs minimum needed | +0.15 | 0.00 |
| Fix | Keyword match on suggested fix | +0.15 | βˆ’0.05 |
| Ordering Bonus | Canonical inspection order | +0.05 | 0.00 |
| **Total** | Clamped to [0.0, 1.0] | **1.00** | **0.00** |

### Step-Level Rewards (during inspection)

| Action | Reward |
|---|---|
| First required source discovered | +0.10 |
| Second required source discovered | +0.07 |
| Third required source discovered | +0.05 |
| Irrelevant source inspected | βˆ’0.03 |
| Re-inspecting a source | βˆ’0.05 |

---

## LLM Judge

### What it does

The programmatic grader handles keyword matching β€” fast and deterministic. The **LLM Judge** runs after the episode ends and evaluates the quality of the agent's *reasoning*: did it actually cite evidence? Was the logic coherent? Did the fix make sense given the diagnosis?

### Judge Flow

```
submit_diagnosis
        β”‚
        β”œβ”€β–Ί Programmatic grader  (keyword match β†’ score)  85% weight
        β”‚
        └─► LLM Judge           (reasoning quality β†’ score)  15% weight
                β”‚
                β”œβ”€β”€ diagnosis + suggested_fix + reasoning + scenario data
                β”‚
                └── LLM evaluates:
                        - Did the agent cite specific numbers from the data?
                        - Is the reasoning internally consistent?
                        - Does the fix address the actual root cause?
                        β”‚
                        └── Returns float 0.0–1.0

Final Score = 0.85 Γ— keyword_score + 0.15 Γ— judge_score
```


The judge uses the same model running inference (configurable via `MODEL_NAME`). It's deliberately lightweight β€” a single-turn evaluation with a structured prompt β€” so it doesn't dominate runtime.

---

## Action Space

| Action | Description |
|---|---|
| `inspect_logs` | View training/validation loss and accuracy curves by epoch |
| `inspect_config` | View hyperparameter config (lr, optimizer, batch size, dropout, etc.) |
| `inspect_gradients` | View gradient norm statistics by layer and epoch |
| `submit_diagnosis` | Submit final diagnosis with label, suggested fix, and reasoning |

## Observation Space

Each step returns a `WhyDidItFailObservation` with:
- `task_description` β€” the current task objective
- `visible_data` β€” data returned by the last inspect action (JSON)
- `feedback` β€” partial progress hint (e.g. which sources still need inspection)
- `steps_taken` β€” step counter
- `reward` β€” step-level reward
- `done` β€” episode termination flag

---

## Baseline Performance (Qwen/Qwen2.5-72B-Instruct)

| Task | Avg Score | Pass Rate |
|---|---|---|
| Easy | 0.964 | 100% |
| Medium | 0.952 | 100% |
| Hard | 0.979 | 100% |

---

## Setup

### Prerequisites

- [uv](https://docs.astral.sh/uv/) β€” Python package manager
- [Docker](https://www.docker.com/) β€” for running the environment server
- A Hugging Face account with an API token ([get one here](https://huggingface.co/settings/tokens))

---

### 1. Install uv

```bash
# macOS / Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

# Or via Homebrew
brew install uv
```

---

### 2. Install dependencies

```bash
git clone https://github.com/samrat-rm/WhyDidItFail
cd WhyDidItFail
uv sync
```

---

### 3. Configure environment variables

Create a `.env` file in the project root:

```bash
HF_TOKEN=your_huggingface_token_here

# Optional overrides
API_BASE_URL=https://router.huggingface.co/v1
MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
SERVER_URL=http://localhost:8000
```

| Variable | Default | Required |
|---|---|---|
| `HF_TOKEN` | β€” | Yes |
| `API_BASE_URL` | `https://router.huggingface.co/v1` | No |
| `MODEL_NAME` | `Qwen/Qwen2.5-72B-Instruct` | No |
| `SERVER_URL` | `http://localhost:8000` | No |

---

### 4. Start the environment server

The environment server runs in Docker. Build and start it:

```bash
# Build the image
docker build -t why_did_it_fail_env:latest .

# Run the server (exposes on port 8000)
docker run -p 8000:8000 why_did_it_fail_env:latest
```

The server is ready when you see `Uvicorn running on http://0.0.0.0:8000`.

---

### 5. Run inference

In a separate terminal:

```bash
uv run python inference.py

# Python 3 explicit
uv run python3 inference.py
```

Stdout will stream `[START]` / `[STEP]` / `[END]` lines per episode. Internal logs go to stderr.

---

### 6. Verify environment state (optional)

To confirm the state endpoint is working correctly β€” episode tracking, inspection order, required sources:

```bash
uv run python test_state.py
```

Runs a single hard episode and calls `state()` after every action. Prints `OK` or `FAIL` per checkpoint.

---

### Local Agent β€” No API Key Required

To smoke test the full pipeline without calling an external LLM, you can run inference with a local model via [Ollama](https://ollama.com/).

**1. Install Ollama**

```bash
brew install ollama
```

**2. Pull a model**

```bash
# Recommended: a small instruction-tuned model
ollama pull qwen2.5:7b
```

**3. Start the Ollama server**

```bash
ollama serve
```

**4. Run inference against the local model**

```bash
USE_LOCAL=true uv run python inference.py
```

> The local agent follows a fixed inspection strategy and won't match frontier model scores, but it exercises the full pipeline β€” server, grader, judge, and stdout format β€” with no API calls or token costs.

---

## Project Structure

```
WhyDidItFail/
β”œβ”€β”€ inference.py                    # Baseline inference script
β”œβ”€β”€ test_state.py                   # State endpoint verification script
β”œβ”€β”€ client.py                       # WhyDidItFailEnv client (WebSocket)
β”œβ”€β”€ models.py                       # Action, Observation, and State Pydantic models
β”œβ”€β”€ openenv.yaml                    # OpenEnv manifest
β”œβ”€β”€ Dockerfile                      # Container image
└── server/
    β”œβ”€β”€ WhyDidItFail_environment.py # Core environment logic (step/reset/state)
    β”œβ”€β”€ app.py                      # FastAPI server (HTTP + WebSocket)
    β”œβ”€β”€ scenarios.py                # 12 scenario definitions
    β”œβ”€β”€ graders.py                  # Programmatic grader
    └── llm_judge.py                # LLM-based reasoning quality judge
```

---

## OpenEnv Spec Compliance

- Typed `Action`, `Observation` Pydantic models βœ“
- `step(action)` β†’ `(observation, reward, done, info)` βœ“
- `reset()` β†’ initial observation βœ“
- `state()` β†’ current state βœ“
- `openenv.yaml` with 3 tasks and grader definitions βœ“
- Passes `openenv validate` βœ“


#### AI Usage Disclosure

This project was developed with the assistance of AI tools, including Claude and ChatGPT. These tools were used to support tasks such as code generation, documentation drafting, and problem-solving.

All AI-generated content has been carefully reviewed, tested, and validated before being included in this repository. I take full responsibility for the accuracy, functionality, and integrity of the code and documentation provided.

AI assistance was used as a productivity aid, but all final decisions, implementations, and reviews were performed by me to ensure quality and correctness.