omkarrr88 commited on
Commit
a3e1032
·
1 Parent(s): 05ccdc6

chore: remove AI planning artifacts and junk files from repo

Browse files

- Remove .claude/ directory (memory + plan files) from tracking
- Remove .python-version, uv.lock, deploy scripts from tracking
- Update .gitignore to prevent re-adding
- 5,508 lines of non-project files removed

.claude/memory/MEMORY.md DELETED
@@ -1,9 +0,0 @@
1
- # Memory Index
2
-
3
- - [Project Overview](project_overview.md) — Architecture, 7 tasks, dual model (CNN+MLP), real training, endpoints, WS format
4
- - [Project Status](project_status.md) — 251 tests/95% cov/885MB Docker/LLM scores, as of 2026-03-30
5
- - [Hackathon Rules](project_hackathon_rules.md) — Scoring rubric, DQ criteria, submission requirements
6
- - [Spec Documents](reference_spec_docs.md) — Which files are source of truth, key spec sections
7
- - [Docker Stripping](feedback_docker_stripping.md) — torch 2.5.1 + multi-stage + strip = 885MB, what breaks/safe
8
- - [WS Message Format](feedback_ws_format.md) — WS task selection via data field, correct step format
9
- - [User Context](user_context.md) — Omkar building hackathon submission, values thorough testing
 
 
 
 
 
 
 
 
 
 
.claude/memory/feedback_docker_stripping.md DELETED
@@ -1,46 +0,0 @@
1
- ---
2
- name: Docker torch stripping — what breaks and final optimized approach
3
- description: Lessons learned from Docker optimization. Final image 885MB using torch 2.5.1 + multi-stage + strip. Which dirs break, which are safe.
4
- type: feedback
5
- ---
6
-
7
- ## Final Optimized Dockerfile Approach (885MB)
8
-
9
- 1. **Use torch 2.5.1+cpu** (not latest 2.11.0) — smaller wheel, libtorch_cpu.so strips to 329MB
10
- 2. **Multi-stage build**: builder installs + strips, runtime copies only site-packages
11
- 3. **`strip --strip-unneeded`** on ALL .so files in one RUN layer
12
- 4. **`--no-compile`** flag on pip install (skip .pyc generation)
13
- 5. **Remove bloated transitive deps** in same layer: gradio (155MB), pandas (42MB), PIL, pip, setuptools
14
-
15
- ## Do NOT Remove (breaks `import torch` or runtime)
16
-
17
- - `torch/testing` → required by `torch.autograd.gradcheck`
18
- - `torch/distributed` → required by `torch._jit_internal`
19
- - `torch/cuda` → required at `_initExtension`
20
- - `torch/_inductor`, `torch/_dynamo` → required by `torch.optim` (optimizer init)
21
- - `torch/_functorch` → required by core init
22
- - `torch/fx` → required by `_functorch`
23
- - `torch/sparse`, `torch/nested`, `torch/masked` → required by `torch.nn`
24
- - `torch/onnx`, `torch/ao`, `torch/_export`, `torch/jit` → required at import time
25
- - `torchgen` → required by `torch.utils._python_dispatch`
26
- - `sympy` + `mpmath` → required by `torch._dynamo.utils`
27
- - `numpy` + `numpy.libs` → required by `torch.storage`
28
- - `beartype` → required by `fastmcp` → `openenv-core`
29
- - `pygments` → required by `rich` → `fastmcp`
30
- - `torch/bin/torch_shm_manager` → required at `_initExtension`
31
-
32
- ## Safe to Remove (verified working after removal)
33
-
34
- - `torch/test`, `torch/include`, `torch/share` — dev/test files
35
- - `torch/bin/*` EXCEPT `torch_shm_manager` — test binaries (47MB)
36
- - `torch/utils/benchmark`, `torch/utils/bottleneck`, `torch/utils/tensorboard`
37
- - `torch/lib/*.a`, `torch/lib/libtorchbind_test.so`, `torch/lib/libjitbackend_test.so`, etc.
38
- - `caffe2/` — not used
39
- - `gradio`, `gradio_client`, `hf_gradio` — pulled by openenv-core, not needed at runtime
40
- - `pandas`, `PIL/Pillow`, `networkx`, `scipy`, `matplotlib`
41
- - `pip`, `setuptools`, `docutils`, `cryptography`, `pytz`
42
- - `ffmpy`, `pydub`, `groovy`, `tomlkit`, `semantic_version`, `safehttpx`, `brotli`
43
- - All `.pyi` files, `__pycache__`, `.pyc`, stale `.dist-info`
44
-
45
- ## Older Torch NOT Smaller
46
- torch 2.2.0+cpu was 179MB wheel but installed to 932MB (numpy version mismatch, no strip benefit). torch 2.5.1+cpu at 885MB is the sweet spot.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
.claude/memory/feedback_ws_format.md DELETED
@@ -1,19 +0,0 @@
1
- ---
2
- name: OpenEnv framework WS message format
3
- description: The openenv-core WS endpoint expects specific message formats. Task selection via data field WORKS. Critical for tests and agent integration.
4
- type: feedback
5
- ---
6
-
7
- The openenv-core framework's WebSocket endpoint at `/ws` uses Pydantic-validated message formats:
8
-
9
- - **Reset (default task)**: `{"type": "reset"}`
10
- - **Reset (select task)**: `{"type": "reset", "data": {"task_id": "task_003", "seed": 42}}` — WORKS! The `data` field passes kwargs to `reset()`.
11
- - **Step**: `{"type": "step", "data": {"action_type": "inspect_gradients"}}` — use `"data"` NOT `"action"`
12
-
13
- **Key discovery (2026-03-28):** `WSResetMessage` has `data: Dict[str, Any]` which passes through to `reset(**kwargs)`. Task selection via WS is NOT broken — just needs the `data` wrapper. Top-level extra fields like `{"type": "reset", "task_id": "..."}` fail with "Extra inputs not permitted."
14
-
15
- **Why:** The framework's `WSResetMessage` uses Pydantic with `extra="forbid"` on top-level fields, but the `data` dict is `Dict[str, Any]` and passes freely.
16
-
17
- **HTTP endpoints** are stateless by framework design — each `/reset` and `/step` creates a fresh environment instance and destroys it after. WS is the only stateful interface for full episodes.
18
-
19
- **Response format:** `{"type": "observation", "data": {"observation": {...}, "reward": float, "done": bool}}`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
.claude/memory/project_hackathon_rules.md DELETED
@@ -1,50 +0,0 @@
1
- ---
2
- name: Hackathon rules and evaluation criteria
3
- description: Meta PyTorch OpenEnv Hackathon scoring rubric, DQ criteria, and submission requirements.
4
- type: project
5
- ---
6
-
7
- ## Hackathon: Meta PyTorch OpenEnv Hackathon x Scaler School of Technology
8
-
9
- **Timeline**: March 14 – April 8, 2026 (Round 1 submission)
10
- **Prize pool**: $30,000
11
- **Top teams advance**: 2,000-3,000 teams to in-person Round 2 (April 25-26, Bangalore)
12
-
13
- ## Scoring Rubric
14
-
15
- | Criterion | Weight |
16
- |-----------|--------|
17
- | Real-world utility | 30% |
18
- | Task & grader quality | 25% |
19
- | Environment design | 20% |
20
- | Code quality & spec compliance | 15% |
21
- | Creativity & novelty | 10% |
22
-
23
- ## DQ Criteria (auto-fail)
24
- - HF Space doesn't deploy or respond to reset()
25
- - openenv validate fails
26
- - Dockerfile doesn't build
27
- - Baseline doesn't reproduce
28
- - <3 tasks with graders
29
- - Graders always return same score
30
- - No baseline inference script
31
- - Plagiarized environment
32
-
33
- ## Required Submission Artifacts
34
- 1. Public GitHub repo (code, README, requirements, demo script)
35
- 2. HF Spaces demo link (tagged `openenv`)
36
- 3. README with: env description, action/obs spaces, task descriptions, setup instructions, baseline scores
37
-
38
- ## Required Endpoints
39
- - `POST /baseline` — trigger inference, return baseline scores
40
- - `POST /grader` — return grader score after completed episode
41
- - `GET /tasks` — return task list with action schema
42
-
43
- ## Evaluation Phases
44
- 1. **Automated Validation**: pass/fail gate (deploy, spec compliance, baseline reproduces)
45
- 2. **Agentic Evaluation**: standard Open LLM agent run against all environments
46
- 3. **Human Review**: Meta/HF engineers review top submissions
47
-
48
- **Why:** Understanding the rubric is essential to prioritize work. Real-world utility (30%) + task quality (25%) = 55% of score. Code quality is only 15%.
49
-
50
- **How to apply:** When making trade-offs, prioritize task quality and realism over code perfection. Ensure all DQ criteria pass before polishing.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
.claude/memory/project_overview.md DELETED
@@ -1,83 +0,0 @@
1
- ---
2
- name: ML Debugger Project Overview
3
- description: PyTorch Training Run Debugger — OpenEnv RL environment for Meta PyTorch Hackathon. Core architecture, 7 tasks, dual model, real training, key modules.
4
- type: project
5
- ---
6
-
7
- ## What This Is
8
-
9
- A complete OpenEnv RL environment where an AI agent debugs broken PyTorch training runs. Built for the **Meta PyTorch OpenEnv Hackathon x Scaler School of Technology** (Round 1 deadline: April 8, 2026).
10
-
11
- **Runtime**: Python 3.12 · PyTorch 2.5.1 CPU-only · openenv-core v0.2.2
12
-
13
- ## Architecture
14
-
15
- ```
16
- server/app.py → FastAPI app via create_app() from openenv-core
17
- server/environment.py → MLTrainingEnvironment(Environment) — reset(), step(), state
18
- server/_baseline_results.py → Shared grader result storage
19
- server/dashboard.html → Live 4-panel Plotly.js dashboard
20
-
21
- ml_training_debugger/
22
- models.py → All Pydantic models (Action, Observation, EpisodeState, etc.)
23
- scenarios.py → ScenarioParams + sample_scenario() — 7 tasks, model_type, difficulty_level
24
- pytorch_engine.py → SimpleCNN + SimpleMLP, fault injection, gradient/weight extraction, run_real_training() with caching
25
- simulation.py → Calls run_real_training() for curves, parametric fallback
26
- reward_engine.py → 7-component reward function (per-step RL signal)
27
- graders.py → Per-task grader functions (0.0-1.0 holistic score at episode end)
28
- code_templates.py → Task 6 code bug templates + multi-strategy fix validation
29
- client.py → MLTrainingEnvClient extending GenericEnvClient
30
- ```
31
-
32
- ## The 7 Tasks
33
-
34
- | Task | Root Cause | Difficulty | Heuristic Score |
35
- |------|-----------|------------|-----------------|
36
- | task_001 | lr_too_high | Easy | 1.00 |
37
- | task_002 | vanishing_gradients | Easy | 1.00 |
38
- | task_003 | data_leakage | Medium | 1.00 |
39
- | task_004 | overfitting | Medium | 0.45 |
40
- | task_005 | batchnorm_eval_mode | Hard | 1.00 |
41
- | task_006 | code_bug (4 variants) | Hard | 1.00 |
42
- | task_007 | scheduler_misconfigured | Med-Hard | 1.00 |
43
-
44
- ## Model Architectures (Dual)
45
- - **SimpleCNN**: 3-layer CNN with BatchNorm, ~50K params (used for task_005, task_006)
46
- - **SimpleMLP**: 3-layer MLP with BatchNorm1d, ~20K params
47
- - Randomly selected per task/seed via `_pick_model_type(rng)`
48
-
49
- ## Real Training Curves
50
- - `run_real_training()` in pytorch_engine.py runs 20 real forward+backward epochs
51
- - Cached per (task_id, seed, model_type) — first call ~2s, subsequent instant
52
- - Replaces parametric formulas — judges see real training dynamics, not `torch.exp()`
53
-
54
- ## Key Endpoints
55
-
56
- - `GET /health` → `{"status": "ready", "tasks": 7}`
57
- - `GET /tasks` → Task list with action schema
58
- - `POST /grader` → Score after completed episode
59
- - `POST /baseline` → Run heuristic baseline, return all scores
60
- - `GET /dashboard` → Live diagnostic dashboard (Plotly.js)
61
- - `GET /validation-report` → Pre-computed fidelity report (8/8 pass)
62
- - `GET /curriculum` → Recommended task order with difficulty scaling
63
- - `GET /leaderboard` → Sorted episode scores
64
- - `GET /replay/{episode_id}` → Episode trace
65
- - `WS /ws` → Primary agent interface
66
- - Framework: `/reset`, `/step`, `/state`, `/schema`, `/docs`
67
-
68
- ## WebSocket Message Format
69
-
70
- - Reset (select task): `{"type": "reset", "data": {"task_id": "task_003", "seed": 42}}`
71
- - Reset (default): `{"type": "reset"}`
72
- - Step: `{"type": "step", "data": {"action_type": "inspect_gradients"}}`
73
- - Response: `{"type": "observation", "data": {"observation": {...}, "reward": float, "done": bool}}`
74
-
75
- ## Key Design Decisions
76
-
77
- - **Grader ≠ Reward**: graders.py (holistic 0.0-1.0) vs reward_engine.py (per-step float)
78
- - **Task IDs are opaque**: task_001-task_007
79
- - **Task 6 diagnosis is ALWAYS `code_bug`** regardless of variant
80
- - **Context-gated penalty**: -0.20 fires ONLY when `gradients_inspected=True AND gradients_were_normal=True`
81
- - **Step penalty is flat -0.01** (never multiplied by step_count)
82
- - **Difficulty scaling**: 1-5 via `difficulty_level` parameter in reset()
83
- - **Confusion matrix** included in data batch stats
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
.claude/memory/project_status.md DELETED
@@ -1,61 +0,0 @@
1
- ---
2
- name: Project Status as of 2026-03-30
3
- description: Current build/test/deployment status, verified metrics, known limitations, and remaining work.
4
- type: project
5
- ---
6
-
7
- ## Status: Code Complete, Deployment Pending
8
-
9
- **Last verified**: 2026-03-30
10
-
11
- ### Verified Metrics
12
- - **251 tests pass** (60s runtime due to real training)
13
- - **95% coverage** on ml_training_debugger/ + server/
14
- - **openenv validate** → `[OK] ML Debugger: Ready for multi-mode deployment`
15
- - **Baseline bit-exact reproducible** across runs
16
- - **Docker image: 885MB** (down from 1.96GB — 55% reduction)
17
- - **Docker uses torch 2.5.1+cpu** (multi-stage build, strip --strip-unneeded)
18
- - **8/8 validation checks pass** (real training curves)
19
- - **All endpoints work** (health, tasks, grader, baseline, dashboard, validation-report, curriculum, leaderboard, replay, schema, ws)
20
- - **All 7 tasks selectable via WS**: `{"type": "reset", "data": {"task_id": "task_007"}}`
21
-
22
- ### Baseline Scores (Heuristic)
23
- ```
24
- task_001: 1.0, task_002: 1.0, task_003: 1.0, task_004: 0.45,
25
- task_005: 1.0, task_006: 1.0, task_007: 1.0
26
- ```
27
-
28
- ### LLM Baseline Scores (Measured)
29
- - **Llama 3.3 70B** (Groq): 1.0, 1.0, 0.4, 0.45, 1.0, —, — (5/7 before rate limit)
30
- - **Llama 3.1 8B** (Cerebras): 0.6, 0.05, 0.4, 0.6, 1.0, 0.6, 0.6 (avg 0.55)
31
- - **Llama 3.1 8B** (Groq): 0.6, 0.05, 0.4, 0.6, 1.0, 1.0, 0.6 (avg 0.61)
32
-
33
- ### Features Implemented
34
- - 7 tasks with 3 difficulty tiers + difficulty scaling (1-5)
35
- - Dual architecture: SimpleCNN + SimpleMLP
36
- - Real 20-epoch PyTorch mini-training (cached per task/seed)
37
- - Context-gated reward penalty
38
- - Code-level debugging (Task 6, 4 bug variants, AST validation)
39
- - Task 7: LR Scheduler misconfigured
40
- - Confusion matrix in data batch stats
41
- - Curriculum, leaderboard, replay endpoints
42
- - PAPER.md research summary
43
- - EXPLANATION.md simple explanation
44
- - Multi-provider LLM baseline (Groq, Cerebras, Gemini, OpenAI)
45
- - Exploit resistance test (20-seed variance)
46
- - deploy-hf.sh deployment script
47
-
48
- ### Pending
49
- - [ ] Push to **public GitHub repo**
50
- - [ ] Deploy to **HF Spaces** (Docker type, tag `openenv`)
51
- - [ ] Run 70B baseline for tasks 6-7 (Groq quota resets daily)
52
- - [ ] Record dashboard GIF for README
53
-
54
- ### Docker Size History
55
- 1.96GB → 1.48GB → 1.09GB → **885MB** (irreducible: libtorch_cpu.so=329MB stripped)
56
-
57
- ### Known Limitations
58
- - Docker 885MB (target was 500MB — libtorch_cpu.so is irreducible)
59
- - HTTP /reset and /step are stateless (framework design — WS is primary interface)
60
- - Heuristic outperforms LLMs on most tasks (environment rewards domain knowledge)
61
- - `replace_optimizer` and `rollback_checkpoint` are no-op actions
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
.claude/memory/reference_spec_docs.md DELETED
@@ -1,32 +0,0 @@
1
- ---
2
- name: Key spec documents and their roles
3
- description: Which files are source of truth for what, and how they relate to each other.
4
- type: reference
5
- ---
6
-
7
- ## Source of Truth Hierarchy
8
-
9
- 1. **`ml-training-debugger-spec.md`** — THE single source of truth. If anything conflicts with this, the spec wins.
10
- 2. **`CLAUDE.md`** — Coding rules, non-negotiable constraints, reward constants, commands. Derived from spec.
11
- 3. **`ROADMAP.md`** — Phase-by-phase implementation plan with acceptance criteria.
12
- 4. **`PRD.md`** — Product requirements (higher-level than spec).
13
-
14
- ## Key Spec Sections (by number)
15
- - S5: Context-gated reward shaping (the differentiator)
16
- - S6: PyTorch-native fault injection engine
17
- - S10: Data models (typed Pydantic models)
18
- - S11: The six core tasks (param ranges, grader breakdowns)
19
- - S12: Reward function (7 components, exact constants)
20
- - S13: Environment lifecycle (reset/step/done)
21
- - S14: OpenEnv spec compliance (endpoint contracts)
22
- - S16: Error handling (step() never raises)
23
- - S17: Baseline inference design (heuristic decision tree)
24
- - S18: PyTorch validation suite
25
- - S22: Code fix validation pipeline (normalize → tokenize → semantic → AST)
26
-
27
- ## Non-Negotiable Rules (from CLAUDE.md)
28
- - Context-gated -0.20 penalty: ONLY when `gradients_inspected=True AND gradients_were_normal=True`
29
- - Task 6 diagnosis is ALWAYS `code_bug` (not `batchnorm_eval_mode` etc.)
30
- - PyTorch-native only — no numpy in core modules
31
- - Grader ≠ reward function (separate modules, separate purposes)
32
- - Opaque task IDs (task_001-task_006, no descriptive names agent can see)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
.claude/memory/user_context.md DELETED
@@ -1,12 +0,0 @@
1
- ---
2
- name: User context and preferences
3
- description: Omkar is building a hackathon submission, wants winning-quality output with comprehensive testing.
4
- type: user
5
- ---
6
-
7
- - Building a hackathon submission for Meta PyTorch OpenEnv Hackathon x Scaler School of Technology
8
- - Wants thorough audit and verification before submission
9
- - Values comprehensive testing and spec compliance
10
- - Project is in the ML Debugger subdirectory under a Rubacus monorepo
11
- - Uses Python 3.12, venv at `.venv/`
12
- - Commands run from `/home/omkar-kadam/Desktop/Rubacus/ML Debugger/`
 
 
 
 
 
 
 
 
 
 
 
 
 
.claude/plan/fix-all-gaps.md DELETED
@@ -1,92 +0,0 @@
1
- # Implementation Plan: Fix All Hackathon Gaps
2
-
3
- ## Task Type
4
- - [x] Backend (→ Claude direct — all fixes are Python/server-side)
5
-
6
- ## Key Discovery
7
-
8
- **WS task selection WORKS!** The correct format is:
9
- ```json
10
- {"type": "reset", "data": {"task_id": "task_003", "seed": 42}}
11
- ```
12
- The framework's `WSResetMessage` has a `data: Dict[str, Any]` field that passes kwargs to `reset()`. This was previously thought broken but actually works — just needs the `data` wrapper.
13
-
14
- **Impact**: The "CRITICAL" WS task selection issue is actually just a documentation/test gap, not a code bug.
15
-
16
- ---
17
-
18
- ## Implementation Steps
19
-
20
- ### Step 1: Fix WS Tests to Use Correct Task Selection Format
21
- **Files**: `tests/test_websocket.py`
22
- **What**: Update tests to verify `{"type": "reset", "data": {"task_id": "task_003"}}` works. Add tests for all 6 tasks via WS.
23
- **Deliverable**: Tests proving WS task selection works for all tasks.
24
-
25
- ### Step 2: Update README WS Documentation
26
- **Files**: `README.md`
27
- **What**: Update WS reset format docs to show the `data` field:
28
- ```json
29
- {"type": "reset", "data": {"task_id": "task_003", "seed": 42}}
30
- ```
31
- **Deliverable**: Correct documentation.
32
-
33
- ### Step 3: Fix HTTP /step Session Isolation
34
- **Files**: `server/environment.py`, `server/app.py`
35
- **What**: Add a module-level shared session store so HTTP `/reset` and `/step` share state. The framework creates a new env instance per WS connection but HTTP requests use the app-level routes.
36
- **Approach**: Use a module-level `_shared_sessions` dict in `_baseline_results.py` (or a new module) that the environment reads from. When HTTP `/reset` creates a session, store it. When HTTP `/step` runs, look up the session.
37
- **Alternative**: If the framework already handles HTTP session state internally, this may not be fixable without patching the framework. In that case, document that WS is the primary interface and HTTP is for single-action calls only.
38
- **Deliverable**: HTTP reset+step work for full episodes, OR clear documentation that WS is the primary interface.
39
-
40
- ### Step 4: Run Real Validation Suite & Store Results
41
- **Files**: `validation/validate_*.py` (create missing scripts), `server/app.py` (update endpoint)
42
- **What**:
43
- - Create validation scripts for all 6 fault types (only exploding_gradients exists)
44
- - Run them locally, capture R² scores
45
- - Store results in `validation/reports/fidelity_report.json`
46
- - Update `/validation-report` endpoint to serve real pre-computed data
47
- **Deliverable**: Real fidelity scores served at `/validation-report`.
48
-
49
- ### Step 5: Verify Dashboard Real-Time Updates
50
- **Files**: `server/dashboard.html`
51
- **What**: Start server, open dashboard in browser, run an episode via the dashboard's built-in controls (the HTML has task select + run button). Verify charts update. If they don't, fix the WS connection in the dashboard JS.
52
- **Deliverable**: Dashboard shows live episode data.
53
-
54
- ### Step 6: Update EXPLANATION.md and README with WS Format
55
- **Files**: `EXPLANATION.md`, `README.md`
56
- **What**: Fix the WS documentation to show the correct task selection format.
57
- **Deliverable**: Accurate docs.
58
-
59
- ### Step 7: Docker Size — Document the Reality
60
- **Files**: `README.md`
61
- **What**: Add a note explaining why the image is ~1.5GB:
62
- > "PyTorch CPU-only requires libtorch_cpu.so (426MB) for real torch.nn.Module and torch.autograd support. This is the minimum for a PyTorch-native environment — the trade-off for real gradient computation vs synthetic data."
63
- **Deliverable**: Judges understand the trade-off is intentional.
64
-
65
- ### Step 8: Run Full Smoke Test
66
- **What**: Execute the complete pre-submission checklist against Docker container.
67
- **Deliverable**: All gates pass.
68
-
69
- ---
70
-
71
- ## Key Files
72
-
73
- | File | Operation | Description |
74
- |------|-----------|-------------|
75
- | tests/test_websocket.py | Modify | Add WS task selection tests for all 6 tasks |
76
- | README.md | Modify | Fix WS reset format, add Docker size note |
77
- | EXPLANATION.md | Modify | Fix WS reset format |
78
- | server/app.py:93-137 | Modify | Update /validation-report with real data |
79
- | validation/validate_*.py | Create | Validation scripts for all fault types |
80
- | validation/reports/fidelity_report.json | Create | Pre-computed R² scores |
81
-
82
- ## Risks and Mitigation
83
-
84
- | Risk | Mitigation |
85
- |------|------------|
86
- | HTTP /step session isolation may not be fixable | Document WS as primary interface; HTTP for single calls |
87
- | Validation R² may be low for some fault types | Use directional agreement as fallback metric |
88
- | Dashboard WS may not connect | Check browser console, fix WS URL construction |
89
-
90
- ## SESSION_ID (for /ccg:execute use)
91
- - CODEX_SESSION: N/A
92
- - GEMINI_SESSION: N/A
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
.claude/plan/hackathon-winning-audit.md DELETED
@@ -1,241 +0,0 @@
1
- # Deep Audit & Winning Plan — PyTorch Training Run Debugger
2
-
3
- ## Audit Date: 2026-03-28 (Submission Window NOW OPEN)
4
-
5
- ---
6
-
7
- ## AUDIT RESULTS SUMMARY
8
-
9
- ### What's Working Well (GREEN)
10
- - **151/151 tests pass** in 6.13s — zero failures
11
- - **96% code coverage** on `ml_training_debugger/` package
12
- - **Baseline bit-exact reproducible**: identical on two consecutive runs
13
- - **`openenv validate` passes**: `[OK] ML Debugger: Ready for multi-mode deployment`
14
- - **All 6 tasks implemented** with correct root causes and graders
15
- - **Context-gated penalty** fires correctly (tested both paths)
16
- - **Zero numpy imports** in core — all `import torch`
17
- - **Typed Pydantic models** everywhere — no `Dict[str, Any]`
18
- - **Graders return varying scores**: task_005=0.35, others=1.0
19
- - **All custom endpoints work**: `/health`, `/tasks`, `/grader`, `/baseline`, `/dashboard`, `/validation-report`
20
- - **WebSocket full episode flow works**: reset → step → diagnose (via correct message format)
21
- - **Reward constants match spec exactly**
22
- - **Task 6 code fix validation**: multi-strategy pipeline (normalize, tokenize, semantic, AST)
23
- - **README comprehensive** with all required sections
24
- - **Docker builds** successfully from `python:3.12-slim`
25
-
26
- ### CRITICAL Issues (Blocking Submission)
27
-
28
- #### C1. Docker Image Size: 1.96GB (Target: <500MB)
29
- - **Impact**: Judges/auto-validator will flag. Spec says <500MB target.
30
- - **Root Cause**: PyTorch CPU wheel layers aren't compressed properly. The cleanup `rm -rf` runs in a separate RUN layer so Docker still stores the original layer.
31
- - **Fix**: Combine install + cleanup in single RUN layer. Use multi-stage build. Strip torch test/include/share dirs, `.pyi` files, and `__pycache__` all in one layer.
32
-
33
- #### C2. WebSocket Message Format Must Be Documented
34
- - **Impact**: Framework expects specific WS formats that differ from intuitive use:
35
- - Reset: `{"type": "reset"}` (no extra fields — task_id NOT accepted via WS)
36
- - Step: `{"type": "step", "data": {"action_type": "inspect_gradients"}}` (NOT `"action"`)
37
- - **Current state**: WS works correctly when using the right format. Tests pass.
38
- - **Fix**: Document the correct WS message format in README. Consider adding a custom WS handler for task selection.
39
-
40
- #### C3. HTTP `/step` Session Isolation
41
- - **Impact**: HTTP `POST /step` returns empty observation when used after HTTP `POST /reset`. Different env instances per request.
42
- - **Status**: The primary agent interface is WS (which works). HTTP reset/step are framework-provided. Auto-validator likely tests WS.
43
- - **Fix**: Accept this limitation and document WS as primary interface. The `/baseline` endpoint works because it creates its own env instances directly.
44
-
45
- ### HIGH Priority Issues
46
-
47
- #### H1. `done` Field in WS Response
48
- - **Status**: After `mark_diagnosed`, the WS response shows `done=None` in the observation. The `done` field may be at the wrapper level `resp['data']['done']`, not `resp['data']['observation']['done']`.
49
- - **Fix**: Verify and ensure the framework passes `done` correctly.
50
-
51
- #### H2. No HF Space Deployed Yet
52
- - **Impact**: DISQUALIFICATION if not deployed.
53
- - **Fix**: Deploy to HF Spaces after Docker fix. Tag with `openenv`.
54
-
55
- #### H3. Git Repo Not Public
56
- - **Impact**: DISQUALIFICATION if not public.
57
- - **Fix**: Push to public GitHub repo.
58
-
59
- ### MEDIUM Priority Issues
60
-
61
- #### M1. Coverage Gaps (4% remaining)
62
- - `code_templates.py` AST fallback paths (lines 177-178, 208, 218, 224-246)
63
- - `pytorch_engine.py` conv1 near-vanishing red herring (lines 198-201)
64
- - **Fix**: Add targeted tests for these edge paths.
65
-
66
- #### M2. Validation Report is Hardcoded
67
- - `/validation-report` returns static dict, not computed from actual runs.
68
- - **Fix**: Acceptable for submission. Consider running validation suite and storing real results.
69
-
70
- #### M3. Heuristic Doesn't Handle All Code Bug Variants
71
- - `baseline_heuristic.py` only catches `eval_mode` and `detach_loss` variants for Task 6.
72
- - `zero_grad_missing` and `inplace_relu` fall through to generic `code_bug` diagnosis (correct) but without fix.
73
- - **Status**: Acceptable — shows the task genuinely challenges even pattern-matching approaches.
74
-
75
- ---
76
-
77
- ## HACKATHON COMPLIANCE MATRIX
78
-
79
- | Requirement | Status | Evidence |
80
- |------------|--------|---------|
81
- | Real-world task simulation | PASS | ML debugging — genuine industry problem |
82
- | OpenEnv spec compliance | PASS | `openenv validate` passes |
83
- | Typed Pydantic models | PASS | All models extend `Action`/`Observation` |
84
- | step()/reset()/state() API | PASS | Full implementation in `environment.py` |
85
- | openenv.yaml with metadata | PASS | 6 tasks, reward config, endpoints |
86
- | 3+ tasks with graders (0.0-1.0) | PASS | 6 tasks, 3 difficulty tiers |
87
- | Meaningful reward function | PASS | 7 components, context-gated penalty |
88
- | Baseline inference script | PASS | `baseline_heuristic.py` (deterministic) + `baseline_inference.py` (LLM) |
89
- | Working Dockerfile | PASS | Builds, runs on 7860 |
90
- | Docker image <500MB | **FAIL** | 1.96GB — needs multi-stage build |
91
- | HF Space deployed | **PENDING** | Not yet deployed |
92
- | HF Space tagged `openenv` | **PENDING** | Not yet tagged |
93
- | Public GitHub repo | **PENDING** | Not yet public |
94
- | README complete | PASS | All required sections present |
95
- | `/health` endpoint | PASS | `{"status": "ready", "tasks": 6}` |
96
- | `/tasks` endpoint | PASS | 6 tasks with action schema |
97
- | `/grader` endpoint | PASS | Score after episode completion |
98
- | `/baseline` endpoint | PASS | Scores for all 6 tasks |
99
- | WS `/ws` responds to reset | PASS | Returns valid observation |
100
-
101
- ---
102
-
103
- ## IMPLEMENTATION PLAN — Priority Order
104
-
105
- ### Phase 1: Fix Docker Size (CRITICAL — Must Do First)
106
-
107
- #### Step 1.1: Rewrite Dockerfile with Multi-Stage Build
108
- **File**: `Dockerfile`
109
- **Goal**: Image <500MB
110
-
111
- **Key changes**:
112
- 1. Combine PyTorch install + aggressive cleanup in a SINGLE RUN layer (Docker layers are immutable — separate RUN for cleanup doesn't reduce size)
113
- 2. Remove more torch internals: `torch/testing/`, `torch/utils/benchmark/`, `torch/distributed/`, `torch/ao/`
114
- 3. Strip all `.pyi` type stub files
115
- 4. Remove all `__pycache__` dirs
116
- 5. Consider using `--target` multi-stage to copy only runtime files
117
-
118
- **Pseudo-Dockerfile**:
119
- ```dockerfile
120
- FROM python:3.12-slim
121
-
122
- WORKDIR /app
123
-
124
- # Install curl for healthcheck
125
- RUN apt-get update && apt-get install -y --no-install-recommends curl && \
126
- rm -rf /var/lib/apt/lists/*
127
-
128
- # Install torch + deps + strip in ONE layer
129
- RUN pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cpu && \
130
- pip install --no-cache-dir openenv-core pydantic fastapi uvicorn openai && \
131
- # Aggressive cleanup in same layer
132
- rm -rf /usr/local/lib/python3.12/site-packages/torch/test \
133
- /usr/local/lib/python3.12/site-packages/torch/testing \
134
- /usr/local/lib/python3.12/site-packages/torch/include \
135
- /usr/local/lib/python3.12/site-packages/torch/share \
136
- /usr/local/lib/python3.12/site-packages/torch/distributed \
137
- /usr/local/lib/python3.12/site-packages/torch/ao \
138
- /usr/local/lib/python3.12/site-packages/torch/utils/benchmark \
139
- /usr/local/lib/python3.12/site-packages/torch/utils/bottleneck \
140
- /usr/local/lib/python3.12/site-packages/torch/utils/tensorboard \
141
- /usr/local/lib/python3.12/site-packages/torch/lib/*.a && \
142
- find /usr/local/lib/python3.12/site-packages/torch -name "*.pyi" -delete && \
143
- find /usr/local/lib/python3.12/site-packages -name "__pycache__" -exec rm -rf {} + 2>/dev/null; true
144
-
145
- COPY ml_training_debugger/ ml_training_debugger/
146
- COPY server/ server/
147
- COPY openenv.yaml .
148
- COPY baseline_heuristic.py .
149
- COPY baseline_inference.py .
150
- COPY README.md .
151
-
152
- EXPOSE 7860
153
-
154
- HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \
155
- CMD curl -f http://localhost:7860/health || exit 1
156
-
157
- CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "7860"]
158
- ```
159
-
160
- **Verification**: `docker images pytorch-debugger` shows <500MB
161
-
162
- #### Step 1.2: Verify Docker Container Works
163
- ```bash
164
- docker build --no-cache -t pytorch-debugger .
165
- docker run -d -p 7860:7860 --name smoke pytorch-debugger
166
- sleep 10
167
- curl -f http://localhost:7860/health
168
- curl -f http://localhost:7860/tasks | python -m json.tool
169
- curl -f -X POST http://localhost:7860/baseline | python -m json.tool
170
- docker stop smoke && docker rm smoke
171
- ```
172
-
173
- ### Phase 2: Deploy (CRITICAL)
174
-
175
- #### Step 2.1: Push to Public GitHub
176
- 1. Initialize git (if not done)
177
- 2. Push to public repo
178
- 3. Ensure README, openenv.yaml, Dockerfile, baseline scripts, source all present
179
-
180
- #### Step 2.2: Deploy to HF Spaces
181
- 1. Create HF Space (Docker type)
182
- 2. Tag with `openenv`
183
- 3. Push code
184
- 4. Verify build completes
185
- 5. Test endpoints:
186
- - `curl https://<space>/health`
187
- - `wscat -c wss://<space>/ws` → `{"type": "reset"}`
188
-
189
- ### Phase 3: Polish for Maximum Score
190
-
191
- #### Step 3.1: Add Coverage for Edge Paths
192
- **Files**: New tests targeting uncovered lines in `code_templates.py` and `pytorch_engine.py`
193
- - Test AST fallback validation in `validate_fix()`
194
- - Test conv1 near-vanishing red herring injection
195
- - Target: 98%+ coverage
196
-
197
- #### Step 3.2: README Final Polish
198
- - Add WS message format documentation
199
- - Add architecture diagram (text-based)
200
- - Update any changed baseline scores
201
- - Add HF Space URL after deployment
202
-
203
- #### Step 3.3: Run Complete Smoke Test Sequence
204
- Execute the full checklist from ROADMAP.md against the deployed Docker container and HF Space.
205
-
206
- ---
207
-
208
- ## SCORING SELF-ASSESSMENT
209
-
210
- | Criterion | Weight | Current | After Fixes | Notes |
211
- |-----------|--------|---------|-------------|-------|
212
- | Real-world utility | 30% | 27/30 | 28/30 | ML debugging is genuine, PyTorch-aligned |
213
- | Task & grader quality | 25% | 23/25 | 24/25 | 6 tasks, difficulty range, deterministic graders |
214
- | Environment design | 20% | 17/20 | 18/20 | Clean state, typed models, shaped reward |
215
- | Code quality & spec | 15% | 11/15 | 14/15 | Docker fix + deploy brings this up |
216
- | Creativity & novelty | 10% | 9/10 | 9/10 | Context-gated penalty is unique |
217
- | **TOTAL** | **100%** | **87/100** | **93/100** | |
218
-
219
- ---
220
-
221
- ## EXECUTION PRIORITY (Top to Bottom)
222
-
223
- 1. **Fix Dockerfile** — single RUN layer for install+cleanup → target <500MB
224
- 2. **Rebuild Docker** — verify size and functionality
225
- 3. **Push to public GitHub**
226
- 4. **Deploy to HF Spaces** — tag with `openenv`
227
- 5. **Add edge-case tests** — 98%+ coverage
228
- 6. **README final polish** — add WS format docs, HF URL
229
- 7. **Full smoke test** — against deployed container and HF Space
230
- 8. **Submit** — HF Space URL + GitHub repo URL
231
-
232
- ---
233
-
234
- ## KEY FILES TO MODIFY
235
-
236
- | File | Change | Priority |
237
- |------|--------|----------|
238
- | `Dockerfile` | Multi-stage or single-layer install+cleanup | CRITICAL |
239
- | `README.md` | Add WS format docs, HF URL, architecture diagram | HIGH |
240
- | `tests/test_code_templates_edge.py` | New: AST fallback, edge cases | MEDIUM |
241
- | `tests/test_pytorch_engine.py` | Extend: conv1 near-vanishing | MEDIUM |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
.claude/plan/pytorch-debugger-mvp.md DELETED
@@ -1,1647 +0,0 @@
1
- # Implementation Plan: PyTorch Training Run Debugger — OpenEnv Environment
2
-
3
- **Generated:** 2026-03-28
4
- **King File:** `ml-training-debugger-spec.md` — single source of truth for all conflicts
5
- **Runtime:** Python 3.12 · PyTorch CPU-only · openenv-core (installed in .venv)
6
- **MVP Scope:** Tasks 1, 3, 5 + rule-based baseline + all required endpoints + Docker + HF Spaces
7
-
8
- ---
9
-
10
- ## Markdown Files Confirmed Read
11
-
12
- | File | Lines | Role |
13
- |------|-------|------|
14
- | `ml-training-debugger-spec.md` | 1549 | **KING FILE** — final authority on all design decisions |
15
- | `CLAUDE.md` | ~280 | Coding standards, non-negotiable rules, reward constants |
16
- | `PRD.md` | ~368 | Product requirements, success metrics, timeline |
17
- | `ROADMAP.md` | ~442 | Phased roadmap with acceptance criteria |
18
-
19
- All four files read in full. The spec is the definitive authority.
20
-
21
- ---
22
-
23
- ## Complete Project Structure (Final State)
24
-
25
- ```
26
- ML Debugger/ # Project root
27
- ├── .claude/
28
- │ └── plan/
29
- │ └── pytorch-debugger-mvp.md # This plan
30
- ├── .dockerignore
31
- ├── .gitignore
32
- ├── .python-version # "3.12"
33
- ├── CLAUDE.md # Already exists
34
- ├── Dockerfile
35
- ├── PRD.md # Already exists
36
- ├── README.md
37
- ├── ROADMAP.md # Already exists
38
- ├── baseline_heuristic.py # Rule-based baseline (no API key)
39
- ├── baseline_inference.py # LLM baseline (optional, requires OPENAI_API_KEY)
40
- ├── deploy.sh # One-command build+test+validate script
41
- ├── ml-training-debugger-spec.md # Already exists (king file)
42
- ├── openenv.yaml
43
- ├── pyproject.toml
44
- ├── requirements.txt
45
-
46
- ├── ml_training_debugger/
47
- │ ├── __init__.py
48
- │ ├── models.py # All Pydantic models + RootCauseDiagnosis enum
49
- │ ├── client.py # EnvClient extension with typed action/observation
50
- │ ├── scenarios.py # ScenarioParams + sample_scenario()
51
- │ ├── pytorch_engine.py # SimpleCNN, fault injection, gradient/weight extraction
52
- │ ├── simulation.py # Parametric curve generation (torch.Tensor ops)
53
- │ ├── code_templates.py # Task 6: code snippets with bugs + validate_fix()
54
- │ ├── reward_engine.py # compute_reward() — all 7 components
55
- │ └── graders.py # Per-task grader functions (0.0–1.0)
56
-
57
- ├── server/
58
- │ ├── __init__.py
59
- │ ├── environment.py # MLTrainingEnvironment(Environment)
60
- │ ├── app.py # create_app() + custom routes
61
- │ └── dashboard.html # Live diagnostic dashboard (Phase 3)
62
-
63
- ├── validation/ # PyTorch validation suite (Phase 3)
64
- │ ├── requirements.txt
65
- │ ├── conftest.py
66
- │ ├── validate_exploding_gradients.py
67
- │ ├── validate_vanishing_gradients.py
68
- │ ├── validate_data_leakage.py
69
- │ ├── validate_overfitting.py
70
- │ ├── validate_batchnorm_eval.py
71
- │ ├── validate_code_bugs.py
72
- │ └── reports/ # Pre-computed fidelity plots
73
-
74
- └── tests/
75
- ├── __init__.py
76
- ├── conftest.py # Shared fixtures
77
- ├── test_models.py
78
- ├── test_scenarios.py
79
- ├── test_pytorch_engine.py
80
- ├── test_simulation.py
81
- ├── test_code_templates.py
82
- ├── test_reward_engine.py
83
- ├── test_graders.py
84
- ├── test_episode_lifecycle.py
85
- ├── test_endpoints.py
86
- └── test_baseline_reproducibility.py
87
- ```
88
-
89
- ---
90
-
91
- ## Phase 0: Project Initialization & Validation Setup
92
-
93
- ### Goal
94
- A running skeleton server that proves the toolchain works end-to-end. Zero business logic — just plumbing.
95
-
96
- ### Files to Create
97
-
98
- **Step 0.1 — Project config files:**
99
-
100
- 1. **`.python-version`** — content: `3.12`
101
-
102
- 2. **`.gitignore`**:
103
- ```
104
- .venv/
105
- __pycache__/
106
- *.pyc
107
- *.pyo
108
- .env
109
- run*.json
110
- .pytest_cache/
111
- htmlcov/
112
- *.egg-info/
113
- dist/
114
- build/
115
- validation/reports/*.png
116
- .mypy_cache/
117
- ```
118
-
119
- 3. **`.dockerignore`**:
120
- ```
121
- .venv/
122
- __pycache__/
123
- .git/
124
- .pytest_cache/
125
- tests/
126
- validation/
127
- *.md
128
- !README.md
129
- .claude/
130
- run*.json
131
- htmlcov/
132
- ```
133
-
134
- 4. **`pyproject.toml`**:
135
- ```toml
136
- [project]
137
- name = "pytorch-training-debugger"
138
- version = "1.0.0"
139
- description = "OpenEnv RL environment for PyTorch training failure debugging"
140
- requires-python = ">=3.12"
141
- dependencies = [
142
- "torch",
143
- "openenv-core",
144
- "pydantic>=2.0",
145
- "fastapi",
146
- "uvicorn",
147
- ]
148
-
149
- [project.optional-dependencies]
150
- dev = [
151
- "pytest",
152
- "pytest-cov",
153
- "pytest-asyncio",
154
- "black",
155
- "ruff",
156
- "isort",
157
- "httpx",
158
- "websockets",
159
- ]
160
- llm = [
161
- "openai",
162
- ]
163
-
164
- [tool.black]
165
- line-length = 88
166
-
167
- [tool.isort]
168
- profile = "black"
169
-
170
- [tool.ruff]
171
- line-length = 88
172
- target-version = "py312"
173
-
174
- [tool.pytest.ini_options]
175
- testpaths = ["tests"]
176
- asyncio_mode = "auto"
177
- ```
178
-
179
- 5. **`requirements.txt`** (for Docker — flat list, no dev deps):
180
- ```
181
- torch
182
- openenv-core
183
- pydantic>=2.0
184
- fastapi
185
- uvicorn
186
- openai
187
- ```
188
-
189
- **Step 0.2 — Package stubs:**
190
-
191
- 6. **`ml_training_debugger/__init__.py`**:
192
- ```python
193
- """PyTorch Training Run Debugger — OpenEnv Environment."""
194
-
195
- __version__ = "1.0.0"
196
- ```
197
-
198
- 7. **`ml_training_debugger/models.py`** — STUB with all Pydantic models:
199
- ```python
200
- """All Pydantic models, enums, and typed data structures.
201
-
202
- No business logic. Pure data definitions.
203
- """
204
-
205
- from __future__ import annotations
206
-
207
- import enum
208
- from typing import Literal, Optional
209
-
210
- import torch
211
- from openenv.core.env_server.types import Action, Observation
212
- from pydantic import BaseModel, Field
213
-
214
-
215
- class RootCauseDiagnosis(str, enum.Enum):
216
- """Closed enumeration of ML failure root causes."""
217
- LR_TOO_HIGH = "lr_too_high"
218
- VANISHING_GRADIENTS = "vanishing_gradients"
219
- DATA_LEAKAGE = "data_leakage"
220
- OVERFITTING = "overfitting"
221
- BATCHNORM_EVAL_MODE = "batchnorm_eval_mode"
222
- CODE_BUG = "code_bug"
223
-
224
-
225
- class TrainingConfig(BaseModel):
226
- """Typed hyperparameter configuration."""
227
- learning_rate: float = 0.001
228
- weight_decay: float = 0.0001
229
- batch_size: int = 64
230
- hidden_dim: int = 64
231
- num_layers: int = 3
232
- optimizer: str = "adam"
233
- dropout_rate: float = 0.0
234
- gradient_clip_norm: Optional[float] = None
235
-
236
-
237
- class GradientStats(BaseModel):
238
- """Per-layer gradient information from real torch.autograd."""
239
- layer_name: str
240
- norm_history: list[float]
241
- mean_norm: float
242
- max_norm: float
243
- is_exploding: bool
244
- is_vanishing: bool
245
-
246
-
247
- class ModelWeightStats(BaseModel):
248
- """Per-layer weight statistics from real state_dict()."""
249
- layer_name: str
250
- weight_norm: float
251
- weight_mean: float
252
- weight_std: float
253
- weight_min: float
254
- weight_max: float
255
- dead_neuron_pct: float = 0.0
256
- has_nan: bool = False
257
- has_inf: bool = False
258
-
259
-
260
- class DataBatchStats(BaseModel):
261
- """Data batch inspection results."""
262
- label_distribution: dict[int, float]
263
- feature_mean: float
264
- feature_std: float
265
- null_count: int = 0
266
- class_overlap_score: float
267
- batch_size: int
268
- duplicate_ratio: float = 0.0
269
-
270
-
271
- class CodeSnippet(BaseModel):
272
- """PyTorch code for Task 6 inspection."""
273
- code: str
274
- filename: str = "train.py"
275
- line_count: int
276
- imports: list[str]
277
- hint: Optional[str] = None
278
-
279
-
280
- class EpisodeState(BaseModel):
281
- """Tracks agent history within an episode."""
282
- step_count: int = 0
283
- gradients_inspected: bool = False
284
- gradients_were_normal: bool = False
285
- data_inspected: bool = False
286
- model_modes_inspected: bool = False
287
- model_weights_inspected: bool = False
288
- code_inspected: bool = False
289
- fix_action_taken: bool = False
290
- restart_after_fix: bool = False
291
- diagnosis_submitted: bool = False
292
- actions_taken: list[str] = Field(default_factory=list)
293
-
294
- def compute_available_actions(self) -> list[str]:
295
- """Dynamically compute available actions based on current state."""
296
- actions = [
297
- "inspect_gradients",
298
- "inspect_data_batch",
299
- "inspect_model_modes",
300
- "inspect_model_weights",
301
- "inspect_code",
302
- "modify_config",
303
- "add_callback",
304
- "replace_optimizer",
305
- "patch_data_loader",
306
- "fix_model_mode",
307
- ]
308
- if self.code_inspected:
309
- actions.append("fix_code")
310
- if self.fix_action_taken:
311
- actions.append("restart_run")
312
- if self.restart_after_fix:
313
- actions.append("rollback_checkpoint")
314
- if not self.diagnosis_submitted:
315
- actions.append("mark_diagnosed")
316
- return actions
317
-
318
-
319
- ACTION_TYPES = Literal[
320
- "inspect_gradients",
321
- "inspect_data_batch",
322
- "inspect_model_modes",
323
- "inspect_model_weights",
324
- "inspect_code",
325
- "modify_config",
326
- "add_callback",
327
- "replace_optimizer",
328
- "patch_data_loader",
329
- "fix_model_mode",
330
- "fix_code",
331
- "restart_run",
332
- "mark_diagnosed",
333
- "rollback_checkpoint",
334
- ]
335
-
336
-
337
- class MLTrainingAction(Action):
338
- """What the agent can do — extends openenv Action."""
339
- action_type: str
340
- target: Optional[str] = None
341
- value: Optional[float | int | str] = None
342
- diagnosis: Optional[str] = None
343
- line: Optional[int] = None
344
- replacement: Optional[str] = None
345
-
346
-
347
- class MLTrainingObservation(Observation):
348
- """Full observation — extends openenv Observation (has done, reward, metadata)."""
349
- run_id: str = ""
350
- framework: str = "pytorch"
351
- epoch: int = 20
352
- training_loss_history: list[float] = Field(default_factory=list)
353
- val_loss_history: list[float] = Field(default_factory=list)
354
- val_accuracy_history: list[float] = Field(default_factory=list)
355
- gradient_stats: list[GradientStats] = Field(default_factory=list)
356
- model_weight_stats: Optional[list[ModelWeightStats]] = None
357
- gpu_memory_used_gb: float = 6.2
358
- gpu_memory_total_gb: float = 16.0
359
- learning_rate: float = 0.001
360
- current_config: TrainingConfig = Field(default_factory=TrainingConfig)
361
- error_log: Optional[str] = None
362
- data_batch_stats: Optional[DataBatchStats] = None
363
- model_mode_info: Optional[dict[str, str]] = None
364
- code_snippet: Optional[CodeSnippet] = None
365
- available_actions: list[str] = Field(default_factory=list)
366
- episode_state: EpisodeState = Field(default_factory=EpisodeState)
367
- notes: Optional[str] = None
368
- ```
369
-
370
- 8. **`ml_training_debugger/client.py`** — STUB:
371
- ```python
372
- """Typed EnvClient for baseline scripts."""
373
-
374
- from openenv.core.env_client import EnvClient
375
-
376
- from ml_training_debugger.models import MLTrainingAction, MLTrainingObservation
377
-
378
-
379
- class MLTrainingEnvClient(EnvClient[MLTrainingAction, MLTrainingObservation, dict]):
380
- """Typed client for the PyTorch Training Debugger environment."""
381
-
382
- def _step_payload(self, action: MLTrainingAction) -> dict:
383
- return action.model_dump(exclude_none=True)
384
-
385
- def _parse_observation(self, data: dict) -> MLTrainingObservation:
386
- return MLTrainingObservation.model_validate(data)
387
- ```
388
-
389
- 9. **`server/__init__.py`** — empty file
390
-
391
- 10. **`server/environment.py`** — STUB:
392
- ```python
393
- """MLTrainingEnvironment — extends openenv Environment."""
394
-
395
- from typing import Any, Optional
396
-
397
- from openenv.core.env_server.interfaces import Environment
398
-
399
- from ml_training_debugger.models import (
400
- EpisodeState,
401
- MLTrainingAction,
402
- MLTrainingObservation,
403
- TrainingConfig,
404
- )
405
-
406
-
407
- class MLTrainingEnvironment(
408
- Environment[MLTrainingAction, MLTrainingObservation, dict]
409
- ):
410
- """OpenEnv environment for PyTorch training run debugging."""
411
-
412
- SUPPORTS_CONCURRENT_SESSIONS = True
413
-
414
- def reset(
415
- self,
416
- seed: Optional[int] = None,
417
- episode_id: Optional[str] = None,
418
- **kwargs: Any,
419
- ) -> MLTrainingObservation:
420
- """Reset environment, return initial observation."""
421
- state = EpisodeState()
422
- obs = MLTrainingObservation(
423
- run_id=episode_id or "episode_001",
424
- training_loss_history=[2.3] * 20,
425
- val_loss_history=[2.3] * 20,
426
- val_accuracy_history=[0.1] * 20,
427
- current_config=TrainingConfig(),
428
- available_actions=state.compute_available_actions(),
429
- episode_state=state,
430
- done=False,
431
- reward=0.0,
432
- )
433
- return obs
434
-
435
- def step(
436
- self,
437
- action: MLTrainingAction,
438
- timeout_s: Optional[float] = None,
439
- **kwargs: Any,
440
- ) -> MLTrainingObservation:
441
- """Process one agent action."""
442
- state = EpisodeState()
443
- obs = MLTrainingObservation(
444
- run_id="episode_001",
445
- training_loss_history=[2.3] * 20,
446
- val_loss_history=[2.3] * 20,
447
- val_accuracy_history=[0.1] * 20,
448
- current_config=TrainingConfig(),
449
- available_actions=state.compute_available_actions(),
450
- episode_state=state,
451
- done=False,
452
- reward=-0.01,
453
- )
454
- return obs
455
-
456
- @property
457
- def state(self) -> dict:
458
- """Return current environment state."""
459
- return {"status": "active"}
460
- ```
461
-
462
- 11. **`server/app.py`** — STUB with all endpoints:
463
- ```python
464
- """FastAPI app — openenv create_app() + custom routes."""
465
-
466
- import logging
467
-
468
- from fastapi import FastAPI
469
- from fastapi.responses import JSONResponse
470
- from openenv.core.env_server.http_server import create_app
471
-
472
- from ml_training_debugger.models import MLTrainingAction, MLTrainingObservation
473
- from server.environment import MLTrainingEnvironment
474
-
475
- logger = logging.getLogger(__name__)
476
-
477
- # create_app takes the class (factory), not an instance
478
- app: FastAPI = create_app(
479
- MLTrainingEnvironment,
480
- MLTrainingAction,
481
- MLTrainingObservation,
482
- env_name="pytorch_training_debugger",
483
- max_concurrent_envs=5,
484
- )
485
-
486
-
487
- @app.get("/health")
488
- def health_check() -> dict:
489
- """Health check — required by hackathon auto-validator."""
490
- return {"status": "ready", "tasks": 3}
491
-
492
-
493
- @app.get("/tasks")
494
- def get_tasks() -> list[dict]:
495
- """Return task list with IDs, difficulties, and action schema."""
496
- schema = MLTrainingAction.model_json_schema()
497
- return [
498
- {"id": "task_001", "difficulty": "easy", "max_steps": 20, "action_schema": schema},
499
- {"id": "task_003", "difficulty": "medium", "max_steps": 25, "action_schema": schema},
500
- {"id": "task_005", "difficulty": "hard", "max_steps": 30, "action_schema": schema},
501
- ]
502
-
503
-
504
- @app.post("/grader")
505
- def post_grader() -> dict:
506
- """Return grader score for most recently completed episode."""
507
- return {"score": None, "error": "no_completed_episode"}
508
-
509
-
510
- @app.post("/baseline")
511
- async def post_baseline() -> dict:
512
- """Trigger baseline run, return scores."""
513
- return {"scores": {"task_001": 0.0, "task_003": 0.0, "task_005": 0.0}}
514
- ```
515
-
516
- 12. **`openenv.yaml`**:
517
- ```yaml
518
- spec_version: 1
519
- name: pytorch-training-debugger
520
- type: space
521
- runtime: fastapi
522
- app: server.app:app
523
- port: 7860
524
-
525
- # Extended metadata
526
- version: "1.0.0"
527
- description: |
528
- PyTorch-native fault injection engine for training failure debugging.
529
- An AI agent investigates, diagnoses, fixes, and verifies broken
530
- training runs using real torch.nn.Module models, torch.autograd
531
- gradients, state_dict() weight inspection, and PyTorch code-level
532
- debugging.
533
- framework: openenv
534
- tags: [ml-debugging, pytorch, reinforcement-learning, root-cause-analysis, fault-injection]
535
-
536
- observation_space:
537
- type: MLTrainingObservation
538
- description: "Training run snapshot with progressive reveal"
539
-
540
- action_space:
541
- type: MLTrainingAction
542
- description: "Investigation, fix, and diagnosis actions with dynamic availability"
543
-
544
- tasks:
545
- - id: task_001
546
- difficulty: easy
547
- max_steps: 20
548
- - id: task_003
549
- difficulty: medium
550
- max_steps: 25
551
- - id: task_005
552
- difficulty: hard
553
- max_steps: 30
554
-
555
- reward:
556
- range: [-1.0, 1.0]
557
- shaped: true
558
- step_penalty: -0.01
559
- investigation_bonus: 0.05
560
- correct_diagnosis: 0.50
561
- terminal_convergence: 0.40
562
-
563
- endpoints:
564
- websocket: "/ws"
565
- tasks: "GET /tasks"
566
- grader: "POST /grader"
567
- baseline: "POST /baseline"
568
- health: "GET /health"
569
- ```
570
-
571
- 13. **`Dockerfile`**:
572
- ```dockerfile
573
- FROM python:3.12-slim
574
-
575
- WORKDIR /app
576
-
577
- # Install PyTorch CPU-only first (largest layer, cached)
578
- RUN pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cpu
579
-
580
- # Install remaining dependencies
581
- COPY requirements.txt .
582
- RUN pip install --no-cache-dir -r requirements.txt
583
-
584
- # Copy application code
585
- COPY ml_training_debugger/ ml_training_debugger/
586
- COPY server/ server/
587
- COPY openenv.yaml .
588
- COPY baseline_heuristic.py .
589
-
590
- # Copy pre-computed validation reports if they exist
591
- COPY validation/reports/ validation/reports/ 2>/dev/null || true
592
-
593
- EXPOSE 7860
594
-
595
- CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "7860"]
596
- ```
597
-
598
- 14. **`tests/__init__.py`** — empty file
599
-
600
- 15. **`tests/conftest.py`**:
601
- ```python
602
- """Shared test fixtures."""
603
-
604
- import pytest
605
-
606
- from ml_training_debugger.models import (
607
- EpisodeState,
608
- MLTrainingAction,
609
- MLTrainingObservation,
610
- TrainingConfig,
611
- )
612
-
613
-
614
- @pytest.fixture
615
- def fresh_episode_state() -> EpisodeState:
616
- return EpisodeState()
617
-
618
-
619
- @pytest.fixture
620
- def sample_config() -> TrainingConfig:
621
- return TrainingConfig(learning_rate=0.001)
622
-
623
-
624
- @pytest.fixture
625
- def sample_observation() -> MLTrainingObservation:
626
- state = EpisodeState()
627
- return MLTrainingObservation(
628
- run_id="test_episode",
629
- training_loss_history=[2.3 - i * 0.1 for i in range(20)],
630
- val_loss_history=[2.3 - i * 0.08 for i in range(20)],
631
- val_accuracy_history=[0.1 + i * 0.04 for i in range(20)],
632
- current_config=TrainingConfig(),
633
- available_actions=state.compute_available_actions(),
634
- episode_state=state,
635
- done=False,
636
- reward=0.0,
637
- )
638
- ```
639
-
640
- 16. **`tests/test_models.py`**:
641
- ```python
642
- """Test all Pydantic models instantiate and serialize correctly."""
643
-
644
- import json
645
- import pytest
646
- from ml_training_debugger.models import (
647
- CodeSnippet,
648
- DataBatchStats,
649
- EpisodeState,
650
- GradientStats,
651
- MLTrainingAction,
652
- MLTrainingObservation,
653
- ModelWeightStats,
654
- RootCauseDiagnosis,
655
- TrainingConfig,
656
- )
657
-
658
-
659
- class TestRootCauseDiagnosis:
660
- def test_all_six_values_exist(self):
661
- assert len(RootCauseDiagnosis) == 6
662
-
663
- def test_values_are_strings(self):
664
- for d in RootCauseDiagnosis:
665
- assert isinstance(d.value, str)
666
-
667
-
668
- class TestTrainingConfig:
669
- def test_default_instantiation(self):
670
- config = TrainingConfig()
671
- assert config.learning_rate == 0.001
672
-
673
- def test_json_roundtrip(self):
674
- config = TrainingConfig(learning_rate=0.01)
675
- data = json.loads(config.model_dump_json())
676
- restored = TrainingConfig.model_validate(data)
677
- assert restored.learning_rate == 0.01
678
-
679
-
680
- class TestEpisodeState:
681
- def test_fresh_state(self):
682
- state = EpisodeState()
683
- assert state.step_count == 0
684
- assert not state.gradients_inspected
685
- assert not state.diagnosis_submitted
686
-
687
- def test_available_actions_initial(self):
688
- state = EpisodeState()
689
- actions = state.compute_available_actions()
690
- assert "inspect_gradients" in actions
691
- assert "mark_diagnosed" in actions
692
- assert "fix_code" not in actions
693
- assert "restart_run" not in actions
694
-
695
- def test_fix_code_available_after_code_inspected(self):
696
- state = EpisodeState(code_inspected=True)
697
- actions = state.compute_available_actions()
698
- assert "fix_code" in actions
699
-
700
- def test_restart_run_available_after_fix(self):
701
- state = EpisodeState(fix_action_taken=True)
702
- actions = state.compute_available_actions()
703
- assert "restart_run" in actions
704
-
705
- def test_mark_diagnosed_disappears_after_submission(self):
706
- state = EpisodeState(diagnosis_submitted=True)
707
- actions = state.compute_available_actions()
708
- assert "mark_diagnosed" not in actions
709
-
710
-
711
- class TestMLTrainingObservation:
712
- def test_extends_observation(self):
713
- from openenv.core.env_server.types import Observation
714
- assert issubclass(MLTrainingObservation, Observation)
715
-
716
- def test_has_done_and_reward(self):
717
- obs = MLTrainingObservation(done=True, reward=0.5)
718
- assert obs.done is True
719
- assert obs.reward == 0.5
720
-
721
- def test_json_serialization(self):
722
- obs = MLTrainingObservation(
723
- run_id="test",
724
- training_loss_history=[1.0, 2.0],
725
- val_accuracy_history=[0.5],
726
- )
727
- data = json.loads(obs.model_dump_json())
728
- assert data["run_id"] == "test"
729
-
730
-
731
- class TestMLTrainingAction:
732
- def test_extends_action(self):
733
- from openenv.core.env_server.types import Action
734
- assert issubclass(MLTrainingAction, Action)
735
-
736
- def test_basic_action(self):
737
- action = MLTrainingAction(action_type="inspect_gradients")
738
- assert action.action_type == "inspect_gradients"
739
-
740
- def test_modify_config_action(self):
741
- action = MLTrainingAction(
742
- action_type="modify_config",
743
- target="learning_rate",
744
- value=0.001,
745
- )
746
- assert action.target == "learning_rate"
747
-
748
- def test_mark_diagnosed_action(self):
749
- action = MLTrainingAction(
750
- action_type="mark_diagnosed",
751
- diagnosis="lr_too_high",
752
- )
753
- assert action.diagnosis == "lr_too_high"
754
-
755
- def test_fix_code_action(self):
756
- action = MLTrainingAction(
757
- action_type="fix_code",
758
- line=13,
759
- replacement="loss = criterion(output, batch_y)",
760
- )
761
- assert action.line == 13
762
- ```
763
-
764
- **Step 0.3 — Validation Commands:**
765
-
766
- ```bash
767
- # In project root with venv activated
768
- source .venv/bin/activate
769
-
770
- # 1. Verify imports
771
- python -c "from ml_training_debugger.models import MLTrainingAction, MLTrainingObservation; print('models OK')"
772
- python -c "from ml_training_debugger.client import MLTrainingEnvClient; print('client OK')"
773
- python -c "from server.app import app; print('app OK')"
774
-
775
- # 2. Run tests
776
- pytest tests/test_models.py -v
777
-
778
- # 3. Start server
779
- uvicorn server.app:app --host 0.0.0.0 --port 7860 &
780
- sleep 3
781
- curl http://localhost:7860/health
782
- curl http://localhost:7860/tasks
783
- curl http://localhost:7860/docs
784
- kill %1
785
-
786
- # 4. Formatting
787
- black ml_training_debugger/ server/ tests/ --check
788
- ruff check ml_training_debugger/ server/ tests/
789
- isort ml_training_debugger/ server/ tests/ --check --profile black
790
- ```
791
-
792
- ### Acceptance Criteria — Phase 0
793
-
794
- - [ ] All Pydantic models instantiate without error and serialize to valid JSON
795
- - [ ] `MLTrainingObservation` extends `Observation` (has `done`, `reward`, `metadata`)
796
- - [ ] `MLTrainingAction` extends `Action` (has `metadata`)
797
- - [ ] `EpisodeState.compute_available_actions()` returns correct dynamic action lists
798
- - [ ] Server starts on port 7860 and responds to `/health` with `{"status": "ready", "tasks": 3}`
799
- - [ ] `/tasks` returns 3 tasks with action schema
800
- - [ ] `pytest tests/test_models.py` passes all tests
801
- - [ ] `client.py` imports without error
802
- - [ ] `black --check`, `ruff check`, `isort --check` all pass
803
-
804
- ---
805
-
806
- ## Phase 1: Core Data Models & Pydantic Types
807
-
808
- ### Goal
809
- Finalize all model fields to match the spec exactly. No business logic yet — just data shapes.
810
-
811
- ### Files to Edit
812
-
813
- **`ml_training_debugger/models.py`** — Already created in Phase 0. Verify:
814
- - All fields match spec Section 10 exactly
815
- - `GradientStats.is_exploding` threshold: `mean_norm > 10.0`
816
- - `GradientStats.is_vanishing` threshold: `mean_norm < 1e-6`
817
- - `TrainingConfig` field names match `modify_config` target options
818
- - `EpisodeState.compute_available_actions()` logic matches spec Section 10 dynamic rules
819
-
820
- ### Tests (write BEFORE implementation — TDD)
821
-
822
- All tests already written in `tests/test_models.py` from Phase 0. Extend with:
823
-
824
- ```python
825
- class TestGradientStats:
826
- def test_exploding_threshold(self):
827
- stats = GradientStats(
828
- layer_name="fc", norm_history=[15.0], mean_norm=15.0, max_norm=15.0,
829
- is_exploding=True, is_vanishing=False,
830
- )
831
- assert stats.is_exploding is True
832
-
833
- def test_vanishing_threshold(self):
834
- stats = GradientStats(
835
- layer_name="conv1", norm_history=[1e-7], mean_norm=1e-7, max_norm=1e-7,
836
- is_exploding=False, is_vanishing=True,
837
- )
838
- assert stats.is_vanishing is True
839
-
840
- def test_normal_gradients(self):
841
- stats = GradientStats(
842
- layer_name="conv1", norm_history=[0.5], mean_norm=0.5, max_norm=0.5,
843
- is_exploding=False, is_vanishing=False,
844
- )
845
- assert not stats.is_exploding
846
- assert not stats.is_vanishing
847
- ```
848
-
849
- ### Acceptance Criteria — Phase 1
850
-
851
- - [ ] Every field in every model matches the spec Section 10 types exactly
852
- - [ ] No `Dict[str, Any]` in any public model (typed Pydantic everywhere)
853
- - [ ] `import torch` appears in `models.py`
854
- - [ ] All model tests pass
855
-
856
- ---
857
-
858
- ## Phase 2: PyTorch-Native Fault Injection Engine + Simulation
859
-
860
- ### Goal
861
- Real PyTorch models with real gradients + parametric curve generators. This is the technical heart.
862
-
863
- ### Files to Create
864
-
865
- **Step 2.1 — `ml_training_debugger/scenarios.py`** (~120 lines):
866
-
867
- ```python
868
- """ScenarioParams and scenario sampling."""
869
-
870
- from __future__ import annotations
871
-
872
- import dataclasses
873
- from typing import Optional
874
-
875
- import torch
876
-
877
- from ml_training_debugger.models import RootCauseDiagnosis
878
-
879
-
880
- @dataclasses.dataclass(frozen=True)
881
- class ScenarioParams:
882
- """Internal scenario parameters — not exposed to agent."""
883
- task_id: str
884
- root_cause: RootCauseDiagnosis
885
- seed: int
886
- learning_rate: float = 0.001
887
- weight_decay: float = 0.0001
888
- leakage_pct: float = 0.0
889
- depth_multiplier: float = 1.0
890
- divergence_epoch: int = 5
891
- red_herring_intensity: float = 1.0
892
- red_herring_spike_layer: str = "fc"
893
- bug_type: Optional[str] = None
894
- notes: Optional[str] = None
895
- error_log: Optional[str] = None
896
- gpu_memory_used_gb: float = 6.2
897
- max_steps: int = 20
898
-
899
-
900
- def sample_scenario(task_id: str, seed: int) -> ScenarioParams:
901
- """Sample a ScenarioParams for the given task."""
902
- rng = torch.Generator()
903
- rng.manual_seed(seed)
904
-
905
- # Use torch for random selection
906
- def choose(options: list) -> any:
907
- idx = int(torch.randint(0, len(options), (1,), generator=rng).item())
908
- return options[idx]
909
-
910
- if task_id == "task_001":
911
- lr = choose([0.05, 0.08, 0.10, 0.15, 0.30])
912
- return ScenarioParams(
913
- task_id=task_id,
914
- root_cause=RootCauseDiagnosis.LR_TOO_HIGH,
915
- seed=seed,
916
- learning_rate=lr,
917
- error_log=f"RuntimeError: Loss is NaN at epoch 12 (lr={lr})",
918
- max_steps=20,
919
- )
920
-
921
- elif task_id == "task_003":
922
- leakage = choose([0.12, 0.18, 0.22, 0.28])
923
- return ScenarioParams(
924
- task_id=task_id,
925
- root_cause=RootCauseDiagnosis.DATA_LEAKAGE,
926
- seed=seed,
927
- leakage_pct=leakage,
928
- notes="Model architecture upgraded from 2-layer to 4-layer CNN at epoch 2. Performance improvement may reflect increased model capacity.",
929
- max_steps=25,
930
- )
931
-
932
- elif task_id == "task_005":
933
- intensity = (
934
- torch.empty(1).uniform_(0.8, 2.5, generator=rng).item()
935
- )
936
- spike_layer = choose(["fc", "conv1"])
937
- return ScenarioParams(
938
- task_id=task_id,
939
- root_cause=RootCauseDiagnosis.BATCHNORM_EVAL_MODE,
940
- seed=seed,
941
- red_herring_intensity=intensity,
942
- red_herring_spike_layer=spike_layer,
943
- gpu_memory_used_gb=14.56, # 91% of 16GB
944
- error_log="Warning: GPU memory pressure detected, consider reducing batch size or enabling gradient checkpointing",
945
- max_steps=30,
946
- )
947
-
948
- raise ValueError(f"Unknown task_id: {task_id}")
949
- ```
950
-
951
- **Step 2.2 — `ml_training_debugger/pytorch_engine.py`** (~250 lines):
952
-
953
- Key functions:
954
- - `SimpleCNN(torch.nn.Module)` — 3-layer CNN, ~50K params
955
- - `create_model_and_inject_fault(scenario: ScenarioParams) -> tuple[torch.nn.Module, dict]`
956
- - `extract_gradient_stats(model: torch.nn.Module) -> list[GradientStats]`
957
- - `extract_weight_stats(model: torch.nn.Module) -> list[ModelWeightStats]`
958
- - `extract_model_modes(model: torch.nn.Module) -> dict[str, str]`
959
-
960
- Implementation notes:
961
- - `torch.manual_seed(scenario.seed)` at the start of every call
962
- - For Task 1: set lr high, run 2 forward+backward passes → gradients explode
963
- - For Task 3: normal model, no gradient anomaly
964
- - For Task 5: call `model.eval()` before training → BatchNorm frozen
965
- - All gradient stats come from real `param.grad` tensors
966
- - All weight stats come from real `model.state_dict()`
967
-
968
- **Step 2.3 — `ml_training_debugger/simulation.py`** (~180 lines):
969
-
970
- Key functions:
971
- - `gen_loss_history(scenario: ScenarioParams) -> list[float]` — all torch.Tensor ops
972
- - `gen_val_accuracy_history(scenario: ScenarioParams) -> list[float]`
973
- - `gen_val_loss_history(scenario: ScenarioParams) -> list[float]`
974
-
975
- Per-task parametric curves from spec Section 6:
976
- - Task 1: `loss = torch.exp(torch.tensor(lr) * torch.arange(20))`
977
- - Task 3: `val_acc = torch.sigmoid(torch.linspace(-3, 3, 20)) * (1 - leakage_pct)`
978
- - Task 5: Normal loss + elevated variance, slow val_acc degradation
979
-
980
- ### Tests to Create FIRST (TDD)
981
-
982
- **`tests/test_scenarios.py`**:
983
- - `sample_scenario("task_001", seed=42)` returns `root_cause == LR_TOO_HIGH`
984
- - `sample_scenario("task_003", seed=42)` returns `root_cause == DATA_LEAKAGE`
985
- - `sample_scenario("task_005", seed=42)` returns `root_cause == BATCHNORM_EVAL_MODE`
986
- - Different seeds produce different parameters (but same root cause per task)
987
- - Unknown task_id raises ValueError
988
-
989
- **`tests/test_pytorch_engine.py`**:
990
- - `SimpleCNN` is a real `torch.nn.Module` with ~50K params
991
- - Task 1 fault injection: `is_exploding=True` on all layers
992
- - Task 5 fault injection: `is_exploding=False` on all layers, `model.training==False`
993
- - `extract_gradient_stats` returns `list[GradientStats]` with real float norms
994
- - `extract_weight_stats` returns `list[ModelWeightStats]` from real state_dict
995
- - `extract_model_modes` returns dict mapping layer names to "train"/"eval"
996
- - **CRITICAL**: `import torch` in pytorch_engine.py, zero `import numpy`
997
-
998
- **`tests/test_simulation.py`**:
999
- - All outputs are `list[float]` of length 20
1000
- - Task 1 (exploding): loss diverges (last value >> first value)
1001
- - Task 3 (leakage): val_acc suspiciously high from early epochs
1002
- - Task 5 (batchnorm): slow val_acc degradation (~1-2% per epoch)
1003
- - All computation uses torch (no numpy)
1004
-
1005
- ### Acceptance Criteria — Phase 2
1006
-
1007
- - [ ] `SimpleCNN` is a real `torch.nn.Module` with ~50K parameters
1008
- - [ ] `create_model_and_inject_fault` for Task 1 produces exploding gradients (`is_exploding=True` all layers)
1009
- - [ ] `create_model_and_inject_fault` for Task 5 produces `model.training==False` on all layers
1010
- - [ ] `extract_gradient_stats` returns real floats from `torch.norm(param.grad)`
1011
- - [ ] `extract_weight_stats` returns real floats from `state_dict()`
1012
- - [ ] Parametric curves produce 20-element lists with correct shapes per task
1013
- - [ ] `import torch` in `pytorch_engine.py` and `simulation.py` — zero `import numpy`
1014
- - [ ] `torch.manual_seed(seed)` ensures reproducibility
1015
- - [ ] All Phase 2 tests pass
1016
-
1017
- ---
1018
-
1019
- ## Phase 3: MVP Tasks (1, 3, 5) + Reward Engine + Graders
1020
-
1021
- ### Goal
1022
- All reward logic and graders implemented. The environment can score episodes.
1023
-
1024
- ### Files to Create
1025
-
1026
- **Step 3.1 — `ml_training_debugger/reward_engine.py`** (~100 lines):
1027
-
1028
- ```python
1029
- def compute_reward(
1030
- action: MLTrainingAction,
1031
- episode_state: EpisodeState,
1032
- scenario: ScenarioParams,
1033
- is_valid_action: bool,
1034
- is_correct_fix: bool | None = None,
1035
- convergence_confirmed: bool = False,
1036
- ) -> float:
1037
- ```
1038
-
1039
- All 7 components per spec Section 12:
1040
- 1. Step penalty: -0.01 (flat, unconditional)
1041
- 2. Investigation bonus: +0.05 (first-time per type)
1042
- 3. Context-gated penalty: -0.20 (ONLY when `gradients_inspected AND gradients_were_normal`)
1043
- 4. Invalid action: -0.05
1044
- 5. Wrong code fix: -0.10
1045
- 6. Correct diagnosis: +0.50 / Wrong diagnosis: -0.30
1046
- 7. Terminal convergence: +0.40 (gated on `fix_action_taken AND restart_after_fix`)
1047
-
1048
- Hard cap at [-1.0, 1.0].
1049
-
1050
- **Step 3.2 — `ml_training_debugger/graders.py`** (~150 lines):
1051
-
1052
- One function per task. Each returns float in [0.0, 1.0]:
1053
- - `grade_task_001(state: EpisodeState, scenario: ScenarioParams) -> float`
1054
- - `grade_task_003(state: EpisodeState, scenario: ScenarioParams) -> float`
1055
- - `grade_task_005(state: EpisodeState, scenario: ScenarioParams) -> float`
1056
-
1057
- Grader scoring per spec Section 11:
1058
- - Task 1: inspect_gradients(+0.05), correct LR fix(+0.20), restart+converge(+0.35), correct diagnosis(+0.40) = 1.0
1059
- - Task 3: inspect_data(+0.05), patch_data_loader(+0.30), restart+converge(+0.30), correct diagnosis(+0.35) = 1.0
1060
- - Task 5: inspect_gradients(+0.05), inspect_model_modes(+0.05), fix_model_mode(+0.25), restart+converge(+0.30), correct diagnosis(+0.40) = 1.05 → capped at 1.0. Penalty: add_callback after normal gradients = -0.20.
1061
-
1062
- **CRITICAL — Grader is NOT a sum of step rewards.** It evaluates EpisodeState holistically.
1063
-
1064
- ### Tests to Create FIRST (TDD)
1065
-
1066
- **`tests/test_reward_engine.py`** — THE MOST CRITICAL TEST FILE:
1067
-
1068
- ```python
1069
- class TestContextGatedPenalty:
1070
- """The project's primary innovation — must be exact."""
1071
-
1072
- def test_no_penalty_before_inspection(self):
1073
- """add_callback at step 1 (no prior inspection) -> NO penalty."""
1074
- state = EpisodeState() # gradients_inspected=False
1075
- action = MLTrainingAction(action_type="add_callback")
1076
- reward = compute_reward(action, state, scenario, is_valid_action=True)
1077
- # Should be just step penalty: -0.01
1078
- assert reward == pytest.approx(-0.01)
1079
-
1080
- def test_penalty_after_normal_gradients(self):
1081
- """inspect_gradients (normal) then add_callback -> -0.20 penalty."""
1082
- state = EpisodeState(gradients_inspected=True, gradients_were_normal=True)
1083
- action = MLTrainingAction(action_type="add_callback")
1084
- reward = compute_reward(action, state, scenario, is_valid_action=True)
1085
- # Step penalty + context-gated penalty: -0.01 + -0.20 = -0.21
1086
- assert reward == pytest.approx(-0.21)
1087
-
1088
- def test_no_penalty_after_abnormal_gradients(self):
1089
- """inspect_gradients (exploding) then add_callback -> no context penalty."""
1090
- state = EpisodeState(gradients_inspected=True, gradients_were_normal=False)
1091
- action = MLTrainingAction(action_type="add_callback")
1092
- reward = compute_reward(action, state, scenario, is_valid_action=True)
1093
- assert reward == pytest.approx(-0.01)
1094
- ```
1095
-
1096
- Also test:
1097
- - Step penalty is flat -0.01 (NOT multiplied by step_count)
1098
- - Investigation bonus +0.05 first-time only
1099
- - Investigation bonus NOT awarded on repeat
1100
- - Correct diagnosis: +0.50
1101
- - Wrong diagnosis: -0.30
1102
- - Terminal convergence: +0.40 when all gates met
1103
- - Invalid action: -0.05
1104
- - Wrong code fix: -0.10
1105
- - Reward capped at [-1.0, 1.0]
1106
-
1107
- **`tests/test_graders.py`**:
1108
- - Each grader returns float in [0.0, 1.0]
1109
- - Perfect Task 1 path scores 1.0
1110
- - Wrong diagnosis on Task 1 scores < 0.5
1111
- - Task 5: agent that chases red herring scores 0.80-0.85
1112
- - Task 5: optimal path scores 1.0
1113
- - Grader is deterministic (same state → same score)
1114
-
1115
- ### Acceptance Criteria — Phase 3
1116
-
1117
- - [ ] `compute_reward` implements all 7 components exactly per spec Section 12
1118
- - [ ] Context-gated penalty fires ONLY when `gradients_inspected=True AND gradients_were_normal=True`
1119
- - [ ] Context-gated penalty does NOT fire before `inspect_gradients` has been called
1120
- - [ ] Step penalty is flat -0.01 (never multiplied by step_count)
1121
- - [ ] All 3 graders return [0.0, 1.0] with meaningful variance
1122
- - [ ] Grader != reward function (separate modules, separate logic)
1123
- - [ ] All Phase 3 tests pass
1124
-
1125
- ---
1126
-
1127
- ## Phase 4: Environment Lifecycle, EpisodeState, and Action Handling
1128
-
1129
- ### Goal
1130
- Full `reset()` and `step()` implementations in `environment.py`. The environment is functionally complete.
1131
-
1132
- ### Files to Edit
1133
-
1134
- **`server/environment.py`** — Full implementation:
1135
-
1136
- `reset(task_id)`:
1137
- 1. Parse `task_id` from `kwargs` (framework passes it via kwargs or episode_id)
1138
- 2. Derive deterministic seed from task_id
1139
- 3. Call `sample_scenario(task_id, seed)`
1140
- 4. Call `torch.manual_seed(scenario.seed)`
1141
- 5. Call `create_model_and_inject_fault(scenario)` → get real model
1142
- 6. Generate parametric curves via `simulation.py`
1143
- 7. Create fresh `EpisodeState`
1144
- 8. Store `(scenario, model, state)` keyed by session/episode ID
1145
- 9. Return `MLTrainingObservation` with populated loss/acc histories, config, error_log, available_actions — but empty gradient_stats, null data_batch_stats, null model_mode_info, null code_snippet
1146
-
1147
- `step(action)`:
1148
- 1. Validate action (see spec Section 16 error handling matrix)
1149
- 2. Increment `step_count`
1150
- 3. Dispatch by `action.action_type`:
1151
- - **`inspect_gradients`**: Extract real gradient stats, set `gradients_inspected=True`, compute `gradients_were_normal` (all layers `is_exploding==False`)
1152
- - **`inspect_data_batch`**: Generate data batch stats, set `data_inspected=True`
1153
- - **`inspect_model_modes`**: Extract model modes, set `model_modes_inspected=True`
1154
- - **`inspect_model_weights`**: Extract real weight stats, set `model_weights_inspected=True`
1155
- - **`inspect_code`**: Generate code snippet (if task supports it), set `code_inspected=True`
1156
- - **`modify_config`**: Validate target/value, apply change, set `fix_action_taken=True`
1157
- - **`add_callback`**: Apply callback, set `fix_action_taken=True`
1158
- - **`replace_optimizer`**: Apply, set `fix_action_taken=True`
1159
- - **`patch_data_loader`**: Apply, set `fix_action_taken=True`
1160
- - **`fix_model_mode`**: Apply, set `fix_action_taken=True`
1161
- - **`fix_code`**: Validate fix via `validate_fix()`, set `fix_action_taken=True`
1162
- - **`restart_run`**: Requires `fix_action_taken`, set `restart_after_fix=True`, check convergence
1163
- - **`mark_diagnosed`**: Set `diagnosis_submitted=True`, `done=True`
1164
- - **`rollback_checkpoint`**: Requires `restart_after_fix`
1165
- 4. Call `compute_reward(action, state, scenario, ...)`
1166
- 5. Check step limit → set `done=True` if reached
1167
- 6. Update `available_actions` via `state.compute_available_actions()`
1168
- 7. Return `MLTrainingObservation` with all updated fields
1169
-
1170
- **Session isolation**:
1171
- - Store per-session state in `self._sessions: dict[str, SessionData]`
1172
- - Session ID comes from the framework (via `episode_id` or WebSocket session)
1173
- - Clean up on episode completion or disconnect
1174
-
1175
- ### Error Handling (spec Section 16 — ALL cases):
1176
-
1177
- | Error | Behavior | Reward |
1178
- |-------|----------|--------|
1179
- | Invalid action_type | Return obs unchanged + error note | -0.05 |
1180
- | Action not in available_actions | Return obs unchanged + error note | -0.05 |
1181
- | modify_config missing target/value | Return obs unchanged + error note | -0.05 |
1182
- | modify_config with unknown target | Return obs unchanged + error note | -0.05 |
1183
- | mark_diagnosed missing diagnosis | Return obs unchanged + error note | -0.05 |
1184
- | mark_diagnosed with invalid diagnosis | Return obs unchanged + error note | -0.05 |
1185
- | fix_code missing line/replacement | Return obs unchanged + error note | -0.05 |
1186
- | Action after done=True | Return final obs, no state change | 0.0 |
1187
- | Step limit reached | Set done=True, return obs | 0.0 |
1188
-
1189
- **CRITICAL**: `step()` must NEVER raise an unhandled exception.
1190
-
1191
- ### Tests to Create FIRST (TDD)
1192
-
1193
- **`tests/test_episode_lifecycle.py`**:
1194
- - Full reset→inspect→fix→restart→diagnose flow for Task 1
1195
- - Full flow for Task 3
1196
- - Full flow for Task 5
1197
- - `available_actions` updates correctly at each step
1198
- - `done=True` after `mark_diagnosed`
1199
- - Step limit triggers `done=True`
1200
- - Action after done returns final obs with no state change
1201
- - Invalid action returns -0.05 penalty
1202
- - `restart_run` not available before `fix_action_taken`
1203
- - `fix_code` not available before `code_inspected`
1204
- - Session isolation: two episodes don't interfere
1205
-
1206
- ### Acceptance Criteria — Phase 4
1207
-
1208
- - [ ] `reset(task_id)` for tasks 001/003/005 returns valid `MLTrainingObservation` with correct initial state
1209
- - [ ] `step()` dispatches all 14 action types correctly
1210
- - [ ] Task 1: `inspect_gradients` → `is_exploding=True` all layers (real torch.autograd)
1211
- - [ ] Task 5: `inspect_gradients` → `is_exploding=False` all layers, `gradients_were_normal=True`
1212
- - [ ] Task 3: `inspect_data_batch` → `class_overlap_score > 0.5`
1213
- - [ ] Task 5: `inspect_model_modes` → all layers in "eval" mode
1214
- - [ ] All error conditions from spec Section 16 handled (never raises)
1215
- - [ ] Progressive information reveal works (gradient_stats empty until inspected)
1216
- - [ ] All Phase 4 tests pass
1217
-
1218
- ---
1219
-
1220
- ## Phase 5: Server (FastAPI + openenv-core) + All Required Endpoints
1221
-
1222
- ### Goal
1223
- Wire the real environment into the server. All hackathon-required endpoints return real data.
1224
-
1225
- ### Files to Edit
1226
-
1227
- **`server/app.py`** — Full implementation:
1228
-
1229
- ```python
1230
- # Store reference to last completed episode for /grader
1231
- _last_completed: dict[str, dict] = {} # session_id -> {score, task_id, steps}
1232
- _baseline_running: bool = False
1233
-
1234
- @app.get("/health")
1235
- def health_check():
1236
- return {"status": "ready", "tasks": 3}
1237
-
1238
- @app.get("/tasks")
1239
- def get_tasks():
1240
- schema = MLTrainingAction.model_json_schema()
1241
- return [
1242
- {"id": "task_001", "difficulty": "easy", "max_steps": 20, "action_schema": schema},
1243
- {"id": "task_003", "difficulty": "medium", "max_steps": 25, "action_schema": schema},
1244
- {"id": "task_005", "difficulty": "hard", "max_steps": 30, "action_schema": schema},
1245
- ]
1246
-
1247
- @app.post("/grader")
1248
- def post_grader(session_id: str | None = None):
1249
- # Return score for most recently completed episode
1250
- # Edge cases per spec Section 14
1251
-
1252
- @app.post("/baseline")
1253
- async def post_baseline():
1254
- # Run baseline_heuristic logic internally
1255
- # Return {"scores": {"task_001": float, ...}}
1256
- # Return 409 if already running
1257
- ```
1258
-
1259
- **Grader endpoint edge cases** (spec Section 14):
1260
- - No episode completed → `{"score": null, "error": "no_completed_episode"}`
1261
- - Episode in progress → `{"score": null, "error": "episode_in_progress"}`
1262
- - Episode completed → `{"score": 0.85, "task_id": "task_003", "steps": 6}`
1263
- - Always HTTP 200 with JSON body
1264
-
1265
- ### Tests to Create FIRST (TDD)
1266
-
1267
- **`tests/test_endpoints.py`**:
1268
- - `GET /health` returns `{"status": "ready", "tasks": 3}` with 200
1269
- - `GET /tasks` returns 3 tasks with action schema
1270
- - `POST /grader` returns `{"score": null, "error": "no_completed_episode"}` initially
1271
- - `POST /baseline` returns scores for all tasks
1272
- - `POST /baseline` while running returns 409
1273
- - Integration: reset→step→grader returns valid score
1274
-
1275
- ### Acceptance Criteria — Phase 5
1276
-
1277
- - [ ] `GET /health` returns `{"status": "ready", "tasks": 3}` (200)
1278
- - [ ] `GET /tasks` returns 3 tasks with IDs, difficulties, action schema
1279
- - [ ] `POST /grader` handles all edge cases per spec Section 14
1280
- - [ ] `POST /baseline` runs baseline and returns scores
1281
- - [ ] Framework auto-provides: `/reset`, `/step`, `/state`, `/ws`, `/schema`, `/docs`
1282
- - [ ] All Phase 5 tests pass
1283
-
1284
- ---
1285
-
1286
- ## Phase 6: Rule-Based Baseline + Reproducibility Guarantees
1287
-
1288
- ### Goal
1289
- Deterministic baseline that produces bit-exact identical scores on two runs.
1290
-
1291
- ### Files to Create
1292
-
1293
- **`baseline_heuristic.py`** (~150 lines):
1294
-
1295
- Decision tree from spec Section 17:
1296
- ```
1297
- 1. reset(task_id)
1298
- 2. inspect_gradients
1299
- 3. IF any layer is_exploding → modify_config(lr=0.001) → restart → diagnose lr_too_high
1300
- 4. IF any layer is_vanishing → modify_config(lr=0.01) → restart → diagnose vanishing_gradients
1301
- 5. inspect_data_batch
1302
- 6. IF class_overlap_score > 0.5 → patch_data_loader → restart → diagnose data_leakage
1303
- 7. IF val_loss diverging → modify_config(weight_decay=0.01) → restart → diagnose overfitting
1304
- 8. inspect_model_modes → IF any eval → fix_model_mode → restart → diagnose batchnorm_eval_mode
1305
- 9. inspect_code → attempt fix → restart → diagnose code_bug
1306
- 10. FALLBACK: diagnose overfitting
1307
- ```
1308
-
1309
- Uses `MLTrainingEnvClient` or `GenericEnvClient` to connect via WebSocket.
1310
-
1311
- **Reproducibility requirements:**
1312
- - `torch.manual_seed(seed)` at every `reset()` with deterministic seed per task
1313
- - No floating-point non-determinism in parametric curves
1314
- - Heuristic is pure logic with no randomness
1315
- - Two runs must produce identical JSON output
1316
-
1317
- ### Tests to Create FIRST (TDD)
1318
-
1319
- **`tests/test_baseline_reproducibility.py`**:
1320
- - Run baseline twice → `diff run1.json run2.json` is empty
1321
- - All scores in [0.0, 1.0]
1322
- - Expected approximate scores: task_001 ~0.85, task_003 ~0.70, task_005 ~0.45
1323
-
1324
- ### Acceptance Criteria — Phase 6
1325
-
1326
- - [ ] `baseline_heuristic.py` runs all 3 MVP tasks without error
1327
- - [ ] Two consecutive runs produce bit-exact identical JSON output
1328
- - [ ] No API key required
1329
- - [ ] All scores in [0.0, 1.0] with meaningful variance
1330
- - [ ] Decision tree follows spec Section 17 exactly
1331
-
1332
- ---
1333
-
1334
- ## Phase 7: Docker, HF Spaces, Logging, Error Handling & Edge Cases
1335
-
1336
- ### Goal
1337
- Production-ready container that deploys cleanly.
1338
-
1339
- ### Files to Edit
1340
-
1341
- **`Dockerfile`** — Finalize:
1342
- - Base: `python:3.12-slim`
1343
- - PyTorch CPU-only: `pip install torch --index-url https://download.pytorch.org/whl/cpu`
1344
- - Target: <500MB
1345
- - `EXPOSE 7860`
1346
- - `CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "7860"]`
1347
-
1348
- **Note on Dockerfile COPY**: Cannot use `COPY ... 2>/dev/null || true` in Dockerfile. Instead, ensure all files exist or use multi-stage approach.
1349
-
1350
- **Logging** — Add to `server/app.py` and `server/environment.py`:
1351
- - JSON structured logging to stdout
1352
- - Log every `reset()`, `step()`, episode completion, errors
1353
-
1354
- **WebSocket edge cases** (spec Section 16):
1355
- - Client disconnects mid-episode → retain state 60s
1356
- - Malformed JSON → return error, keep connection
1357
- - step() before reset() → return "no_active_episode" error
1358
- - reset() during active episode → terminate current, start new
1359
-
1360
- ### Acceptance Criteria — Phase 7
1361
-
1362
- - [ ] `docker build -t pytorch-debugger .` succeeds
1363
- - [ ] Docker image <500MB
1364
- - [ ] `docker run -p 7860:7860 pytorch-debugger` starts and serves in <60s
1365
- - [ ] `curl http://localhost:7860/health` returns `{"status": "ready", "tasks": 3}`
1366
- - [ ] All WebSocket edge cases handled per spec Section 16
1367
- - [ ] Structured JSON logging on all significant events
1368
-
1369
- ---
1370
-
1371
- ## Phase 8: Full Testing Suite + Pre-Submission Smoke Tests
1372
-
1373
- ### Goal
1374
- >80% test coverage, all edge cases covered.
1375
-
1376
- ### Files to Create/Extend
1377
-
1378
- All test files listed above, plus:
1379
- - Fill coverage gaps identified by `pytest --cov`
1380
- - Add edge case tests for every error in spec Section 16
1381
- - Add test for `step()` after `done=True`
1382
- - Add test for step limit termination
1383
-
1384
- ### Commands
1385
-
1386
- ```bash
1387
- pytest tests/ -v --cov=ml_training_debugger --cov=server --cov-report=term-missing
1388
- ```
1389
-
1390
- ### Acceptance Criteria — Phase 8
1391
-
1392
- - [ ] `pytest --cov` shows >80% coverage on all modules
1393
- - [ ] Every error condition from spec Section 16 has a test
1394
- - [ ] Context-gated penalty tests pass (both paths)
1395
- - [ ] Dynamic available_actions tests pass
1396
- - [ ] All 3 graders tested with multiple scenarios
1397
- - [ ] Zero test failures
1398
-
1399
- ---
1400
-
1401
- ## Phase 9: Final Polish & Submission Readiness
1402
-
1403
- ### Goal
1404
- README complete, all endpoints verified, `openenv validate` passes, deploy to HF Spaces.
1405
-
1406
- ### Files to Create
1407
-
1408
- **`README.md`** (~200 lines):
1409
- - Environment description and motivation
1410
- - Action/observation space definitions
1411
- - Task descriptions with difficulty
1412
- - Setup instructions
1413
- - Baseline scores table
1414
-
1415
- **`deploy.sh`**:
1416
- ```bash
1417
- #!/bin/bash
1418
- set -euo pipefail
1419
-
1420
- echo "=== Building Docker image ==="
1421
- docker build -t pytorch-debugger .
1422
-
1423
- echo "=== Starting container ==="
1424
- docker run -d -p 7860:7860 --name smoke-test pytorch-debugger
1425
- sleep 10
1426
-
1427
- echo "=== Health check ==="
1428
- curl -f http://localhost:7860/health || { echo "FAIL: health"; exit 1; }
1429
-
1430
- echo "=== Tasks endpoint ==="
1431
- curl -f http://localhost:7860/tasks | python3 -m json.tool || { echo "FAIL: tasks"; exit 1; }
1432
-
1433
- echo "=== Baseline reproducibility ==="
1434
- python3 baseline_heuristic.py > run1.json 2>/dev/null
1435
- python3 baseline_heuristic.py > run2.json 2>/dev/null
1436
- diff run1.json run2.json && echo "PASS: reproducible" || { echo "FAIL: non-reproducible"; exit 1; }
1437
-
1438
- echo "=== Baseline via endpoint ==="
1439
- curl -f -X POST http://localhost:7860/baseline | python3 -m json.tool || { echo "FAIL: baseline endpoint"; exit 1; }
1440
-
1441
- echo "=== Grader via endpoint ==="
1442
- curl -f -X POST http://localhost:7860/grader | python3 -m json.tool || { echo "FAIL: grader endpoint"; exit 1; }
1443
-
1444
- echo "=== Tests ==="
1445
- pytest tests/ -v --cov=ml_training_debugger --cov-report=term-missing
1446
-
1447
- echo "=== Cleanup ==="
1448
- docker stop smoke-test && docker rm smoke-test
1449
- rm -f run1.json run2.json
1450
-
1451
- echo "=== ALL CHECKS PASSED ==="
1452
- ```
1453
-
1454
- ### Acceptance Criteria — Phase 9
1455
-
1456
- - [ ] `openenv validate` passes
1457
- - [ ] `deploy.sh` runs end-to-end with zero failures
1458
- - [ ] README is complete per hackathon requirements
1459
- - [ ] Docker image <500MB, starts <60s
1460
- - [ ] Baseline bit-exact reproducible
1461
- - [ ] 3+ tasks with graders returning [0.0, 1.0] with meaningful variance
1462
- - [ ] HF Space deployed, tagged `openenv`, responds to `reset()`
1463
- - [ ] All typed Pydantic models — no `Dict[str, Any]`
1464
- - [ ] `import torch` in every core module — zero numpy in core
1465
- - [ ] Context-gated penalty fires correctly and does not fire prematurely
1466
- - [ ] Test suite passes with >80% coverage
1467
-
1468
- ---
1469
-
1470
- ## Technical Risk Mitigations
1471
-
1472
- | Risk | Impact | Mitigation |
1473
- |------|--------|------------|
1474
- | **WebSocket + HTTP composition** | ~~High~~ RESOLVED | `create_app()` returns standard FastAPI. Custom routes add cleanly. Verified in Phase 0. |
1475
- | **Docker image size** | Medium | `python:3.12-slim` + torch CPU-only (~150MB). Target <500MB. Test early in Phase 7. |
1476
- | **Task 6 fix validation fragility** | Medium | Multi-strategy pipeline: normalize → tokenize → semantic patterns → AST fallback. Test 5+ whitespace variations. (Post-MVP Phase 2 stretch) |
1477
- | **Red-herring penalty gating** | HIGH | `gradients_were_normal` set inside `inspect_gradients` handler when ALL layers have `is_exploding=False`. Threshold: `mean_norm > 10.0`. Test BOTH paths explicitly. |
1478
- | **Session isolation** | Medium | `dict[str, SessionData]` keyed by session ID. Framework provides session management. |
1479
- | **Baseline reproducibility** | HIGH | `torch.manual_seed(seed)` at every `reset()`. Seed derived deterministically from task_id. Heuristic is pure logic. Test with `diff run1.json run2.json`. |
1480
- | **Dockerfile build time** | Low | No real training during build. Validation reports pre-computed locally. |
1481
- | **openenv.yaml format** | Medium | Template uses `spec_version: 1`, `type: space`, `runtime: fastapi`, `app: server.app:app`. Extended fields (tasks, reward, etc.) are additive. Test with `openenv validate` early. |
1482
- | **Port mismatch** | Low | Spec says 7860 (HF Spaces default). openenv template says 8000. Use 7860 everywhere. |
1483
-
1484
- ---
1485
-
1486
- ## Exact openenv.yaml (Final)
1487
-
1488
- ```yaml
1489
- spec_version: 1
1490
- name: pytorch-training-debugger
1491
- type: space
1492
- runtime: fastapi
1493
- app: server.app:app
1494
- port: 7860
1495
-
1496
- version: "1.0.0"
1497
- description: |
1498
- PyTorch-native fault injection engine for training failure debugging.
1499
- An AI agent investigates, diagnoses, fixes, and verifies broken
1500
- training runs using real torch.nn.Module models, torch.autograd
1501
- gradients, state_dict() weight inspection, and PyTorch code-level
1502
- debugging. 3 tasks across 3 difficulty tiers with context-gated
1503
- reward shaping.
1504
- framework: openenv
1505
- tags: [ml-debugging, pytorch, reinforcement-learning, root-cause-analysis, fault-injection, openenv]
1506
-
1507
- observation_space:
1508
- type: MLTrainingObservation
1509
- description: "Training run snapshot with progressive reveal — gradients, weights, data stats, model modes revealed on inspection"
1510
-
1511
- action_space:
1512
- type: MLTrainingAction
1513
- description: "Investigation, fix, and diagnosis actions with dynamic availability"
1514
-
1515
- tasks:
1516
- - id: task_001
1517
- difficulty: easy
1518
- max_steps: 20
1519
- - id: task_003
1520
- difficulty: medium
1521
- max_steps: 25
1522
- - id: task_005
1523
- difficulty: hard
1524
- max_steps: 30
1525
-
1526
- reward:
1527
- range: [-1.0, 1.0]
1528
- shaped: true
1529
- step_penalty: -0.01
1530
- investigation_bonus: 0.05
1531
- max_investigation_bonus: 0.25
1532
- correct_diagnosis: 0.50
1533
- terminal_convergence: 0.40
1534
-
1535
- endpoints:
1536
- websocket: "/ws"
1537
- tasks: "GET /tasks"
1538
- grader: "POST /grader"
1539
- baseline: "POST /baseline"
1540
- health: "GET /health"
1541
- ```
1542
-
1543
- ---
1544
-
1545
- ## Exact Dockerfile (Final)
1546
-
1547
- ```dockerfile
1548
- FROM python:3.12-slim
1549
-
1550
- WORKDIR /app
1551
-
1552
- # Install PyTorch CPU-only first (largest layer, cached)
1553
- RUN pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cpu
1554
-
1555
- # Install remaining dependencies
1556
- COPY requirements.txt .
1557
- RUN pip install --no-cache-dir -r requirements.txt
1558
-
1559
- # Copy application code
1560
- COPY ml_training_debugger/ ml_training_debugger/
1561
- COPY server/ server/
1562
- COPY openenv.yaml .
1563
- COPY baseline_heuristic.py .
1564
- COPY README.md .
1565
-
1566
- EXPOSE 7860
1567
-
1568
- HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \
1569
- CMD curl -f http://localhost:7860/health || exit 1
1570
-
1571
- CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "7860"]
1572
- ```
1573
-
1574
- ---
1575
-
1576
- ## Pre-Submission Smoke Test Sequence
1577
-
1578
- ```bash
1579
- # 1. Clean build
1580
- docker build --no-cache -t pytorch-debugger .
1581
-
1582
- # 2. Start container
1583
- docker run -d -p 7860:7860 --name smoke-test pytorch-debugger
1584
- sleep 10
1585
-
1586
- # 3. Health check
1587
- curl -f http://localhost:7860/health
1588
-
1589
- # 4. Tasks endpoint
1590
- curl -f http://localhost:7860/tasks | python3 -m json.tool
1591
-
1592
- # 5. Baseline reproducibility
1593
- python3 baseline_heuristic.py > run1.json 2>/dev/null
1594
- python3 baseline_heuristic.py > run2.json 2>/dev/null
1595
- diff run1.json run2.json && echo "PASS: reproducible" || echo "FAIL"
1596
-
1597
- # 6. Baseline via endpoint
1598
- curl -f -X POST http://localhost:7860/baseline | python3 -m json.tool
1599
-
1600
- # 7. Grader via endpoint
1601
- curl -f -X POST http://localhost:7860/grader | python3 -m json.tool
1602
-
1603
- # 8. OpenEnv validation
1604
- openenv validate
1605
-
1606
- # 9. Test suite
1607
- pytest tests/ -v --cov=ml_training_debugger --cov-report=term-missing
1608
-
1609
- # 10. Cleanup
1610
- docker stop smoke-test && docker rm smoke-test
1611
- rm -f run1.json run2.json
1612
-
1613
- echo "=== All checks passed ==="
1614
- ```
1615
-
1616
- ---
1617
-
1618
- ## Post-MVP Stretch (Phase 2 from ROADMAP)
1619
-
1620
- **Only after MVP is 100% deployed and passing all auto-validation:**
1621
-
1622
- 1. **Task 6** (code debugging) — highest impact differentiator
1623
- - Create `ml_training_debugger/code_templates.py`
1624
- - 4 bug variants: eval_mode, detach_loss, zero_grad_missing, inplace_relu
1625
- - Multi-strategy fix validation: normalize → tokenize → semantic → AST
1626
- - Diagnosis is ALWAYS `code_bug` regardless of variant
1627
-
1628
- 2. **Tasks 2 & 4** — fill out to 6 tasks
1629
- - Task 2: vanishing gradients (easy, mirror of Task 1)
1630
- - Task 4: overfitting (medium, train-val divergence)
1631
-
1632
- 3. **Dashboard** — `server/dashboard.html`, Plotly.js via CDN
1633
-
1634
- 4. **Validation Suite** — `validation/*.py`, R² > 0.85
1635
-
1636
- 5. **LLM Baseline** — `baseline_inference.py`, GPT-4o
1637
-
1638
- Update `openenv.yaml`, `/tasks`, `/health` task count as tasks are added.
1639
-
1640
- ---
1641
-
1642
- ## SESSION_ID
1643
-
1644
- - CODEX_SESSION: N/A (codeagent-wrapper not available)
1645
- - GEMINI_SESSION: N/A (codeagent-wrapper not available)
1646
-
1647
- Plan generated by Claude Opus 4.6 via deep analysis of all 4 project markdown files + openenv-core framework API inspection.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
.claude/plan/winning-implementation.md DELETED
@@ -1,261 +0,0 @@
1
- # Implementation Plan: All 13 Improvements for #1 Finish
2
-
3
- ## Task Type
4
- - [x] Backend (Python/PyTorch/FastAPI)
5
-
6
- ## Current State (Verified 2026-03-28)
7
- - 187 tests pass, 97% coverage
8
- - 6 tasks, all endpoints working, WS task selection works
9
- - Docker 1.48GB, baseline reproducible, openenv validates
10
- - Missing: real training curves, LLM scores, 2nd architecture, Task 7, Docker optimization
11
-
12
- ---
13
-
14
- ## Phase 0: Repo Cleanup (5 min)
15
-
16
- **Files**: None to create
17
- **What**: Verify clean state, ensure no stale files
18
- **Acceptance**: `pytest` passes, `openenv validate` passes
19
-
20
- ---
21
-
22
- ## Phase 1: Add SimpleMLP Architecture (Tier 1, Item 3)
23
-
24
- **Files to create**: None (add to `pytorch_engine.py`)
25
- **Files to edit**: `ml_training_debugger/pytorch_engine.py`, `ml_training_debugger/scenarios.py`
26
-
27
- **What**:
28
- - Add `SimpleMLP(nn.Module)` class — 3 hidden layers, ~20K params, BatchNorm, ReLU
29
- - Add `model_type` field to `ScenarioParams` (Literal["cnn", "mlp"])
30
- - Use torch.Generator to randomly pick CNN or MLP at `sample_scenario()` time
31
- - Update `create_model_and_inject_fault()` to use selected model type
32
- - Update `extract_gradient_stats()` layer names for MLP
33
-
34
- **Pseudo-code**:
35
- ```python
36
- class SimpleMLP(nn.Module):
37
- def __init__(self, input_dim=3072, hidden_dim=128, num_classes=10):
38
- super().__init__()
39
- self.flatten = nn.Flatten()
40
- self.fc1 = nn.Linear(input_dim, hidden_dim)
41
- self.bn1 = nn.BatchNorm1d(hidden_dim)
42
- self.fc2 = nn.Linear(hidden_dim, hidden_dim)
43
- self.bn2 = nn.BatchNorm1d(hidden_dim)
44
- self.fc3 = nn.Linear(hidden_dim, num_classes)
45
- self.relu = nn.ReLU()
46
-
47
- def forward(self, x):
48
- x = self.flatten(x)
49
- x = self.relu(self.bn1(self.fc1(x)))
50
- x = self.relu(self.bn2(self.fc2(x)))
51
- return self.fc3(x)
52
- ```
53
-
54
- **Tests**: New tests in `test_pytorch_engine.py` for SimpleMLP
55
- **Acceptance**: Both CNN and MLP instantiate, fault injection works on both, gradient extraction works
56
-
57
- ---
58
-
59
- ## Phase 2: Replace Parametric Curves with Real Mini-Training (Tier 1, Item 2)
60
-
61
- **Files to edit**: `ml_training_debugger/simulation.py`, `ml_training_debugger/pytorch_engine.py`
62
-
63
- **What**:
64
- - Add `run_real_training(model, scenario, epochs=20) -> dict` to `pytorch_engine.py`
65
- - Returns `{"loss_history": [...], "val_acc_history": [...], "val_loss_history": [...]}`
66
- - Use real forward+backward on random CIFAR-10 style data
67
- - Cache results in module-level `_TRAINING_CACHE: dict[tuple[str, int], dict]` keyed by (task_id, seed)
68
- - Update `simulation.py` to call real training instead of parametric formulas
69
- - Keep `torch.manual_seed(seed)` for reproducibility
70
- - Fallback to parametric if cache miss and training too slow (>3s)
71
-
72
- **Key constraints**:
73
- - 20 epochs on SimpleCNN with batch_size=16 takes ~0.5-1s on CPU
74
- - Cache means second reset() with same task/seed is instant
75
- - Must still be deterministic (torch.manual_seed)
76
-
77
- **Tests**: Verify loss histories come from real training, are reproducible across runs
78
- **Acceptance**: `baseline_heuristic.py` produces identical scores on two runs with real curves
79
-
80
- ---
81
-
82
- ## Phase 3: Add Task 7 — LR Scheduler Bug (Tier 1, Item 4)
83
-
84
- **Files to edit**: `models.py`, `scenarios.py`, `simulation.py`, `pytorch_engine.py`, `graders.py`, `reward_engine.py`, `server/app.py`, `openenv.yaml`, `baseline_heuristic.py`, `README.md`
85
-
86
- **What**:
87
- - Add `SCHEDULER_MISCONFIGURED = "scheduler_misconfigured"` to `RootCauseDiagnosis`
88
- - Add `task_007` to `sample_scenario()` — medium-hard difficulty, max_steps=25
89
- - Scenario: training starts OK for first N epochs, then LR scheduler kicks in with wrong gamma/step_size, causing performance degradation
90
- - Agent must inspect config + loss curve inflection point
91
- - New grader: `grade_task_007()` — rewards inspecting config, identifying scheduler issue, fixing it
92
- - Add `fix_scheduler` to action space (or reuse `modify_config` with target `lr_scheduler_gamma`)
93
- - Update `/health` to return `"tasks": 7`
94
- - Update `/tasks` to include task_007
95
- - Update heuristic baseline to handle task_007
96
- - Add to openenv.yaml
97
-
98
- **Pseudo-scenario**:
99
- ```python
100
- if task_id == "task_007":
101
- gamma = _choose([0.01, 0.001, 0.0001], rng) # way too aggressive
102
- step_size = _choose([2, 3, 5], rng)
103
- return ScenarioParams(
104
- task_id=task_id,
105
- root_cause=RootCauseDiagnosis.SCHEDULER_MISCONFIGURED,
106
- seed=effective_seed,
107
- scheduler_gamma=gamma,
108
- scheduler_step_size=step_size,
109
- max_steps=25,
110
- notes="LR scheduler was recently added to improve convergence.",
111
- )
112
- ```
113
-
114
- **Tests**: Full lifecycle test for task_007, grader test
115
- **Acceptance**: task_007 works end-to-end, heuristic baseline handles it
116
-
117
- ---
118
-
119
- ## Phase 4: Add Difficulty Scaling (Tier 2, Item 6)
120
-
121
- **Files to edit**: `scenarios.py`, `server/environment.py`
122
-
123
- **What**:
124
- - Add `difficulty_level: int = 3` to `ScenarioParams` (1-5)
125
- - Accept `difficulty_level` in `reset()` kwargs
126
- - Scale noise, red herring intensity, and ambiguity based on level:
127
- - Level 1: obvious signals, no noise, no red herrings
128
- - Level 3: default (current behavior)
129
- - Level 5: max noise, multiple red herrings, ambiguous signals
130
- - Affects: noise amplitude in curves, red herring intensity, number of misleading notes
131
-
132
- **Acceptance**: `reset(task_id="task_005", difficulty_level=1)` produces clearer signals than level 5
133
-
134
- ---
135
-
136
- ## Phase 5: Add Curriculum, Leaderboard, Replay Endpoints (Tier 2 + Tier 3)
137
-
138
- **Files to edit**: `server/app.py`
139
-
140
- **What**:
141
- - `GET /curriculum` — returns ordered task list for training:
142
- ```json
143
- {"curriculum": [
144
- {"task_id": "task_001", "difficulty_level": 1},
145
- {"task_id": "task_001", "difficulty_level": 3},
146
- ...
147
- {"task_id": "task_005", "difficulty_level": 5}
148
- ]}
149
- ```
150
- - `GET /leaderboard` — returns sorted episode scores from `_baseline_results`
151
- - `GET /replay/{episode_id}` — returns full action/observation trace for an episode
152
- - For replay: store action/observation history in `SessionData`
153
-
154
- **Acceptance**: All 3 endpoints return valid JSON
155
-
156
- ---
157
-
158
- ## Phase 6: Add Confusion Matrix to Data Batch Stats (Tier 3, Item 10)
159
-
160
- **Files to edit**: `models.py`, `simulation.py`
161
-
162
- **What**:
163
- - Add `confusion_matrix: Optional[list[list[float]]]` to `DataBatchStats`
164
- - Generate 10x10 confusion matrix in `gen_data_batch_stats()`
165
- - For data leakage: high diagonal, some off-diagonal leakage
166
- - For overfitting: perfect diagonal for train, scattered for val
167
- - For normal: moderate diagonal with realistic confusion
168
-
169
- **Acceptance**: `inspect_data_batch` returns confusion_matrix field
170
-
171
- ---
172
-
173
- ## Phase 7: Exploit Resistance Proof (Tier 2, Item 8)
174
-
175
- **Files to create**: `tests/test_exploit_resistance.py`
176
- **Files to edit**: `README.md`
177
-
178
- **What**:
179
- - Test that runs all 7 tasks with seeds 1-100
180
- - Records score variance per task
181
- - Asserts no single strategy works across all seeds (std > 0 for hard tasks)
182
- - Add results table to README
183
-
184
- **Acceptance**: Test passes, README shows variance table
185
-
186
- ---
187
-
188
- ## Phase 8: PAPER.md (Tier 3, Item 13)
189
-
190
- **Files to create**: `PAPER.md`
191
-
192
- **What**: 1-page research summary:
193
- - Title: "Context-Gated Reward Shaping for Evidence-Based ML Debugging"
194
- - Abstract, motivation, method (context-gated penalty), environment design, results, conclusion
195
- - Include baseline comparison table
196
- - ~500-800 words
197
-
198
- **Acceptance**: PAPER.md exists and reads well
199
-
200
- ---
201
-
202
- ## Phase 9: LLM Baseline (Tier 1, Item 1)
203
-
204
- **Files to edit**: `baseline_inference.py`, `README.md`
205
-
206
- **What**:
207
- - This requires OPENAI_API_KEY from the user
208
- - Run `python baseline_inference.py` with real API key
209
- - Record scores for all 7 tasks
210
- - Update README with comparison table
211
- - If no API key available: document expected behavior and add placeholder scores
212
-
213
- **Acceptance**: README has heuristic vs LLM comparison table
214
-
215
- ---
216
-
217
- ## Phase 10: Final Polish + Docker + README + Smoke Test
218
-
219
- **Files to edit**: `Dockerfile`, `README.md`, `deploy-hf.sh`
220
-
221
- **What**:
222
- - Docker: Already at 1.48GB — document the trade-off (libtorch_cpu.so is 426MB minimum)
223
- - Create `deploy-hf.sh` script
224
- - Update README with all new features (Task 7, difficulty scaling, curriculum, leaderboard, replay, confusion matrix)
225
- - Final smoke test: all tests pass, all endpoints work, baseline reproducible
226
-
227
- **Acceptance**: Everything green, ready to submit
228
-
229
- ---
230
-
231
- ## Key Files to Create/Edit
232
-
233
- | File | Operation | Phase | Description |
234
- |------|-----------|-------|-------------|
235
- | `ml_training_debugger/pytorch_engine.py` | Modify | 1,2 | Add SimpleMLP, real training, caching |
236
- | `ml_training_debugger/models.py` | Modify | 3,6 | Add scheduler_misconfigured enum, confusion_matrix |
237
- | `ml_training_debugger/scenarios.py` | Modify | 1,3,4 | Add model_type, task_007, difficulty_level |
238
- | `ml_training_debugger/simulation.py` | Modify | 2,6 | Real training curves, confusion matrix |
239
- | `ml_training_debugger/graders.py` | Modify | 3 | Add grade_task_007 |
240
- | `server/app.py` | Modify | 3,5 | Task 7, curriculum, leaderboard, replay endpoints |
241
- | `server/environment.py` | Modify | 4,5 | Difficulty scaling, replay storage |
242
- | `openenv.yaml` | Modify | 3 | Add task_007 |
243
- | `baseline_heuristic.py` | Modify | 3 | Handle task_007 |
244
- | `README.md` | Modify | 7,9,10 | Exploit resistance, LLM scores, new features |
245
- | `PAPER.md` | Create | 8 | Research summary |
246
- | `deploy-hf.sh` | Create | 10 | HF deployment script |
247
- | `tests/test_exploit_resistance.py` | Create | 7 | 100-seed variance test |
248
-
249
- ## Risks and Mitigation
250
-
251
- | Risk | Mitigation |
252
- |------|------------|
253
- | Real training slows reset() beyond 3s | Cache per (task_id, seed); MLP is faster than CNN |
254
- | Task 7 breaks existing tests | Run full suite after each phase |
255
- | LLM baseline needs API key | Document expected behavior; user provides key |
256
- | Docker can't go below 1.4GB | Document trade-off; libtorch_cpu.so is irreducible |
257
- | SimpleMLP gradient patterns differ | Adapt extract_gradient_stats for MLP layers |
258
-
259
- ## SESSION_ID
260
- - CODEX_SESSION: N/A
261
- - GEMINI_SESSION: N/A
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
.gitignore CHANGED
@@ -16,3 +16,7 @@ validation/reports/*.png
16
  .claude/
17
  CLAUDE.md
18
  .hf-space/
 
 
 
 
 
16
  .claude/
17
  CLAUDE.md
18
  .hf-space/
19
+ .python-version
20
+ uv.lock
21
+ deploy-hf.sh
22
+ deploy.sh
.python-version DELETED
@@ -1 +0,0 @@
1
- 3.12
 
 
deploy-hf.sh DELETED
@@ -1,72 +0,0 @@
1
- #!/bin/bash
2
- # Deploy to Hugging Face Spaces
3
- # Usage: ./deploy-hf.sh <your-hf-username>/<space-name>
4
- # Example: ./deploy-hf.sh omkarrr88/pytorch-training-debugger
5
-
6
- set -euo pipefail
7
-
8
- SPACE="${1:-}"
9
- if [ -z "$SPACE" ]; then
10
- echo "Usage: ./deploy-hf.sh <username>/<space-name>"
11
- exit 1
12
- fi
13
-
14
- echo "=== Deploying to HF Space: $SPACE ==="
15
-
16
- # Ensure huggingface-cli is installed
17
- if ! command -v huggingface-cli &> /dev/null; then
18
- pip install huggingface_hub
19
- fi
20
-
21
- # Clone or create the space
22
- if [ ! -d ".hf-space" ]; then
23
- echo "Cloning space..."
24
- git clone "https://huggingface.co/spaces/$SPACE" .hf-space || {
25
- echo "Creating new space..."
26
- huggingface-cli repo create "$SPACE" --type space --space-sdk docker
27
- git clone "https://huggingface.co/spaces/$SPACE" .hf-space
28
- }
29
- fi
30
-
31
- # Copy files to space
32
- echo "Copying files..."
33
- rsync -av --exclude='.venv' --exclude='__pycache__' --exclude='.git' \
34
- --exclude='.hf-space' --exclude='tests' --exclude='validation' \
35
- --exclude='.claude' --exclude='*.pyc' --exclude='run*.json' \
36
- --exclude='.env' --exclude='.coverage' --exclude='uv.lock' \
37
- . .hf-space/
38
-
39
- # Copy validation report (pre-computed)
40
- mkdir -p .hf-space/validation/reports
41
- cp -r validation/reports/fidelity_report.json .hf-space/validation/reports/ 2>/dev/null || true
42
-
43
- cd .hf-space
44
-
45
- # Add openenv tag to README if not present
46
- if ! grep -q "tags:" README.md 2>/dev/null; then
47
- cat > README.md.header <<'EOF'
48
- ---
49
- title: PyTorch Training Run Debugger
50
- emoji: 🔧
51
- colorFrom: red
52
- colorTo: blue
53
- sdk: docker
54
- pinned: false
55
- license: mit
56
- tags:
57
- - openenv
58
- ---
59
-
60
- EOF
61
- cat README.md >> README.md.header
62
- mv README.md.header README.md
63
- fi
64
-
65
- # Commit and push
66
- git add -A
67
- git commit -m "Deploy: PyTorch Training Run Debugger" || echo "No changes to commit"
68
- git push
69
-
70
- echo "=== Deployed! ==="
71
- echo "Space URL: https://huggingface.co/spaces/$SPACE"
72
- echo "Health: https://${SPACE/\//-}.hf.space/health"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
deploy.sh DELETED
@@ -1,52 +0,0 @@
1
- #!/bin/bash
2
- set -euo pipefail
3
-
4
- echo "=== PyTorch Training Run Debugger — Pre-Submission Smoke Test ==="
5
- echo ""
6
-
7
- # 1. Run tests
8
- echo "=== 1. Running test suite ==="
9
- source .venv/bin/activate
10
- pytest tests/ -v --cov=ml_training_debugger --cov-report=term-missing
11
- echo ""
12
-
13
- # 2. Code formatting check
14
- echo "=== 2. Code formatting ==="
15
- black --check ml_training_debugger/ server/ tests/ || { echo "Run: black ml_training_debugger/ server/ tests/"; exit 1; }
16
- ruff check ml_training_debugger/ server/ tests/ || { echo "Run: ruff check --fix"; exit 1; }
17
- isort --check ml_training_debugger/ server/ tests/ --profile black || { echo "Run: isort --profile black"; exit 1; }
18
- echo "PASS: formatting OK"
19
- echo ""
20
-
21
- # 3. Baseline reproducibility
22
- echo "=== 3. Baseline reproducibility ==="
23
- python baseline_heuristic.py > /tmp/run1.json 2>/dev/null
24
- python baseline_heuristic.py > /tmp/run2.json 2>/dev/null
25
- diff /tmp/run1.json /tmp/run2.json && echo "PASS: bit-exact reproducible" || { echo "FAIL: non-reproducible"; exit 1; }
26
- echo ""
27
-
28
- # 4. Docker build
29
- echo "=== 4. Docker build ==="
30
- docker build -t pytorch-debugger .
31
- IMAGE_SIZE=$(docker images pytorch-debugger --format "{{.Size}}")
32
- echo "Image size: $IMAGE_SIZE"
33
- echo ""
34
-
35
- # 5. Docker run + health check
36
- echo "=== 5. Docker run + endpoint checks ==="
37
- docker run -d -p 7860:7860 --name smoke-test pytorch-debugger
38
- sleep 10
39
-
40
- curl -f http://localhost:7860/health || { echo "FAIL: health"; docker stop smoke-test; docker rm smoke-test; exit 1; }
41
- echo ""
42
- curl -f http://localhost:7860/tasks || { echo "FAIL: tasks"; docker stop smoke-test; docker rm smoke-test; exit 1; }
43
- echo ""
44
- curl -f -X POST http://localhost:7860/grader || { echo "FAIL: grader"; docker stop smoke-test; docker rm smoke-test; exit 1; }
45
- echo ""
46
-
47
- # 6. Cleanup
48
- docker stop smoke-test && docker rm smoke-test
49
- rm -f /tmp/run1.json /tmp/run2.json
50
-
51
- echo ""
52
- echo "=== ALL CHECKS PASSED ==="
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
uv.lock DELETED
The diff for this file is too large to render. See raw diff