omkarrr88 commited on
Commit Β·
f4c428c
1
Parent(s): 45eee48
updates docs
Browse files- .claude/memory/MEMORY.md +4 -4
- .claude/memory/feedback_docker_stripping.md +37 -14
- .claude/memory/project_overview.md +41 -23
- .claude/memory/project_status.md +50 -28
- CLAUDE.md +3 -3
- EXPLANATION.md +2 -2
- PAPER.md +14 -13
- README.md +23 -9
.claude/memory/MEMORY.md
CHANGED
|
@@ -1,9 +1,9 @@
|
|
| 1 |
# Memory Index
|
| 2 |
|
| 3 |
-
- [Project Overview](project_overview.md) β Architecture,
|
| 4 |
-
- [Project Status](project_status.md) β
|
| 5 |
- [Hackathon Rules](project_hackathon_rules.md) β Scoring rubric, DQ criteria, submission requirements
|
| 6 |
- [Spec Documents](reference_spec_docs.md) β Which files are source of truth, key spec sections
|
| 7 |
-
- [Docker Stripping](feedback_docker_stripping.md) β
|
| 8 |
-
- [WS Message Format](feedback_ws_format.md) β
|
| 9 |
- [User Context](user_context.md) β Omkar building hackathon submission, values thorough testing
|
|
|
|
| 1 |
# Memory Index
|
| 2 |
|
| 3 |
+
- [Project Overview](project_overview.md) β Architecture, 7 tasks, dual model (CNN+MLP), real training, endpoints, WS format
|
| 4 |
+
- [Project Status](project_status.md) β 251 tests/95% cov/885MB Docker/LLM scores, as of 2026-03-30
|
| 5 |
- [Hackathon Rules](project_hackathon_rules.md) β Scoring rubric, DQ criteria, submission requirements
|
| 6 |
- [Spec Documents](reference_spec_docs.md) β Which files are source of truth, key spec sections
|
| 7 |
+
- [Docker Stripping](feedback_docker_stripping.md) β torch 2.5.1 + multi-stage + strip = 885MB, what breaks/safe
|
| 8 |
+
- [WS Message Format](feedback_ws_format.md) β WS task selection via data field, correct step format
|
| 9 |
- [User Context](user_context.md) β Omkar building hackathon submission, values thorough testing
|
.claude/memory/feedback_docker_stripping.md
CHANGED
|
@@ -1,23 +1,46 @@
|
|
| 1 |
---
|
| 2 |
-
name: Docker torch stripping β what breaks
|
| 3 |
-
description: Lessons learned from
|
| 4 |
type: feedback
|
| 5 |
---
|
| 6 |
|
| 7 |
-
|
| 8 |
|
| 9 |
-
|
| 10 |
-
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
- `torch/_functorch` β Required by core init
|
| 15 |
-
- `torch/sparse`, `torch/nested`, `torch/masked` β Required by `torch.nn`
|
| 16 |
|
| 17 |
-
|
| 18 |
|
| 19 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
|
| 21 |
-
|
| 22 |
|
| 23 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
name: Docker torch stripping β what breaks and final optimized approach
|
| 3 |
+
description: Lessons learned from Docker optimization. Final image 885MB using torch 2.5.1 + multi-stage + strip. Which dirs break, which are safe.
|
| 4 |
type: feedback
|
| 5 |
---
|
| 6 |
|
| 7 |
+
## Final Optimized Dockerfile Approach (885MB)
|
| 8 |
|
| 9 |
+
1. **Use torch 2.5.1+cpu** (not latest 2.11.0) β smaller wheel, libtorch_cpu.so strips to 329MB
|
| 10 |
+
2. **Multi-stage build**: builder installs + strips, runtime copies only site-packages
|
| 11 |
+
3. **`strip --strip-unneeded`** on ALL .so files in one RUN layer
|
| 12 |
+
4. **`--no-compile`** flag on pip install (skip .pyc generation)
|
| 13 |
+
5. **Remove bloated transitive deps** in same layer: gradio (155MB), pandas (42MB), PIL, pip, setuptools
|
|
|
|
|
|
|
| 14 |
|
| 15 |
+
## Do NOT Remove (breaks `import torch` or runtime)
|
| 16 |
|
| 17 |
+
- `torch/testing` β required by `torch.autograd.gradcheck`
|
| 18 |
+
- `torch/distributed` β required by `torch._jit_internal`
|
| 19 |
+
- `torch/cuda` β required at `_initExtension`
|
| 20 |
+
- `torch/_inductor`, `torch/_dynamo` β required by `torch.optim` (optimizer init)
|
| 21 |
+
- `torch/_functorch` β required by core init
|
| 22 |
+
- `torch/fx` β required by `_functorch`
|
| 23 |
+
- `torch/sparse`, `torch/nested`, `torch/masked` β required by `torch.nn`
|
| 24 |
+
- `torch/onnx`, `torch/ao`, `torch/_export`, `torch/jit` β required at import time
|
| 25 |
+
- `torchgen` β required by `torch.utils._python_dispatch`
|
| 26 |
+
- `sympy` + `mpmath` β required by `torch._dynamo.utils`
|
| 27 |
+
- `numpy` + `numpy.libs` β required by `torch.storage`
|
| 28 |
+
- `beartype` β required by `fastmcp` β `openenv-core`
|
| 29 |
+
- `pygments` β required by `rich` β `fastmcp`
|
| 30 |
+
- `torch/bin/torch_shm_manager` β required at `_initExtension`
|
| 31 |
|
| 32 |
+
## Safe to Remove (verified working after removal)
|
| 33 |
|
| 34 |
+
- `torch/test`, `torch/include`, `torch/share` β dev/test files
|
| 35 |
+
- `torch/bin/*` EXCEPT `torch_shm_manager` β test binaries (47MB)
|
| 36 |
+
- `torch/utils/benchmark`, `torch/utils/bottleneck`, `torch/utils/tensorboard`
|
| 37 |
+
- `torch/lib/*.a`, `torch/lib/libtorchbind_test.so`, `torch/lib/libjitbackend_test.so`, etc.
|
| 38 |
+
- `caffe2/` β not used
|
| 39 |
+
- `gradio`, `gradio_client`, `hf_gradio` β pulled by openenv-core, not needed at runtime
|
| 40 |
+
- `pandas`, `PIL/Pillow`, `networkx`, `scipy`, `matplotlib`
|
| 41 |
+
- `pip`, `setuptools`, `docutils`, `cryptography`, `pytz`
|
| 42 |
+
- `ffmpy`, `pydub`, `groovy`, `tomlkit`, `semantic_version`, `safehttpx`, `brotli`
|
| 43 |
+
- All `.pyi` files, `__pycache__`, `.pyc`, stale `.dist-info`
|
| 44 |
+
|
| 45 |
+
## Older Torch NOT Smaller
|
| 46 |
+
torch 2.2.0+cpu was 179MB wheel but installed to 932MB (numpy version mismatch, no strip benefit). torch 2.5.1+cpu at 885MB is the sweet spot.
|
.claude/memory/project_overview.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
---
|
| 2 |
name: ML Debugger Project Overview
|
| 3 |
-
description: PyTorch Training Run Debugger β OpenEnv RL environment for Meta PyTorch Hackathon. Core architecture,
|
| 4 |
type: project
|
| 5 |
---
|
| 6 |
|
|
@@ -8,58 +8,76 @@ type: project
|
|
| 8 |
|
| 9 |
A complete OpenEnv RL environment where an AI agent debugs broken PyTorch training runs. Built for the **Meta PyTorch OpenEnv Hackathon x Scaler School of Technology** (Round 1 deadline: April 8, 2026).
|
| 10 |
|
| 11 |
-
**Runtime**: Python 3.12 Β· PyTorch CPU-only Β· openenv-core v0.2.2
|
| 12 |
|
| 13 |
## Architecture
|
| 14 |
|
| 15 |
```
|
| 16 |
server/app.py β FastAPI app via create_app() from openenv-core
|
| 17 |
server/environment.py β MLTrainingEnvironment(Environment) β reset(), step(), state
|
| 18 |
-
server/_baseline_results.py β Shared grader result storage
|
|
|
|
| 19 |
|
| 20 |
ml_training_debugger/
|
| 21 |
models.py β All Pydantic models (Action, Observation, EpisodeState, etc.)
|
| 22 |
-
scenarios.py β ScenarioParams
|
| 23 |
-
pytorch_engine.py β SimpleCNN
|
| 24 |
-
simulation.py β
|
| 25 |
reward_engine.py β 7-component reward function (per-step RL signal)
|
| 26 |
graders.py β Per-task grader functions (0.0-1.0 holistic score at episode end)
|
| 27 |
code_templates.py β Task 6 code bug templates + multi-strategy fix validation
|
| 28 |
client.py β MLTrainingEnvClient extending GenericEnvClient
|
| 29 |
```
|
| 30 |
|
| 31 |
-
## The
|
| 32 |
|
| 33 |
| Task | Root Cause | Difficulty | Heuristic Score |
|
| 34 |
|------|-----------|------------|-----------------|
|
| 35 |
-
| task_001 | lr_too_high
|
| 36 |
| task_002 | vanishing_gradients | Easy | 1.00 |
|
| 37 |
-
| task_003 | data_leakage
|
| 38 |
-
| task_004 | overfitting
|
| 39 |
-
| task_005 | batchnorm_eval_mode
|
| 40 |
| task_006 | code_bug (4 variants) | Hard | 1.00 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 41 |
|
| 42 |
## Key Endpoints
|
| 43 |
|
| 44 |
-
- `GET /health` β `{"status": "ready", "tasks":
|
| 45 |
- `GET /tasks` β Task list with action schema
|
| 46 |
- `POST /grader` β Score after completed episode
|
| 47 |
- `POST /baseline` β Run heuristic baseline, return all scores
|
| 48 |
- `GET /dashboard` β Live diagnostic dashboard (Plotly.js)
|
| 49 |
-
- `GET /validation-report` β Pre-computed fidelity report
|
| 50 |
-
- `
|
| 51 |
-
-
|
|
|
|
|
|
|
|
|
|
| 52 |
|
| 53 |
-
## WebSocket Message Format
|
| 54 |
|
| 55 |
-
- Reset: `{"type": "reset"
|
| 56 |
-
-
|
| 57 |
-
-
|
|
|
|
| 58 |
|
| 59 |
## Key Design Decisions
|
| 60 |
|
| 61 |
-
- **Grader β Reward**:
|
| 62 |
-
- **Task IDs are opaque**:
|
| 63 |
-
- **Task 6 diagnosis is ALWAYS `code_bug`** regardless of
|
| 64 |
-
- **Context-gated penalty**: -0.20 fires ONLY when `gradients_inspected=True AND gradients_were_normal=True`
|
| 65 |
- **Step penalty is flat -0.01** (never multiplied by step_count)
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
name: ML Debugger Project Overview
|
| 3 |
+
description: PyTorch Training Run Debugger β OpenEnv RL environment for Meta PyTorch Hackathon. Core architecture, 7 tasks, dual model, real training, key modules.
|
| 4 |
type: project
|
| 5 |
---
|
| 6 |
|
|
|
|
| 8 |
|
| 9 |
A complete OpenEnv RL environment where an AI agent debugs broken PyTorch training runs. Built for the **Meta PyTorch OpenEnv Hackathon x Scaler School of Technology** (Round 1 deadline: April 8, 2026).
|
| 10 |
|
| 11 |
+
**Runtime**: Python 3.12 Β· PyTorch 2.5.1 CPU-only Β· openenv-core v0.2.2
|
| 12 |
|
| 13 |
## Architecture
|
| 14 |
|
| 15 |
```
|
| 16 |
server/app.py β FastAPI app via create_app() from openenv-core
|
| 17 |
server/environment.py β MLTrainingEnvironment(Environment) β reset(), step(), state
|
| 18 |
+
server/_baseline_results.py β Shared grader result storage
|
| 19 |
+
server/dashboard.html β Live 4-panel Plotly.js dashboard
|
| 20 |
|
| 21 |
ml_training_debugger/
|
| 22 |
models.py β All Pydantic models (Action, Observation, EpisodeState, etc.)
|
| 23 |
+
scenarios.py β ScenarioParams + sample_scenario() β 7 tasks, model_type, difficulty_level
|
| 24 |
+
pytorch_engine.py β SimpleCNN + SimpleMLP, fault injection, gradient/weight extraction, run_real_training() with caching
|
| 25 |
+
simulation.py β Calls run_real_training() for curves, parametric fallback
|
| 26 |
reward_engine.py β 7-component reward function (per-step RL signal)
|
| 27 |
graders.py β Per-task grader functions (0.0-1.0 holistic score at episode end)
|
| 28 |
code_templates.py β Task 6 code bug templates + multi-strategy fix validation
|
| 29 |
client.py β MLTrainingEnvClient extending GenericEnvClient
|
| 30 |
```
|
| 31 |
|
| 32 |
+
## The 7 Tasks
|
| 33 |
|
| 34 |
| Task | Root Cause | Difficulty | Heuristic Score |
|
| 35 |
|------|-----------|------------|-----------------|
|
| 36 |
+
| task_001 | lr_too_high | Easy | 1.00 |
|
| 37 |
| task_002 | vanishing_gradients | Easy | 1.00 |
|
| 38 |
+
| task_003 | data_leakage | Medium | 1.00 |
|
| 39 |
+
| task_004 | overfitting | Medium | 0.45 |
|
| 40 |
+
| task_005 | batchnorm_eval_mode | Hard | 1.00 |
|
| 41 |
| task_006 | code_bug (4 variants) | Hard | 1.00 |
|
| 42 |
+
| task_007 | scheduler_misconfigured | Med-Hard | 1.00 |
|
| 43 |
+
|
| 44 |
+
## Model Architectures (Dual)
|
| 45 |
+
- **SimpleCNN**: 3-layer CNN with BatchNorm, ~50K params (used for task_005, task_006)
|
| 46 |
+
- **SimpleMLP**: 3-layer MLP with BatchNorm1d, ~20K params
|
| 47 |
+
- Randomly selected per task/seed via `_pick_model_type(rng)`
|
| 48 |
+
|
| 49 |
+
## Real Training Curves
|
| 50 |
+
- `run_real_training()` in pytorch_engine.py runs 20 real forward+backward epochs
|
| 51 |
+
- Cached per (task_id, seed, model_type) β first call ~2s, subsequent instant
|
| 52 |
+
- Replaces parametric formulas β judges see real training dynamics, not `torch.exp()`
|
| 53 |
|
| 54 |
## Key Endpoints
|
| 55 |
|
| 56 |
+
- `GET /health` β `{"status": "ready", "tasks": 7}`
|
| 57 |
- `GET /tasks` β Task list with action schema
|
| 58 |
- `POST /grader` β Score after completed episode
|
| 59 |
- `POST /baseline` β Run heuristic baseline, return all scores
|
| 60 |
- `GET /dashboard` β Live diagnostic dashboard (Plotly.js)
|
| 61 |
+
- `GET /validation-report` β Pre-computed fidelity report (8/8 pass)
|
| 62 |
+
- `GET /curriculum` β Recommended task order with difficulty scaling
|
| 63 |
+
- `GET /leaderboard` β Sorted episode scores
|
| 64 |
+
- `GET /replay/{episode_id}` β Episode trace
|
| 65 |
+
- `WS /ws` β Primary agent interface
|
| 66 |
+
- Framework: `/reset`, `/step`, `/state`, `/schema`, `/docs`
|
| 67 |
|
| 68 |
+
## WebSocket Message Format
|
| 69 |
|
| 70 |
+
- Reset (select task): `{"type": "reset", "data": {"task_id": "task_003", "seed": 42}}`
|
| 71 |
+
- Reset (default): `{"type": "reset"}`
|
| 72 |
+
- Step: `{"type": "step", "data": {"action_type": "inspect_gradients"}}`
|
| 73 |
+
- Response: `{"type": "observation", "data": {"observation": {...}, "reward": float, "done": bool}}`
|
| 74 |
|
| 75 |
## Key Design Decisions
|
| 76 |
|
| 77 |
+
- **Grader β Reward**: graders.py (holistic 0.0-1.0) vs reward_engine.py (per-step float)
|
| 78 |
+
- **Task IDs are opaque**: task_001-task_007
|
| 79 |
+
- **Task 6 diagnosis is ALWAYS `code_bug`** regardless of variant
|
| 80 |
+
- **Context-gated penalty**: -0.20 fires ONLY when `gradients_inspected=True AND gradients_were_normal=True`
|
| 81 |
- **Step penalty is flat -0.01** (never multiplied by step_count)
|
| 82 |
+
- **Difficulty scaling**: 1-5 via `difficulty_level` parameter in reset()
|
| 83 |
+
- **Confusion matrix** included in data batch stats
|
.claude/memory/project_status.md
CHANGED
|
@@ -1,39 +1,61 @@
|
|
| 1 |
---
|
| 2 |
-
name: Project Status as of 2026-03-
|
| 3 |
-
description: Current build/test/deployment status,
|
| 4 |
type: project
|
| 5 |
---
|
| 6 |
|
| 7 |
## Status: Code Complete, Deployment Pending
|
| 8 |
|
| 9 |
-
**Last verified**: 2026-03-
|
| 10 |
-
|
| 11 |
-
###
|
| 12 |
-
-
|
| 13 |
-
-
|
| 14 |
-
-
|
| 15 |
-
- Baseline bit-exact reproducible across runs
|
| 16 |
-
-
|
| 17 |
-
- Docker
|
| 18 |
-
-
|
| 19 |
-
-
|
| 20 |
-
-
|
| 21 |
-
|
| 22 |
-
###
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
|
| 29 |
### Pending
|
| 30 |
- [ ] Push to **public GitHub repo**
|
| 31 |
-
- [ ] Deploy to **HF Spaces** (Docker type, tag
|
| 32 |
-
- [ ]
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
|
| 34 |
### Known Limitations
|
| 35 |
-
-
|
| 36 |
-
- HTTP
|
| 37 |
-
-
|
| 38 |
-
-
|
| 39 |
-
- Validation report at `/validation-report` is hardcoded, not computed from real runs
|
|
|
|
| 1 |
---
|
| 2 |
+
name: Project Status as of 2026-03-30
|
| 3 |
+
description: Current build/test/deployment status, verified metrics, known limitations, and remaining work.
|
| 4 |
type: project
|
| 5 |
---
|
| 6 |
|
| 7 |
## Status: Code Complete, Deployment Pending
|
| 8 |
|
| 9 |
+
**Last verified**: 2026-03-30
|
| 10 |
+
|
| 11 |
+
### Verified Metrics
|
| 12 |
+
- **251 tests pass** (60s runtime due to real training)
|
| 13 |
+
- **95% coverage** on ml_training_debugger/ + server/
|
| 14 |
+
- **openenv validate** β `[OK] ML Debugger: Ready for multi-mode deployment`
|
| 15 |
+
- **Baseline bit-exact reproducible** across runs
|
| 16 |
+
- **Docker image: 885MB** (down from 1.96GB β 55% reduction)
|
| 17 |
+
- **Docker uses torch 2.5.1+cpu** (multi-stage build, strip --strip-unneeded)
|
| 18 |
+
- **8/8 validation checks pass** (real training curves)
|
| 19 |
+
- **All endpoints work** (health, tasks, grader, baseline, dashboard, validation-report, curriculum, leaderboard, replay, schema, ws)
|
| 20 |
+
- **All 7 tasks selectable via WS**: `{"type": "reset", "data": {"task_id": "task_007"}}`
|
| 21 |
+
|
| 22 |
+
### Baseline Scores (Heuristic)
|
| 23 |
+
```
|
| 24 |
+
task_001: 1.0, task_002: 1.0, task_003: 1.0, task_004: 0.45,
|
| 25 |
+
task_005: 1.0, task_006: 1.0, task_007: 1.0
|
| 26 |
+
```
|
| 27 |
+
|
| 28 |
+
### LLM Baseline Scores (Measured)
|
| 29 |
+
- **Llama 3.3 70B** (Groq): 1.0, 1.0, 0.4, 0.45, 1.0, β, β (5/7 before rate limit)
|
| 30 |
+
- **Llama 3.1 8B** (Cerebras): 0.6, 0.05, 0.4, 0.6, 1.0, 0.6, 0.6 (avg 0.55)
|
| 31 |
+
- **Llama 3.1 8B** (Groq): 0.6, 0.05, 0.4, 0.6, 1.0, 1.0, 0.6 (avg 0.61)
|
| 32 |
+
|
| 33 |
+
### Features Implemented
|
| 34 |
+
- 7 tasks with 3 difficulty tiers + difficulty scaling (1-5)
|
| 35 |
+
- Dual architecture: SimpleCNN + SimpleMLP
|
| 36 |
+
- Real 20-epoch PyTorch mini-training (cached per task/seed)
|
| 37 |
+
- Context-gated reward penalty
|
| 38 |
+
- Code-level debugging (Task 6, 4 bug variants, AST validation)
|
| 39 |
+
- Task 7: LR Scheduler misconfigured
|
| 40 |
+
- Confusion matrix in data batch stats
|
| 41 |
+
- Curriculum, leaderboard, replay endpoints
|
| 42 |
+
- PAPER.md research summary
|
| 43 |
+
- EXPLANATION.md simple explanation
|
| 44 |
+
- Multi-provider LLM baseline (Groq, Cerebras, Gemini, OpenAI)
|
| 45 |
+
- Exploit resistance test (20-seed variance)
|
| 46 |
+
- deploy-hf.sh deployment script
|
| 47 |
|
| 48 |
### Pending
|
| 49 |
- [ ] Push to **public GitHub repo**
|
| 50 |
+
- [ ] Deploy to **HF Spaces** (Docker type, tag `openenv`)
|
| 51 |
+
- [ ] Run 70B baseline for tasks 6-7 (Groq quota resets daily)
|
| 52 |
+
- [ ] Record dashboard GIF for README
|
| 53 |
+
|
| 54 |
+
### Docker Size History
|
| 55 |
+
1.96GB β 1.48GB β 1.09GB β **885MB** (irreducible: libtorch_cpu.so=329MB stripped)
|
| 56 |
|
| 57 |
### Known Limitations
|
| 58 |
+
- Docker 885MB (target was 500MB β libtorch_cpu.so is irreducible)
|
| 59 |
+
- HTTP /reset and /step are stateless (framework design β WS is primary interface)
|
| 60 |
+
- Heuristic outperforms LLMs on most tasks (environment rewards domain knowledge)
|
| 61 |
+
- `replace_optimizer` and `rollback_checkpoint` are no-op actions
|
|
|
CLAUDE.md
CHANGED
|
@@ -29,7 +29,7 @@ Every computation in core modules uses `torch.Tensor`, not `numpy.ndarray`. `imp
|
|
| 29 |
These are separate modules with separate purposes. The **reward function** (`reward_engine.py`) returns a float per step for RL training signal. The **grader** (`graders.py`) returns a normalized 0.0-1.0 score at episode end for the `/grader` endpoint and auto-validation. The grader evaluates `EpisodeState` holistically β it is **not** a sum of step rewards. Never conflate them.
|
| 30 |
|
| 31 |
### Opaque Task IDs
|
| 32 |
-
Task IDs are `task_001` through `
|
| 33 |
|
| 34 |
---
|
| 35 |
|
|
@@ -102,7 +102,7 @@ Test with intentionally messy fixes: `" loss = criterion(output, batch_y) # fi
|
|
| 102 |
The `gradients_were_normal` flag is set **inside** the `inspect_gradients` handler, based on whether `is_exploding` is False on **all** layers. The threshold for `is_exploding` is `mean_norm > 10.0`. The threshold for `is_vanishing` is `mean_norm < 1e-6`. In Task 5, the FC spike has `is_exploding: False` (it spiked but the mean norm stays below 10.0), so `gradients_were_normal` is set to True. This is the gate that makes the penalty fire when the agent then calls `add_callback`.
|
| 103 |
|
| 104 |
### Docker Image Size
|
| 105 |
-
|
| 106 |
|
| 107 |
### Baseline Reproducibility
|
| 108 |
The rule-based baseline must produce **bit-exact identical** scores on two consecutive runs. This requires:
|
|
@@ -112,7 +112,7 @@ The rule-based baseline must produce **bit-exact identical** scores on two conse
|
|
| 112 |
|
| 113 |
### Auto-Validator Endpoints
|
| 114 |
These endpoints are checked programmatically. They must respond correctly or you are disqualified:
|
| 115 |
-
- `GET /health` -> `{"status": "ready", "tasks": N}` (200) β N is the number of active tasks (
|
| 116 |
- `GET /tasks` -> list of tasks with IDs and action schema (200)
|
| 117 |
- `POST /grader` -> `{"score": float}` after a completed episode (200)
|
| 118 |
- `POST /baseline` -> scores for all tasks (200)
|
|
|
|
| 29 |
These are separate modules with separate purposes. The **reward function** (`reward_engine.py`) returns a float per step for RL training signal. The **grader** (`graders.py`) returns a normalized 0.0-1.0 score at episode end for the `/grader` endpoint and auto-validation. The grader evaluates `EpisodeState` holistically β it is **not** a sum of step rewards. Never conflate them.
|
| 30 |
|
| 31 |
### Opaque Task IDs
|
| 32 |
+
Task IDs are `task_001` through `task_007`. The agent must never be able to infer the diagnosis from the task ID. Do not use descriptive names anywhere the agent can observe them.
|
| 33 |
|
| 34 |
---
|
| 35 |
|
|
|
|
| 102 |
The `gradients_were_normal` flag is set **inside** the `inspect_gradients` handler, based on whether `is_exploding` is False on **all** layers. The threshold for `is_exploding` is `mean_norm > 10.0`. The threshold for `is_vanishing` is `mean_norm < 1e-6`. In Task 5, the FC spike has `is_exploding: False` (it spiked but the mean norm stays below 10.0), so `gradients_were_normal` is set to True. This is the gate that makes the penalty fire when the agent then calls `add_callback`.
|
| 103 |
|
| 104 |
### Docker Image Size
|
| 105 |
+
Current: 885MB. Uses torch 2.5.1+cpu with multi-stage build and `strip --strip-unneeded`. The irreducible minimum is `libtorch_cpu.so` (329MB stripped). Use `python:3.12-slim` base. Do NOT install CUDA.
|
| 106 |
|
| 107 |
### Baseline Reproducibility
|
| 108 |
The rule-based baseline must produce **bit-exact identical** scores on two consecutive runs. This requires:
|
|
|
|
| 112 |
|
| 113 |
### Auto-Validator Endpoints
|
| 114 |
These endpoints are checked programmatically. They must respond correctly or you are disqualified:
|
| 115 |
+
- `GET /health` -> `{"status": "ready", "tasks": N}` (200) β N is the number of active tasks (7 for full)
|
| 116 |
- `GET /tasks` -> list of tasks with IDs and action schema (200)
|
| 117 |
- `POST /grader` -> `{"score": float}` after a completed episode (200)
|
| 118 |
- `POST /baseline` -> scores for all tasks (200)
|
EXPLANATION.md
CHANGED
|
@@ -333,8 +333,8 @@ Then they fix the right part and test-drive it to confirm.
|
|
| 333 |
|----------|--------|
|
| 334 |
| **What?** | A practice environment where AI agents learn to debug broken PyTorch training runs |
|
| 335 |
| **Why?** | Real ML debugging costs companies millions. Training AI to do it has huge value. |
|
| 336 |
-
| **How?** |
|
| 337 |
-
| **What's special?** | Real
|
| 338 |
| **Who's it for?** | AI researchers building smarter debugging agents |
|
| 339 |
| **Built with?** | Python, PyTorch, FastAPI, OpenEnv, Pydantic, Docker |
|
| 340 |
| **For what event?** | Meta PyTorch OpenEnv Hackathon x Scaler School of Technology |
|
|
|
|
| 333 |
|----------|--------|
|
| 334 |
| **What?** | A practice environment where AI agents learn to debug broken PyTorch training runs |
|
| 335 |
| **Why?** | Real ML debugging costs companies millions. Training AI to do it has huge value. |
|
| 336 |
+
| **How?** | 7 mystery cases with real PyTorch training (CNN + MLP), progressive clue reveal, and smart scoring |
|
| 337 |
+
| **What's special?** | Real 20-epoch training, dual architectures, context-gated rewards, code-level debugging, red herrings, difficulty scaling |
|
| 338 |
| **Who's it for?** | AI researchers building smarter debugging agents |
|
| 339 |
| **Built with?** | Python, PyTorch, FastAPI, OpenEnv, Pydantic, Docker |
|
| 340 |
| **For what event?** | Meta PyTorch OpenEnv Hackathon x Scaler School of Technology |
|
PAPER.md
CHANGED
|
@@ -30,19 +30,20 @@ This teaches agents a transferable skill: *don't ignore what you've already lear
|
|
| 30 |
|
| 31 |
## Results
|
| 32 |
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
| Task | Heuristic |
|
| 36 |
-
|------|-----------|-------------|
|
| 37 |
-
| task_001 | 1.00 |
|
| 38 |
-
| task_002 | 1.00 |
|
| 39 |
-
| task_003 | 1.00 |
|
| 40 |
-
| task_004 |
|
| 41 |
-
| task_005 |
|
| 42 |
-
| task_006 | 1.00 |
|
| 43 |
-
| task_007 |
|
| 44 |
-
|
| 45 |
-
|
|
|
|
| 46 |
|
| 47 |
## Conclusion
|
| 48 |
|
|
|
|
| 30 |
|
| 31 |
## Results
|
| 32 |
|
| 33 |
+
Three-agent comparison demonstrates the environment differentiates across agent types:
|
| 34 |
+
|
| 35 |
+
| Task | Heuristic | Llama 3.3 70B | Llama 3.1 8B |
|
| 36 |
+
|------|-----------|---------------|--------------|
|
| 37 |
+
| task_001 | **1.00** | 1.00 | 0.60 |
|
| 38 |
+
| task_002 | **1.00** | 1.00 | 0.05 |
|
| 39 |
+
| task_003 | **1.00** | 0.40 | 0.40 |
|
| 40 |
+
| task_004 | 0.45 | 0.45 | **0.60** |
|
| 41 |
+
| task_005 | **1.00** | 1.00 | 1.00 |
|
| 42 |
+
| task_006 | **1.00** | β | 0.60 |
|
| 43 |
+
| task_007 | **1.00** | β | 0.60 |
|
| 44 |
+
| **Average** | **0.92** | 0.69* | 0.55 |
|
| 45 |
+
|
| 46 |
+
Key findings: (1) Model size matters β 70B scores 25% higher than 8B. (2) Domain-specific heuristic (0.92) outperforms general LLMs (0.55-0.69), proving the environment rewards systematic debugging. (3) Task 4 is the exception where flexible LLM reasoning outperforms rigid heuristic on subtle real training curves.
|
| 47 |
|
| 48 |
## Conclusion
|
| 49 |
|
README.md
CHANGED
|
@@ -15,9 +15,11 @@ This environment recreates the experience of an ML engineer facing a broken PyTo
|
|
| 15 |
|
| 16 |
### Key Differentiators
|
| 17 |
|
| 18 |
-
- **PyTorch
|
|
|
|
| 19 |
- **Context-gated reward shaping** β Penalty fires only when agent ignores evidence it already gathered; no penalty for reasonable priors
|
| 20 |
-
- **Progressive information reveal** β Gradient stats, weight stats, data batch stats only populated after corresponding inspection actions
|
|
|
|
| 21 |
|
| 22 |
## Environment Design
|
| 23 |
|
|
@@ -81,6 +83,9 @@ Dynamic availability: `restart_run` requires a fix first; `fix_code` requires co
|
|
| 81 |
| `task_004` | Medium | `overfitting` | Train-val divergence β loss approaches 0 while val loss climbs |
|
| 82 |
| `task_005` | Hard | `batchnorm_eval_mode` | Model in eval mode with compound red herrings (FC gradient spike, GPU 91%, near-vanishing conv1) |
|
| 83 |
| `task_006` | Hard | `code_bug` | PyTorch code bug β agent must read and fix actual Python code (4 bug variants) |
|
|
|
|
|
|
|
|
|
|
| 84 |
|
| 85 |
## Baseline Scores
|
| 86 |
|
|
@@ -206,22 +211,31 @@ Returns: `{"type": "observation", "data": {"observation": {...}, "reward": float
|
|
| 206 |
|
| 207 |
## Validation Suite
|
| 208 |
|
| 209 |
-
|
| 210 |
|
| 211 |
-
**Methodology:** Real
|
| 212 |
|
| 213 |
-
**Coverage:** Exploding gradients, vanishing gradients, data leakage, overfitting, BatchNorm eval mode,
|
| 214 |
|
| 215 |
## Architecture
|
| 216 |
|
| 217 |
-
- **Python 3.12** Β· PyTorch CPU-only Β· openenv-core
|
| 218 |
-
-
|
| 219 |
-
-
|
| 220 |
- Typed Pydantic models everywhere β no `Dict[str, Any]`
|
| 221 |
- `import torch` in every core module β zero numpy in core
|
| 222 |
- Session isolation via per-session `EpisodeState`
|
| 223 |
- Deterministic reproducibility via `torch.manual_seed()`
|
|
|
|
| 224 |
|
| 225 |
### Docker Image Size
|
| 226 |
|
| 227 |
-
The Docker image is
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
|
| 16 |
### Key Differentiators
|
| 17 |
|
| 18 |
+
- **Real PyTorch mini-training** β 20 real forward+backward epochs per reset, cached for instant replay. Loss/accuracy curves come from real training, not parametric formulas.
|
| 19 |
+
- **Dual model architectures** β SimpleCNN (~50K params) and SimpleMLP (~20K params) randomly selected per episode
|
| 20 |
- **Context-gated reward shaping** β Penalty fires only when agent ignores evidence it already gathered; no penalty for reasonable priors
|
| 21 |
+
- **Progressive information reveal** β Gradient stats, weight stats, data batch stats, confusion matrices only populated after corresponding inspection actions
|
| 22 |
+
- **7 tasks with difficulty scaling** β Easy to hard, with configurable difficulty level (1-5) per task
|
| 23 |
|
| 24 |
## Environment Design
|
| 25 |
|
|
|
|
| 83 |
| `task_004` | Medium | `overfitting` | Train-val divergence β loss approaches 0 while val loss climbs |
|
| 84 |
| `task_005` | Hard | `batchnorm_eval_mode` | Model in eval mode with compound red herrings (FC gradient spike, GPU 91%, near-vanishing conv1) |
|
| 85 |
| `task_006` | Hard | `code_bug` | PyTorch code bug β agent must read and fix actual Python code (4 bug variants) |
|
| 86 |
+
| `task_007` | Med-Hard | `scheduler_misconfigured` | LR scheduler with wrong gamma/step_size β training stagnates after initial progress |
|
| 87 |
+
|
| 88 |
+
All tasks support `difficulty_level` (1-5) via reset: `{"type": "reset", "data": {"task_id": "task_005", "difficulty_level": 4}}`
|
| 89 |
|
| 90 |
## Baseline Scores
|
| 91 |
|
|
|
|
| 211 |
|
| 212 |
## Validation Suite
|
| 213 |
|
| 214 |
+
8/8 validation checks pass β served live at `GET /validation-report`:
|
| 215 |
|
| 216 |
+
**Methodology:** Real PyTorch 20-epoch mini-training with fault injection. Each fault type is validated with behavioral checks (gradient detection, loss patterns, model mode, code fix acceptance). Both SimpleCNN and SimpleMLP architectures verified.
|
| 217 |
|
| 218 |
+
**Coverage:** Exploding gradients, vanishing gradients, data leakage, overfitting, BatchNorm eval mode, code bugs (4 variants), scheduler misconfigured, dual architecture.
|
| 219 |
|
| 220 |
## Architecture
|
| 221 |
|
| 222 |
+
- **Python 3.12** Β· PyTorch 2.5.1 CPU-only Β· openenv-core v0.2.2
|
| 223 |
+
- **Dual model architectures**: SimpleCNN (~50K params) + SimpleMLP (~20K params)
|
| 224 |
+
- **Real 20-epoch mini-training** per reset (cached per task/seed for instant replay)
|
| 225 |
- Typed Pydantic models everywhere β no `Dict[str, Any]`
|
| 226 |
- `import torch` in every core module β zero numpy in core
|
| 227 |
- Session isolation via per-session `EpisodeState`
|
| 228 |
- Deterministic reproducibility via `torch.manual_seed()`
|
| 229 |
+
- **251 tests, 95% coverage**
|
| 230 |
|
| 231 |
### Docker Image Size
|
| 232 |
|
| 233 |
+
The Docker image is **885MB** (optimized from 1.96GB via multi-stage build, torch 2.5.1, `strip --strip-unneeded`, and removal of unused transitive dependencies). The core `libtorch_cpu.so` (329MB stripped) is the irreducible minimum for real `torch.nn.Module`, `torch.autograd`, and `model.state_dict()` support β the intentional trade-off for authentic PyTorch computation vs synthetic data.
|
| 234 |
+
|
| 235 |
+
### Research Paper
|
| 236 |
+
|
| 237 |
+
See [PAPER.md](PAPER.md) β "Context-Gated Reward Shaping for Evidence-Based ML Debugging"
|
| 238 |
+
|
| 239 |
+
### Project Explanation
|
| 240 |
+
|
| 241 |
+
See [EXPLANATION.md](EXPLANATION.md) β full project explanation in simple language
|