Spaces:

ujjwalpardeshi
/

pytorch-training-debugger

Running

App Files Files Community

omkarrr88 commited on Mar 30

Commit

f4c428c

1 Parent(s): 45eee48

updates docs

Browse files

Files changed (8) hide show

.claude/memory/MEMORY.md +4 -4
.claude/memory/feedback_docker_stripping.md +37 -14
.claude/memory/project_overview.md +41 -23
.claude/memory/project_status.md +50 -28
CLAUDE.md +3 -3
EXPLANATION.md +2 -2
PAPER.md +14 -13
README.md +23 -9

.claude/memory/MEMORY.md CHANGED Viewed

@@ -1,9 +1,9 @@
 # Memory Index
-- [Project Overview](project_overview.md) — Architecture, 6 tasks, endpoints, WS format, key design decisions
-- [Project Status](project_status.md) — Build/test/deploy status as of 2026-03-28, known limitations
 - [Hackathon Rules](project_hackathon_rules.md) — Scoring rubric, DQ criteria, submission requirements
 - [Spec Documents](reference_spec_docs.md) — Which files are source of truth, key spec sections
-- [Docker Stripping](feedback_docker_stripping.md) — Which torch dirs are safe/unsafe to remove in Docker
-- [WS Message Format](feedback_ws_format.md) — openenv-core WS expects "data" not "action", no extra fields on reset
 - [User Context](user_context.md) — Omkar building hackathon submission, values thorough testing

 # Memory Index
+- [Project Overview](project_overview.md) — Architecture, 7 tasks, dual model (CNN+MLP), real training, endpoints, WS format
+- [Project Status](project_status.md) — 251 tests/95% cov/885MB Docker/LLM scores, as of 2026-03-30
 - [Hackathon Rules](project_hackathon_rules.md) — Scoring rubric, DQ criteria, submission requirements
 - [Spec Documents](reference_spec_docs.md) — Which files are source of truth, key spec sections
+- [Docker Stripping](feedback_docker_stripping.md) — torch 2.5.1 + multi-stage + strip = 885MB, what breaks/safe
+- [WS Message Format](feedback_ws_format.md) — WS task selection via data field, correct step format
 - [User Context](user_context.md) — Omkar building hackathon submission, values thorough testing

.claude/memory/feedback_docker_stripping.md CHANGED Viewed

@@ -1,23 +1,46 @@
 ---
-name: Docker torch stripping — what breaks
-description: Lessons learned from aggressive PyTorch stripping in Docker. Which dirs are safe to remove and which break imports.
 type: feedback
 ---
-Do NOT remove these torch directories in Docker — they break `import torch`:
-- `torch/cuda` → `ModuleNotFoundError: No module named 'torch.cuda'` (imported at `_initExtension`)
-- `torch/distributed` → `ModuleNotFoundError` (imported via `torch._jit_internal`)
-- `torch/testing` → `ModuleNotFoundError` (imported via `torch.autograd.gradcheck`)
-- `torch/jit` → Required by core torch init
-- `torch/fx` → Required by `torch._functorch`
-- `torch/_functorch` → Required by core init
-- `torch/sparse`, `torch/nested`, `torch/masked` → Required by `torch.nn`
-**Why:** PyTorch's `__init__.py` eagerly imports these modules during initialization. Even CPU-only builds reference them.
-**Safe to remove** (verified working): `torch/test`, `torch/include`, `torch/share`, `torch/utils/benchmark`, `torch/utils/bottleneck`, `torch/utils/tensorboard`, `torch/lib/*.a`, `torch/lib/libtorchbind_test.so`, `torch/lib/libjitbackend_test.so`, `torch/lib/libbackend_with_compiler.so`, `caffe2/`, `torch/_inductor`, `torch/_dynamo`, `torch/onnx`, `torch/_export`, `torch/compiler`, `torch/package`, `torch/profiler`, `torch/export`, `.pyi` files
-**How to apply:** Always combine pip install + cleanup in ONE Docker RUN layer. Separate layers don't reduce size.
-**`strip --strip-debug` on .so files**: Did NOT reduce `libtorch_cpu.so` size (426MB → 426MB). The pre-built CPU wheel has no debug symbols.

 ---
+name: Docker torch stripping — what breaks and final optimized approach
+description: Lessons learned from Docker optimization. Final image 885MB using torch 2.5.1 + multi-stage + strip. Which dirs break, which are safe.
 type: feedback
 ---
+## Final Optimized Dockerfile Approach (885MB)
+1. **Use torch 2.5.1+cpu** (not latest 2.11.0) — smaller wheel, libtorch_cpu.so strips to 329MB
+2. **Multi-stage build**: builder installs + strips, runtime copies only site-packages
+3. **`strip --strip-unneeded`** on ALL .so files in one RUN layer
+4. **`--no-compile`** flag on pip install (skip .pyc generation)
+5. **Remove bloated transitive deps** in same layer: gradio (155MB), pandas (42MB), PIL, pip, setuptools
+## Do NOT Remove (breaks `import torch` or runtime)
+- `torch/testing` → required by `torch.autograd.gradcheck`
+- `torch/distributed` → required by `torch._jit_internal`
+- `torch/cuda` → required at `_initExtension`
+- `torch/_inductor`, `torch/_dynamo` → required by `torch.optim` (optimizer init)
+- `torch/_functorch` → required by core init
+- `torch/fx` → required by `_functorch`
+- `torch/sparse`, `torch/nested`, `torch/masked` → required by `torch.nn`
+- `torch/onnx`, `torch/ao`, `torch/_export`, `torch/jit` → required at import time
+- `torchgen` → required by `torch.utils._python_dispatch`
+- `sympy` + `mpmath` → required by `torch._dynamo.utils`
+- `numpy` + `numpy.libs` → required by `torch.storage`
+- `beartype` → required by `fastmcp` → `openenv-core`
+- `pygments` → required by `rich` → `fastmcp`
+- `torch/bin/torch_shm_manager` → required at `_initExtension`
+## Safe to Remove (verified working after removal)
+- `torch/test`, `torch/include`, `torch/share` — dev/test files
+- `torch/bin/*` EXCEPT `torch_shm_manager` — test binaries (47MB)
+- `torch/utils/benchmark`, `torch/utils/bottleneck`, `torch/utils/tensorboard`
+- `torch/lib/*.a`, `torch/lib/libtorchbind_test.so`, `torch/lib/libjitbackend_test.so`, etc.
+- `caffe2/` — not used
+- `gradio`, `gradio_client`, `hf_gradio` — pulled by openenv-core, not needed at runtime
+- `pandas`, `PIL/Pillow`, `networkx`, `scipy`, `matplotlib`
+- `pip`, `setuptools`, `docutils`, `cryptography`, `pytz`
+- `ffmpy`, `pydub`, `groovy`, `tomlkit`, `semantic_version`, `safehttpx`, `brotli`
+- All `.pyi` files, `__pycache__`, `.pyc`, stale `.dist-info`
+## Older Torch NOT Smaller
+torch 2.2.0+cpu was 179MB wheel but installed to 932MB (numpy version mismatch, no strip benefit). torch 2.5.1+cpu at 885MB is the sweet spot.

.claude/memory/project_overview.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
 name: ML Debugger Project Overview
-description: PyTorch Training Run Debugger — OpenEnv RL environment for Meta PyTorch Hackathon. Core architecture, 6 tasks, key modules, and how they connect.
 type: project
 ---
@@ -8,58 +8,76 @@ type: project
 A complete OpenEnv RL environment where an AI agent debugs broken PyTorch training runs. Built for the **Meta PyTorch OpenEnv Hackathon x Scaler School of Technology** (Round 1 deadline: April 8, 2026).
-**Runtime**: Python 3.12 · PyTorch CPU-only · openenv-core v0.2.2
 ## Architecture
 ```
 server/app.py          → FastAPI app via create_app() from openenv-core
 server/environment.py  → MLTrainingEnvironment(Environment) — reset(), step(), state
-server/_baseline_results.py → Shared grader result storage across endpoints
 ml_training_debugger/
   models.py            → All Pydantic models (Action, Observation, EpisodeState, etc.)
-  scenarios.py         → ScenarioParams dataclass + sample_scenario(task_id, seed)
-  pytorch_engine.py    → SimpleCNN model, fault injection, gradient/weight extraction
-  simulation.py        → Parametric curve generation (loss/accuracy histories) — all torch ops
   reward_engine.py     → 7-component reward function (per-step RL signal)
   graders.py           → Per-task grader functions (0.0-1.0 holistic score at episode end)
   code_templates.py    → Task 6 code bug templates + multi-strategy fix validation
   client.py            → MLTrainingEnvClient extending GenericEnvClient
 ```
-## The 6 Tasks
 | Task | Root Cause | Difficulty | Heuristic Score |
 |------|-----------|------------|-----------------|
-| task_001 | lr_too_high (exploding gradients) | Easy | 1.00 |
 | task_002 | vanishing_gradients | Easy | 1.00 |
-| task_003 | data_leakage (class_overlap_score) | Medium | 1.00 |
-| task_004 | overfitting (train-val divergence) | Medium | 1.00 |
-| task_005 | batchnorm_eval_mode (red herrings) | Hard | 0.35 |
 | task_006 | code_bug (4 variants) | Hard | 1.00 |
 ## Key Endpoints
-- `GET /health` → `{"status": "ready", "tasks": 6}`
 - `GET /tasks` → Task list with action schema
 - `POST /grader` → Score after completed episode
 - `POST /baseline` → Run heuristic baseline, return all scores
 - `GET /dashboard` → Live diagnostic dashboard (Plotly.js)
-- `GET /validation-report` → Pre-computed fidelity report
-- `WS /ws` → Primary agent interface (framework-provided)
-- Framework also provides: `/reset`, `/step`, `/state`, `/schema`, `/docs`
-## WebSocket Message Format (Critical!)
-- Reset: `{"type": "reset"}` — NO extra fields (task_id NOT accepted via WS, defaults to task_001)
-- Step: `{"type": "step", "data": {"action_type": "inspect_gradients"}}` — use `"data"` NOT `"action"`
-- HTTP step wraps differently: `POST /step {"action": {"action_type": "..."}}`
 ## Key Design Decisions
-- **Grader ≠ Reward**: `graders.py` (holistic 0.0-1.0 at episode end) vs `reward_engine.py` (per-step float)
-- **Task IDs are opaque**: `task_001`-`task_006` — agent can't infer diagnosis from ID
-- **Task 6 diagnosis is ALWAYS `code_bug`** regardless of bug variant (eval_mode, detach_loss, etc.)
-- **Context-gated penalty**: -0.20 fires ONLY when `gradients_inspected=True AND gradients_were_normal=True` then `add_callback`
 - **Step penalty is flat -0.01** (never multiplied by step_count)

 ---
 name: ML Debugger Project Overview
+description: PyTorch Training Run Debugger — OpenEnv RL environment for Meta PyTorch Hackathon. Core architecture, 7 tasks, dual model, real training, key modules.
 type: project
 ---
 A complete OpenEnv RL environment where an AI agent debugs broken PyTorch training runs. Built for the **Meta PyTorch OpenEnv Hackathon x Scaler School of Technology** (Round 1 deadline: April 8, 2026).
+**Runtime**: Python 3.12 · PyTorch 2.5.1 CPU-only · openenv-core v0.2.2
 ## Architecture
 ```
 server/app.py          → FastAPI app via create_app() from openenv-core
 server/environment.py  → MLTrainingEnvironment(Environment) — reset(), step(), state
+server/_baseline_results.py → Shared grader result storage
+server/dashboard.html  → Live 4-panel Plotly.js dashboard
 ml_training_debugger/
   models.py            → All Pydantic models (Action, Observation, EpisodeState, etc.)
+  scenarios.py         → ScenarioParams + sample_scenario() — 7 tasks, model_type, difficulty_level
+  pytorch_engine.py    → SimpleCNN + SimpleMLP, fault injection, gradient/weight extraction, run_real_training() with caching
+  simulation.py        → Calls run_real_training() for curves, parametric fallback
   reward_engine.py     → 7-component reward function (per-step RL signal)
   graders.py           → Per-task grader functions (0.0-1.0 holistic score at episode end)
   code_templates.py    → Task 6 code bug templates + multi-strategy fix validation
   client.py            → MLTrainingEnvClient extending GenericEnvClient
 ```
+## The 7 Tasks
 | Task | Root Cause | Difficulty | Heuristic Score |
 |------|-----------|------------|-----------------|
+| task_001 | lr_too_high | Easy | 1.00 |
 | task_002 | vanishing_gradients | Easy | 1.00 |
+| task_003 | data_leakage | Medium | 1.00 |
+| task_004 | overfitting | Medium | 0.45 |
+| task_005 | batchnorm_eval_mode | Hard | 1.00 |
 | task_006 | code_bug (4 variants) | Hard | 1.00 |
+| task_007 | scheduler_misconfigured | Med-Hard | 1.00 |
+## Model Architectures (Dual)
+- **SimpleCNN**: 3-layer CNN with BatchNorm, ~50K params (used for task_005, task_006)
+- **SimpleMLP**: 3-layer MLP with BatchNorm1d, ~20K params
+- Randomly selected per task/seed via `_pick_model_type(rng)`
+## Real Training Curves
+- `run_real_training()` in pytorch_engine.py runs 20 real forward+backward epochs
+- Cached per (task_id, seed, model_type) — first call ~2s, subsequent instant
+- Replaces parametric formulas — judges see real training dynamics, not `torch.exp()`
 ## Key Endpoints
+- `GET /health` → `{"status": "ready", "tasks": 7}`
 - `GET /tasks` → Task list with action schema
 - `POST /grader` → Score after completed episode
 - `POST /baseline` → Run heuristic baseline, return all scores
 - `GET /dashboard` → Live diagnostic dashboard (Plotly.js)
+- `GET /validation-report` → Pre-computed fidelity report (8/8 pass)
+- `GET /curriculum` → Recommended task order with difficulty scaling
+- `GET /leaderboard` → Sorted episode scores
+- `GET /replay/{episode_id}` → Episode trace
+- `WS /ws` → Primary agent interface
+- Framework: `/reset`, `/step`, `/state`, `/schema`, `/docs`
+## WebSocket Message Format
+- Reset (select task): `{"type": "reset", "data": {"task_id": "task_003", "seed": 42}}`
+- Reset (default): `{"type": "reset"}`
+- Step: `{"type": "step", "data": {"action_type": "inspect_gradients"}}`
+- Response: `{"type": "observation", "data": {"observation": {...}, "reward": float, "done": bool}}`
 ## Key Design Decisions
+- **Grader ≠ Reward**: graders.py (holistic 0.0-1.0) vs reward_engine.py (per-step float)
+- **Task IDs are opaque**: task_001-task_007
+- **Task 6 diagnosis is ALWAYS `code_bug`** regardless of variant
+- **Context-gated penalty**: -0.20 fires ONLY when `gradients_inspected=True AND gradients_were_normal=True`
 - **Step penalty is flat -0.01** (never multiplied by step_count)
+- **Difficulty scaling**: 1-5 via `difficulty_level` parameter in reset()
+- **Confusion matrix** included in data batch stats

.claude/memory/project_status.md CHANGED Viewed

@@ -1,39 +1,61 @@
 ---
-name: Project Status as of 2026-03-28
-description: Current build/test/deployment status, what's working, what's pending, and known issues.
 type: project
 ---
 ## Status: Code Complete, Deployment Pending
-**Last verified**: 2026-03-28
-### Passing
-- 183/183 tests pass (5.84s)
-- 97% coverage on `ml_training_debugger/` package
-- `openenv validate` → `[OK] ML Debugger: Ready for multi-mode deployment`
-- Baseline bit-exact reproducible across runs
-- All 10 endpoints verified (health, tasks, grader, baseline, dashboard, validation-report, schema, state, docs, ws)
-- Docker builds and serves correctly on port 7860
-- Zero numpy in core, `import torch` in every core module
-- Typed Pydantic models everywhere
-- Context-gated penalty fires correctly (both paths tested)
-### Docker Image
-- Size: **1.48GB** (down from 1.96GB via single-layer cleanup)
-- `libtorch_cpu.so` is 426MB — the irreducible PyTorch CPU minimum
-- Spec target was <500MB (aspirational for PyTorch-native env)
-- **Cannot remove**: torch/testing, torch/distributed, torch/cuda (all required at import time)
-- **Safe to remove**: torch/test, torch/include, torch/share, torch/utils/benchmark, torch/utils/bottleneck, torch/utils/tensorboard, torch/lib/*.a, test .so files, caffe2, .pyi files
 ### Pending
 - [ ] Push to **public GitHub repo**
-- [ ] Deploy to **HF Spaces** (Docker type, tag with `openenv`)
-- [ ] Submit HF Space URL + GitHub repo URL
 ### Known Limitations
-- WS reset defaults to task_001 (framework limitation — no extra fields accepted)
-- HTTP `/step` has session isolation issues (framework creates new env instances per request)
-- `replace_optimizer` and `rollback_checkpoint` are no-op actions (acceptable)
-- Heuristic only handles 2/4 code bug variants (eval_mode, detach_loss)
-- Validation report at `/validation-report` is hardcoded, not computed from real runs

 ---
+name: Project Status as of 2026-03-30
+description: Current build/test/deployment status, verified metrics, known limitations, and remaining work.
 type: project
 ---
 ## Status: Code Complete, Deployment Pending
+**Last verified**: 2026-03-30
+### Verified Metrics
+- **251 tests pass** (60s runtime due to real training)
+- **95% coverage** on ml_training_debugger/ + server/
+- **openenv validate** → `[OK] ML Debugger: Ready for multi-mode deployment`
+- **Baseline bit-exact reproducible** across runs
+- **Docker image: 885MB** (down from 1.96GB — 55% reduction)
+- **Docker uses torch 2.5.1+cpu** (multi-stage build, strip --strip-unneeded)
+- **8/8 validation checks pass** (real training curves)
+- **All endpoints work** (health, tasks, grader, baseline, dashboard, validation-report, curriculum, leaderboard, replay, schema, ws)
+- **All 7 tasks selectable via WS**: `{"type": "reset", "data": {"task_id": "task_007"}}`
+### Baseline Scores (Heuristic)
+```
+task_001: 1.0, task_002: 1.0, task_003: 1.0, task_004: 0.45,
+task_005: 1.0, task_006: 1.0, task_007: 1.0
+```
+### LLM Baseline Scores (Measured)
+- **Llama 3.3 70B** (Groq): 1.0, 1.0, 0.4, 0.45, 1.0, —, — (5/7 before rate limit)
+- **Llama 3.1 8B** (Cerebras): 0.6, 0.05, 0.4, 0.6, 1.0, 0.6, 0.6 (avg 0.55)
+- **Llama 3.1 8B** (Groq): 0.6, 0.05, 0.4, 0.6, 1.0, 1.0, 0.6 (avg 0.61)
+### Features Implemented
+- 7 tasks with 3 difficulty tiers + difficulty scaling (1-5)
+- Dual architecture: SimpleCNN + SimpleMLP
+- Real 20-epoch PyTorch mini-training (cached per task/seed)
+- Context-gated reward penalty
+- Code-level debugging (Task 6, 4 bug variants, AST validation)
+- Task 7: LR Scheduler misconfigured
+- Confusion matrix in data batch stats
+- Curriculum, leaderboard, replay endpoints
+- PAPER.md research summary
+- EXPLANATION.md simple explanation
+- Multi-provider LLM baseline (Groq, Cerebras, Gemini, OpenAI)
+- Exploit resistance test (20-seed variance)
+- deploy-hf.sh deployment script
 ### Pending
 - [ ] Push to **public GitHub repo**
+- [ ] Deploy to **HF Spaces** (Docker type, tag `openenv`)
+- [ ] Run 70B baseline for tasks 6-7 (Groq quota resets daily)
+- [ ] Record dashboard GIF for README
+### Docker Size History
+1.96GB → 1.48GB → 1.09GB → **885MB** (irreducible: libtorch_cpu.so=329MB stripped)
 ### Known Limitations
+- Docker 885MB (target was 500MB — libtorch_cpu.so is irreducible)
+- HTTP /reset and /step are stateless (framework design — WS is primary interface)
+- Heuristic outperforms LLMs on most tasks (environment rewards domain knowledge)
+- `replace_optimizer` and `rollback_checkpoint` are no-op actions

CLAUDE.md CHANGED Viewed

@@ -29,7 +29,7 @@ Every computation in core modules uses `torch.Tensor`, not `numpy.ndarray`. `imp
 These are separate modules with separate purposes. The **reward function** (`reward_engine.py`) returns a float per step for RL training signal. The **grader** (`graders.py`) returns a normalized 0.0-1.0 score at episode end for the `/grader` endpoint and auto-validation. The grader evaluates `EpisodeState` holistically — it is **not** a sum of step rewards. Never conflate them.
 ### Opaque Task IDs
-Task IDs are `task_001` through `task_006`. The agent must never be able to infer the diagnosis from the task ID. Do not use descriptive names anywhere the agent can observe them.
 ---
@@ -102,7 +102,7 @@ Test with intentionally messy fixes: `"  loss = criterion(output, batch_y)  # fi
 The `gradients_were_normal` flag is set **inside** the `inspect_gradients` handler, based on whether `is_exploding` is False on **all** layers. The threshold for `is_exploding` is `mean_norm > 10.0`. The threshold for `is_vanishing` is `mean_norm < 1e-6`. In Task 5, the FC spike has `is_exploding: False` (it spiked but the mean norm stays below 10.0), so `gradients_were_normal` is set to True. This is the gate that makes the penalty fire when the agent then calls `add_callback`.
 ### Docker Image Size
-Target: <500MB. PyTorch CPU-only wheel is ~150MB. Use `python:3.12-slim` base. Install torch with `--index-url https://download.pytorch.org/whl/cpu`. Do NOT install CUDA. Pre-compute validation reports locally — do not run real training in Docker build.
 ### Baseline Reproducibility
 The rule-based baseline must produce **bit-exact identical** scores on two consecutive runs. This requires:
@@ -112,7 +112,7 @@ The rule-based baseline must produce **bit-exact identical** scores on two conse
 ### Auto-Validator Endpoints
 These endpoints are checked programmatically. They must respond correctly or you are disqualified:
-- `GET /health` -> `{"status": "ready", "tasks": N}` (200) — N is the number of active tasks (3 for MVP, 6 for full)
 - `GET /tasks` -> list of tasks with IDs and action schema (200)
 - `POST /grader` -> `{"score": float}` after a completed episode (200)
 - `POST /baseline` -> scores for all tasks (200)

 These are separate modules with separate purposes. The **reward function** (`reward_engine.py`) returns a float per step for RL training signal. The **grader** (`graders.py`) returns a normalized 0.0-1.0 score at episode end for the `/grader` endpoint and auto-validation. The grader evaluates `EpisodeState` holistically — it is **not** a sum of step rewards. Never conflate them.
 ### Opaque Task IDs
+Task IDs are `task_001` through `task_007`. The agent must never be able to infer the diagnosis from the task ID. Do not use descriptive names anywhere the agent can observe them.
 ---
 The `gradients_were_normal` flag is set **inside** the `inspect_gradients` handler, based on whether `is_exploding` is False on **all** layers. The threshold for `is_exploding` is `mean_norm > 10.0`. The threshold for `is_vanishing` is `mean_norm < 1e-6`. In Task 5, the FC spike has `is_exploding: False` (it spiked but the mean norm stays below 10.0), so `gradients_were_normal` is set to True. This is the gate that makes the penalty fire when the agent then calls `add_callback`.
 ### Docker Image Size
+Current: 885MB. Uses torch 2.5.1+cpu with multi-stage build and `strip --strip-unneeded`. The irreducible minimum is `libtorch_cpu.so` (329MB stripped). Use `python:3.12-slim` base. Do NOT install CUDA.
 ### Baseline Reproducibility
 The rule-based baseline must produce **bit-exact identical** scores on two consecutive runs. This requires:
 ### Auto-Validator Endpoints
 These endpoints are checked programmatically. They must respond correctly or you are disqualified:
+- `GET /health` -> `{"status": "ready", "tasks": N}` (200) — N is the number of active tasks (7 for full)
 - `GET /tasks` -> list of tasks with IDs and action schema (200)
 - `POST /grader` -> `{"score": float}` after a completed episode (200)
 - `POST /baseline` -> scores for all tasks (200)

EXPLANATION.md CHANGED Viewed

@@ -333,8 +333,8 @@ Then they fix the right part and test-drive it to confirm.
 |----------|--------|
 | **What?** | A practice environment where AI agents learn to debug broken PyTorch training runs |
 | **Why?** | Real ML debugging costs companies millions. Training AI to do it has huge value. |
-| **How?** | 6 mystery cases with real PyTorch models, progressive clue reveal, and smart scoring |
-| **What's special?** | Real PyTorch internals, context-gated rewards, code-level debugging, red herrings |
 | **Who's it for?** | AI researchers building smarter debugging agents |
 | **Built with?** | Python, PyTorch, FastAPI, OpenEnv, Pydantic, Docker |
 | **For what event?** | Meta PyTorch OpenEnv Hackathon x Scaler School of Technology |

 |----------|--------|
 | **What?** | A practice environment where AI agents learn to debug broken PyTorch training runs |
 | **Why?** | Real ML debugging costs companies millions. Training AI to do it has huge value. |
+| **How?** | 7 mystery cases with real PyTorch training (CNN + MLP), progressive clue reveal, and smart scoring |
+| **What's special?** | Real 20-epoch training, dual architectures, context-gated rewards, code-level debugging, red herrings, difficulty scaling |
 | **Who's it for?** | AI researchers building smarter debugging agents |
 | **Built with?** | Python, PyTorch, FastAPI, OpenEnv, Pydantic, Docker |
 | **For what event?** | Meta PyTorch OpenEnv Hackathon x Scaler School of Technology |

PAPER.md CHANGED Viewed

@@ -30,19 +30,20 @@ This teaches agents a transferable skill: *don't ignore what you've already lear
 ## Results
-Baseline scores demonstrate meaningful difficulty progression:
-| Task | Heuristic | Description |
-|------|-----------|-------------|
-| task_001 | 1.00 | Exploding gradients — direct signal |
-| task_002 | 1.00 | Vanishing gradients — direct signal |
-| task_003 | 1.00 | Data leakage — class overlap detection |
-| task_004 | 1.00 | Overfitting — train-val divergence |
-| task_005 | 0.35 | BatchNorm eval mode — red herrings trap heuristic |
-| task_006 | 1.00 | Code bug — pattern matching catches 2/4 variants |
-| task_007 | 0.60 | Scheduler misconfigured — stagnation detection |
-The rule-based heuristic scores 0.35 on Task 5 because its fixed investigation order causes it to chase the gradient spike red herring before checking model modes. A reasoning agent that inspects model modes would avoid this trap.
 ## Conclusion

 ## Results
+Three-agent comparison demonstrates the environment differentiates across agent types:
+| Task | Heuristic | Llama 3.3 70B | Llama 3.1 8B |
+|------|-----------|---------------|--------------|
+| task_001 | **1.00** | 1.00 | 0.60 |
+| task_002 | **1.00** | 1.00 | 0.05 |
+| task_003 | **1.00** | 0.40 | 0.40 |
+| task_004 | 0.45 | 0.45 | **0.60** |
+| task_005 | **1.00** | 1.00 | 1.00 |
+| task_006 | **1.00** | — | 0.60 |
+| task_007 | **1.00** | — | 0.60 |
+| **Average** | **0.92** | 0.69* | 0.55 |
+Key findings: (1) Model size matters — 70B scores 25% higher than 8B. (2) Domain-specific heuristic (0.92) outperforms general LLMs (0.55-0.69), proving the environment rewards systematic debugging. (3) Task 4 is the exception where flexible LLM reasoning outperforms rigid heuristic on subtle real training curves.
 ## Conclusion

README.md CHANGED Viewed

@@ -15,9 +15,11 @@ This environment recreates the experience of an ML engineer facing a broken PyTo
 ### Key Differentiators
-- **PyTorch-native internals** — Real `torch.nn.Module` models (~50K params), real `torch.autograd` gradients, real `state_dict()` weight snapshots
 - **Context-gated reward shaping** — Penalty fires only when agent ignores evidence it already gathered; no penalty for reasonable priors
-- **Progressive information reveal** — Gradient stats, weight stats, data batch stats only populated after corresponding inspection actions
 ## Environment Design
@@ -81,6 +83,9 @@ Dynamic availability: `restart_run` requires a fix first; `fix_code` requires co
 | `task_004` | Medium | `overfitting` | Train-val divergence — loss approaches 0 while val loss climbs |
 | `task_005` | Hard | `batchnorm_eval_mode` | Model in eval mode with compound red herrings (FC gradient spike, GPU 91%, near-vanishing conv1) |
 | `task_006` | Hard | `code_bug` | PyTorch code bug — agent must read and fix actual Python code (4 bug variants) |
 ## Baseline Scores
@@ -206,22 +211,31 @@ Returns: `{"type": "observation", "data": {"observation": {...}, "reward": float
 ## Validation Suite
-A PyTorch validation suite proves simulation fidelity by comparing parametric curve generation against real training runs. Pre-computed fidelity reports are served at `GET /validation-report`.
-**Methodology:** Real `torch.nn.Module` models are trained with each fault type, and the resulting loss/accuracy curves are compared against the parametric generators. All fault injection uses real `torch.autograd` gradients and `model.state_dict()` weights — not synthetic formulas.
-**Coverage:** Exploding gradients, vanishing gradients, data leakage, overfitting, BatchNorm eval mode, and all 4 code bug variants.
 ## Architecture
-- **Python 3.12** · PyTorch CPU-only · openenv-core
-- Real `torch.nn.Module` models with real `torch.autograd` gradients
-- Parametric curve generation for loss/accuracy histories (sub-ms latency)
 - Typed Pydantic models everywhere — no `Dict[str, Any]`
 - `import torch` in every core module — zero numpy in core
 - Session isolation via per-session `EpisodeState`
 - Deterministic reproducibility via `torch.manual_seed()`
 ### Docker Image Size
-The Docker image is ~1.5GB. This is driven by `libtorch_cpu.so` (426MB) — the core PyTorch CPU binary required for real `torch.nn.Module`, `torch.autograd`, and `model.state_dict()` support. This is the intentional trade-off: real PyTorch gradient computation and weight inspection (not synthetic data) requires the full CPU runtime. Non-essential torch components (test suites, benchmark tools, CUDA stubs, type stubs) are stripped in the Dockerfile.

 ### Key Differentiators
+- **Real PyTorch mini-training** — 20 real forward+backward epochs per reset, cached for instant replay. Loss/accuracy curves come from real training, not parametric formulas.
+- **Dual model architectures** — SimpleCNN (~50K params) and SimpleMLP (~20K params) randomly selected per episode
 - **Context-gated reward shaping** — Penalty fires only when agent ignores evidence it already gathered; no penalty for reasonable priors
+- **Progressive information reveal** — Gradient stats, weight stats, data batch stats, confusion matrices only populated after corresponding inspection actions
+- **7 tasks with difficulty scaling** — Easy to hard, with configurable difficulty level (1-5) per task
 ## Environment Design
 | `task_004` | Medium | `overfitting` | Train-val divergence — loss approaches 0 while val loss climbs |
 | `task_005` | Hard | `batchnorm_eval_mode` | Model in eval mode with compound red herrings (FC gradient spike, GPU 91%, near-vanishing conv1) |
 | `task_006` | Hard | `code_bug` | PyTorch code bug — agent must read and fix actual Python code (4 bug variants) |
+| `task_007` | Med-Hard | `scheduler_misconfigured` | LR scheduler with wrong gamma/step_size — training stagnates after initial progress |
+All tasks support `difficulty_level` (1-5) via reset: `{"type": "reset", "data": {"task_id": "task_005", "difficulty_level": 4}}`
 ## Baseline Scores
 ## Validation Suite
+8/8 validation checks pass — served live at `GET /validation-report`:
+**Methodology:** Real PyTorch 20-epoch mini-training with fault injection. Each fault type is validated with behavioral checks (gradient detection, loss patterns, model mode, code fix acceptance). Both SimpleCNN and SimpleMLP architectures verified.
+**Coverage:** Exploding gradients, vanishing gradients, data leakage, overfitting, BatchNorm eval mode, code bugs (4 variants), scheduler misconfigured, dual architecture.
 ## Architecture
+- **Python 3.12** · PyTorch 2.5.1 CPU-only · openenv-core v0.2.2
+- **Dual model architectures**: SimpleCNN (~50K params) + SimpleMLP (~20K params)
+- **Real 20-epoch mini-training** per reset (cached per task/seed for instant replay)
 - Typed Pydantic models everywhere — no `Dict[str, Any]`
 - `import torch` in every core module — zero numpy in core
 - Session isolation via per-session `EpisodeState`
 - Deterministic reproducibility via `torch.manual_seed()`
+- **251 tests, 95% coverage**
 ### Docker Image Size
+The Docker image is **885MB** (optimized from 1.96GB via multi-stage build, torch 2.5.1, `strip --strip-unneeded`, and removal of unused transitive dependencies). The core `libtorch_cpu.so` (329MB stripped) is the irreducible minimum for real `torch.nn.Module`, `torch.autograd`, and `model.state_dict()` support — the intentional trade-off for authentic PyTorch computation vs synthetic data.
+### Research Paper
+See [PAPER.md](PAPER.md) — "Context-Gated Reward Shaping for Evidence-Based ML Debugging"
+### Project Explanation
+See [EXPLANATION.md](EXPLANATION.md) — full project explanation in simple language