omkarrr88 commited on
Commit ·
4414fa9
1
Parent(s): ec9ad2a
fix: clean repo for hackathon submission
Browse files- Move 6 planning docs (PRD, PAPER, ROADMAP, etc.) to docs/
- Remove CLAUDE.md from git tracking (AI context file)
- Remove .hf-space from git tracking (deployment staging)
- Pin all dependency versions in requirements.txt
- Change task_007 difficulty from "medium-hard" to "hard"
- Add HF Space live demo links to README header
- .gitignore +2 -0
- .hf-space +0 -1
- CLAUDE.md +0 -186
- README.md +2 -0
- EXPLANATION.md → docs/EXPLANATION.md +0 -0
- PAPER.md → docs/PAPER.md +0 -0
- PRD.md → docs/PRD.md +0 -0
- PROJECT_GUIDE.md → docs/PROJECT_GUIDE.md +0 -0
- ROADMAP.md → docs/ROADMAP.md +0 -0
- ml-training-debugger-spec.md → docs/ml-training-debugger-spec.md +0 -0
- openenv.yaml +1 -1
- requirements.txt +6 -6
- server/app.py +1 -1
.gitignore
CHANGED
|
@@ -14,3 +14,5 @@ validation/reports/*.png
|
|
| 14 |
.ruff_cache/
|
| 15 |
.coverage
|
| 16 |
.claude/
|
|
|
|
|
|
|
|
|
| 14 |
.ruff_cache/
|
| 15 |
.coverage
|
| 16 |
.claude/
|
| 17 |
+
CLAUDE.md
|
| 18 |
+
.hf-space/
|
.hf-space
DELETED
|
@@ -1 +0,0 @@
|
|
| 1 |
-
Subproject commit 76adf683962c647563fb1410fbba821bf1a59972
|
|
|
|
|
|
CLAUDE.md
DELETED
|
@@ -1,186 +0,0 @@
|
|
| 1 |
-
# CLAUDE.md — PyTorch Training Run Debugger
|
| 2 |
-
|
| 3 |
-
OpenEnv RL environment for the Meta PyTorch OpenEnv Hackathon x Scaler School of Technology.
|
| 4 |
-
An AI agent debugs broken PyTorch training runs by investigating gradients, weights, data, model modes, and source code to diagnose and fix real ML failure patterns.
|
| 5 |
-
|
| 6 |
-
**Spec:** `ml-training-debugger-spec.md` is the single source of truth. If this file and the spec conflict, the spec wins.
|
| 7 |
-
|
| 8 |
-
**Runtime:** Python 3.12 · PyTorch CPU-only · openenv-core v0.2.2
|
| 9 |
-
|
| 10 |
-
---
|
| 11 |
-
|
| 12 |
-
## Non-Negotiable Rules
|
| 13 |
-
|
| 14 |
-
### MVP-First Execution
|
| 15 |
-
Ship Tasks 1, 3, 5 (easy/medium/hard) + rule-based baseline + Docker + HF deploy **before** touching anything else. A deployed MVP that passes auto-validation beats a half-finished 6-task environment. Priority order after MVP: Task 6 > Tasks 2 & 4 > dashboard > validation suite > LLM baseline.
|
| 16 |
-
|
| 17 |
-
### Context-Gated Penalty Must Be Exact
|
| 18 |
-
The -0.20 penalty for `add_callback` fires **only when both** `gradients_inspected == True` AND `gradients_were_normal == True`. It must **never** fire before `inspect_gradients` has been called. This is the project's primary innovation. Get the gate conditions wrong and the differentiator is broken. Test both paths:
|
| 19 |
-
- `add_callback` at step 1 (no prior inspection) -> **no penalty**
|
| 20 |
-
- `inspect_gradients` (normal) then `add_callback` -> **-0.20 penalty**
|
| 21 |
-
|
| 22 |
-
### Task 6 Diagnosis Is Always `code_bug`
|
| 23 |
-
Regardless of the specific bug variant (`eval_mode`, `detach_loss`, `zero_grad_missing`, `inplace_relu`), Task 6's correct diagnosis is **always** `code_bug`. Submitting `batchnorm_eval_mode` on Task 6's `eval_mode` variant is a wrong diagnosis (-0.30). The grader enforces this with a strict equality check.
|
| 24 |
-
|
| 25 |
-
### PyTorch-Native Only — No NumPy
|
| 26 |
-
Every computation in core modules uses `torch.Tensor`, not `numpy.ndarray`. `import torch` must appear in `models.py`, `simulation.py`, `pytorch_engine.py`, `reward_engine.py`, and `graders.py`. This is a Meta PyTorch hackathon — judges will notice. The only exception is test utilities and the validation suite where `scipy`/`matplotlib` are acceptable.
|
| 27 |
-
|
| 28 |
-
### Grader != Reward Function
|
| 29 |
-
These are separate modules with separate purposes. The **reward function** (`reward_engine.py`) returns a float per step for RL training signal. The **grader** (`graders.py`) returns a normalized 0.0-1.0 score at episode end for the `/grader` endpoint and auto-validation. The grader evaluates `EpisodeState` holistically — it is **not** a sum of step rewards. Never conflate them.
|
| 30 |
-
|
| 31 |
-
### Opaque Task IDs
|
| 32 |
-
Task IDs are `task_001` through `task_007`. The agent must never be able to infer the diagnosis from the task ID. Do not use descriptive names anywhere the agent can observe them.
|
| 33 |
-
|
| 34 |
-
---
|
| 35 |
-
|
| 36 |
-
## Architecture Constraints
|
| 37 |
-
|
| 38 |
-
### Framework Integration (Verified)
|
| 39 |
-
```
|
| 40 |
-
openenv-core v0.2.2 → create_app() → returns standard FastAPI instance
|
| 41 |
-
```
|
| 42 |
-
|
| 43 |
-
- `MLTrainingAction` extends `Action` from `openenv.core.env_server.types`
|
| 44 |
-
- `MLTrainingObservation` extends `Observation` from `openenv.core.env_server.types` (has built-in `done`, `reward`, `metadata`)
|
| 45 |
-
- `MLTrainingEnvironment` extends `Environment` from `openenv.core.env_server.interfaces` (must implement `reset()`, `step()`, `state` property)
|
| 46 |
-
- `MLTrainingEnvClient` in `client.py` extends `EnvClient` with typed `action_type` and `observation_type` — used by baseline scripts
|
| 47 |
-
- `create_app()` takes the **class** (factory), not an instance
|
| 48 |
-
- Custom routes (`/tasks`, `/grader`, `/baseline`, `/health`) are added directly to the returned FastAPI app via `@app.get()`/`@app.post()` decorators
|
| 49 |
-
- Framework auto-provides: `POST /reset`, `POST /step`, `GET /state`, `WS /ws`, `GET /schema`, `GET /docs`, `/mcp`
|
| 50 |
-
|
| 51 |
-
### Key Constraints (see spec for full detail)
|
| 52 |
-
- **Real PyTorch models:** `pytorch_engine.py` instantiates `SimpleCNN` (~50K params) at every `reset()`, runs 1-2 real forward+backward passes. Gradient and weight stats come from real `torch.autograd` and `model.state_dict()`.
|
| 53 |
-
- **Typed Pydantic models everywhere:** No `Dict[str, Any]`. `available_actions` is dynamically computed from `EpisodeState`, never hardcoded.
|
| 54 |
-
- **Session isolation:** Each WebSocket client gets its own `EpisodeState` keyed by session ID. `SUPPORTS_CONCURRENT_SESSIONS = True`.
|
| 55 |
-
|
| 56 |
-
---
|
| 57 |
-
|
| 58 |
-
## Coding Standards
|
| 59 |
-
|
| 60 |
-
### Formatting & Linting
|
| 61 |
-
- **black** for formatting (line length 88)
|
| 62 |
-
- **ruff** for linting
|
| 63 |
-
- **isort** for import ordering (profile=black)
|
| 64 |
-
- Run all three before every commit
|
| 65 |
-
|
| 66 |
-
### Type Hints
|
| 67 |
-
Type annotations on **every** function signature and return type. No `Any` in public APIs. Use `Optional[X]` for nullable fields, `Literal[...]` for closed string unions, `list[X]` (lowercase) for Python 3.12+.
|
| 68 |
-
|
| 69 |
-
### Testing
|
| 70 |
-
- **pytest** for all tests
|
| 71 |
-
- Every module in `ml_training_debugger/` has a corresponding `tests/test_*.py`
|
| 72 |
-
- Minimum test coverage: 80%
|
| 73 |
-
- Critical tests that must exist:
|
| 74 |
-
- `test_reward_engine.py`: context-gated penalty fires/doesn't fire under correct conditions
|
| 75 |
-
- `test_graders.py`: each grader returns 0.0-1.0, correct diagnosis scores high, wrong diagnosis scores low
|
| 76 |
-
- `test_pytorch_engine.py`: model instantiation, fault injection, gradient/weight extraction produces real tensors
|
| 77 |
-
- `test_code_templates.py`: all 4 bug variants generate valid code, fix validation accepts correct fixes and rejects wrong ones (including whitespace/comment variations)
|
| 78 |
-
- `test_episode_lifecycle.py`: full episode flow reset->inspect->fix->restart->diagnose produces expected state transitions
|
| 79 |
-
|
| 80 |
-
### File Size Limits
|
| 81 |
-
- 400 lines typical, 800 max per file
|
| 82 |
-
- `models.py` may exceed 400 lines due to many Pydantic models — this is acceptable
|
| 83 |
-
- `pytorch_engine.py` must stay under 300 lines (isolate model definitions if needed)
|
| 84 |
-
|
| 85 |
-
### Error Handling
|
| 86 |
-
`step()` must **never** raise an unhandled exception. All invalid actions return a valid observation with `-0.05` penalty and an error note. All edge cases (step after done, step before reset, malformed JSON) return structured error responses.
|
| 87 |
-
|
| 88 |
-
---
|
| 89 |
-
|
| 90 |
-
## Key Risks to Watch
|
| 91 |
-
|
| 92 |
-
### Task 6 Code Fix Validation
|
| 93 |
-
LLM agents will submit fixes with trailing spaces, inline comments, or minor reformatting. Use the multi-strategy validation pipeline:
|
| 94 |
-
1. Normalize whitespace + strip comments
|
| 95 |
-
2. Token-stream comparison via `tokenize` module
|
| 96 |
-
3. 2-3 semantic equivalence patterns per bug variant
|
| 97 |
-
4. `ast.parse()` fallback to verify buggy pattern is absent
|
| 98 |
-
|
| 99 |
-
Test with intentionally messy fixes: `" loss = criterion(output, batch_y) # fixed "` must pass.
|
| 100 |
-
|
| 101 |
-
### Red-Herring Penalty Gating
|
| 102 |
-
The `gradients_were_normal` flag is set **inside** the `inspect_gradients` handler, based on whether `is_exploding` is False on **all** layers. The threshold for `is_exploding` is `mean_norm > 10.0`. The threshold for `is_vanishing` is `mean_norm < 1e-6`. In Task 5, the FC spike has `is_exploding: False` (it spiked but the mean norm stays below 10.0), so `gradients_were_normal` is set to True. This is the gate that makes the penalty fire when the agent then calls `add_callback`.
|
| 103 |
-
|
| 104 |
-
### Docker Image Size
|
| 105 |
-
Current: 885MB. Uses torch 2.5.1+cpu with multi-stage build and `strip --strip-unneeded`. The irreducible minimum is `libtorch_cpu.so` (329MB stripped). Use `python:3.12-slim` base. Do NOT install CUDA.
|
| 106 |
-
|
| 107 |
-
### Baseline Reproducibility
|
| 108 |
-
The rule-based baseline must produce **bit-exact identical** scores on two consecutive runs. This requires:
|
| 109 |
-
- `torch.manual_seed(seed)` at every `reset()` with a deterministic seed per task
|
| 110 |
-
- No floating-point non-determinism in the parametric curve generators
|
| 111 |
-
- The heuristic decision tree is pure logic with no randomness
|
| 112 |
-
|
| 113 |
-
### Auto-Validator Endpoints
|
| 114 |
-
These endpoints are checked programmatically. They must respond correctly or you are disqualified:
|
| 115 |
-
- `GET /health` -> `{"status": "ready", "tasks": N}` (200) — N is the number of active tasks (7 for full)
|
| 116 |
-
- `GET /tasks` -> list of tasks with IDs and action schema (200)
|
| 117 |
-
- `POST /grader` -> `{"score": float}` after a completed episode (200)
|
| 118 |
-
- `POST /baseline` -> scores for all tasks (200)
|
| 119 |
-
- `WS /ws` -> responds to `reset` message
|
| 120 |
-
|
| 121 |
-
---
|
| 122 |
-
|
| 123 |
-
## Reward Constants (Do Not Change)
|
| 124 |
-
|
| 125 |
-
See spec Section 12 for full rationale. Summary:
|
| 126 |
-
|
| 127 |
-
| Event | Value | Gate |
|
| 128 |
-
|---|---|---|
|
| 129 |
-
| Step penalty | -0.01 | Unconditional, flat (never multiply by step_count) |
|
| 130 |
-
| Investigation bonus | +0.05 | First-time only per inspection type |
|
| 131 |
-
| Context-gated penalty | -0.20 | `gradients_inspected AND gradients_were_normal` |
|
| 132 |
-
| Invalid action | -0.05 | Action not in `available_actions` |
|
| 133 |
-
| Wrong code fix | -0.10 | `fix_code` with wrong line/replacement |
|
| 134 |
-
| Correct diagnosis | +0.50 | `diagnosis == true_root_cause` |
|
| 135 |
-
| Wrong diagnosis | -0.30 | `diagnosis != true_root_cause` |
|
| 136 |
-
| Terminal convergence | +0.40 | `fix_action_taken AND restart_after_fix AND convergence` |
|
| 137 |
-
|
| 138 |
-
---
|
| 139 |
-
|
| 140 |
-
## Success Criteria — "Perfect" Submission
|
| 141 |
-
|
| 142 |
-
All of these must be true:
|
| 143 |
-
- [ ] `openenv validate` passes
|
| 144 |
-
- [ ] `docker build && docker run` starts server on port 7860 in <60s
|
| 145 |
-
- [ ] HF Space deploys, responds to `reset()`, tagged with `openenv`
|
| 146 |
-
- [ ] `baseline_heuristic.py` produces identical scores on two runs
|
| 147 |
-
- [ ] 3+ tasks with graders returning scores in [0.0, 1.0] with meaningful variance
|
| 148 |
-
- [ ] Hard task (Task 5) genuinely challenges frontier models (heuristic 0.75, requires thorough investigation for full credit)
|
| 149 |
-
- [ ] Context-gated penalty fires correctly and does not fire prematurely
|
| 150 |
-
- [ ] All typed Pydantic models, no `Dict[str, Any]`
|
| 151 |
-
- [ ] `import torch` in every core module, zero numpy imports in core
|
| 152 |
-
- [ ] README documents: environment description, action/observation spaces, task descriptions with difficulty, setup instructions, baseline scores
|
| 153 |
-
- [ ] POST `/baseline`, POST `/grader`, GET `/tasks` all respond correctly
|
| 154 |
-
- [ ] Test suite passes with >80% coverage
|
| 155 |
-
|
| 156 |
-
---
|
| 157 |
-
|
| 158 |
-
## Commands
|
| 159 |
-
|
| 160 |
-
```bash
|
| 161 |
-
# Development (from project root: ML Debugger/)
|
| 162 |
-
source .venv/bin/activate
|
| 163 |
-
uvicorn server.app:app --reload --host 0.0.0.0 --port 7860
|
| 164 |
-
|
| 165 |
-
# Tests
|
| 166 |
-
pytest tests/ -v --cov=ml_training_debugger --cov-report=term-missing
|
| 167 |
-
|
| 168 |
-
# Formatting
|
| 169 |
-
black ml_training_debugger/ server/ tests/
|
| 170 |
-
ruff check ml_training_debugger/ server/ tests/ --fix
|
| 171 |
-
isort ml_training_debugger/ server/ tests/ --profile black
|
| 172 |
-
|
| 173 |
-
# Docker
|
| 174 |
-
docker build -t pytorch-debugger .
|
| 175 |
-
docker run -p 7860:7860 pytorch-debugger
|
| 176 |
-
|
| 177 |
-
# Smoke test
|
| 178 |
-
curl http://localhost:7860/health
|
| 179 |
-
curl http://localhost:7860/tasks
|
| 180 |
-
python baseline_heuristic.py > run1.json
|
| 181 |
-
python baseline_heuristic.py > run2.json
|
| 182 |
-
diff run1.json run2.json # Must be empty
|
| 183 |
-
|
| 184 |
-
# OpenEnv validation
|
| 185 |
-
openenv validate
|
| 186 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
README.md
CHANGED
|
@@ -2,6 +2,8 @@
|
|
| 2 |
|
| 3 |
**OpenEnv RL Environment** | Meta PyTorch OpenEnv Hackathon x Scaler School of Technology
|
| 4 |
|
|
|
|
|
|
|
| 5 |
An AI agent debugs broken PyTorch training runs by investigating gradients, model weights, data pipelines, and source code to diagnose and fix real ML failure patterns.
|
| 6 |
|
| 7 |
---
|
|
|
|
| 2 |
|
| 3 |
**OpenEnv RL Environment** | Meta PyTorch OpenEnv Hackathon x Scaler School of Technology
|
| 4 |
|
| 5 |
+
**Live Demo:** [HF Space](https://ujjwalpardeshi-pytorch-training-debugger.hf.space/dashboard) | **API Health:** [/health](https://ujjwalpardeshi-pytorch-training-debugger.hf.space/health) | **API Docs:** [/docs](https://ujjwalpardeshi-pytorch-training-debugger.hf.space/docs)
|
| 6 |
+
|
| 7 |
An AI agent debugs broken PyTorch training runs by investigating gradients, model weights, data pipelines, and source code to diagnose and fix real ML failure patterns.
|
| 8 |
|
| 9 |
---
|
EXPLANATION.md → docs/EXPLANATION.md
RENAMED
|
File without changes
|
PAPER.md → docs/PAPER.md
RENAMED
|
File without changes
|
PRD.md → docs/PRD.md
RENAMED
|
File without changes
|
PROJECT_GUIDE.md → docs/PROJECT_GUIDE.md
RENAMED
|
File without changes
|
ROADMAP.md → docs/ROADMAP.md
RENAMED
|
File without changes
|
ml-training-debugger-spec.md → docs/ml-training-debugger-spec.md
RENAMED
|
File without changes
|
openenv.yaml
CHANGED
|
@@ -72,7 +72,7 @@ tasks:
|
|
| 72 |
bug_type: [eval_mode, detach_loss, zero_grad_missing, inplace_relu]
|
| 73 |
|
| 74 |
- id: task_007
|
| 75 |
-
difficulty:
|
| 76 |
max_steps: 25
|
| 77 |
param_ranges:
|
| 78 |
scheduler_gamma: [0.01, 0.001, 0.0001]
|
|
|
|
| 72 |
bug_type: [eval_mode, detach_loss, zero_grad_missing, inplace_relu]
|
| 73 |
|
| 74 |
- id: task_007
|
| 75 |
+
difficulty: hard
|
| 76 |
max_steps: 25
|
| 77 |
param_ranges:
|
| 78 |
scheduler_gamma: [0.01, 0.001, 0.0001]
|
requirements.txt
CHANGED
|
@@ -1,8 +1,8 @@
|
|
| 1 |
-
openenv-core
|
| 2 |
-
pydantic>=2.0
|
| 3 |
-
fastapi
|
| 4 |
-
uvicorn
|
| 5 |
-
openai
|
| 6 |
-
websockets
|
| 7 |
# torch is installed separately with CPU-only index:
|
| 8 |
# pip install torch --index-url https://download.pytorch.org/whl/cpu
|
|
|
|
| 1 |
+
openenv-core==0.2.2
|
| 2 |
+
pydantic>=2.0,<3.0
|
| 3 |
+
fastapi>=0.115.0,<1.0
|
| 4 |
+
uvicorn>=0.30.0,<1.0
|
| 5 |
+
openai>=1.0.0,<3.0
|
| 6 |
+
websockets>=13.0,<17.0
|
| 7 |
# torch is installed separately with CPU-only index:
|
| 8 |
# pip install torch --index-url https://download.pytorch.org/whl/cpu
|
server/app.py
CHANGED
|
@@ -55,7 +55,7 @@ ALL_TASKS = [
|
|
| 55 |
{"id": "task_004", "difficulty": "medium", "max_steps": 25},
|
| 56 |
{"id": "task_005", "difficulty": "hard", "max_steps": 30},
|
| 57 |
{"id": "task_006", "difficulty": "hard", "max_steps": 30},
|
| 58 |
-
{"id": "task_007", "difficulty": "
|
| 59 |
]
|
| 60 |
|
| 61 |
# create_app takes the class (factory), not an instance
|
|
|
|
| 55 |
{"id": "task_004", "difficulty": "medium", "max_steps": 25},
|
| 56 |
{"id": "task_005", "difficulty": "hard", "max_steps": 30},
|
| 57 |
{"id": "task_006", "difficulty": "hard", "max_steps": 30},
|
| 58 |
+
{"id": "task_007", "difficulty": "hard", "max_steps": 25},
|
| 59 |
]
|
| 60 |
|
| 61 |
# create_app takes the class (factory), not an instance
|