omkarrr88 commited on
Commit
4414fa9
·
1 Parent(s): ec9ad2a

fix: clean repo for hackathon submission

Browse files

- Move 6 planning docs (PRD, PAPER, ROADMAP, etc.) to docs/
- Remove CLAUDE.md from git tracking (AI context file)
- Remove .hf-space from git tracking (deployment staging)
- Pin all dependency versions in requirements.txt
- Change task_007 difficulty from "medium-hard" to "hard"
- Add HF Space live demo links to README header

.gitignore CHANGED
@@ -14,3 +14,5 @@ validation/reports/*.png
14
  .ruff_cache/
15
  .coverage
16
  .claude/
 
 
 
14
  .ruff_cache/
15
  .coverage
16
  .claude/
17
+ CLAUDE.md
18
+ .hf-space/
.hf-space DELETED
@@ -1 +0,0 @@
1
- Subproject commit 76adf683962c647563fb1410fbba821bf1a59972
 
 
CLAUDE.md DELETED
@@ -1,186 +0,0 @@
1
- # CLAUDE.md — PyTorch Training Run Debugger
2
-
3
- OpenEnv RL environment for the Meta PyTorch OpenEnv Hackathon x Scaler School of Technology.
4
- An AI agent debugs broken PyTorch training runs by investigating gradients, weights, data, model modes, and source code to diagnose and fix real ML failure patterns.
5
-
6
- **Spec:** `ml-training-debugger-spec.md` is the single source of truth. If this file and the spec conflict, the spec wins.
7
-
8
- **Runtime:** Python 3.12 · PyTorch CPU-only · openenv-core v0.2.2
9
-
10
- ---
11
-
12
- ## Non-Negotiable Rules
13
-
14
- ### MVP-First Execution
15
- Ship Tasks 1, 3, 5 (easy/medium/hard) + rule-based baseline + Docker + HF deploy **before** touching anything else. A deployed MVP that passes auto-validation beats a half-finished 6-task environment. Priority order after MVP: Task 6 > Tasks 2 & 4 > dashboard > validation suite > LLM baseline.
16
-
17
- ### Context-Gated Penalty Must Be Exact
18
- The -0.20 penalty for `add_callback` fires **only when both** `gradients_inspected == True` AND `gradients_were_normal == True`. It must **never** fire before `inspect_gradients` has been called. This is the project's primary innovation. Get the gate conditions wrong and the differentiator is broken. Test both paths:
19
- - `add_callback` at step 1 (no prior inspection) -> **no penalty**
20
- - `inspect_gradients` (normal) then `add_callback` -> **-0.20 penalty**
21
-
22
- ### Task 6 Diagnosis Is Always `code_bug`
23
- Regardless of the specific bug variant (`eval_mode`, `detach_loss`, `zero_grad_missing`, `inplace_relu`), Task 6's correct diagnosis is **always** `code_bug`. Submitting `batchnorm_eval_mode` on Task 6's `eval_mode` variant is a wrong diagnosis (-0.30). The grader enforces this with a strict equality check.
24
-
25
- ### PyTorch-Native Only — No NumPy
26
- Every computation in core modules uses `torch.Tensor`, not `numpy.ndarray`. `import torch` must appear in `models.py`, `simulation.py`, `pytorch_engine.py`, `reward_engine.py`, and `graders.py`. This is a Meta PyTorch hackathon — judges will notice. The only exception is test utilities and the validation suite where `scipy`/`matplotlib` are acceptable.
27
-
28
- ### Grader != Reward Function
29
- These are separate modules with separate purposes. The **reward function** (`reward_engine.py`) returns a float per step for RL training signal. The **grader** (`graders.py`) returns a normalized 0.0-1.0 score at episode end for the `/grader` endpoint and auto-validation. The grader evaluates `EpisodeState` holistically — it is **not** a sum of step rewards. Never conflate them.
30
-
31
- ### Opaque Task IDs
32
- Task IDs are `task_001` through `task_007`. The agent must never be able to infer the diagnosis from the task ID. Do not use descriptive names anywhere the agent can observe them.
33
-
34
- ---
35
-
36
- ## Architecture Constraints
37
-
38
- ### Framework Integration (Verified)
39
- ```
40
- openenv-core v0.2.2 → create_app() → returns standard FastAPI instance
41
- ```
42
-
43
- - `MLTrainingAction` extends `Action` from `openenv.core.env_server.types`
44
- - `MLTrainingObservation` extends `Observation` from `openenv.core.env_server.types` (has built-in `done`, `reward`, `metadata`)
45
- - `MLTrainingEnvironment` extends `Environment` from `openenv.core.env_server.interfaces` (must implement `reset()`, `step()`, `state` property)
46
- - `MLTrainingEnvClient` in `client.py` extends `EnvClient` with typed `action_type` and `observation_type` — used by baseline scripts
47
- - `create_app()` takes the **class** (factory), not an instance
48
- - Custom routes (`/tasks`, `/grader`, `/baseline`, `/health`) are added directly to the returned FastAPI app via `@app.get()`/`@app.post()` decorators
49
- - Framework auto-provides: `POST /reset`, `POST /step`, `GET /state`, `WS /ws`, `GET /schema`, `GET /docs`, `/mcp`
50
-
51
- ### Key Constraints (see spec for full detail)
52
- - **Real PyTorch models:** `pytorch_engine.py` instantiates `SimpleCNN` (~50K params) at every `reset()`, runs 1-2 real forward+backward passes. Gradient and weight stats come from real `torch.autograd` and `model.state_dict()`.
53
- - **Typed Pydantic models everywhere:** No `Dict[str, Any]`. `available_actions` is dynamically computed from `EpisodeState`, never hardcoded.
54
- - **Session isolation:** Each WebSocket client gets its own `EpisodeState` keyed by session ID. `SUPPORTS_CONCURRENT_SESSIONS = True`.
55
-
56
- ---
57
-
58
- ## Coding Standards
59
-
60
- ### Formatting & Linting
61
- - **black** for formatting (line length 88)
62
- - **ruff** for linting
63
- - **isort** for import ordering (profile=black)
64
- - Run all three before every commit
65
-
66
- ### Type Hints
67
- Type annotations on **every** function signature and return type. No `Any` in public APIs. Use `Optional[X]` for nullable fields, `Literal[...]` for closed string unions, `list[X]` (lowercase) for Python 3.12+.
68
-
69
- ### Testing
70
- - **pytest** for all tests
71
- - Every module in `ml_training_debugger/` has a corresponding `tests/test_*.py`
72
- - Minimum test coverage: 80%
73
- - Critical tests that must exist:
74
- - `test_reward_engine.py`: context-gated penalty fires/doesn't fire under correct conditions
75
- - `test_graders.py`: each grader returns 0.0-1.0, correct diagnosis scores high, wrong diagnosis scores low
76
- - `test_pytorch_engine.py`: model instantiation, fault injection, gradient/weight extraction produces real tensors
77
- - `test_code_templates.py`: all 4 bug variants generate valid code, fix validation accepts correct fixes and rejects wrong ones (including whitespace/comment variations)
78
- - `test_episode_lifecycle.py`: full episode flow reset->inspect->fix->restart->diagnose produces expected state transitions
79
-
80
- ### File Size Limits
81
- - 400 lines typical, 800 max per file
82
- - `models.py` may exceed 400 lines due to many Pydantic models — this is acceptable
83
- - `pytorch_engine.py` must stay under 300 lines (isolate model definitions if needed)
84
-
85
- ### Error Handling
86
- `step()` must **never** raise an unhandled exception. All invalid actions return a valid observation with `-0.05` penalty and an error note. All edge cases (step after done, step before reset, malformed JSON) return structured error responses.
87
-
88
- ---
89
-
90
- ## Key Risks to Watch
91
-
92
- ### Task 6 Code Fix Validation
93
- LLM agents will submit fixes with trailing spaces, inline comments, or minor reformatting. Use the multi-strategy validation pipeline:
94
- 1. Normalize whitespace + strip comments
95
- 2. Token-stream comparison via `tokenize` module
96
- 3. 2-3 semantic equivalence patterns per bug variant
97
- 4. `ast.parse()` fallback to verify buggy pattern is absent
98
-
99
- Test with intentionally messy fixes: `" loss = criterion(output, batch_y) # fixed "` must pass.
100
-
101
- ### Red-Herring Penalty Gating
102
- The `gradients_were_normal` flag is set **inside** the `inspect_gradients` handler, based on whether `is_exploding` is False on **all** layers. The threshold for `is_exploding` is `mean_norm > 10.0`. The threshold for `is_vanishing` is `mean_norm < 1e-6`. In Task 5, the FC spike has `is_exploding: False` (it spiked but the mean norm stays below 10.0), so `gradients_were_normal` is set to True. This is the gate that makes the penalty fire when the agent then calls `add_callback`.
103
-
104
- ### Docker Image Size
105
- Current: 885MB. Uses torch 2.5.1+cpu with multi-stage build and `strip --strip-unneeded`. The irreducible minimum is `libtorch_cpu.so` (329MB stripped). Use `python:3.12-slim` base. Do NOT install CUDA.
106
-
107
- ### Baseline Reproducibility
108
- The rule-based baseline must produce **bit-exact identical** scores on two consecutive runs. This requires:
109
- - `torch.manual_seed(seed)` at every `reset()` with a deterministic seed per task
110
- - No floating-point non-determinism in the parametric curve generators
111
- - The heuristic decision tree is pure logic with no randomness
112
-
113
- ### Auto-Validator Endpoints
114
- These endpoints are checked programmatically. They must respond correctly or you are disqualified:
115
- - `GET /health` -> `{"status": "ready", "tasks": N}` (200) — N is the number of active tasks (7 for full)
116
- - `GET /tasks` -> list of tasks with IDs and action schema (200)
117
- - `POST /grader` -> `{"score": float}` after a completed episode (200)
118
- - `POST /baseline` -> scores for all tasks (200)
119
- - `WS /ws` -> responds to `reset` message
120
-
121
- ---
122
-
123
- ## Reward Constants (Do Not Change)
124
-
125
- See spec Section 12 for full rationale. Summary:
126
-
127
- | Event | Value | Gate |
128
- |---|---|---|
129
- | Step penalty | -0.01 | Unconditional, flat (never multiply by step_count) |
130
- | Investigation bonus | +0.05 | First-time only per inspection type |
131
- | Context-gated penalty | -0.20 | `gradients_inspected AND gradients_were_normal` |
132
- | Invalid action | -0.05 | Action not in `available_actions` |
133
- | Wrong code fix | -0.10 | `fix_code` with wrong line/replacement |
134
- | Correct diagnosis | +0.50 | `diagnosis == true_root_cause` |
135
- | Wrong diagnosis | -0.30 | `diagnosis != true_root_cause` |
136
- | Terminal convergence | +0.40 | `fix_action_taken AND restart_after_fix AND convergence` |
137
-
138
- ---
139
-
140
- ## Success Criteria — "Perfect" Submission
141
-
142
- All of these must be true:
143
- - [ ] `openenv validate` passes
144
- - [ ] `docker build && docker run` starts server on port 7860 in <60s
145
- - [ ] HF Space deploys, responds to `reset()`, tagged with `openenv`
146
- - [ ] `baseline_heuristic.py` produces identical scores on two runs
147
- - [ ] 3+ tasks with graders returning scores in [0.0, 1.0] with meaningful variance
148
- - [ ] Hard task (Task 5) genuinely challenges frontier models (heuristic 0.75, requires thorough investigation for full credit)
149
- - [ ] Context-gated penalty fires correctly and does not fire prematurely
150
- - [ ] All typed Pydantic models, no `Dict[str, Any]`
151
- - [ ] `import torch` in every core module, zero numpy imports in core
152
- - [ ] README documents: environment description, action/observation spaces, task descriptions with difficulty, setup instructions, baseline scores
153
- - [ ] POST `/baseline`, POST `/grader`, GET `/tasks` all respond correctly
154
- - [ ] Test suite passes with >80% coverage
155
-
156
- ---
157
-
158
- ## Commands
159
-
160
- ```bash
161
- # Development (from project root: ML Debugger/)
162
- source .venv/bin/activate
163
- uvicorn server.app:app --reload --host 0.0.0.0 --port 7860
164
-
165
- # Tests
166
- pytest tests/ -v --cov=ml_training_debugger --cov-report=term-missing
167
-
168
- # Formatting
169
- black ml_training_debugger/ server/ tests/
170
- ruff check ml_training_debugger/ server/ tests/ --fix
171
- isort ml_training_debugger/ server/ tests/ --profile black
172
-
173
- # Docker
174
- docker build -t pytorch-debugger .
175
- docker run -p 7860:7860 pytorch-debugger
176
-
177
- # Smoke test
178
- curl http://localhost:7860/health
179
- curl http://localhost:7860/tasks
180
- python baseline_heuristic.py > run1.json
181
- python baseline_heuristic.py > run2.json
182
- diff run1.json run2.json # Must be empty
183
-
184
- # OpenEnv validation
185
- openenv validate
186
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
README.md CHANGED
@@ -2,6 +2,8 @@
2
 
3
  **OpenEnv RL Environment** | Meta PyTorch OpenEnv Hackathon x Scaler School of Technology
4
 
 
 
5
  An AI agent debugs broken PyTorch training runs by investigating gradients, model weights, data pipelines, and source code to diagnose and fix real ML failure patterns.
6
 
7
  ---
 
2
 
3
  **OpenEnv RL Environment** | Meta PyTorch OpenEnv Hackathon x Scaler School of Technology
4
 
5
+ **Live Demo:** [HF Space](https://ujjwalpardeshi-pytorch-training-debugger.hf.space/dashboard) | **API Health:** [/health](https://ujjwalpardeshi-pytorch-training-debugger.hf.space/health) | **API Docs:** [/docs](https://ujjwalpardeshi-pytorch-training-debugger.hf.space/docs)
6
+
7
  An AI agent debugs broken PyTorch training runs by investigating gradients, model weights, data pipelines, and source code to diagnose and fix real ML failure patterns.
8
 
9
  ---
EXPLANATION.md → docs/EXPLANATION.md RENAMED
File without changes
PAPER.md → docs/PAPER.md RENAMED
File without changes
PRD.md → docs/PRD.md RENAMED
File without changes
PROJECT_GUIDE.md → docs/PROJECT_GUIDE.md RENAMED
File without changes
ROADMAP.md → docs/ROADMAP.md RENAMED
File without changes
ml-training-debugger-spec.md → docs/ml-training-debugger-spec.md RENAMED
File without changes
openenv.yaml CHANGED
@@ -72,7 +72,7 @@ tasks:
72
  bug_type: [eval_mode, detach_loss, zero_grad_missing, inplace_relu]
73
 
74
  - id: task_007
75
- difficulty: medium-hard
76
  max_steps: 25
77
  param_ranges:
78
  scheduler_gamma: [0.01, 0.001, 0.0001]
 
72
  bug_type: [eval_mode, detach_loss, zero_grad_missing, inplace_relu]
73
 
74
  - id: task_007
75
+ difficulty: hard
76
  max_steps: 25
77
  param_ranges:
78
  scheduler_gamma: [0.01, 0.001, 0.0001]
requirements.txt CHANGED
@@ -1,8 +1,8 @@
1
- openenv-core
2
- pydantic>=2.0
3
- fastapi
4
- uvicorn
5
- openai
6
- websockets
7
  # torch is installed separately with CPU-only index:
8
  # pip install torch --index-url https://download.pytorch.org/whl/cpu
 
1
+ openenv-core==0.2.2
2
+ pydantic>=2.0,<3.0
3
+ fastapi>=0.115.0,<1.0
4
+ uvicorn>=0.30.0,<1.0
5
+ openai>=1.0.0,<3.0
6
+ websockets>=13.0,<17.0
7
  # torch is installed separately with CPU-only index:
8
  # pip install torch --index-url https://download.pytorch.org/whl/cpu
server/app.py CHANGED
@@ -55,7 +55,7 @@ ALL_TASKS = [
55
  {"id": "task_004", "difficulty": "medium", "max_steps": 25},
56
  {"id": "task_005", "difficulty": "hard", "max_steps": 30},
57
  {"id": "task_006", "difficulty": "hard", "max_steps": 30},
58
- {"id": "task_007", "difficulty": "medium-hard", "max_steps": 25},
59
  ]
60
 
61
  # create_app takes the class (factory), not an instance
 
55
  {"id": "task_004", "difficulty": "medium", "max_steps": 25},
56
  {"id": "task_005", "difficulty": "hard", "max_steps": 30},
57
  {"id": "task_006", "difficulty": "hard", "max_steps": 30},
58
+ {"id": "task_007", "difficulty": "hard", "max_steps": 25},
59
  ]
60
 
61
  # create_app takes the class (factory), not an instance