Spaces:

ujjwalpardeshi
/

pytorch-training-debugger

Sleeping

App Files Files Community

omkarrr88 commited on Apr 2

Commit

02e58fe

1 Parent(s): a3e1032

task 6 and 7 made hard

Browse files

Files changed (5) hide show

README.md +9 -7
ml_training_debugger/graders.py +45 -10
pyproject.toml +1 -1
server/app.py +3 -4
tests/test_graders.py +146 -2

README.md CHANGED Viewed

@@ -107,6 +107,7 @@ Fields like `gradient_stats`, `data_batch_stats`, `model_mode_info`, and `code_s
 **Terminal** — end the episode:
 - `restart_run` — restart training (only available after a fix)
 - `mark_diagnosed` — submit diagnosis from 7 possible root causes
 Actions are dynamically available based on episode state: `fix_code` requires prior code inspection, `restart_run` requires a fix, `mark_diagnosed` disappears after submission.
@@ -156,13 +157,14 @@ An agent that chases the gradient spike red herring loses 0.20 points. An agent
 | `task_003` | Medium | **1.00** | 0.40 |
 | `task_004` | Medium | **1.00** | 0.60 |
 | `task_005` | Hard | **0.80** | 0.38-0.55 |
-| `task_006` | Hard | **1.00** | 0.60-1.00 |
-| `task_007` | Hard | **1.00** | 0.60 |
-| **Average** | | **0.97** | 0.52 |
 **What this tells you:**
-- **Hard tasks are genuinely hard:** Task 5 requires thorough investigation (weight AND data inspection) for full credit. The heuristic scores 0.80 because it skips weight inspection. An LLM that falls for the gradient red herring scores 0.48 or lower.
 - **Red herring traps work:** Task 5 penalizes agents that call `add_callback` after seeing normal gradients (-0.20) or `modify_config` when LR isn't the issue (-0.10). LLMs routinely fall for both traps.
 - **8B struggles on multi-step tasks:** Task 2 score of 0.05 shows small models can't maintain investigation strategy across many steps.
 - **The heuristic baseline is strong** because it was designed with knowledge of the task structure. An agent that doesn't know the structure has to figure it out from observations alone.
@@ -247,7 +249,7 @@ pip install pytest pytest-cov pytest-asyncio httpx websockets
 # Start server
 uvicorn server.app:app --host 0.0.0.0 --port 7860
-# Run tests (255 tests, 97% coverage)
 pytest tests/ -v --cov=ml_training_debugger
 # Run heuristic baseline
@@ -284,7 +286,7 @@ ml_training_debugger/
     models.py            — Pydantic data models (Action, Observation, EpisodeState)
     scenarios.py         — Task parameter sampling (7 tasks, deterministic per seed)
     pytorch_engine.py    — Real PyTorch models, fault injection, gradient/weight extraction
-    simulation.py        — 20-epoch real training with parametric fallback
     reward_engine.py     — 7-component per-step reward with context gating
     graders.py           — Per-task holistic 0.0-1.0 scoring
     code_templates.py    — Task 6 bug variants + 4-strategy fix validation
@@ -295,7 +297,7 @@ server/
     app.py               — FastAPI + custom endpoints
     dashboard.html       — Live Plotly.js diagnostic dashboard
-tests/                   — 255 tests, 97% coverage
 baseline_heuristic.py    — Rule-based agent (deterministic, no API key)
 baseline_inference.py    — LLM agent (Groq/Cerebras/Gemini/OpenAI)
 ```

 **Terminal** — end the episode:
 - `restart_run` — restart training (only available after a fix)
+- `rollback_checkpoint` — rollback to pre-fix state (only available after restart)
 - `mark_diagnosed` — submit diagnosis from 7 possible root causes
 Actions are dynamically available based on episode state: `fix_code` requires prior code inspection, `restart_run` requires a fix, `mark_diagnosed` disappears after submission.
 | `task_003` | Medium | **1.00** | 0.40 |
 | `task_004` | Medium | **1.00** | 0.60 |
 | `task_005` | Hard | **0.80** | 0.38-0.55 |
+| `task_006` | Hard | **0.81** | 0.60-1.00 |
+| `task_007` | Hard | **0.79** | 0.60 |
+| **Average** | | **0.91** | 0.52 |
 **What this tells you:**
+- **Hard tasks are genuinely hard:** All three hard tasks (5, 6, 7) require thorough investigation including weight inspection for full credit. The heuristic scores 0.79-0.81 on hard tasks because it skips weight inspection. An LLM that falls for red herrings or skips investigation scores even lower.
 - **Red herring traps work:** Task 5 penalizes agents that call `add_callback` after seeing normal gradients (-0.20) or `modify_config` when LR isn't the issue (-0.10). LLMs routinely fall for both traps.
+- **Investigation thoroughness matters:** Tasks 6 and 7 scale fix/restart credit based on how thoroughly the agent investigated before acting. Quick fixes without ruling out alternatives score ~60-65% of full credit.
 - **8B struggles on multi-step tasks:** Task 2 score of 0.05 shows small models can't maintain investigation strategy across many steps.
 - **The heuristic baseline is strong** because it was designed with knowledge of the task structure. An agent that doesn't know the structure has to figure it out from observations alone.
 # Start server
 uvicorn server.app:app --host 0.0.0.0 --port 7860
+# Run tests (246 tests, 96% coverage)
 pytest tests/ -v --cov=ml_training_debugger
 # Run heuristic baseline
     models.py            — Pydantic data models (Action, Observation, EpisodeState)
     scenarios.py         — Task parameter sampling (7 tasks, deterministic per seed)
     pytorch_engine.py    — Real PyTorch models, fault injection, gradient/weight extraction
+    simulation.py        — 20-epoch real training with fault injection
     reward_engine.py     — 7-component per-step reward with context gating
     graders.py           — Per-task holistic 0.0-1.0 scoring
     code_templates.py    — Task 6 bug variants + 4-strategy fix validation
     app.py               — FastAPI + custom endpoints
     dashboard.html       — Live Plotly.js diagnostic dashboard
+tests/                   — 246 tests, 96% coverage
 baseline_heuristic.py    — Rule-based agent (deterministic, no API key)
 baseline_inference.py    — LLM agent (Groq/Cerebras/Gemini/OpenAI)
 ```

ml_training_debugger/graders.py CHANGED Viewed

@@ -183,26 +183,35 @@ def grade_task_006(state: EpisodeState, scenario: ScenarioParams) -> float:
     Diagnosis must ALWAYS be 'code_bug' regardless of bug variant.
     Hard task rewards thorough investigation before fixing.
     """
     score = 0.0
-    # +0.05 for inspect_code
     if state.code_inspected:
         score += 0.05
-    # Thoroughness bonus: inspecting other systems first rules out non-code causes
     if state.gradients_inspected:
         score += 0.05
     if state.model_modes_inspected:
         score += 0.05
-    # Code fix credit (reduced base, bonus for thorough investigation)
     if _has_action(state, "fix_code") and state.fix_action_taken:
-        score += 0.20
-    # Restart credit
     if state.restart_after_fix:
-        score += 0.20
     # +0.45 for correct diagnosis (must be code_bug)
     if _correct_diagnosis(state, scenario):
@@ -212,20 +221,46 @@ def grade_task_006(state: EpisodeState, scenario: ScenarioParams) -> float:
 def grade_task_007(state: EpisodeState, scenario: ScenarioParams) -> float:
-    """Grade Task 7 — LR Scheduler Misconfigured (medium-hard). Spec extension."""
     score = 0.0
     if state.gradients_inspected:
         score += 0.05
     if state.data_inspected:
         score += 0.05
     if _has_action(state, "modify_config"):
-        score += 0.25
     if state.restart_after_fix:
-        score += 0.25
     if _correct_diagnosis(state, scenario):
         score += 0.40
     return min(1.0, max(0.0, score))

     Diagnosis must ALWAYS be 'code_bug' regardless of bug variant.
     Hard task rewards thorough investigation before fixing.
+    Full credit requires ruling out non-code causes via weight inspection.
     """
     score = 0.0
+    # Investigation credits (+0.05 each, up to +0.25 for all 5 types)
     if state.code_inspected:
         score += 0.05
     if state.gradients_inspected:
         score += 0.05
     if state.model_modes_inspected:
         score += 0.05
+    if state.model_weights_inspected:
+        score += 0.05
+    if state.data_inspected:
+        score += 0.05
+    # Code fix credit scaled by investigation thoroughness
     if _has_action(state, "fix_code") and state.fix_action_taken:
+        if state.model_weights_inspected:
+            score += 0.15  # Thorough: ruled out weight-related causes
+        else:
+            score += 0.08  # Quick fix without full investigation
+    # Restart credit scaled by thoroughness
     if state.restart_after_fix:
+        if state.model_weights_inspected:
+            score += 0.15  # Full restart credit
+        else:
+            score += 0.08  # Partial credit
     # +0.45 for correct diagnosis (must be code_bug)
     if _correct_diagnosis(state, scenario):
 def grade_task_007(state: EpisodeState, scenario: ScenarioParams) -> float:
+    """Grade Task 7 — LR Scheduler Misconfigured (hard). Spec extension.
+    Requires thorough investigation: agents must inspect weights to rule out
+    weight-related issues before concluding scheduler is the root cause.
+    Penalizes wrong fixes (e.g. patch_data_loader when data is fine).
+    """
     score = 0.0
+    # Investigation credits (+0.05 each, up to +0.20 for all 4 types)
     if state.gradients_inspected:
         score += 0.05
     if state.data_inspected:
         score += 0.05
+    if state.model_weights_inspected:
+        score += 0.05
+    if state.model_modes_inspected:
+        score += 0.05
+    # Fix credit scaled by investigation thoroughness
     if _has_action(state, "modify_config"):
+        if state.model_weights_inspected:
+            score += 0.20  # Thorough: ruled out weight issues
+        else:
+            score += 0.12  # Partial: didn't check weights
+    # Restart credit scaled by thoroughness
     if state.restart_after_fix:
+        if state.model_weights_inspected:
+            score += 0.20  # Full restart credit
+        else:
+            score += 0.12  # Partial credit
+    # Diagnosis
     if _correct_diagnosis(state, scenario):
         score += 0.40
+    # Wrong-fix penalty: patch_data_loader when data is clean
+    if _has_action(state, "patch_data_loader"):
+        score -= 0.10
     return min(1.0, max(0.0, score))

pyproject.toml CHANGED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "pytorch-training-debugger"
-version = "1.0.0"
 description = "OpenEnv RL environment for PyTorch training failure debugging"
 requires-python = ">=3.12"
 dependencies = [

 [project]
 name = "pytorch-training-debugger"
+version = "1.1.0"
 description = "OpenEnv RL environment for PyTorch training failure debugging"
 requires-python = ">=3.12"
 dependencies = [

server/app.py CHANGED Viewed

@@ -12,7 +12,7 @@ import sys
 from typing import Optional
 from fastapi import FastAPI
-from fastapi.responses import HTMLResponse, JSONResponse
 from openenv.core.env_server.http_server import create_app
 from ml_training_debugger.models import MLTrainingAction, MLTrainingObservation
@@ -77,9 +77,8 @@ _baseline_lock = asyncio.Lock()
 @app.get("/")
-def root():
     """Redirect root to dashboard."""
-    from fastapi.responses import RedirectResponse
     return RedirectResponse(url="/dashboard")
@@ -174,7 +173,7 @@ def post_grader(session_id: Optional[str] = None) -> dict:
 @app.post("/baseline", response_model=None)
-async def post_baseline():
     """Trigger baseline run, return scores for all tasks.
     Returns 409 if already running. Uses asyncio.Lock for thread safety.

 from typing import Optional
 from fastapi import FastAPI
+from fastapi.responses import HTMLResponse, JSONResponse, RedirectResponse
 from openenv.core.env_server.http_server import create_app
 from ml_training_debugger.models import MLTrainingAction, MLTrainingObservation
 @app.get("/")
+def root() -> RedirectResponse:
     """Redirect root to dashboard."""
     return RedirectResponse(url="/dashboard")
 @app.post("/baseline", response_model=None)
+async def post_baseline() -> JSONResponse | dict:
     """Trigger baseline run, return scores for all tasks.
     Returns 409 if already running. Uses asyncio.Lock for thread safety.

tests/test_graders.py CHANGED Viewed

@@ -10,6 +10,7 @@ from ml_training_debugger.graders import (
     grade_task_001,
     grade_task_003,
     grade_task_005,
     grade_task_007,
 )
 from ml_training_debugger.models import EpisodeState
@@ -241,25 +242,168 @@ class TestGradeEpisode:
         assert score == 0.0
 class TestGradeTask007:
-    def test_perfect_score(self):
         scenario = sample_scenario("task_007", seed=42)
         state = EpisodeState(
             gradients_inspected=True,
             data_inspected=True,
             fix_action_taken=True,
             restart_after_fix=True,
             diagnosis_submitted=True,
             actions_taken=[
                 "inspect_gradients",
                 "inspect_data_batch",
                 "modify_config",
                 "restart_run",
                 "mark_diagnosed:scheduler_misconfigured",
             ],
         )
         score = grade_task_007(state, scenario)
-        assert score == 1.0
     def test_wrong_diagnosis(self):
         scenario = sample_scenario("task_007", seed=42)

     grade_task_001,
     grade_task_003,
     grade_task_005,
+    grade_task_006,
     grade_task_007,
 )
 from ml_training_debugger.models import EpisodeState
         assert score == 0.0
+class TestGradeTask006:
+    @pytest.fixture
+    def scenario_006(self):
+        return sample_scenario("task_006", seed=42)
+    def test_perfect_score_thorough(self, scenario_006):
+        """Thorough agent inspects ALL systems including weights — gets perfect score."""
+        state = EpisodeState(
+            code_inspected=True,
+            gradients_inspected=True,
+            model_modes_inspected=True,
+            model_weights_inspected=True,
+            data_inspected=True,
+            fix_action_taken=True,
+            restart_after_fix=True,
+            diagnosis_submitted=True,
+            actions_taken=[
+                "inspect_gradients",
+                "inspect_data_batch",
+                "inspect_model_weights",
+                "inspect_model_modes",
+                "inspect_code",
+                "fix_code",
+                "restart_run",
+                "mark_diagnosed:code_bug",
+            ],
+        )
+        score = grade_task_006(state, scenario_006)
+        assert score == pytest.approx(1.0)
+    def test_no_weights_inspection_partial(self, scenario_006):
+        """Agent that skips weight inspection gets reduced fix/restart credit."""
+        state = EpisodeState(
+            code_inspected=True,
+            gradients_inspected=True,
+            model_modes_inspected=True,
+            data_inspected=True,
+            fix_action_taken=True,
+            restart_after_fix=True,
+            diagnosis_submitted=True,
+            actions_taken=[
+                "inspect_gradients",
+                "inspect_data_batch",
+                "inspect_model_modes",
+                "inspect_code",
+                "fix_code",
+                "restart_run",
+                "mark_diagnosed:code_bug",
+            ],
+        )
+        score = grade_task_006(state, scenario_006)
+        # 0.05*4 + 0.08 + 0.08 + 0.45 = 0.81
+        assert score == pytest.approx(0.81)
+        assert score < 1.0  # Must not be perfect without weights
+    def test_minimal_investigation(self, scenario_006):
+        """Agent that only inspects code, fixes, and diagnoses."""
+        state = EpisodeState(
+            code_inspected=True,
+            fix_action_taken=True,
+            restart_after_fix=True,
+            diagnosis_submitted=True,
+            actions_taken=[
+                "inspect_code",
+                "fix_code",
+                "restart_run",
+                "mark_diagnosed:code_bug",
+            ],
+        )
+        score = grade_task_006(state, scenario_006)
+        # 0.05 + 0.08 + 0.08 + 0.45 = 0.66
+        assert score == pytest.approx(0.66)
+    def test_wrong_diagnosis(self, scenario_006):
+        """Submitting batchnorm_eval_mode on a code_bug task fails."""
+        state = EpisodeState(
+            code_inspected=True,
+            diagnosis_submitted=True,
+            actions_taken=[
+                "inspect_code",
+                "mark_diagnosed:batchnorm_eval_mode",
+            ],
+        )
+        score = grade_task_006(state, scenario_006)
+        assert score < 0.2  # Only gets code_inspected bonus
+    def test_score_in_range(self, scenario_006):
+        state = EpisodeState()
+        score = grade_task_006(state, scenario_006)
+        assert 0.0 <= score <= 1.0
 class TestGradeTask007:
+    def test_perfect_score_thorough(self):
+        """Thorough agent inspects weights — gets perfect score."""
         scenario = sample_scenario("task_007", seed=42)
         state = EpisodeState(
             gradients_inspected=True,
             data_inspected=True,
+            model_weights_inspected=True,
+            model_modes_inspected=True,
             fix_action_taken=True,
             restart_after_fix=True,
             diagnosis_submitted=True,
             actions_taken=[
                 "inspect_gradients",
                 "inspect_data_batch",
+                "inspect_model_weights",
+                "inspect_model_modes",
                 "modify_config",
                 "restart_run",
                 "mark_diagnosed:scheduler_misconfigured",
             ],
         )
         score = grade_task_007(state, scenario)
+        assert score == pytest.approx(1.0)
+    def test_no_weights_partial(self):
+        """Agent without weight inspection gets reduced fix/restart credit."""
+        scenario = sample_scenario("task_007", seed=42)
+        state = EpisodeState(
+            gradients_inspected=True,
+            data_inspected=True,
+            model_modes_inspected=True,
+            fix_action_taken=True,
+            restart_after_fix=True,
+            diagnosis_submitted=True,
+            actions_taken=[
+                "inspect_gradients",
+                "inspect_data_batch",
+                "inspect_model_modes",
+                "modify_config",
+                "restart_run",
+                "mark_diagnosed:scheduler_misconfigured",
+            ],
+        )
+        score = grade_task_007(state, scenario)
+        # 0.05*3 + 0.12 + 0.12 + 0.40 = 0.79
+        assert score == pytest.approx(0.79)
+        assert score < 1.0
+    def test_wrong_fix_penalty(self):
+        """Agent that patches data loader (wrong fix) gets penalized."""
+        scenario = sample_scenario("task_007", seed=42)
+        state = EpisodeState(
+            gradients_inspected=True,
+            data_inspected=True,
+            fix_action_taken=True,
+            restart_after_fix=True,
+            diagnosis_submitted=True,
+            actions_taken=[
+                "inspect_gradients",
+                "inspect_data_batch",
+                "patch_data_loader",
+                "modify_config",
+                "restart_run",
+                "mark_diagnosed:scheduler_misconfigured",
+            ],
+        )
+        score = grade_task_007(state, scenario)
+        # Normal partial score minus 0.10 penalty
+        assert score < 0.75
     def test_wrong_diagnosis(self):
         scenario = sample_scenario("task_007", seed=42)