Spaces:

ujjwalpardeshi
/

pytorch-training-debugger

Sleeping

App Files Files Community

omkarrr88 commited on Apr 6

Commit

7336adb

1 Parent(s): deea97b

Minor fixes

Browse files

Files changed (10) hide show

.gitignore +0 -9
README.md +5 -12
baseline_inference.py +241 -0
docs/EXPLANATION.md +0 -340
docs/ml-training-debugger-spec.md +0 -0
ml_training_debugger/models.py +0 -4
run_all_baselines.py +130 -0
server/_baseline_results.py +11 -6
server/environment.py +11 -4
tests/test_models.py +0 -7

.gitignore CHANGED Viewed

@@ -13,16 +13,7 @@ validation/reports/*.png
 .mypy_cache/
 .ruff_cache/
 .coverage
-.claude/
-CLAUDE.md
 .hf-space/
 .python-version
 deploy-hf.sh
 deploy.sh
-AUDIT_REPORT.md
-baseline_inference.py
-run_all_baselines.py
-docs/PAPER.md
-docs/PRD.md
-docs/ROADMAP.md
-docs/PROJECT_GUIDE.md

 .mypy_cache/
 .ruff_cache/
 .coverage
 .hf-space/
 .python-version
 deploy-hf.sh
 deploy.sh

README.md CHANGED Viewed

@@ -29,7 +29,7 @@ The agent starts with limited information (loss curves, config, error log) and m
 ### Real PyTorch Model Internals
-Every gradient comes from real `torch.autograd`. Every weight stat comes from real `model.state_dict()`. The environment instantiates actual `torch.nn.Module` models (SimpleCNN ~50K params, SimpleMLP ~20K params), runs 20 real forward+backward epochs per reset, and extracts real tensor statistics. Not synthetic formulas — real PyTorch computation, cached for instant replay.
 ### Context-Gated Reward Shaping
@@ -88,7 +88,7 @@ Fields like `gradient_stats`, `data_batch_stats`, `model_mode_info`, and `code_s
 ## Action Space
-14 action types in 3 categories:
 **Investigation** — reveal hidden observation fields:
 - `inspect_gradients` — per-layer gradient norms, is_exploding/is_vanishing flags
@@ -107,7 +107,6 @@ Fields like `gradient_stats`, `data_batch_stats`, `model_mode_info`, and `code_s
 **Terminal** — end the episode:
 - `restart_run` — restart training (only available after a fix)
-- `rollback_checkpoint` — rollback to pre-fix state (only available after restart)
 - `mark_diagnosed` — submit diagnosis from 7 possible root causes
 Actions are dynamically available based on episode state: `fix_code` requires prior code inspection, `restart_run` requires a fix, `mark_diagnosed` disappears after submission.
@@ -174,14 +173,8 @@ An agent that chases the gradient spike red herring loses 0.20 points. An agent
 # Heuristic (deterministic, no API key, bit-exact reproducible)
 python3 baseline_heuristic.py
-# LLM (multi-provider support)
-python3 baseline_inference.py                       # Groq — Llama 3.3 70B (free)
-python3 baseline_inference.py --provider cerebras    # Cerebras — Llama 3.1 8B (free)
-python3 baseline_inference.py --provider gemini      # Google Gemini 2.0 Flash
-python3 baseline_inference.py --provider openai      # OpenAI GPT-4o
-# Run all baselines with comparison table
-python3 run_all_baselines.py
 ```
 ## API
@@ -299,7 +292,7 @@ server/
 tests/                   — 246 tests, 96% coverage
 baseline_heuristic.py    — Rule-based agent (deterministic, no API key)
-baseline_inference.py    — LLM agent (Groq/Cerebras/Gemini/OpenAI)
 ```
 **Key design decisions:**

 ### Real PyTorch Model Internals
+Every gradient comes from real `torch.autograd`. Every weight stat comes from real `model.state_dict()`. The environment instantiates actual `torch.nn.Module` models (SimpleCNN ~67K params, SimpleMLP ~412K params), runs 20 real forward+backward epochs per reset, and extracts real tensor statistics. Not synthetic formulas — real PyTorch computation, cached for instant replay.
 ### Context-Gated Reward Shaping
 ## Action Space
+13 action types in 3 categories:
 **Investigation** — reveal hidden observation fields:
 - `inspect_gradients` — per-layer gradient norms, is_exploding/is_vanishing flags
 **Terminal** — end the episode:
 - `restart_run` — restart training (only available after a fix)
 - `mark_diagnosed` — submit diagnosis from 7 possible root causes
 Actions are dynamically available based on episode state: `fix_code` requires prior code inspection, `restart_run` requires a fix, `mark_diagnosed` disappears after submission.
 # Heuristic (deterministic, no API key, bit-exact reproducible)
 python3 baseline_heuristic.py
+# LLM (hackathon evaluator format — uses OpenAI client)
+API_BASE_URL=https://api.openai.com/v1 MODEL_NAME=gpt-4o OPENAI_API_KEY=sk-... python3 inference.py
 ```
 ## API
 tests/                   — 246 tests, 96% coverage
 baseline_heuristic.py    — Rule-based agent (deterministic, no API key)
+inference.py             — LLM agent (OpenAI client, hackathon format)
 ```
 **Key design decisions:**

baseline_inference.py ADDED Viewed

	@@ -0,0 +1,241 @@

+#!/usr/bin/env python3
+"""LLM baseline agent using Google Gemini (via OpenAI-compatible SDK).
+Requires GEMINI_API_KEY environment variable (or pass via --api-key).
+Uses temperature=0.0 for near-deterministic behavior.
+Spec reference: Section 17.
+Usage:
+    GEMINI_API_KEY=... python baseline_inference.py
+    python baseline_inference.py --api-key YOUR_KEY
+"""
+from __future__ import annotations
+import argparse
+import json
+import os
+import sys
+from pathlib import Path
+# Load .env file if present
+_env_path = Path(__file__).parent / ".env"
+if _env_path.exists():
+    for line in _env_path.read_text().splitlines():
+        line = line.strip()
+        if line and not line.startswith("#") and "=" in line:
+            key, _, value = line.partition("=")
+            os.environ.setdefault(key.strip(), value.strip())
+try:
+    from openai import OpenAI
+except ImportError:
+    print("Error: openai package not installed. Run: pip install openai")
+    sys.exit(1)
+from ml_training_debugger.models import MLTrainingAction
+from server.environment import MLTrainingEnvironment
+ALL_TASKS = [
+    "task_001",
+    "task_002",
+    "task_003",
+    "task_004",
+    "task_005",
+    "task_006",
+    "task_007",
+]
+SYSTEM_PROMPT = """You are an expert ML engineer debugging a PyTorch training run.
+You are interacting with an environment that simulates a broken training job.
+Available actions (respond with JSON only, no explanation):
+- {"action_type": "inspect_gradients"} - View gradient statistics per layer
+- {"action_type": "inspect_data_batch"} - View data batch statistics and confusion matrix
+- {"action_type": "inspect_model_modes"} - View model layer modes (train/eval)
+- {"action_type": "inspect_model_weights"} - View model weight statistics
+- {"action_type": "inspect_code"} - View PyTorch training code
+- {"action_type": "modify_config", "target": "<field>", "value": <val>} - Change a hyperparameter
+- {"action_type": "add_callback"} - Add gradient clipping/scheduler
+- {"action_type": "patch_data_loader"} - Fix data pipeline issues
+- {"action_type": "fix_model_mode"} - Call model.train()
+- {"action_type": "fix_code", "line": <int>, "replacement": "<code>"} - Fix a code line
+- {"action_type": "restart_run"} - Restart training (requires a fix first)
+- {"action_type": "mark_diagnosed", "diagnosis": "<cause>"} - Submit diagnosis
+Valid diagnoses: lr_too_high, vanishing_gradients, data_leakage, overfitting, batchnorm_eval_mode, code_bug, scheduler_misconfigured
+Strategy:
+1. First investigate by inspecting gradients, data, model modes, and code
+2. Form a hypothesis based on the evidence gathered
+3. Apply the correct fix for the identified root cause
+4. Restart training to verify the fix works
+5. Submit your diagnosis
+IMPORTANT: Respond with ONLY a valid JSON action object. No explanation, no markdown, no code blocks."""
+def run_llm_episode(task_id: str, client: OpenAI, model_name: str) -> float:
+    """Run one LLM agent episode."""
+    env = MLTrainingEnvironment()
+    obs = env.reset(seed=42, episode_id=f"llm_{task_id}", task_id=task_id)
+    initial_obs = {
+        "training_loss_history": obs.training_loss_history[:5],
+        "val_accuracy_history": obs.val_accuracy_history[:5],
+        "current_config": obs.current_config.model_dump(),
+        "error_log": obs.error_log,
+        "available_actions": obs.available_actions,
+        "notes": obs.notes,
+        "gpu_memory_used_gb": obs.gpu_memory_used_gb,
+    }
+    messages = [
+        {"role": "system", "content": SYSTEM_PROMPT},
+        {
+            "role": "user",
+            "content": f"New episode started for a broken PyTorch training run.\n\nInitial observation:\n{json.dumps(initial_obs, indent=2, default=str)}",
+        },
+    ]
+    for step in range(25):
+        if obs.done:
+            break
+        try:
+            response = client.chat.completions.create(
+                model=model_name,
+                messages=messages,
+                temperature=0.0,
+                max_tokens=300,
+            )
+            action_text = response.choices[0].message.content.strip()
+        except Exception as e:
+            print(f"    Step {step}: API error — {e}", file=sys.stderr)
+            break
+        # Clean up common LLM formatting issues
+        action_text = action_text.strip("`").strip()
+        if action_text.startswith("json"):
+            action_text = action_text[4:].strip()
+        messages.append({"role": "assistant", "content": action_text})
+        try:
+            action_data = json.loads(action_text)
+            action = MLTrainingAction(**action_data)
+        except (json.JSONDecodeError, Exception) as e:
+            messages.append(
+                {
+                    "role": "user",
+                    "content": f"Invalid action format: {e}. Respond with ONLY valid JSON.",
+                }
+            )
+            continue
+        obs = env.step(action)
+        obs_summary: dict = {
+            "reward": obs.reward,
+            "done": obs.done,
+            "step": obs.episode_state.step_count,
+            "available_actions": obs.available_actions,
+        }
+        if obs.error_log:
+            obs_summary["error_log"] = obs.error_log
+        if obs.gradient_stats:
+            obs_summary["gradient_stats"] = [
+                {
+                    "layer": g.layer_name,
+                    "mean_norm": round(g.mean_norm, 4),
+                    "exploding": g.is_exploding,
+                    "vanishing": g.is_vanishing,
+                }
+                for g in obs.gradient_stats
+            ]
+        if obs.data_batch_stats:
+            obs_summary["data_overlap"] = obs.data_batch_stats.class_overlap_score
+            obs_summary["duplicate_ratio"] = obs.data_batch_stats.duplicate_ratio
+        if obs.model_mode_info:
+            obs_summary["model_modes"] = obs.model_mode_info
+        if obs.code_snippet:
+            obs_summary["code"] = obs.code_snippet.code[:600]
+            obs_summary["hint"] = obs.code_snippet.hint
+        messages.append(
+            {
+                "role": "user",
+                "content": f"Observation after your action:\n{json.dumps(obs_summary, indent=2, default=str)}",
+            }
+        )
+    session = env._get_session()
+    return session.last_score if session and session.last_score is not None else 0.0
+PROVIDERS = {
+    "groq": {
+        "env_key": "GROQ_API_KEY",
+        "base_url": "https://api.groq.com/openai/v1",
+        "default_model": "llama-3.3-70b-versatile",
+    },
+    "cerebras": {
+        "env_key": "CEREBRAS_API_KEY",
+        "base_url": "https://api.cerebras.ai/v1",
+        "default_model": "llama3.1-8b",
+    },
+    "gemini": {
+        "env_key": "GEMINI_API_KEY",
+        "base_url": "https://generativelanguage.googleapis.com/v1beta/openai/",
+        "default_model": "gemini-2.0-flash",
+    },
+    "openai": {
+        "env_key": "OPENAI_API_KEY",
+        "base_url": None,
+        "default_model": "gpt-4o",
+    },
+}
+def main() -> None:
+    parser = argparse.ArgumentParser(description="LLM baseline agent")
+    parser.add_argument("--url", default="http://localhost:7860")
+    parser.add_argument("--api-key", default=None, help="API key")
+    parser.add_argument(
+        "--provider",
+        default="groq",
+        choices=list(PROVIDERS.keys()),
+        help="LLM provider (default: groq)",
+    )
+    parser.add_argument("--model", default=None, help="Model name (auto-detected from provider)")
+    args = parser.parse_args()
+    prov = PROVIDERS[args.provider]
+    api_key = args.api_key or os.environ.get(prov["env_key"])
+    if not api_key:
+        print(f"Error: Set {prov['env_key']} env var or pass --api-key")
+        sys.exit(1)
+    model_name = args.model or prov["default_model"]
+    client_kwargs: dict = {"api_key": api_key}
+    if prov["base_url"]:
+        client_kwargs["base_url"] = prov["base_url"]
+    client = OpenAI(**client_kwargs)
+    scores: dict[str, float] = {}
+    print(f"Running LLM baseline with {args.provider}/{model_name}...", file=sys.stderr)
+    for task_id in ALL_TASKS:
+        try:
+            score = run_llm_episode(task_id, client, model_name)
+            scores[task_id] = round(score, 4)
+            print(f"  {task_id}: {score:.4f}", file=sys.stderr)
+        except Exception as e:
+            print(f"  {task_id}: ERROR — {e}", file=sys.stderr)
+            scores[task_id] = 0.0
+    print(json.dumps(scores, indent=2))
+if __name__ == "__main__":
+    main()

docs/EXPLANATION.md DELETED Viewed

@@ -1,340 +0,0 @@
-# PyTorch Training Run Debugger — Explained Simply
-> This file explains the entire project as if you're 10 years old. No jargon. Just simple language.
----
-## What Is This Project?
-Imagine you're a doctor, but instead of fixing sick people, you fix **sick computers that are trying to learn**.
-When computers learn (this is called "Machine Learning" or ML), they look at thousands of examples — like pictures of cats and dogs — and slowly get better at telling them apart. This learning process is called **training**.
-But sometimes, training goes wrong. The computer makes mistakes, gets confused, or learns the wrong things. When that happens, a human engineer has to figure out what went wrong and fix it — just like a doctor diagnosing a patient.
-**This project builds a practice hospital for AI doctors.** It creates fake "sick training runs" with known problems, and then an AI agent (the doctor) has to:
-1. **Investigate** — Look at clues (like checking temperature or blood pressure)
-2. **Diagnose** — Figure out what's wrong
-3. **Fix** — Apply the right treatment
-4. **Verify** — Check if the patient recovered
----
-## Why Does This Matter?
-Real companies like Meta, Google, and OpenAI spend millions of dollars training AI models. When training breaks, engineers waste hours (sometimes days!) figuring out what went wrong. Each hour of broken training can cost **$2-$8 per GPU** — and some companies use thousands of GPUs at once.
-If we could train an AI to automatically find and fix these problems, it would save enormous amounts of time and money.
-This project is a **training ground** where AI agents can practice debugging — like a flight simulator for pilots, but for ML engineers.
----
-## How Does It Work? (The Big Picture)
-Think of it like a detective game with 6 mystery cases:
-### The Game Rules
-1. **The computer shows you a broken training run** — You see charts showing how the training is going (spoiler: it's going badly!)
-2. **You can investigate** — You have 5 different "magnifying glasses" to look at different parts of the problem
-3. **You figure out what's wrong** — You pick from a list of 6 possible problems
-4. **You fix it** — You apply the right fix
-5. **You restart and check** — You restart the training and see if it works now
-6. **You submit your answer** — "I think the problem was X"
-If you're right, you get points. If you're wrong, you lose points. If you investigate smartly, you get bonus points. If you ignore evidence and do something silly, you get penalty points.
----
-## The 6 Mystery Cases (Tasks)
-### Easy Cases (Like finding a broken window)
-**Case 1: Learning Rate Too High (task_001)**
-> Imagine you're learning to ride a bike, but someone set the speed to 100 mph. You'd crash immediately!
-That's what happens here. The computer is learning too fast and everything explodes. The numbers go crazy and become "NaN" (Not a Number — like dividing by zero).
-**Clues:** Every part of the computer shows "EXPLODING!" when you check the gradients (the direction signals that guide learning).
-**Fix:** Turn down the speed (reduce the learning rate from 0.1 to 0.001).
----
-**Case 2: Vanishing Gradients (task_002)**
-> Now imagine you're whispering instructions to someone 100 rooms away. By the time the message reaches them, it's too quiet to hear.
-The learning signals get weaker and weaker as they travel through the computer's brain layers. The deeper layers get almost zero signal — so they can't learn anything.
-**Clues:** Deeper layers show "VANISHING!" gradients. The loss curve is flat — nothing is being learned.
-**Fix:** Increase the learning rate so the signals are louder.
----
-### Medium Cases (Like finding a hidden leak)
-**Case 3: Data Leakage (task_003)**
-> Imagine taking a math test, but the answer key is mixed into your practice problems. You'd score 100% — but you didn't actually learn anything!
-The training data and test data got mixed together. The computer looks amazing on tests, but it's just memorizing answers — it hasn't actually learned.
-**Clues:** Suspiciously high test scores from the very start. When you check the data, you find a "class overlap score" above 0.5 — meaning lots of test answers leaked into the training set.
-**Trick:** There's a misleading note saying "we upgraded the model architecture" — making you think the high scores are from a better model, not leaked data.
-**Fix:** Clean the data pipeline to remove the overlap.
----
-**Case 4: Overfitting (task_004)**
-> Imagine memorizing every single answer to last year's exam, but then failing this year's exam because the questions are slightly different.
-The computer has memorized the training data perfectly (train loss near zero!) but fails on new data it hasn't seen before (validation loss keeps rising).
-**Clues:** Training loss drops to almost zero while validation loss goes up — the classic "train-val divergence."
-**Fix:** Add regularization (weight decay) — this is like telling the computer "don't memorize, understand the patterns instead."
----
-### Hard Cases (Like solving a mystery with fake clues)
-**Case 5: BatchNorm Eval Mode (task_005)**
-> Imagine a student who studies perfectly at home but freezes during the actual exam because they switched into "test mode" too early.
-The computer's model has a special feature called BatchNorm that behaves differently during training vs testing. Someone accidentally left it in "test mode" during training. This causes subtle, slow degradation — not an obvious crash.
-**The Trap:** This case has **red herrings** — fake clues designed to mislead you:
-- One layer's gradient suddenly spikes (but it's not actually exploding)
-- GPU memory is at 91% (looks scary, but it's not the problem)
-- One layer has near-vanishing gradients (but that's normal for this layer)
-- An error log warns about GPU memory (irrelevant to the real problem)
-**Clues:** When you check the model modes, you find all layers are in "eval" (test) mode instead of "train" mode. That's the real problem.
-**Why it's hard:** Most agents see the gradient spike and immediately try to fix gradients — falling for the trap. The smart agent checks model modes and finds the real issue.
----
-**Case 6: Code Bug (task_006)**
-> Imagine a recipe that says "bake for 30 minutes" but someone accidentally changed it to "bake for 0 minutes." The oven runs, but nothing gets cooked.
-There's an actual bug in the Python code. The agent sees the source code and has to find the buggy line and fix it. There are 4 possible bugs:
-1. **eval_mode** — `model.eval()` instead of `model.train()` (wrong mode)
-2. **detach_loss** — `loss.detach()` before `.backward()` (disconnects the learning signal)
-3. **zero_grad_missing** — Forgot to clear old gradients (gradients pile up incorrectly)
-4. **inplace_relu** — `inplace=True` on ReLU (corrupts the computation graph)
-**Why it's hard:** The agent must actually READ code and understand what each line does — not just look at numbers and charts.
----
-## The Scoring System
-### Rewards (Points You Earn)
-Think of it like a video game:
-| What You Do | Points | Why |
-|-------------|--------|-----|
-| Take any action | **-0.01** | Every move costs a tiny bit (encourages efficiency) |
-| Investigate something for the first time | **+0.05** | Looking at clues is good! |
-| Correct diagnosis | **+0.50** | You found the answer! |
-| Fix works and training recovers | **+0.40** | Your fix actually helped! |
-### Penalties (Points You Lose)
-| What You Do | Points | Why |
-|-------------|--------|-----|
-| Do something invalid | **-0.05** | You tried something that's not allowed |
-| Wrong code fix | **-0.10** | Your code fix didn't work |
-| Wrong diagnosis | **-0.30** | You guessed wrong |
-### The Special Penalty: Context-Gated Penalty
-This is the **coolest part** of the project. Here's how it works:
-> You check the gradients and see they're all normal. Then you add gradient clipping anyway (a fix for gradient problems). But wait — YOU ALREADY KNOW the gradients are fine! You're ignoring your own evidence!
-**Penalty: -0.20 points**
-But if you add gradient clipping BEFORE checking gradients? No penalty — you haven't seen any evidence yet, so it's a reasonable guess.
-This teaches the AI: **"Don't ignore what you've already learned."**
----
-### The Grader (Final Score)
-At the end of each case, a grader gives you a score from **0.0 to 1.0**:
-- **1.0** = Perfect — investigated, fixed, restarted, and diagnosed correctly
-- **0.5-0.8** = Partial — got some things right, missed others
-- **0.0** = Failed — wrong diagnosis, no fix, or ran out of steps
-The grader looks at the WHOLE story of what you did, not just the final answer.
----
-## How the Code Is Organized
-```
-ML Debugger/
-│
-├── ml_training_debugger/          ← The brain of the project
-│   ├── models.py                  ← Data shapes (what observations and actions look like)
-│   ├── scenarios.py               ← Creates the 6 mystery cases with random parameters
-│   ├── pytorch_engine.py          ← Real PyTorch model that gets "sick" (fault injection)
-│   ├── simulation.py              ← Generates fake training charts (loss curves, accuracy)
-│   ├── reward_engine.py           ← Calculates points for each action
-│   ├── graders.py                 ← Final scoring (0.0 to 1.0) at episode end
-│   ├── code_templates.py          ← The buggy code snippets for Task 6
-│   └── client.py                  ← Helper for connecting to the environment
-│
-├── server/                        ← The web server
-│   ├── app.py                     ← Main server with all API endpoints
-│   ├── environment.py             ← The game logic (reset, step, state)
-│   └── _baseline_results.py       ← Stores grader results
-│
-├── tests/                         ← 183 tests making sure everything works
-│
-├── baseline_heuristic.py          ← A simple robot that plays the game using rules
-├── baseline_inference.py          ← A smart AI (GPT-4) that plays the game
-├── Dockerfile                     ← Instructions to package everything in a container
-├── openenv.yaml                   ← Configuration file for the OpenEnv framework
-└── README.md                      ← Technical documentation
-```
----
-## How a Game Session Works (Step by Step)
-Let's walk through a complete game:
-### Step 1: Start a New Game
-```
-Agent: "Start task_001 please"
-Environment: "Here's your broken training run:"
-  - Loss history: [2.3, 3.5, 8.2, 45.0, inf, inf, inf, ...]  ← Yikes, numbers exploding!
-  - Error log: "Loss is NaN at epoch 12"
-  - Available actions: [inspect_gradients, inspect_data_batch, ...]
-```
-### Step 2: Investigate
-```
-Agent: "Let me inspect the gradients"
-Environment: "Here's what I found:"
-  - conv1: mean_norm=51.1, is_exploding=True
-  - conv2: mean_norm=91.3, is_exploding=True
-  - conv3: mean_norm=111.8, is_exploding=True
-  - fc: mean_norm=37.7, is_exploding=True
-  Reward: +0.04 (step penalty + investigation bonus)
-```
-### Step 3: Fix
-```
-Agent: "Reduce learning rate to 0.001"
-Environment: "Config updated. learning_rate = 0.001"
-  Reward: -0.01 (step penalty only)
-```
-### Step 4: Restart
-```
-Agent: "Restart the training run"
-Environment: "Training restarted. Convergence detected!"
-  Reward: +0.39 (step penalty + convergence bonus)
-```
-### Step 5: Diagnose
-```
-Agent: "The problem was lr_too_high"
-Environment: "CORRECT! Episode complete."
-  Reward: +0.49 (step penalty + correct diagnosis)
-  Final grader score: 1.0 ← Perfect!
-```
----
-## What Makes This Project Special?
-### 1. It Uses REAL PyTorch
-This isn't fake data. When you inspect gradients, you're looking at real numbers computed by a real neural network using `torch.autograd`. The model has ~50,000 parameters and runs real forward/backward passes. This matters because the hackathon is organized by **Meta (the company that makes PyTorch)**.
-### 2. Context-Gated Rewards
-No other OpenEnv environment does this. The reward system tracks what the agent has learned and penalizes it for ignoring evidence. This teaches AI to reason like a real engineer — gather evidence first, then act.
-### 3. Code-Level Debugging (Task 6)
-The agent reads actual Python code and submits line-by-line fixes. This tests code understanding — not just number crunching. Meta cares about this because they want AI that can debug PyTorch code.
-### 4. Red Herrings in Hard Tasks
-Task 5 deliberately plants misleading clues. This separates agents that follow rigid patterns from agents that can reason through ambiguity — exactly like real debugging.
-### 5. Progressive Information Reveal
-The agent starts with limited information and must actively choose what to investigate. Each inspection reveals new data. This makes it a genuine investigation — not just a classification task.
----
-## The Two Baselines (Robot Players)
-### Baseline 1: The Rule-Following Robot (`baseline_heuristic.py`)
-This robot follows a fixed checklist:
-1. Check gradients → if exploding, fix learning rate
-2. Check data → if leaking, patch data
-3. Check model modes → if eval, fix mode
-4. Check code → if bug found, fix it
-5. If nothing works, guess "overfitting"
-**Scores:** Perfect on easy/medium tasks, but only 0.35 on Task 5 because its fixed order means it tries to fix gradients before checking model modes — falling for the red herring.
-### Baseline 2: The Smart AI (`baseline_inference.py`)
-This uses GPT-4 to reason about the evidence. It reads the observations, thinks about what to do, and makes decisions. It should score higher on hard tasks because it can reason, not just follow rules.
----
-## The Technology Stack
-| Component | What It Is | Why We Use It |
-|-----------|-----------|---------------|
-| **Python 3.12** | Programming language | Modern, fast, supports type hints |
-| **PyTorch (CPU)** | Machine learning framework | Real neural networks, real gradients (Meta's framework!) |
-| **FastAPI** | Web framework | Fast, modern, auto-generates docs |
-| **OpenEnv** | RL environment framework | Standard interface for AI agents (step/reset/state) |
-| **Pydantic** | Data validation | Ensures all data is properly typed |
-| **Plotly.js** | Charting library | Live dashboard with interactive charts |
-| **Docker** | Containerization | Package everything so it runs anywhere |
----
-## How to Think About This Project
-**Analogy 1: Medical Training Simulator**
-Medical students practice on mannequins before treating real patients. This project is a mannequin for AI debugging — the "patients" have known problems, and the "doctor" (AI agent) learns to diagnose them.
-**Analogy 2: Escape Room**
-Each task is like an escape room. You're locked in with clues scattered around. Some clues are helpful, some are red herrings. You need to investigate systematically, not randomly try everything.
-**Analogy 3: Car Mechanic School**
-A car comes in making weird noises. The mechanic can:
-- Check the engine (inspect_gradients)
-- Check the fuel (inspect_data_batch)
-- Check the gearbox (inspect_model_modes)
-- Read the error codes (inspect_code)
-Then they fix the right part and test-drive it to confirm.
----
-## Summary
-| Question | Answer |
-|----------|--------|
-| **What?** | A practice environment where AI agents learn to debug broken PyTorch training runs |
-| **Why?** | Real ML debugging costs companies millions. Training AI to do it has huge value. |
-| **How?** | 7 mystery cases with real PyTorch training (CNN + MLP), progressive clue reveal, and smart scoring |
-| **What's special?** | Real 20-epoch training, dual architectures, context-gated rewards, code-level debugging, red herrings, difficulty scaling |
-| **Who's it for?** | AI researchers building smarter debugging agents |
-| **Built with?** | Python, PyTorch, FastAPI, OpenEnv, Pydantic, Docker |
-| **For what event?** | Meta PyTorch OpenEnv Hackathon x Scaler School of Technology |

docs/ml-training-debugger-spec.md DELETED Viewed

The diff for this file is too large to render. See raw diff

ml_training_debugger/models.py CHANGED Viewed

@@ -113,7 +113,6 @@ class EpisodeState(BaseModel):
         Rules from spec Section 10 — Dynamic available_actions:
         - restart_run: only after fix_action_taken
-        - rollback_checkpoint: only after restart_after_fix
         - fix_code: only after code_inspected
         - mark_diagnosed: disappears after diagnosis_submitted
         """
@@ -133,8 +132,6 @@ class EpisodeState(BaseModel):
             actions.append("fix_code")
         if self.fix_action_taken:
             actions.append("restart_run")
-        if self.restart_after_fix:
-            actions.append("rollback_checkpoint")
         if not self.diagnosis_submitted:
             actions.append("mark_diagnosed")
         return actions
@@ -154,7 +151,6 @@ ALL_ACTION_TYPES: set[str] = {
     "fix_code",
     "restart_run",
     "mark_diagnosed",
-    "rollback_checkpoint",
 }

         Rules from spec Section 10 — Dynamic available_actions:
         - restart_run: only after fix_action_taken
         - fix_code: only after code_inspected
         - mark_diagnosed: disappears after diagnosis_submitted
         """
             actions.append("fix_code")
         if self.fix_action_taken:
             actions.append("restart_run")
         if not self.diagnosis_submitted:
             actions.append("mark_diagnosed")
         return actions
     "fix_code",
     "restart_run",
     "mark_diagnosed",
 }

run_all_baselines.py ADDED Viewed

	@@ -0,0 +1,130 @@

+#!/usr/bin/env python3
+"""Run heuristic + multiple LLM baselines and show comparison table.
+Usage:
+    python3 run_all_baselines.py
+"""
+from __future__ import annotations
+import json
+import os
+import sys
+import time
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from pathlib import Path
+# Load .env
+_env_path = Path(__file__).parent / ".env"
+if _env_path.exists():
+    for line in _env_path.read_text().splitlines():
+        line = line.strip()
+        if line and not line.startswith("#") and "=" in line:
+            key, _, value = line.partition("=")
+            os.environ.setdefault(key.strip(), value.strip())
+from baseline_heuristic import ALL_TASKS
+from baseline_heuristic import run_heuristic_episode
+from baseline_inference import PROVIDERS, run_llm_episode
+try:
+    from openai import OpenAI
+except ImportError:
+    print("Error: pip install openai")
+    sys.exit(1)
+def run_heuristic() -> dict[str, float]:
+    scores = {}
+    for task_id in ALL_TASKS:
+        scores[task_id] = round(run_heuristic_episode(task_id), 4)
+    return scores
+def run_llm_provider(provider_name: str, model: str | None = None) -> dict[str, float]:
+    prov = PROVIDERS[provider_name]
+    api_key = os.environ.get(prov["env_key"])
+    if not api_key:
+        return {t: -1.0 for t in ALL_TASKS}  # -1 = no key
+    model_name = model or prov["default_model"]
+    client_kwargs: dict = {"api_key": api_key}
+    if prov["base_url"]:
+        client_kwargs["base_url"] = prov["base_url"]
+    client = OpenAI(**client_kwargs)
+    scores: dict[str, float] = {}
+    for task_id in ALL_TASKS:
+        try:
+            score = run_llm_episode(task_id, client, model_name)
+            scores[task_id] = round(score, 4)
+            print(f"  [{provider_name}/{model_name}] {task_id}: {score:.4f}", file=sys.stderr)
+        except Exception as e:
+            err_str = str(e)[:80]
+            print(f"  [{provider_name}/{model_name}] {task_id}: ERROR — {err_str}", file=sys.stderr)
+            scores[task_id] = 0.0
+    return scores
+def main() -> None:
+    print("Running all baselines...\n", file=sys.stderr)
+    results: dict[str, dict[str, float]] = {}
+    # Run heuristic first (fast, deterministic)
+    print("--- Heuristic baseline ---", file=sys.stderr)
+    results["Heuristic"] = run_heuristic()
+    print(f"  Done: {json.dumps(results['Heuristic'])}", file=sys.stderr)
+    # Run LLM providers sequentially (avoids thread hang issues)
+    llm_runs = [
+        ("Cerebras/Llama-3.1-8B", "cerebras", "llama3.1-8b"),
+        ("Groq/Llama-3.1-8B", "groq", "llama-3.1-8b-instant"),
+    ]
+    for label, provider, model in llm_runs:
+        print(f"\n--- {label} ---", file=sys.stderr)
+        try:
+            results[label] = run_llm_provider(provider, model)
+        except Exception as e:
+            print(f"  {label}: FAILED — {e}", file=sys.stderr)
+            results[label] = {t: 0.0 for t in ALL_TASKS}
+    # Print comparison table
+    print("\n" + "=" * 80)
+    print("BASELINE COMPARISON TABLE")
+    print("=" * 80)
+    headers = list(results.keys())
+    print(f"\n{'Task':<12}", end="")
+    for h in headers:
+        print(f"{h:>25}", end="")
+    print()
+    print("-" * (12 + 25 * len(headers)))
+    for task_id in ALL_TASKS:
+        print(f"{task_id:<12}", end="")
+        for h in headers:
+            score = results[h].get(task_id, 0.0)
+            if score < 0:
+                print(f"{'no key':>25}", end="")
+            else:
+                print(f"{score:>25.4f}", end="")
+        print()
+    print("-" * (12 + 25 * len(headers)))
+    # Averages
+    print(f"{'AVERAGE':<12}", end="")
+    for h in headers:
+        valid = [v for v in results[h].values() if v >= 0]
+        avg = sum(valid) / len(valid) if valid else 0
+        print(f"{avg:>25.4f}", end="")
+    print()
+    # Save JSON
+    print(json.dumps(results, indent=2))
+if __name__ == "__main__":
+    main()

server/_baseline_results.py CHANGED Viewed

@@ -2,9 +2,11 @@
 from __future__ import annotations
 from typing import Optional
-# Store last completed episode results
 _last_results: dict[str, dict] = {}
@@ -12,16 +14,19 @@ def store_grader_result(
     session_id: str, score: float, task_id: str, steps: int
 ) -> None:
     """Store a grader result for retrieval."""
-    _last_results[session_id] = {
         "score": round(score, 4),
         "task_id": task_id,
         "steps": steps,
     }
-    _last_results["_latest"] = _last_results[session_id]
 def get_last_grader_result(session_id: Optional[str] = None) -> dict | None:
     """Get grader result for a session, or the most recent one."""
-    if session_id:
-        return _last_results.get(session_id)
-    return _last_results.get("_latest")

 from __future__ import annotations
+import threading
 from typing import Optional
+# Thread-safe store for completed episode results
+_lock = threading.Lock()
 _last_results: dict[str, dict] = {}
     session_id: str, score: float, task_id: str, steps: int
 ) -> None:
     """Store a grader result for retrieval."""
+    entry = {
         "score": round(score, 4),
         "task_id": task_id,
         "steps": steps,
     }
+    with _lock:
+        _last_results[session_id] = entry
+        _last_results["_latest"] = entry
 def get_last_grader_result(session_id: Optional[str] = None) -> dict | None:
     """Get grader result for a session, or the most recent one."""
+    with _lock:
+        if session_id:
+            return _last_results.get(session_id)
+        return _last_results.get("_latest")

server/environment.py CHANGED Viewed

@@ -467,9 +467,6 @@ class MLTrainingEnvironment(Environment[MLTrainingAction, MLTrainingObservation,
             state.diagnosis_submitted = True
             session.done = True
-        elif at == "rollback_checkpoint":
-            pass  # No-op for now
         return is_correct_fix, convergence
     def _check_convergence(self, session: SessionData) -> bool:
@@ -516,11 +513,21 @@ class MLTrainingEnvironment(Environment[MLTrainingAction, MLTrainingObservation,
         session = self._get_session()
         if session is None:
             return {"status": "no_active_episode"}
         return {
             "status": "active",
             "task_id": session.scenario.task_id,
-            "step_count": session.state.step_count,
             "done": session.done,
         }
     def get_last_completed(self, session_id: str | None = None) -> dict | None:

             state.diagnosis_submitted = True
             session.done = True
         return is_correct_fix, convergence
     def _check_convergence(self, session: SessionData) -> bool:
         session = self._get_session()
         if session is None:
             return {"status": "no_active_episode"}
+        st = session.state
         return {
             "status": "active",
             "task_id": session.scenario.task_id,
+            "step_count": st.step_count,
             "done": session.done,
+            "gradients_inspected": st.gradients_inspected,
+            "data_inspected": st.data_inspected,
+            "model_modes_inspected": st.model_modes_inspected,
+            "model_weights_inspected": st.model_weights_inspected,
+            "code_inspected": st.code_inspected,
+            "fix_action_taken": st.fix_action_taken,
+            "restart_after_fix": st.restart_after_fix,
+            "diagnosis_submitted": st.diagnosis_submitted,
+            "available_actions": st.compute_available_actions(),
         }
     def get_last_completed(self, session_id: str | None = None) -> dict | None:

tests/test_models.py CHANGED Viewed

@@ -93,8 +93,6 @@ class TestEpisodeState:
         assert "mark_diagnosed" in actions
         assert "fix_code" not in actions
         assert "restart_run" not in actions
-        assert "rollback_checkpoint" not in actions
     def test_fix_code_available_after_code_inspected(self):
         state = EpisodeState(code_inspected=True)
         actions = state.compute_available_actions()
@@ -105,11 +103,6 @@ class TestEpisodeState:
         actions = state.compute_available_actions()
         assert "restart_run" in actions
-    def test_rollback_available_after_restart(self):
-        state = EpisodeState(restart_after_fix=True)
-        actions = state.compute_available_actions()
-        assert "rollback_checkpoint" in actions
     def test_mark_diagnosed_disappears_after_submission(self):
         state = EpisodeState(diagnosis_submitted=True)
         actions = state.compute_available_actions()

         assert "mark_diagnosed" in actions
         assert "fix_code" not in actions
         assert "restart_run" not in actions
     def test_fix_code_available_after_code_inspected(self):
         state = EpisodeState(code_inspected=True)
         actions = state.compute_available_actions()
         actions = state.compute_available_actions()
         assert "restart_run" in actions
     def test_mark_diagnosed_disappears_after_submission(self):
         state = EpisodeState(diagnosis_submitted=True)
         actions = state.compute_available_actions()