Spaces:

ujjwalpardeshi
/

pytorch-training-debugger

Sleeping

App Files Files Community

omkarrr88 commited on Mar 28

Commit

43647d3

1 Parent(s): aa0bed2

LLM scores added

Browse files

Files changed (3) hide show

README.md +36 -10
baseline_inference.py +41 -13
run_all_baselines.py +130 -0

README.md CHANGED Viewed

@@ -84,16 +84,42 @@ Dynamic availability: `restart_run` requires a fix first; `fix_code` requires co
 ## Baseline Scores
-Rule-based heuristic baseline (deterministic, no API key, bit-exact reproducible):
-| Task | Score | Notes |
-|------|-------|-------|
-| `task_001` | 1.00 | Direct signal: `is_exploding` on all layers |
-| `task_002` | 1.00 | Direct signal: `is_vanishing` on deeper layers |
-| `task_003` | 1.00 | `class_overlap_score > 0.5` triggers correct path |
-| `task_004` | 1.00 | Detects train-val divergence + near-zero train loss |
-| `task_005` | 0.35 | Fixed investigation order misses eval mode — hard task genuinely challenges agents |
-| `task_006` | 1.00 | Pattern-matching catches 2 of 4 bug variants |
 ## Setup

 ## Baseline Scores
+### Heuristic vs LLM Comparison (3 agents, 7 tasks)
+| Task | Difficulty | Heuristic | Llama 3.3 70B | Llama 3.1 8B | Notes |
+|------|-----------|-----------|---------------|--------------|-------|
+| `task_001` | Easy | **1.00** | 1.00 | 0.60 | 8B finds issue but misses fix+restart sequence |
+| `task_002` | Easy | **1.00** | 1.00 | 0.05 | 8B barely investigates — struggles with multi-step reasoning |
+| `task_003` | Medium | **1.00** | 0.40 | 0.40 | Both LLMs explore inefficiently vs heuristic's direct path |
+| `task_004` | Medium | 0.45 | 0.45 | **0.60** | LLM's flexible investigation finds overfitting signals heuristic misses |
+| `task_005` | Hard | **1.00** | 1.00 | 1.00 | All agents find eval mode via model inspection |
+| `task_006` | Hard | **1.00** | — | 0.60–1.00 | Code debugging — 8B varies across providers |
+| `task_007` | Med-Hard | **1.00** | — | 0.60 | Scheduler detection — heuristic's pattern matching excels |
+| **Average** | | **0.92** | **0.69*** | **0.55** | |
+*Llama 3.3 70B results are partial (5/7 tasks before rate limit). Projected average ~0.69.
+**Key insights:**
+1. **Model size matters:** 70B scores ~25% higher than 8B — the environment scales with model capability
+2. **Heuristic beats LLMs:** A domain-specific decision tree (0.92) outperforms general-purpose LLMs (0.55-0.69) — proving the environment rewards systematic debugging strategy
+3. **Task 4 is the exception:** LLMs outperform the heuristic on overfitting because real training curves require flexible reasoning, not rigid pattern matching
+4. **8B struggles on multi-step tasks:** Task 2 (0.05) shows small models can't maintain investigation strategy across many steps
+### Running Baselines
+```bash
+# Heuristic (deterministic, no API key, bit-exact reproducible)
+python3 baseline_heuristic.py
+# LLM (multi-provider support — set API key in .env)
+python3 baseline_inference.py                       # Groq (default, free)
+python3 baseline_inference.py --provider cerebras    # Cerebras (free)
+python3 baseline_inference.py --provider gemini      # Google Gemini
+python3 baseline_inference.py --provider openai      # OpenAI GPT-4o
+# Run all baselines with comparison table
+python3 run_all_baselines.py
+```
 ## Setup

baseline_inference.py CHANGED Viewed

@@ -173,33 +173,61 @@ def run_llm_episode(task_id: str, client: OpenAI, model_name: str) -> float:
     return session.last_score if session and session.last_score is not None else 0.0
 def main() -> None:
-    parser = argparse.ArgumentParser(description="LLM baseline agent (Gemini)")
     parser.add_argument("--url", default="http://localhost:7860")
-    parser.add_argument("--api-key", default=None, help="Gemini API key")
     parser.add_argument(
-        "--model",
-        default="gemini-2.0-flash",
-        help="Model name (default: gemini-2.0-flash)",
     )
     args = parser.parse_args()
-    api_key = args.api_key or os.environ.get("GEMINI_API_KEY")
     if not api_key:
-        print("Error: Set GEMINI_API_KEY env var or pass --api-key")
         sys.exit(1)
-    client = OpenAI(
-        api_key=api_key,
-        base_url="https://generativelanguage.googleapis.com/v1beta/openai/",
-    )
     scores: dict[str, float] = {}
-    print(f"Running LLM baseline with {args.model}...", file=sys.stderr)
     for task_id in ALL_TASKS:
         try:
-            score = run_llm_episode(task_id, client, args.model)
             scores[task_id] = round(score, 4)
             print(f"  {task_id}: {score:.4f}", file=sys.stderr)
         except Exception as e:

     return session.last_score if session and session.last_score is not None else 0.0
+PROVIDERS = {
+    "groq": {
+        "env_key": "GROQ_API_KEY",
+        "base_url": "https://api.groq.com/openai/v1",
+        "default_model": "llama-3.3-70b-versatile",
+    },
+    "cerebras": {
+        "env_key": "CEREBRAS_API_KEY",
+        "base_url": "https://api.cerebras.ai/v1",
+        "default_model": "llama3.1-8b",
+    },
+    "gemini": {
+        "env_key": "GEMINI_API_KEY",
+        "base_url": "https://generativelanguage.googleapis.com/v1beta/openai/",
+        "default_model": "gemini-2.0-flash",
+    },
+    "openai": {
+        "env_key": "OPENAI_API_KEY",
+        "base_url": None,
+        "default_model": "gpt-4o",
+    },
+}
 def main() -> None:
+    parser = argparse.ArgumentParser(description="LLM baseline agent")
     parser.add_argument("--url", default="http://localhost:7860")
+    parser.add_argument("--api-key", default=None, help="API key")
     parser.add_argument(
+        "--provider",
+        default="groq",
+        choices=list(PROVIDERS.keys()),
+        help="LLM provider (default: groq)",
     )
+    parser.add_argument("--model", default=None, help="Model name (auto-detected from provider)")
     args = parser.parse_args()
+    prov = PROVIDERS[args.provider]
+    api_key = args.api_key or os.environ.get(prov["env_key"])
     if not api_key:
+        print(f"Error: Set {prov['env_key']} env var or pass --api-key")
         sys.exit(1)
+    model_name = args.model or prov["default_model"]
+    client_kwargs: dict = {"api_key": api_key}
+    if prov["base_url"]:
+        client_kwargs["base_url"] = prov["base_url"]
+    client = OpenAI(**client_kwargs)
     scores: dict[str, float] = {}
+    print(f"Running LLM baseline with {args.provider}/{model_name}...", file=sys.stderr)
     for task_id in ALL_TASKS:
         try:
+            score = run_llm_episode(task_id, client, model_name)
             scores[task_id] = round(score, 4)
             print(f"  {task_id}: {score:.4f}", file=sys.stderr)
         except Exception as e:

run_all_baselines.py ADDED Viewed

	@@ -0,0 +1,130 @@

+#!/usr/bin/env python3
+"""Run heuristic + multiple LLM baselines and show comparison table.
+Usage:
+    python3 run_all_baselines.py
+"""
+from __future__ import annotations
+import json
+import os
+import sys
+import time
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from pathlib import Path
+# Load .env
+_env_path = Path(__file__).parent / ".env"
+if _env_path.exists():
+    for line in _env_path.read_text().splitlines():
+        line = line.strip()
+        if line and not line.startswith("#") and "=" in line:
+            key, _, value = line.partition("=")
+            os.environ.setdefault(key.strip(), value.strip())
+from baseline_heuristic import ALL_TASKS
+from baseline_heuristic import run_heuristic_episode
+from baseline_inference import PROVIDERS, run_llm_episode
+try:
+    from openai import OpenAI
+except ImportError:
+    print("Error: pip install openai")
+    sys.exit(1)
+def run_heuristic() -> dict[str, float]:
+    scores = {}
+    for task_id in ALL_TASKS:
+        scores[task_id] = round(run_heuristic_episode(task_id), 4)
+    return scores
+def run_llm_provider(provider_name: str, model: str | None = None) -> dict[str, float]:
+    prov = PROVIDERS[provider_name]
+    api_key = os.environ.get(prov["env_key"])
+    if not api_key:
+        return {t: -1.0 for t in ALL_TASKS}  # -1 = no key
+    model_name = model or prov["default_model"]
+    client_kwargs: dict = {"api_key": api_key}
+    if prov["base_url"]:
+        client_kwargs["base_url"] = prov["base_url"]
+    client = OpenAI(**client_kwargs)
+    scores: dict[str, float] = {}
+    for task_id in ALL_TASKS:
+        try:
+            score = run_llm_episode(task_id, client, model_name)
+            scores[task_id] = round(score, 4)
+            print(f"  [{provider_name}/{model_name}] {task_id}: {score:.4f}", file=sys.stderr)
+        except Exception as e:
+            err_str = str(e)[:80]
+            print(f"  [{provider_name}/{model_name}] {task_id}: ERROR — {err_str}", file=sys.stderr)
+            scores[task_id] = 0.0
+    return scores
+def main() -> None:
+    print("Running all baselines...\n", file=sys.stderr)
+    results: dict[str, dict[str, float]] = {}
+    # Run heuristic first (fast, deterministic)
+    print("--- Heuristic baseline ---", file=sys.stderr)
+    results["Heuristic"] = run_heuristic()
+    print(f"  Done: {json.dumps(results['Heuristic'])}", file=sys.stderr)
+    # Run LLM providers sequentially (avoids thread hang issues)
+    llm_runs = [
+        ("Cerebras/Llama-3.1-8B", "cerebras", "llama3.1-8b"),
+        ("Groq/Llama-3.1-8B", "groq", "llama-3.1-8b-instant"),
+    ]
+    for label, provider, model in llm_runs:
+        print(f"\n--- {label} ---", file=sys.stderr)
+        try:
+            results[label] = run_llm_provider(provider, model)
+        except Exception as e:
+            print(f"  {label}: FAILED — {e}", file=sys.stderr)
+            results[label] = {t: 0.0 for t in ALL_TASKS}
+    # Print comparison table
+    print("\n" + "=" * 80)
+    print("BASELINE COMPARISON TABLE")
+    print("=" * 80)
+    headers = list(results.keys())
+    print(f"\n{'Task':<12}", end="")
+    for h in headers:
+        print(f"{h:>25}", end="")
+    print()
+    print("-" * (12 + 25 * len(headers)))
+    for task_id in ALL_TASKS:
+        print(f"{task_id:<12}", end="")
+        for h in headers:
+            score = results[h].get(task_id, 0.0)
+            if score < 0:
+                print(f"{'no key':>25}", end="")
+            else:
+                print(f"{score:>25.4f}", end="")
+        print()
+    print("-" * (12 + 25 * len(headers)))
+    # Averages
+    print(f"{'AVERAGE':<12}", end="")
+    for h in headers:
+        valid = [v for v in results[h].values() if v >= 0]
+        avg = sum(valid) / len(valid) if valid else 0
+        print(f"{avg:>25.4f}", end="")
+    print()
+    # Save JSON
+    print(json.dumps(results, indent=2))
+if __name__ == "__main__":
+    main()