Spaces:

bledden
/

stack_doctor

Sleeping

App Files Files Community

bledden commited on Mar 7

Commit

8b92d51

verified ·

1 Parent(s): 1dcde67

Upload folder using huggingface_hub

Browse files

Files changed (26) hide show

Dockerfile +43 -0
README.md +115 -5
ROOT_CAUSE_VISIBLE_PLAN.md +332 -0
__init__.py +10 -0
client.py +56 -0
models.py +41 -0
openenv.yaml +6 -0
openenv_stack_doctor.egg-info/PKG-INFO +9 -0
openenv_stack_doctor.egg-info/SOURCES.txt +16 -0
openenv_stack_doctor.egg-info/dependency_links.txt +1 -0
openenv_stack_doctor.egg-info/entry_points.txt +2 -0
openenv_stack_doctor.egg-info/requires.txt +5 -0
openenv_stack_doctor.egg-info/top_level.txt +1 -0
pyproject.toml +26 -0
server/__init__.py +6 -0
server/app.py +41 -0
server/baselines.py +203 -0
server/requirements.txt +6 -0
server/scenarios.py +1893 -0
server/stack_doctor_environment.py +269 -0
server/stack_doctor_mcp.py +393 -0
training/Dockerfile +39 -0
training/__init__.py +0 -0
training/eval_stack_doctor.py +143 -0
training/train_stack_doctor.py +311 -0
uv.lock +0 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,43 @@

+# Stack Doctor — OpenEnv Environment
+# Standard pattern from OpenEnv docs (slide 11)
+ARG BASE_IMAGE=ghcr.io/meta-pytorch/openenv-base:latest
+FROM ${BASE_IMAGE} AS builder
+WORKDIR /app
+# Ensure git is available for VCS dependencies
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends git && \
+    rm -rf /var/lib/apt/lists/*
+COPY . /app/env
+WORKDIR /app/env
+# Ensure uv is available
+RUN if ! command -v uv >/dev/null 2>&1; then \
+        curl -LsSf https://astral.sh/uv/install.sh | sh && \
+        mv /root/.local/bin/uv /usr/local/bin/uv && \
+        mv /root/.local/bin/uvx /usr/local/bin/uvx; \
+    fi
+# Install dependencies — try frozen first, fall back to fresh resolve
+RUN --mount=type=cache,target=/root/.cache/uv \
+    uv sync --frozen --no-editable 2>/dev/null || uv sync --no-editable
+# Final runtime stage
+FROM ${BASE_IMAGE}
+WORKDIR /app
+COPY --from=builder /app/env/.venv /app/.venv
+COPY --from=builder /app/env /app/env
+ENV PATH="/app/.venv/bin:$PATH"
+ENV PYTHONPATH="/app/env:$PYTHONPATH"
+HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
+    CMD curl -f http://localhost:8000/health || exit 1
+ENV ENABLE_WEB_INTERFACE=true
+CMD ["sh", "-c", "cd /app/env && uvicorn server.app:app --host 0.0.0.0 --port 8000"]

README.md CHANGED Viewed

@@ -1,10 +1,120 @@
 ---
-title: Stack Doctor
-emoji: 💻
-colorFrom: blue
-colorTo: pink
 sdk: docker
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Stack Doctor Environment Server
+emoji: 🩺
+colorFrom: red
+colorTo: blue
 sdk: docker
 pinned: false
+app_port: 8000
+base_path: /web
+tags:
+  - openenv
 ---
+# Stack Doctor
+An OpenEnv RL environment where an overseer LLM diagnoses sick inference stacks. The agent probes subsystems, reconciles conflicting specialist-agent reports (some of which are wrong), and selects the minimal correct fix — all within a 6-step budget.
+Inspired by real SM12x enablement bugs across vLLM, FlashInfer, SGLang, CUTLASS, and Flash-Attention.
+**Track**: Statement 3.1 — World Modeling / Professional Tasks
+**Sub-theme**: Fleet AI — Scalable Oversight Agents ($10K)
+## Quick Start
+```python
+from stack_doctor import StackDoctorEnv, StackDoctorAction
+import json
+env = StackDoctorEnv(base_url="https://bledden-stack-doctor.hf.space")
+env.connect()
+# Start a new incident
+result = env.reset()
+print(result.observation.incident_ticket)
+print(result.observation.specialist_opinions)
+# Investigate
+result = env.step(StackDoctorAction(message=json.dumps(
+    {"type": "inspect", "target": "logs"}
+)))
+print(result.observation.output)
+# Submit diagnosis
+result = env.step(StackDoctorAction(message=json.dumps(
+    {"type": "submit", "root_cause": "arch_guard", "fix": "relax_arch_check"}
+)))
+print(f"Reward: {result.reward}, Done: {result.done}")
+env.close()
+```
+## Environment Design
+### Root Causes (6) and Fixes (6)
+| Root Cause | Fix | Real-World Motif |
+|-----------|-----|-----------------|
+| `arch_guard` | `relax_arch_check` | FlashInfer SM121 capability checks |
+| `backend_whitelist` | `add_whitelist_entry` | vLLM Marlin SM121+ whitelist gaps |
+| `runtime_loader` | `fix_runtime_path` | SGLang CUDA 13 runtime issues |
+| `backend_selector` | `switch_backend` | CUTLASS dispatch mistakes |
+| `model_config` | `update_model_config` | Model config mismatches on new hardware |
+| `weight_layout` | `fix_weight_mapping` | Weight layout problems across backends |
+### Specialists (4)
+`runtime`, `dispatch`, `kernel`, `loader` — at least one gives wrong advice per scenario.
+### Action Space (JSON)
+```json
+{"type":"inspect","target":"logs|config|snippet|metrics"}
+{"type":"ask_specialist","specialist":"runtime|dispatch|kernel|loader"}
+{"type":"apply_fix","fix":"<one of 6 fixes>"}
+{"type":"submit","root_cause":"<one of 6>","fix":"<one of 6>"}
+```
+### Reward Function
+| Event | Reward |
+|-------|--------|
+| `inspect` or `ask_specialist` | -0.25 |
+| Correct `apply_fix` | +3 |
+| Wrong `apply_fix` | -2 |
+| Correct `submit` (per field) | +8 |
+| Wrong `submit` (per field) | -4 |
+| Solved in ≤4 steps | +2 bonus |
+| Invalid action | -2 |
+### Baselines
+| Policy | RC Accuracy | Fix Accuracy | Avg Steps | Avg Reward |
+|--------|:-:|:-:|:-:|:-:|
+| Oracle | 100% | 100% | 1.0 | 18.0 |
+| Heuristic | 100% | 100% | 4.0 | 20.5 |
+| Random | 18% | 18% | 3.2 | -4.1 |
+## Fleet AI: Specialist Oversight
+The core mechanic that targets Fleet AI's $10K sub-theme: the agent must act as a **scalable oversight agent** that reconciles conflicting specialist reports. Specialists have per-scenario reliability — the agent cannot learn "always trust specialist X" and must evaluate evidence on each case.
+## Training
+Uses Unsloth + TRL GRPO with 3 reward signals:
+1. **Valid JSON** — can the output be parsed as an action plan?
+2. **Environment reward** — cumulative reward from executing the plan
+3. **Efficiency** — bonus for shorter plans that still submit correctly
+## Development
+```bash
+# Local server
+cd stack_doctor && PYTHONPATH=. uvicorn server.app:app --port 8000
+# Run baselines
+PYTHONPATH=. python3 -c "from server.baselines import *; ..."
+# Deploy to HF Spaces
+openenv push --repo-id bledden/stack-doctor
+```

ROOT_CAUSE_VISIBLE_PLAN.md ADDED Viewed

	@@ -0,0 +1,332 @@

+# Root-Cause-Visible Stack Doctor Plan
+## Summary
+This proposal adds a second mode to Stack Doctor where the agent is told the true root cause at the start of the episode.
+Instead of diagnosing from noisy evidence and conflicting specialists, the agent's job becomes:
+1. Validate the known root cause with the minimum useful evidence.
+2. Choose the correct and safest fix.
+3. Apply or recommend the fix.
+4. Submit a short operational justification.
+This makes the environment easier to explain in a hackathon setting while keeping it meaningfully interactive.
+## Recommendation
+Do **not** replace the current Stack Doctor environment entirely.
+Instead, support two modes:
+- `blind_diagnosis`: current mode, where the agent must infer the root cause from imperfect evidence.
+- `root_cause_visible`: new mode, where the root cause is given and the task becomes evidence-based remediation.
+Reason:
+- The current mode is stronger as an oversight benchmark.
+- The new mode is cleaner and easier for judges to understand quickly.
+- Having both lets us tell a better story: "same incident world, two difficulty levels."
+## Why Change It
+The current environment is a valid RL environment, but it can look messy to people seeing it for the first time because:
+- specialist opinions can be wrong
+- the agent has to infer latent state
+- the reward mixes diagnosis quality with investigation efficiency
+Giving the root cause up front removes the hardest-to-explain part of the setup and shifts the task toward operational decision-making:
+- What evidence should I verify before acting?
+- Which fix is safest and most minimal?
+- How much investigation is enough?
+- Can I justify the rollout clearly?
+That is still a good agent task. It is just a different one.
+## New Product Framing
+Position the new mode as:
+**"An incident commander agent that receives a probable root cause from upstream monitoring and must validate, remediate, and explain the fix."**
+This framing is cleaner than "the model magically knows everything," because it implies:
+- another system or monitor identified the likely root cause
+- Stack Doctor is responsible for safe execution, not initial detection
+## Environment Changes
+### 1. Observation Schema
+Add a field to the initial observation:
+```json
+{
+  "known_root_cause": "runtime_loader"
+}
+```
+Recommended additions:
+- `known_root_cause`
+- `mode`
+- optional `recommended_fix_family` if we want a very easy demo mode later
+In `root_cause_visible` mode, the reset observation should explicitly say:
+> Root cause has been pre-identified. Validate it, choose the minimal safe fix, and submit.
+### 2. Action Space
+Keep the action space mostly the same to minimize changes:
+- `inspect`
+- `ask_specialist`
+- `apply_fix`
+- `submit`
+But change the meaning of `submit`.
+### Current `submit`
+The agent submits:
+- `root_cause`
+- `fix`
+### Proposed `submit`
+The agent submits:
+- `fix`
+- `evidence`
+- `justification`
+Suggested JSON:
+```json
+{
+  "type": "submit",
+  "fix": "fix_runtime_path",
+  "evidence": ["logs", "config"],
+  "justification": "CUDA 13 is installed, but LD_LIBRARY_PATH still points to cuda-12."
+}
+```
+If backward compatibility matters, keep `root_cause` in the schema but ignore scoring for it in `root_cause_visible` mode.
+### 3. Specialists
+In the new mode, specialists should no longer be the center of the task.
+Recommended options:
+- keep specialists, but make them supportive rather than adversarial
+- reduce emphasis on conflicting specialist opinions
+- use specialists mainly for implementation details and risk checks
+Example:
+- `runtime`: confirms the path mismatch
+- `dispatch`: says whether dispatch will recover after the fix
+- `loader`: clarifies whether a restart is needed
+This makes the environment feel less noisy without removing interactivity.
+### 4. Reward Redesign
+If the root cause is visible, the current reward design should change. The agent should no longer get major reward for naming the diagnosis correctly.
+### Proposed reward priorities
+1. Correct fix selection
+2. Minimal useful investigation
+3. Safe behavior
+4. Clear justification
+### Example reward table
+| Event | Reward |
+|---|---:|
+| `inspect` or `ask_specialist` | -0.25 |
+| relevant evidence inspected | +0.5 |
+| irrelevant or redundant evidence | 0 |
+| correct `apply_fix` | +4 |
+| wrong `apply_fix` | -4 |
+| correct `submit.fix` | +10 |
+| wrong `submit.fix` | -6 |
+| concise valid justification | +1 |
+| solved in `<= 4` steps | +2 |
+| unsafe sequence or invalid action | -2 to -4 |
+Key point: in this mode, the skill is not "guess the cause." The skill is "verify enough, then act correctly."
+### 5. Success Criteria
+The policy should be judged on:
+- fix accuracy
+- average steps
+- evidence efficiency
+- justification quality
+- avoidable bad interventions
+Optional additional metric:
+- `evidence_precision`: fraction of inspected items that were actually relevant
+This gives a more legible evaluation story than pure diagnosis accuracy.
+## Repo Changes
+### 1. `models.py`
+Add new observation fields:
+- `known_root_cause: str = ""`
+- `mode: str = "blind_diagnosis"`
+Potentially add:
+- `recommended_fix_family: str = ""`
+### 2. `server/stack_doctor_environment.py`
+Add a reset kwarg:
+```python
+mode = kwargs.get("mode", "blind_diagnosis")
+```
+Implementation steps:
+- store the mode on episode state
+- include `known_root_cause` in reset observation when mode is `root_cause_visible`
+- branch reward logic inside `_handle_submit`
+- optionally branch specialist behavior to be less misleading
+- keep existing default behavior unchanged
+### 3. `server/scenarios.py`
+No structural rewrite is required.
+Small recommended additions:
+- tag which inspect targets are most probative for each scenario
+- tag which specialist follow-ups are useful vs distracting
+- optionally define a `minimal_evidence` set per scenario
+This will help score validation quality in the new mode.
+### 4. `training/train_stack_doctor.py`
+Add a second training prompt for `root_cause_visible` mode.
+The prompt should tell the model:
+- the root cause is already known
+- do not waste steps proving obvious facts
+- verify the highest-value evidence
+- choose the safest correct fix
+- submit a short justification
+Also update reward functions to score:
+- correct fix choice
+- evidence use
+- step efficiency
+- valid justification text
+### 5. `training/eval_stack_doctor.py`
+Add mode-aware evaluation metrics:
+- `fix_accuracy`
+- `avg_steps`
+- `avg_reward`
+- `evidence_precision`
+- `justification_pass_rate`
+### 6. `README.md`
+Update the README to explain both modes:
+- what each mode is testing
+- why both matter
+- which one is easiest to demo to judges
+## Demo Story
+Recommended demo sequence:
+1. Show one `root_cause_visible` episode first.
+2. Explain that upstream monitoring identified the likely cause.
+3. Let Stack Doctor inspect 1-2 evidence sources, choose the fix, and justify it.
+4. Then mention that the same environment also supports the harder `blind_diagnosis` mode.
+This makes the system understandable in under a minute.
+## Risks
+### Risk 1: Too easy
+If the root cause is visible and the only remaining task is mapping root cause to fix, the environment becomes trivial.
+Mitigation:
+- make evidence validation matter
+- score fix safety and justification
+- include cases where multiple fixes are plausible but only one is minimal
+### Risk 2: Loses the best part of the current project
+The current environment's most differentiated feature is conflicting specialist oversight.
+Mitigation:
+- keep current mode
+- present `root_cause_visible` as a simpler companion mode, not a replacement
+### Risk 3: Becomes a static classification problem again
+If the model can submit immediately with no downside, the interaction disappears.
+Mitigation:
+- require evidence references in `submit`
+- reward minimal but real validation
+- penalize unsupported submissions
+## MVP Scope
+For a hackathon-friendly implementation, do only this:
+1. Add `mode` and `known_root_cause` to the observation.
+2. Branch scoring so `submit` is mostly about the fix in `root_cause_visible` mode.
+3. Require a short justification string in submit.
+4. Update the training prompt and evaluation script.
+5. Update the README and demo flow.
+This is enough to tell the story cleanly without rewriting the whole project.
+## Stretch Scope
+If there is extra time:
+- add `minimal_evidence` scoring per scenario
+- add safe-vs-risky fix tradeoffs
+- generate a postmortem note at the end of the episode
+- support multi-incident scheduling where root cause is known but resources are limited
+## Final Recommendation
+Proceed with a **dual-mode** design.
+That gives the team two benefits:
+- a cleaner, easier-to-pitch hackathon demo with `root_cause_visible`
+- a stronger long-term benchmark with `blind_diagnosis`
+If we collapse entirely to "the agent sees the true root cause," the project becomes easier to explain but materially less differentiated. The best version is to keep both and present them as two levels of the same environment.

__init__.py ADDED Viewed

	@@ -0,0 +1,10 @@

+"""Stack Doctor Environment."""
+from .client import StackDoctorEnv
+from .models import StackDoctorAction, StackDoctorObservation
+__all__ = [
+    "StackDoctorAction",
+    "StackDoctorObservation",
+    "StackDoctorEnv",
+]

client.py ADDED Viewed

	@@ -0,0 +1,56 @@

+"""Stack Doctor Client."""
+from typing import Dict
+from openenv.core.client_types import StepResult
+from openenv.core.env_server.types import State
+from openenv.core import EnvClient
+from .models import StackDoctorAction, StackDoctorObservation
+class StackDoctorEnv(EnvClient[StackDoctorAction, StackDoctorObservation, State]):
+    """
+    Client for the Stack Doctor Environment.
+    Example:
+        >>> env = StackDoctorEnv(base_url="http://localhost:8000")
+        >>> env.connect()
+        >>> result = env.reset()
+        >>> print(result.observation.incident_ticket)
+        >>> result = env.step(StackDoctorAction(message='{"type":"inspect","target":"logs"}'))
+        >>> print(result.observation.output)
+        >>> env.close()
+    """
+    def _step_payload(self, action: StackDoctorAction) -> Dict:
+        return {"message": action.message}
+    def _parse_result(self, payload: Dict) -> StepResult[StackDoctorObservation]:
+        obs_data = payload.get("observation", {})
+        observation = StackDoctorObservation(
+            output=obs_data.get("output", ""),
+            incident_ticket=obs_data.get("incident_ticket", ""),
+            hardware=obs_data.get("hardware", ""),
+            model_name=obs_data.get("model_name", ""),
+            backend=obs_data.get("backend", ""),
+            log_excerpt=obs_data.get("log_excerpt", ""),
+            code_snippet=obs_data.get("code_snippet", ""),
+            specialist_opinions=obs_data.get("specialist_opinions", {}),
+            steps_remaining=obs_data.get("steps_remaining", 0),
+            fix_used=obs_data.get("fix_used", False),
+            done=payload.get("done", False),
+            reward=payload.get("reward"),
+            metadata=obs_data.get("metadata", {}),
+        )
+        return StepResult(
+            observation=observation,
+            reward=payload.get("reward"),
+            done=payload.get("done", False),
+        )
+    def _parse_state(self, payload: Dict) -> State:
+        return State(
+            episode_id=payload.get("episode_id"),
+            step_count=payload.get("step_count", 0),
+        )

models.py ADDED Viewed

	@@ -0,0 +1,41 @@

+"""
+Data models for the Stack Doctor Environment.
+An overseer LLM diagnoses sick inference stacks by probing subsystems,
+reconciling conflicting specialist-agent reports, and selecting the
+minimal correct fix.
+"""
+from pydantic import Field
+from openenv.core.env_server.types import Action, Observation
+class StackDoctorAction(Action):
+    """Agent action — a JSON message selecting one of 4 action types."""
+    message: str = Field(
+        ...,
+        description=(
+            'JSON action. One of:\n'
+            '  {"type":"inspect","target":"logs|config|snippet|metrics"}\n'
+            '  {"type":"ask_specialist","specialist":"runtime|dispatch|kernel|loader"}\n'
+            '  {"type":"apply_fix","fix":"relax_arch_check|add_whitelist_entry|fix_runtime_path|switch_backend|update_model_config|fix_weight_mapping"}\n'
+            '  {"type":"submit","root_cause":"...","fix":"...","justification":"..."}'
+        ),
+    )
+class StackDoctorObservation(Observation):
+    """What the agent sees after each action."""
+    output: str = Field(default="", description="Natural-language feedback")
+    incident_ticket: str = Field(default="", description="The incident description")
+    hardware: str = Field(default="", description="Hardware identifier")
+    model_name: str = Field(default="", description="Model being served")
+    backend: str = Field(default="", description="Inference backend in use")
+    log_excerpt: str = Field(default="", description="Log snippet")
+    code_snippet: str = Field(default="", description="Config or code snippet")
+    specialist_opinions: dict = Field(default_factory=dict, description="Specialist name -> {opinion, confidence}")
+    steps_remaining: int = Field(default=6, description="Steps left in episode")
+    fix_used: bool = Field(default=False, description="Whether apply_fix has been used")

openenv.yaml ADDED Viewed

	@@ -0,0 +1,6 @@

+spec_version: 1
+name: stack_doctor
+type: space
+runtime: fastapi
+app: server.app:app
+port: 8000

openenv_stack_doctor.egg-info/PKG-INFO ADDED Viewed

	@@ -0,0 +1,9 @@

+Metadata-Version: 2.4
+Name: openenv-stack-doctor
+Version: 0.1.0
+Summary: Stack Doctor: an RL environment for diagnosing inference-stack incidents
+Requires-Python: >=3.10
+Requires-Dist: openenv-core[core]>=0.2.0
+Provides-Extra: dev
+Requires-Dist: pytest>=8.0.0; extra == "dev"
+Requires-Dist: pytest-cov>=4.0.0; extra == "dev"

openenv_stack_doctor.egg-info/SOURCES.txt ADDED Viewed

	@@ -0,0 +1,16 @@

+README.md
+pyproject.toml
+./__init__.py
+./client.py
+./models.py
+openenv_stack_doctor.egg-info/PKG-INFO
+openenv_stack_doctor.egg-info/SOURCES.txt
+openenv_stack_doctor.egg-info/dependency_links.txt
+openenv_stack_doctor.egg-info/entry_points.txt
+openenv_stack_doctor.egg-info/requires.txt
+openenv_stack_doctor.egg-info/top_level.txt
+server/__init__.py
+server/app.py
+server/baselines.py
+server/scenarios.py
+server/stack_doctor_environment.py

openenv_stack_doctor.egg-info/dependency_links.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+

openenv_stack_doctor.egg-info/entry_points.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ [console_scripts]
2	+ server = stack_doctor.server.app:main

openenv_stack_doctor.egg-info/requires.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+openenv-core[core]>=0.2.0
+[dev]
+pytest>=8.0.0
+pytest-cov>=4.0.0

openenv_stack_doctor.egg-info/top_level.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ stack_doctor

pyproject.toml ADDED Viewed

	@@ -0,0 +1,26 @@

+[build-system]
+requires = ["setuptools>=45", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "openenv-stack-doctor"
+version = "0.1.0"
+description = "Stack Doctor: an RL environment for diagnosing inference-stack incidents"
+requires-python = ">=3.10"
+dependencies = [
+    "openenv-core[core]>=0.2.0",
+]
+[project.optional-dependencies]
+dev = [
+    "pytest>=8.0.0",
+    "pytest-cov>=4.0.0",
+]
+[project.scripts]
+server = "stack_doctor.server.app:main"
+[tool.setuptools]
+include-package-data = true
+packages = ["stack_doctor", "stack_doctor.server"]
+package-dir = { "stack_doctor" = ".", "stack_doctor.server" = "server" }

server/__init__.py ADDED Viewed

	@@ -0,0 +1,6 @@

+"""Stack Doctor environment server components."""
+from .stack_doctor_environment import StackDoctorEnvironment
+from .stack_doctor_mcp import StackDoctorMCPEnvironment
+__all__ = ["StackDoctorEnvironment", "StackDoctorMCPEnvironment"]

server/app.py ADDED Viewed

	@@ -0,0 +1,41 @@

+"""
+FastAPI application for the Stack Doctor Environment.
+Exposes both:
+  - WebSocket API (reset/step/state) for RL training
+  - MCP API (tools/list, tools/call) for agent interaction
+Usage:
+    uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
+"""
+try:
+    from openenv.core.env_server.http_server import create_app
+except Exception as e:
+    raise ImportError(
+        "openenv is required. Install with: uv sync"
+    ) from e
+from models import StackDoctorAction, StackDoctorObservation
+from .stack_doctor_mcp import StackDoctorMCPEnvironment
+app = create_app(
+    StackDoctorMCPEnvironment,
+    StackDoctorAction,
+    StackDoctorObservation,
+    env_name="stack_doctor",
+    max_concurrent_envs=4,
+)
+def main(host: str = "0.0.0.0", port: int = 8000):
+    import uvicorn
+    uvicorn.run(app, host=host, port=port)
+if __name__ == "__main__":
+    import argparse
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--port", type=int, default=8000)
+    args = parser.parse_args()
+    main(port=args.port)

server/baselines.py ADDED Viewed

	@@ -0,0 +1,203 @@

+"""
+Oracle, heuristic, and random baselines for Stack Doctor.
+Used to validate the reward function: random < heuristic < oracle must hold.
+"""
+from __future__ import annotations
+import json
+import random
+from .scenarios import (
+    ROOT_CAUSE_TO_FIX,
+    ROOT_CAUSES,
+    FIXES,
+    SPECIALISTS,
+    Scenario,
+    SCENARIOS,
+    TRAIN_SCENARIOS,
+    EVAL_SCENARIOS,
+)
+def oracle_policy(scenario: Scenario) -> list[dict]:
+    """Perfect policy: submit correct answer in 1 step."""
+    return [
+        {
+            "type": "submit",
+            "root_cause": scenario.root_cause,
+            "fix": scenario.correct_fix,
+            "justification": f"Root cause is {scenario.root_cause}, applying the correct fix.",
+        }
+    ]
+def heuristic_policy(scenario: Scenario) -> list[dict]:
+    """
+    Reasonable heuristic: inspect logs, ask the highest-confidence specialist,
+    then submit based on clues.
+    Uses keyword matching on specialist opinions and logs to guess root cause.
+    """
+    actions = []
+    # Step 1: inspect logs
+    actions.append({"type": "inspect", "target": "logs"})
+    # Step 2: ask the highest-confidence specialist
+    best_spec = max(
+        scenario.specialist_opinions.items(),
+        key=lambda kv: kv[1].confidence,
+    )
+    actions.append({"type": "ask_specialist", "specialist": best_spec[0]})
+    # Step 3: heuristic root-cause guess from keywords
+    combined_text = (
+        scenario.incident_ticket
+        + " " + scenario.initial_log
+        + " " + best_spec[1].opinion
+    ).lower()
+    guess = _keyword_guess(combined_text)
+    # Step 4: apply fix
+    actions.append({"type": "apply_fix", "fix": ROOT_CAUSE_TO_FIX[guess]})
+    # Step 5: submit
+    actions.append({
+        "type": "submit",
+        "root_cause": guess,
+        "fix": ROOT_CAUSE_TO_FIX[guess],
+    })
+    return actions
+def random_policy(scenario: Scenario) -> list[dict]:
+    """Random policy: random actions, random submit."""
+    actions = []
+    n_steps = random.randint(1, 5)
+    for _ in range(n_steps - 1):
+        choice = random.choice(["inspect", "ask_specialist"])
+        if choice == "inspect":
+            actions.append({
+                "type": "inspect",
+                "target": random.choice(["logs", "config", "snippet", "metrics"]),
+            })
+        else:
+            actions.append({
+                "type": "ask_specialist",
+                "specialist": random.choice(SPECIALISTS),
+            })
+    # Final: random submit
+    rc = random.choice(ROOT_CAUSES)
+    actions.append({
+        "type": "submit",
+        "root_cause": rc,
+        "fix": ROOT_CAUSE_TO_FIX[rc],
+    })
+    return actions
+def _keyword_guess(text: str) -> str:
+    """Guess root cause from keyword presence in text."""
+    scores = {
+        "arch_guard": 0,
+        "backend_whitelist": 0,
+        "runtime_loader": 0,
+        "backend_selector": 0,
+        "model_config": 0,
+        "weight_layout": 0,
+    }
+    # arch_guard keywords
+    for kw in ["arch", "architecture", "sm_12", "sm_120", "sm_121", "supported_arch", "capability", "is_supported"]:
+        if kw in text:
+            scores["arch_guard"] += 1
+    # backend_whitelist keywords
+    for kw in ["whitelist", "supported_gpu", "not in", "marlin", "awq", "gpu name"]:
+        if kw in text:
+            scores["backend_whitelist"] += 1
+    # runtime_loader keywords
+    for kw in ["runtime", "libcuda", "ld_library", "cuda_home", "symlink", "shared object", "rocm_path", "hipError"]:
+        if kw in text:
+            scores["runtime_loader"] += 1
+    # backend_selector keywords
+    for kw in ["backend", "selector", "xformers", "flash_attn", "latency", "slow", "e4m3fn", "fp8 format"]:
+        if kw in text:
+            scores["backend_selector"] += 1
+    # model_config keywords
+    for kw in ["config", "num_expert", "shape mismatch", "rope", "checkpoint", "config.json"]:
+        if kw in text:
+            scores["model_config"] += 1
+    # weight_layout keywords
+    for kw in ["weight", "mapping", "swap", "gate_proj", "up_proj", "convert", "layout", "qkv"]:
+        if kw in text:
+            scores["weight_layout"] += 1
+    return max(scores, key=scores.get)
+def evaluate_policy(policy_fn, scenarios: list[Scenario], n_runs: int = 1) -> dict:
+    """
+    Run a policy across scenarios and compute metrics.
+    Returns dict with:
+      - rc_accuracy: fraction of correct root cause submissions
+      - fix_accuracy: fraction of correct fix submissions
+      - avg_steps: average steps to resolution
+      - avg_reward: average cumulative reward
+    """
+    from .stack_doctor_environment import StackDoctorEnvironment
+    from models import StackDoctorAction
+    total_rc_correct = 0
+    total_fix_correct = 0
+    total_steps = 0
+    total_reward = 0.0
+    total_episodes = 0
+    for _ in range(n_runs):
+        for scenario in scenarios:
+            env = StackDoctorEnvironment()
+            env.reset(scenario_id=scenario.id)
+            actions = policy_fn(scenario)
+            cumulative = 0.0
+            steps = 0
+            for action_dict in actions:
+                obs = env.step(StackDoctorAction(message=json.dumps(action_dict)))
+                cumulative += obs.reward
+                steps += 1
+                if obs.done:
+                    break
+            # Check if submit happened
+            last_action = actions[-1] if actions else {}
+            if last_action.get("type") == "submit":
+                if last_action["root_cause"] == scenario.root_cause:
+                    total_rc_correct += 1
+                if last_action["fix"] == scenario.correct_fix:
+                    total_fix_correct += 1
+            total_steps += steps
+            total_reward += cumulative
+            total_episodes += 1
+    return {
+        "rc_accuracy": total_rc_correct / total_episodes if total_episodes else 0,
+        "fix_accuracy": total_fix_correct / total_episodes if total_episodes else 0,
+        "avg_steps": total_steps / total_episodes if total_episodes else 0,
+        "avg_reward": total_reward / total_episodes if total_episodes else 0,
+        "n_episodes": total_episodes,
+    }

server/requirements.txt ADDED Viewed

	@@ -0,0 +1,6 @@

+openenv[core]>=0.2.0
+fastapi>=0.115.0
+uvicorn>=0.24.0

server/scenarios.py ADDED Viewed

	@@ -0,0 +1,1893 @@

+"""
+Scenario data for the Launch-Day War Room.
+Each scenario encodes a hidden root cause, the correct fix, an incident ticket,
+hardware/model/backend context, log and code snippets, and specialist opinions
+(some of which may be wrong).
+"""
+from __future__ import annotations
+import random
+from dataclasses import dataclass, field
+ROOT_CAUSES = [
+    "arch_guard",
+    "backend_whitelist",
+    "runtime_loader",
+    "backend_selector",
+    "model_config",
+    "weight_layout",
+]
+FIXES = [
+    "relax_arch_check",
+    "add_whitelist_entry",
+    "fix_runtime_path",
+    "switch_backend",
+    "update_model_config",
+    "fix_weight_mapping",
+]
+# 1:1 mapping
+ROOT_CAUSE_TO_FIX = dict(zip(ROOT_CAUSES, FIXES))
+FIX_TO_ROOT_CAUSE = {v: k for k, v in ROOT_CAUSE_TO_FIX.items()}
+SPECIALISTS = ["runtime", "dispatch", "kernel", "loader"]
+HARDWARE_OPTIONS = [
+    "NVIDIA SM121 (DGX Spark)",
+    "NVIDIA SM120 (GeForce RTX 5090)",
+    "AMD MI300X",
+    "AMD MI355X",
+    "NVIDIA H100",
+    "NVIDIA B200",
+]
+MODEL_OPTIONS = [
+    "DeepSeek-V3-671B",
+    "Llama-4-Maverick-17Bx128E",
+    "Qwen3-235B-A22B",
+    "Mistral-Large-2",
+    "DeepSeek-R1-Distill-70B",
+    "Llama-3.3-70B-Instruct",
+]
+BACKEND_OPTIONS = [
+    "vLLM 0.8.x",
+    "SGLang 0.5.x",
+    "TensorRT-LLM 0.18",
+    "FlashInfer 0.4",
+    "Triton Inference Server",
+]
+@dataclass
+class SpecialistOpinion:
+    opinion: str
+    confidence: float
+    is_correct: bool
+@dataclass
+class InspectResult:
+    logs: str
+    config: str
+    snippet: str
+    metrics: str
+@dataclass
+class Scenario:
+    id: str
+    root_cause: str
+    correct_fix: str
+    incident_ticket: str
+    hardware: str
+    model_name: str
+    backend: str
+    initial_log: str
+    initial_snippet: str
+    specialist_opinions: dict[str, SpecialistOpinion]
+    inspect_results: InspectResult
+    # For ask_specialist follow-ups
+    specialist_followups: dict[str, str] = field(default_factory=dict)
+# ---------------------------------------------------------------------------
+# Seed scenarios
+# ---------------------------------------------------------------------------
+def _make_scenarios() -> list[Scenario]:
+    scenarios = []
+    # --- arch_guard scenarios ---
+    scenarios.append(Scenario(
+        id="arch_guard_01",
+        root_cause="arch_guard",
+        correct_fix="relax_arch_check",
+        incident_ticket=(
+            "INCIDENT: FlashInfer attention kernel fails to launch on newly provisioned "
+            "DGX Spark nodes. Error: 'Unsupported GPU architecture sm_121'. "
+            "Identical model config works on H100 nodes."
+        ),
+        hardware="NVIDIA SM121 (DGX Spark)",
+        model_name="DeepSeek-V3-671B",
+        backend="FlashInfer 0.4",
+        initial_log=(
+            "[FlashInfer] Checking GPU capability... sm_121 detected\n"
+            "[FlashInfer] ERROR: is_supported_arch() returned False for sm_121\n"
+            "[FlashInfer] Falling back to... no fallback available\n"
+            "RuntimeError: No compatible attention kernel for architecture sm_121"
+        ),
+        initial_snippet=(
+            "# flashinfer/arch_check.py\n"
+            "SUPPORTED_ARCHS = {70, 75, 80, 86, 89, 90}\n"
+            "\n"
+            "def is_supported_arch(cc: int) -> bool:\n"
+            "    return cc in SUPPORTED_ARCHS"
+        ),
+        specialist_opinions={
+            "runtime": SpecialistOpinion(
+                "CUDA runtime loaded successfully. No runtime issues detected.", 0.85, False
+            ),
+            "dispatch": SpecialistOpinion(
+                "Architecture check is blocking kernel dispatch. The SM121 architecture "
+                "is not in the supported set despite being SM90-compatible at the instruction level.", 0.92, True
+            ),
+            "kernel": SpecialistOpinion(
+                "The HMMA m16n8k16 instructions used by the attention kernel are available on SM121. "
+                "This looks like a capability check issue, not a kernel issue.", 0.88, True
+            ),
+            "loader": SpecialistOpinion(
+                "Model weights loaded correctly. Weight layout is standard.", 0.80, False
+            ),
+        },
+        inspect_results=InspectResult(
+            logs=(
+                "[FlashInfer] GPU: NVIDIA GH200 (sm_121)\n"
+                "[FlashInfer] CUDA version: 13.0\n"
+                "[FlashInfer] is_supported_arch(121) = False\n"
+                "[FlashInfer] Architecture check FAILED\n"
+                "[CUDA] All CUDA operations nominal\n"
+                "[System] GPU memory: 96GB available"
+            ),
+            config=(
+                "gpu_architecture: sm_121\n"
+                "cuda_version: 13.0\n"
+                "flashinfer_version: 0.4.1\n"
+                "attention_backend: flashinfer\n"
+                "supported_archs: [70, 75, 80, 86, 89, 90]"
+            ),
+            snippet=(
+                "# The arch check function uses an exact match:\n"
+                "def is_supported_arch(cc):\n"
+                "    return cc in SUPPORTED_ARCHS  # misses sm_12x family\n\n"
+                "# SM121 supports HMMA m16n8k16 (same as SM90)\n"
+                "# but is not in the allowlist"
+            ),
+            metrics=(
+                "kernel_launch_attempts: 47\n"
+                "kernel_launch_failures: 47\n"
+                "fallback_attempts: 47\n"
+                "fallback_failures: 47\n"
+                "gpu_utilization: 0%"
+            ),
+        ),
+        specialist_followups={
+            "runtime": "I confirmed CUDA 13.0 runtime is functional. All driver calls succeed. This isn't a runtime issue.",
+            "dispatch": "The dispatch table maps arch -> kernel. SM121 has no entry. Adding sm_12x family to the arch check should fix it.",
+            "kernel": "I inspected the PTX. The kernel only needs HMMA m16n8k16 which SM121 supports. The kernel itself is fine.",
+            "loader": "Weights are in the expected layout. No loader issues.",
+        },
+    ))
+    scenarios.append(Scenario(
+        id="arch_guard_02",
+        root_cause="arch_guard",
+        correct_fix="relax_arch_check",
+        incident_ticket=(
+            "INCIDENT: MLA attention fails on GeForce RTX 5090. Error: "
+            "'compute capability 120 not supported'. Customer reports RTX 4090 works fine."
+        ),
+        hardware="NVIDIA SM120 (GeForce RTX 5090)",
+        model_name="DeepSeek-R1-Distill-70B",
+        backend="vLLM 0.8.x",
+        initial_log=(
+            "[vLLM] Detecting GPU... GeForce RTX 5090 (sm_120)\n"
+            "[vLLM] FlashAttention: compute capability 120 not in supported list\n"
+            "[vLLM] ERROR: Cannot initialize attention backend"
+        ),
+        initial_snippet=(
+            "# vllm/attention/backends/flash_attn.py\n"
+            "MIN_CC = 80\n"
+            "MAX_CC = 90\n"
+            "\n"
+            "def is_supported(cc: int) -> bool:\n"
+            "    return MIN_CC <= cc <= MAX_CC"
+        ),
+        specialist_opinions={
+            "runtime": SpecialistOpinion("Runtime is fine. CUDA 13 loaded.", 0.75, False),
+            "dispatch": SpecialistOpinion(
+                "The capability range check excludes SM120. Needs to include SM12x family.", 0.90, True
+            ),
+            "kernel": SpecialistOpinion(
+                "Possible kernel incompatibility — SM120 lacks tcgen05 MMA.", 0.60, False
+            ),
+            "loader": SpecialistOpinion("Weights look fine.", 0.70, False),
+        },
+        inspect_results=InspectResult(
+            logs="[vLLM] GPU cc=120 rejected by range [80,90]\n[vLLM] No fallback attention backend",
+            config="compute_capability: 120\nmax_supported_cc: 90\nattention_backend: flash_attn",
+            snippet="# Range check: MIN_CC(80) <= cc <= MAX_CC(90)\n# SM120 = 120 > 90, so rejected\n# Fix: add sm_12x family check",
+            metrics="attention_init_failures: 1\nmodel_load_time: 0s (blocked at init)",
+        ),
+        specialist_followups={
+            "runtime": "CUDA 13.0 runtime is healthy. Driver version matches.",
+            "dispatch": "SM120 uses HMMA path (no warp specialization), same code path as SM86. Just need to update the arch range.",
+            "kernel": "On closer inspection, SM120 does support the needed HMMA instructions. My earlier concern about tcgen05 was wrong — that's only needed for Hopper-style warp specialization.",
+            "loader": "No weight issues detected.",
+        },
+    ))
+    # --- backend_whitelist scenarios ---
+    scenarios.append(Scenario(
+        id="backend_whitelist_01",
+        root_cause="backend_whitelist",
+        correct_fix="add_whitelist_entry",
+        incident_ticket=(
+            "INCIDENT: Marlin quantized inference crashes on SM121 nodes. "
+            "Error: 'Marlin kernel not available for current GPU'. "
+            "FP16 inference works, only quantized (GPTQ/AWQ) path fails."
+        ),
+        hardware="NVIDIA SM121 (DGX Spark)",
+        model_name="Llama-3.3-70B-Instruct",
+        backend="vLLM 0.8.x",
+        initial_log=(
+            "[vLLM] Loading GPTQ-quantized model...\n"
+            "[vLLM] Checking Marlin kernel availability for sm_121\n"
+            "[vLLM] WARNING: GPU sm_121 not in Marlin whitelist\n"
+            "[vLLM] ERROR: No quantization kernel available"
+        ),
+        initial_snippet=(
+            "# vllm/model_executor/layers/quantization/marlin.py\n"
+            "MARLIN_SUPPORTED_GPUS = [\n"
+            "    'A100', 'A10', 'H100', 'L40', 'RTX 4090',\n"
+            "]\n"
+        ),
+        specialist_opinions={
+            "runtime": SpecialistOpinion("CUDA runtime OK. Libraries loaded.", 0.80, False),
+            "dispatch": SpecialistOpinion(
+                "Marlin whitelist doesn't include SM121 GPU names. Need to add the entry.", 0.91, True
+            ),
+            "kernel": SpecialistOpinion(
+                "Marlin kernels use standard HMMA ops that SM121 supports. It's just not whitelisted.", 0.85, True
+            ),
+            "loader": SpecialistOpinion(
+                "Quantized weights loaded but kernel never launches. Might be a weight format issue.", 0.55, False
+            ),
+        },
+        inspect_results=InspectResult(
+            logs="[Marlin] GPU name 'NVIDIA GH200' not in whitelist\n[Marlin] Whitelist: ['A100','A10','H100','L40','RTX 4090']",
+            config="quantization: gptq\nmarlin_whitelist: [A100, A10, H100, L40, RTX 4090]\ngpu_name: NVIDIA GH200",
+            snippet="# Whitelist check uses GPU product name string matching\n# GH200 / DGX Spark not in the list\n# Should use arch family check instead of name matching",
+            metrics="quantized_kernel_attempts: 1\nquantized_kernel_failures: 1\nfp16_fallback: not_attempted",
+        ),
+        specialist_followups={
+            "runtime": "All good on the runtime side.",
+            "dispatch": "The whitelist is name-based, not arch-based. Adding 'GH200' or switching to family-level arch checks fixes this.",
+            "kernel": "The Marlin FP8 GEMM dispatch works with SM121's MMA units. It's purely a whitelist gap.",
+            "loader": "Actually, the weights loaded fine. I retract my earlier concern.",
+        },
+    ))
+    scenarios.append(Scenario(
+        id="backend_whitelist_02",
+        root_cause="backend_whitelist",
+        correct_fix="add_whitelist_entry",
+        incident_ticket=(
+            "INCIDENT: AWQ quantization backend refuses to initialize on MI300X. "
+            "Error: 'GPU not supported for AWQ acceleration'. "
+            "Other backends work fine on the same hardware."
+        ),
+        hardware="AMD MI300X",
+        model_name="Qwen3-235B-A22B",
+        backend="vLLM 0.8.x",
+        initial_log=(
+            "[vLLM] Initializing AWQ backend...\n"
+            "[vLLM] GPU: AMD Instinct MI300X\n"
+            "[vLLM] AWQ: GPU not in supported devices list\n"
+            "[vLLM] ERROR: AWQ acceleration unavailable"
+        ),
+        initial_snippet=(
+            "# vllm/model_executor/layers/quantization/awq.py\n"
+            "AWQ_SUPPORTED = {'A100', 'H100', 'RTX 4090', 'L40S'}\n"
+        ),
+        specialist_opinions={
+            "runtime": SpecialistOpinion("ROCm runtime healthy. HIP version matches.", 0.82, False),
+            "dispatch": SpecialistOpinion(
+                "AWQ whitelist is NVIDIA-only. MI300X needs to be added.", 0.93, True
+            ),
+            "kernel": SpecialistOpinion(
+                "MI300X has MFMA instructions that can handle the AWQ GEMM. Not a kernel issue.", 0.87, True
+            ),
+            "loader": SpecialistOpinion("Weight format might not match AMD layout expectations.", 0.50, False),
+        },
+        inspect_results=InspectResult(
+            logs="[AWQ] Device 'AMD Instinct MI300X' not in AWQ_SUPPORTED\n[AWQ] Supported: A100, H100, RTX 4090, L40S",
+            config="quantization: awq\nawq_supported: [A100, H100, RTX 4090, L40S]\ngpu: AMD Instinct MI300X",
+            snippet="# AWQ_SUPPORTED only lists NVIDIA GPUs\n# MI300X MFMA f32_32x32x8_f16 can handle AWQ ops\n# Need to add MI300X to whitelist",
+            metrics="awq_init_failures: 1\nfallback_to_fp16: pending",
+        ),
+        specialist_followups={
+            "runtime": "ROCm 6.3 loaded successfully. No runtime concerns.",
+            "dispatch": "Simple whitelist gap. Adding MI300X resolves the issue.",
+            "kernel": "Confirmed: MFMA ops on MI300X handle the AWQ GEMM pattern.",
+            "loader": "I was wrong earlier — weights are fine. It's the whitelist.",
+        },
+    ))
+    # --- runtime_loader scenarios ---
+    scenarios.append(Scenario(
+        id="runtime_loader_01",
+        root_cause="runtime_loader",
+        correct_fix="fix_runtime_path",
+        incident_ticket=(
+            "INCIDENT: SGLang server crashes on startup with CUDA 13 on DGX Spark. "
+            "Error: 'libcudart.so.13: cannot open shared object file'. "
+            "System has CUDA 13 installed but SGLang can't find it."
+        ),
+        hardware="NVIDIA SM121 (DGX Spark)",
+        model_name="Llama-4-Maverick-17Bx128E",
+        backend="SGLang 0.5.x",
+        initial_log=(
+            "[SGLang] Starting server...\n"
+            "[SGLang] Loading CUDA runtime...\n"
+            "[SGLang] ERROR: libcudart.so.13: cannot open shared object file\n"
+            "[SGLang] LD_LIBRARY_PATH=/usr/local/cuda-12/lib64\n"
+            "ImportError: CUDA runtime not found"
+        ),
+        initial_snippet=(
+            "# sglang/startup.py\n"
+            "CUDA_LIB_PATH = os.environ.get(\n"
+            "    'CUDA_HOME', '/usr/local/cuda'\n"
+            ") + '/lib64'\n"
+            "# Hardcoded to cuda, not cuda-13\n"
+        ),
+        specialist_opinions={
+            "runtime": SpecialistOpinion(
+                "CUDA 13 is installed at /usr/local/cuda-13 but LD_LIBRARY_PATH points to cuda-12. "
+                "The runtime path needs to be updated.", 0.95, True
+            ),
+            "dispatch": SpecialistOpinion("Can't tell — server never gets to dispatch phase.", 0.40, False),
+            "kernel": SpecialistOpinion("No kernel issue — server crashes before kernel init.", 0.60, False),
+            "loader": SpecialistOpinion(
+                "The CUDA shared library loader can't find libcudart.so.13. Path issue.", 0.88, True
+            ),
+        },
+        inspect_results=InspectResult(
+            logs=(
+                "[System] CUDA installations:\n"
+                "  /usr/local/cuda-12 -> CUDA 12.4\n"
+                "  /usr/local/cuda-13 -> CUDA 13.0\n"
+                "  /usr/local/cuda -> symlink to cuda-12\n"
+                "[SGLang] Trying to load libcudart.so.13 from /usr/local/cuda/lib64 -> NOT FOUND"
+            ),
+            config="CUDA_HOME=/usr/local/cuda\nLD_LIBRARY_PATH=/usr/local/cuda-12/lib64\ncuda_13_path=/usr/local/cuda-13",
+            snippet="# /usr/local/cuda symlinks to cuda-12\n# Need: export CUDA_HOME=/usr/local/cuda-13\n# Or: update symlink",
+            metrics="server_start_attempts: 3\nserver_start_failures: 3\nuptime: 0s",
+        ),
+        specialist_followups={
+            "runtime": "Confirmed: /usr/local/cuda symlink targets cuda-12. CUDA 13 is at /usr/local/cuda-13. Fix the path.",
+            "dispatch": "Server never started, so I can't diagnose dispatch.",
+            "kernel": "Same — no kernel loaded.",
+            "loader": "The dynamic linker searches LD_LIBRARY_PATH first. It needs /usr/local/cuda-13/lib64.",
+        },
+    ))
+    scenarios.append(Scenario(
+        id="runtime_loader_02",
+        root_cause="runtime_loader",
+        correct_fix="fix_runtime_path",
+        incident_ticket=(
+            "INCIDENT: ROCm HIP runtime fails to initialize on MI300X cluster. "
+            "Error: 'hipErrorNoDevice' despite GPUs being visible in lspci. "
+            "Worked yesterday before system update."
+        ),
+        hardware="AMD MI300X",
+        model_name="DeepSeek-V3-671B",
+        backend="vLLM 0.8.x",
+        initial_log=(
+            "[HIP] Initializing runtime...\n"
+            "[HIP] ERROR: hipErrorNoDevice (code 100)\n"
+            "[System] lspci shows 8x AMD Instinct MI300X\n"
+            "[System] /opt/rocm -> /opt/rocm-6.2 (outdated symlink)"
+        ),
+        initial_snippet=(
+            "# environment setup\n"
+            "ROCM_PATH=/opt/rocm  # symlinks to rocm-6.2\n"
+            "# But rocm-6.3 installed at /opt/rocm-6.3\n"
+            "# Driver expects rocm-6.3 runtime\n"
+        ),
+        specialist_opinions={
+            "runtime": SpecialistOpinion(
+                "ROCm path mismatch. /opt/rocm points to 6.2 but driver needs 6.3 runtime.", 0.94, True
+            ),
+            "dispatch": SpecialistOpinion("Not a dispatch issue — runtime doesn't initialize.", 0.70, False),
+            "kernel": SpecialistOpinion("Might be a kernel module issue with the GPU driver.", 0.45, False),
+            "loader": SpecialistOpinion("ROCm shared libraries at wrong version.", 0.80, True),
+        },
+        inspect_results=InspectResult(
+            logs="[System] /opt/rocm -> /opt/rocm-6.2\n[System] Driver version: 6.3.0\n[HIP] Runtime version mismatch: expected 6.3, found 6.2",
+            config="ROCM_PATH=/opt/rocm\nrocm_symlink_target=/opt/rocm-6.2\ninstalled_versions: [6.2, 6.3]\ndriver_version: 6.3.0",
+            snippet="# The system was updated and ROCm 6.3 driver installed\n# But /opt/rocm symlink still points to 6.2\n# Fix: ln -sf /opt/rocm-6.3 /opt/rocm",
+            metrics="gpu_init_failures: 8\ndriver_version: 6.3.0\nruntime_version: 6.2.0",
+        ),
+        specialist_followups={
+            "runtime": "Classic version mismatch after system update. Fix the symlink to point to rocm-6.3.",
+            "dispatch": "Can't assess dispatch without a working runtime.",
+            "kernel": "I was wrong — it's not a kernel module issue. The GPU driver is fine, it's the userspace runtime path.",
+            "loader": "The shared library loader finds rocm-6.2 libs but driver expects 6.3. Path fix needed.",
+        },
+    ))
+    # --- backend_selector scenarios ---
+    scenarios.append(Scenario(
+        id="backend_selector_01",
+        root_cause="backend_selector",
+        correct_fix="switch_backend",
+        incident_ticket=(
+            "INCIDENT: Extreme latency (10x expected) on H100 serving Llama-3.3-70B. "
+            "No errors, just very slow. GPU utilization looks low. "
+            "Other models on the same node are fast."
+        ),
+        hardware="NVIDIA H100",
+        model_name="Llama-3.3-70B-Instruct",
+        backend="vLLM 0.8.x",
+        initial_log=(
+            "[vLLM] Selected attention backend: xformers\n"
+            "[vLLM] WARNING: FlashAttention v2 not selected (override with VLLM_ATTENTION_BACKEND)\n"
+            "[vLLM] Serving Llama-3.3-70B-Instruct...\n"
+            "[vLLM] p99 latency: 4200ms (expected: ~400ms)"
+        ),
+        initial_snippet=(
+            "# vllm/attention/selector.py\n"
+            "def get_attention_backend(model_config):\n"
+            "    if model_config.head_dim not in [64, 128]:\n"
+            "        return 'xformers'  # fallback\n"
+            "    return 'flash_attn'\n"
+        ),
+        specialist_opinions={
+            "runtime": SpecialistOpinion("CUDA runtime is fine. No errors.", 0.75, False),
+            "dispatch": SpecialistOpinion(
+                "Wrong attention backend selected. xformers is much slower than FlashAttention on H100. "
+                "The backend selector has a bug in head_dim detection.", 0.94, True
+            ),
+            "kernel": SpecialistOpinion(
+                "The xformers kernel is correct but suboptimal for H100. Should use flash_attn.", 0.82, True
+            ),
+            "loader": SpecialistOpinion("Model loaded correctly. Not a weight issue.", 0.80, False),
+        },
+        inspect_results=InspectResult(
+            logs="[vLLM] head_dim=128, num_heads=64\n[vLLM] Backend selection: model reports head_dim=None (config missing) -> fallback to xformers",
+            config="attention_backend: xformers (auto-selected)\nmodel_head_dim: null\nactual_head_dim: 128\ngpu: H100",
+            snippet="# The model config doesn't explicitly set head_dim\n# Selector falls back to xformers when head_dim is None\n# Should infer head_dim from hidden_size / num_heads",
+            metrics="p50_latency_ms: 3100\np99_latency_ms: 4200\ngpu_utilization: 12%\nexpected_gpu_util: 85%",
+        ),
+        specialist_followups={
+            "runtime": "No runtime issues. The server is running, just slow.",
+            "dispatch": "Backend selector bug: head_dim is None in model config, causing xformers fallback. Switch to flash_attn.",
+            "kernel": "xformers works but doesn't use H100 TMA/warp specialization. flash_attn v2 would be 8-10x faster.",
+            "loader": "Weights loaded correctly.",
+        },
+    ))
+    scenarios.append(Scenario(
+        id="backend_selector_02",
+        root_cause="backend_selector",
+        correct_fix="switch_backend",
+        incident_ticket=(
+            "INCIDENT: FP8 inference on MI300X producing garbage output. "
+            "Model loads, tokens generate, but output is nonsensical. "
+            "BF16 inference on same hardware works perfectly."
+        ),
+        hardware="AMD MI300X",
+        model_name="Mistral-Large-2",
+        backend="vLLM 0.8.x",
+        initial_log=(
+            "[vLLM] FP8 quantization: e4m3fn format selected\n"
+            "[vLLM] WARNING: MI300X uses e4m3fnuz format, not e4m3fn\n"
+            "[vLLM] Serving with FP8...\n"
+            "[vLLM] Output quality check: FAIL (perplexity 847.3, expected <15)"
+        ),
+        initial_snippet=(
+            "# vllm/quantization/fp8.py\n"
+            "FP8_FORMAT = 'e4m3fn'  # NVIDIA default\n"
+            "# AMD MI300X needs e4m3fnuz (no NaN, unsigned zero)\n"
+        ),
+        specialist_opinions={
+            "runtime": SpecialistOpinion("ROCm runtime is healthy.", 0.80, False),
+            "dispatch": SpecialistOpinion(
+                "Wrong FP8 format selected. MI300X uses e4m3fnuz, not e4m3fn. "
+                "The backend selector should detect AMD and switch format.", 0.93, True
+            ),
+            "kernel": SpecialistOpinion(
+                "The GEMM kernel runs but produces wrong results due to format mismatch.", 0.85, True
+            ),
+            "loader": SpecialistOpinion(
+                "Weight dequantization might be wrong for AMD FP8 format.", 0.65, False
+            ),
+        },
+        inspect_results=InspectResult(
+            logs="[FP8] Using e4m3fn format\n[FP8] AMD GPU detected but format not switched\n[FP8] Numerical errors in first GEMM",
+            config="fp8_format: e4m3fn\ngpu_vendor: AMD\nexpected_format: e4m3fnuz\nformat_mismatch: true",
+            snippet="# e4m3fn: 1 sign, 4 exp, 3 mantissa, has NaN encoding\n# e4m3fnuz: 1 sign, 4 exp, 3 mantissa, NO NaN, unsigned zero\n# Bit patterns interpreted differently -> garbage output",
+            metrics="output_perplexity: 847.3\nexpected_perplexity: 12.5\ngemm_numerical_errors: 100%",
+        ),
+        specialist_followups={
+            "runtime": "ROCm fine. This is a numerical issue, not runtime.",
+            "dispatch": "Switch the FP8 format selector to use e4m3fnuz for AMD GPUs. Clear fix.",
+            "kernel": "The kernel math is correct for the format it's given — the problem is the format itself.",
+            "loader": "Actually, weights are fine. The issue is at the GEMM dispatch level.",
+        },
+    ))
+    # --- model_config scenarios ---
+    scenarios.append(Scenario(
+        id="model_config_01",
+        root_cause="model_config",
+        correct_fix="update_model_config",
+        incident_ticket=(
+            "INCIDENT: DeepSeek-V3 MoE routing crashes with shape mismatch. "
+            "Error: 'Expected expert count 256, got 160'. "
+            "Model just updated to new checkpoint, was working before."
+        ),
+        hardware="NVIDIA H100",
+        model_name="DeepSeek-V3-671B",
+        backend="SGLang 0.5.x",
+        initial_log=(
+            "[SGLang] Loading DeepSeek-V3-671B...\n"
+            "[SGLang] MoE config: num_experts=256 (from config.json)\n"
+            "[SGLang] Actual weight shape: experts.0-159\n"
+            "[SGLang] ERROR: Shape mismatch in MoE layer: expected 256 experts, found 160"
+        ),
+        initial_snippet=(
+            "# config.json (model repo)\n"
+            '{\n'
+            '  "num_local_experts": 256,\n'
+            '  "num_experts_per_tok": 8,\n'
+            '  "intermediate_size": 2048\n'
+            '}\n'
+            "# But actual checkpoint has 160 experts\n"
+        ),
+        specialist_opinions={
+            "runtime": SpecialistOpinion("Runtime is fine. Model loading proceeds until shape error.", 0.75, False),
+            "dispatch": SpecialistOpinion("Not a dispatch bug — the model config is wrong.", 0.70, False),
+            "kernel": SpecialistOpinion(
+                "MoE kernel expects expert count from config. Config says 256 but weights have 160. "
+                "Config needs to be updated to match the new checkpoint.", 0.90, True
+            ),
+            "loader": SpecialistOpinion(
+                "The model config doesn't match the checkpoint. num_local_experts should be 160.", 0.92, True
+            ),
+        },
+        inspect_results=InspectResult(
+            logs="[SGLang] config.json: num_local_experts=256\n[SGLang] checkpoint expert layers: 160\n[SGLang] Mismatch detected at layer 0",
+            config="num_local_experts: 256 (config)\nactual_experts: 160 (checkpoint)\nnum_experts_per_tok: 8\ncheckpoint_version: v3.1",
+            snippet="# New checkpoint v3.1 reduced experts from 256 to 160\n# But config.json wasn't updated\n# Fix: set num_local_experts=160 in config.json",
+            metrics="model_load_progress: 15%\nlayers_loaded: 0/60\nerror_at: moe_layer_0",
+        ),
+        specialist_followups={
+            "runtime": "No runtime issue. Pure config mismatch.",
+            "dispatch": "Dispatch looks fine. The error is before dispatch even runs.",
+            "kernel": "The grouped GEMM kernel allocates buffers based on config expert count. Fix the config.",
+            "loader": "Config.json says 256 experts but the v3.1 checkpoint only has 160. Update the config.",
+        },
+    ))
+    scenarios.append(Scenario(
+        id="model_config_02",
+        root_cause="model_config",
+        correct_fix="update_model_config",
+        incident_ticket=(
+            "INCIDENT: Qwen3 MoE model gives wrong results after hardware migration. "
+            "Output is coherent but factually wrong. "
+            "Same model on old cluster was correct."
+        ),
+        hardware="NVIDIA B200",
+        model_name="Qwen3-235B-A22B",
+        backend="vLLM 0.8.x",
+        initial_log=(
+            "[vLLM] Loading Qwen3-235B-A22B...\n"
+            "[vLLM] Config: rope_theta=1000000.0\n"
+            "[vLLM] WARNING: RoPE scaling config missing for extended context\n"
+            "[vLLM] Serving... output quality degraded at positions > 4096"
+        ),
+        initial_snippet=(
+            "# config.json\n"
+            '{\n'
+            '  "rope_theta": 1000000.0,\n'
+            '  "max_position_embeddings": 32768\n'
+            '  // Missing: rope_scaling config for YaRN\n'
+            '}\n'
+        ),
+        specialist_opinions={
+            "runtime": SpecialistOpinion("Runtime fine. No crashes.", 0.80, False),
+            "dispatch": SpecialistOpinion("Backend selected correctly.", 0.65, False),
+            "kernel": SpecialistOpinion(
+                "RoPE computation looks standard. Config might be missing the scaling parameters.", 0.78, True
+            ),
+            "loader": SpecialistOpinion(
+                "Model config is incomplete — missing rope_scaling section for YaRN. "
+                "Old cluster had a patched config.", 0.91, True
+            ),
+        },
+        inspect_results=InspectResult(
+            logs="[vLLM] RoPE: theta=1e6, no scaling applied\n[vLLM] Quality degrades > 4096 tokens\n[vLLM] Old cluster config had rope_scaling: {type: yarn, factor: 4}",
+            config="rope_theta: 1000000.0\nrope_scaling: null\nmax_position_embeddings: 32768\nold_config_had: {rope_scaling: {type: yarn, factor: 4}}",
+            snippet="# Missing rope_scaling config:\n# rope_scaling: {type: 'yarn', factor: 4, ...}\n# Without it, positions > 4096 are garbage",
+            metrics="quality_0_4k: 95%\nquality_4k_8k: 43%\nquality_8k_plus: 12%",
+        ),
+        specialist_followups={
+            "runtime": "No runtime issues.",
+            "dispatch": "Backend is correct. Not a dispatch issue.",
+            "kernel": "The RoPE kernel is fine — it just doesn't have the scaling config to apply YaRN.",
+            "loader": "The config.json from the model repo is missing rope_scaling. Add it back.",
+        },
+    ))
+    # --- weight_layout scenarios ---
+    scenarios.append(Scenario(
+        id="weight_layout_01",
+        root_cause="weight_layout",
+        correct_fix="fix_weight_mapping",
+        incident_ticket=(
+            "INCIDENT: Model produces random output after converting weights from "
+            "HuggingFace format to TensorRT-LLM format. Conversion reported success "
+            "but inference output is gibberish."
+        ),
+        hardware="NVIDIA H100",
+        model_name="Llama-3.3-70B-Instruct",
+        backend="TensorRT-LLM 0.18",
+        initial_log=(
+            "[TRT-LLM] Loading converted weights...\n"
+            "[TRT-LLM] Weight shapes match expected layout\n"
+            "[TRT-LLM] Running inference...\n"
+            "[TRT-LLM] Output: 'asdfjkl; the the the purple 2847...'\n"
+            "[TRT-LLM] Perplexity: 2341.7 (expected < 10)"
+        ),
+        initial_snippet=(
+            "# convert_weights.py\n"
+            "# gate_proj and up_proj were swapped during conversion\n"
+            "mapping = {\n"
+            "    'gate_proj': 'linear_fc1_gate',\n"
+            "    'up_proj': 'linear_fc1_up',\n"
+            "}\n"
+            "# TRT-LLM expects opposite order\n"
+        ),
+        specialist_opinions={
+            "runtime": SpecialistOpinion("Runtime and engine init successful. No errors.", 0.80, False),
+            "dispatch": SpecialistOpinion("Backend dispatch is correct. TRT engine built fine.", 0.70, False),
+            "kernel": SpecialistOpinion(
+                "Kernels execute without error. This is a data issue, not compute.", 0.75, False
+            ),
+            "loader": SpecialistOpinion(
+                "Weight mapping is wrong. gate_proj and up_proj are swapped in the conversion script. "
+                "TRT-LLM expects the opposite order.", 0.94, True
+            ),
+        },
+        inspect_results=InspectResult(
+            logs="[TRT-LLM] Weight conversion: gate_proj -> linear_fc1_gate, up_proj -> linear_fc1_up\n[TRT-LLM] Expected: gate_proj -> linear_fc1_up, up_proj -> linear_fc1_gate",
+            config="weight_mapping:\n  gate_proj: linear_fc1_gate  # WRONG\n  up_proj: linear_fc1_up      # WRONG\n  # Should be swapped",
+            snippet="# TRT-LLM MLP layout: [up_proj; gate_proj] concatenated\n# But converter wrote [gate_proj; up_proj]\n# Result: SiLU applied to wrong half",
+            metrics="output_perplexity: 2341.7\nexpected_perplexity: 8.2\nweight_shapes: correct\nweight_values: misaligned",
+        ),
+        specialist_followups={
+            "runtime": "Engine runs fine. Not a runtime issue.",
+            "dispatch": "TRT engine dispatch is correct.",
+            "kernel": "Compute is correct for the data it gets. Fix the data (weights).",
+            "loader": "Classic weight mapping bug. Swap gate_proj and up_proj in the conversion mapping.",
+        },
+    ))
+    scenarios.append(Scenario(
+        id="weight_layout_02",
+        root_cause="weight_layout",
+        correct_fix="fix_weight_mapping",
+        incident_ticket=(
+            "INCIDENT: QKV attention weights transposed incorrectly for GQA model. "
+            "Attention scores are wrong — model generates repetitive text. "
+            "Happened after switching from MHA to GQA config."
+        ),
+        hardware="AMD MI300X",
+        model_name="Llama-4-Maverick-17Bx128E",
+        backend="FlashInfer 0.4",
+        initial_log=(
+            "[FlashInfer] GQA mode: 64 query heads, 8 KV heads\n"
+            "[FlashInfer] WARNING: QKV projection weight shape unexpected\n"
+            "[FlashInfer] Expected Q:[8192,8192] K:[8192,1024] V:[8192,1024]\n"
+            "[FlashInfer] Got Q:[8192,8192] K:[8192,8192] V:[8192,1024]\n"
+            "[FlashInfer] Repetitive output detected"
+        ),
+        initial_snippet=(
+            "# weight_converter.py\n"
+            "# GQA: Q has num_heads, K/V have num_kv_heads\n"
+            "q_proj = weights['q_proj']  # [8192, 8192] correct\n"
+            "k_proj = weights['q_proj']  # BUG: should be 'k_proj'\n"
+            "v_proj = weights['v_proj']  # [8192, 1024] correct\n"
+        ),
+        specialist_opinions={
+            "runtime": SpecialistOpinion("ROCm runtime fine.", 0.75, False),
+            "dispatch": SpecialistOpinion("FlashInfer dispatch selected GQA path correctly.", 0.70, False),
+            "kernel": SpecialistOpinion(
+                "GQA attention kernel is correct but K weights are wrong shape. "
+                "Looks like Q weights loaded twice instead of K.", 0.88, True
+            ),
+            "loader": SpecialistOpinion(
+                "Weight mapping bug: k_proj loaded from q_proj key. Copy-paste error in converter.", 0.95, True
+            ),
+        },
+        inspect_results=InspectResult(
+            logs="[FlashInfer] K weight shape [8192,8192] != expected [8192,1024]\n[FlashInfer] K weights appear identical to Q weights\n[FlashInfer] This causes attention to compute Q*Q^T instead of Q*K^T",
+            config="num_query_heads: 64\nnum_kv_heads: 8\nhead_dim: 128\nq_shape: [8192,8192]\nk_shape: [8192,8192] # WRONG\nv_shape: [8192,1024]",
+            snippet="# Bug in weight_converter.py line 47:\n# k_proj = weights['q_proj']  # should be weights['k_proj']\n# Result: K = Q, so attention = softmax(Q @ Q^T) -> repetitive",
+            metrics="attention_entropy: 0.03 (expected > 2.0)\nrepetition_rate: 94%\nperplexity: 567.8",
+        ),
+        specialist_followups={
+            "runtime": "No runtime problems.",
+            "dispatch": "GQA dispatch path is correct for this model.",
+            "kernel": "Attention kernel computes correctly for the data given. K weights are just wrong.",
+            "loader": "Line 47 has `weights['q_proj']` instead of `weights['k_proj']`. Classic copy-paste bug.",
+        },
+    ))
+    # --- arch_guard additional scenarios ---
+    scenarios.append(Scenario(
+        id="arch_guard_03",
+        root_cause="arch_guard",
+        correct_fix="relax_arch_check",
+        incident_ticket=(
+            "INCIDENT: TensorRT-LLM refuses to build engine for B200 GPU. "
+            "Error: 'Unsupported compute capability 120'. "
+            "Same model builds fine targeting H100."
+        ),
+        hardware="NVIDIA B200",
+        model_name="Qwen3-235B-A22B",
+        backend="TensorRT-LLM 0.18",
+        initial_log=(
+            "[TRT-LLM] Building engine for gpu_arch=sm_120...\n"
+            "[TRT-LLM] ERROR: Compute capability 120 not in supported set\n"
+            "[TRT-LLM] Supported: {70, 75, 80, 86, 89, 90}"
+        ),
+        initial_snippet=(
+            "# tensorrt_llm/builder.py\n"
+            "SUPPORTED_SM = {70, 75, 80, 86, 89, 90}\n"
+            "if sm not in SUPPORTED_SM:\n"
+            "    raise UnsupportedGPU(f'sm_{sm}')"
+        ),
+        specialist_opinions={
+            "runtime": SpecialistOpinion("CUDA 13 runtime loaded fine.", 0.78, False),
+            "dispatch": SpecialistOpinion(
+                "Architecture guard rejects sm_120. B200 uses Blackwell arch not in the allowlist.", 0.91, True
+            ),
+            "kernel": SpecialistOpinion(
+                "Try switching to a different quantization scheme for B200.", 0.45, False
+            ),
+            "loader": SpecialistOpinion("No weight loading attempted yet — blocked at engine build.", 0.72, False),
+        },
+        inspect_results=InspectResult(
+            logs="[TRT-LLM] sm_120 not in {70,75,80,86,89,90}\n[TRT-LLM] Engine build aborted before weight conversion",
+            config="target_gpu: sm_120\nsupported_sm: [70,75,80,86,89,90]\nbuilder_version: 0.18.0",
+            snippet="# B200 (sm_120) supports FP8 MMA, BF16 HMMA\n# Same instruction set as H100 for inference\n# Just not in the allowlist",
+            metrics="engine_build_attempts: 1\nengine_build_failures: 1\nmodel_loaded: false",
+        ),
+        specialist_followups={
+            "runtime": "Runtime is fine. Engine builder is the blocker.",
+            "dispatch": "Add sm_120 (and sm_12x family) to SUPPORTED_SM. The instructions are compatible.",
+            "kernel": "On reflection, quantization scheme isn't the issue. It's the arch check.",
+            "loader": "Can't load weights until engine builds.",
+        },
+    ))
+    scenarios.append(Scenario(
+        id="arch_guard_04",
+        root_cause="arch_guard",
+        correct_fix="relax_arch_check",
+        incident_ticket=(
+            "INCIDENT: Flash-Attention fwd pass returns CUDA error on MI355X. "
+            "Error: 'Unsupported AMD GPU architecture'. "
+            "MI300X works fine with same code."
+        ),
+        hardware="AMD MI355X",
+        model_name="Llama-3.3-70B-Instruct",
+        backend="vLLM 0.8.x",
+        initial_log=(
+            "[Flash-Attn] Checking GPU: AMD Instinct MI355X (gfx950)\n"
+            "[Flash-Attn] Supported AMD archs: [gfx90a, gfx942]\n"
+            "[Flash-Attn] ERROR: gfx950 not supported"
+        ),
+        initial_snippet=(
+            "# flash_attn/amd_check.py\n"
+            "AMD_SUPPORTED = ['gfx90a', 'gfx942']\n"
+            "if gpu_arch not in AMD_SUPPORTED:\n"
+            "    raise RuntimeError(f'{gpu_arch} not supported')"
+        ),
+        specialist_opinions={
+            "runtime": SpecialistOpinion("ROCm 6.4 runtime operational.", 0.80, False),
+            "dispatch": SpecialistOpinion(
+                "gfx950 (MI355X/CDNA4) isn't in the AMD arch allowlist. Needs to be added.", 0.92, True
+            ),
+            "kernel": SpecialistOpinion(
+                "MI355X has different MFMA tile sizes — kernel might actually be incompatible.", 0.55, False
+            ),
+            "loader": SpecialistOpinion("Can't assess — kernel never launched.", 0.60, False),
+        },
+        inspect_results=InspectResult(
+            logs="[Flash-Attn] gfx950 not in [gfx90a, gfx942]\n[Flash-Attn] MI355X CDNA4 arch check failed",
+            config="gpu_arch: gfx950\namd_supported: [gfx90a, gfx942]\nrocm_version: 6.4",
+            snippet="# MI355X (gfx950/CDNA4) extends gfx942 instruction set\n# MFMA f32_32x32x16_fp8 available\n# Just missing from allowlist",
+            metrics="kernel_launch_failures: 1\ngpu_utilization: 0%",
+        ),
+        specialist_followups={
+            "runtime": "ROCm works. Not a runtime issue.",
+            "dispatch": "Add gfx950 to AMD_SUPPORTED. CDNA4 is backwards-compatible with gfx942 kernels.",
+            "kernel": "I was wrong — gfx950 does support the needed MFMA instructions. It's just the allowlist.",
+            "loader": "No weight issues.",
+        },
+    ))
+    scenarios.append(Scenario(
+        id="arch_guard_05",
+        root_cause="arch_guard",
+        correct_fix="relax_arch_check",
+        incident_ticket=(
+            "INCIDENT: Triton kernel compilation fails on RTX 5090 for custom MoE layer. "
+            "Error: 'target sm_120 not recognized'. Compiled fine for sm_90."
+        ),
+        hardware="NVIDIA SM120 (GeForce RTX 5090)",
+        model_name="DeepSeek-V3-671B",
+        backend="SGLang 0.5.x",
+        initial_log=(
+            "[Triton] Compiling MoE routing kernel for sm_120...\n"
+            "[Triton] ERROR: Unknown target 'sm_120'\n"
+            "[Triton] Known targets: sm_70, sm_75, sm_80, sm_86, sm_89, sm_90"
+        ),
+        initial_snippet=(
+            "# triton/compiler/target.py\n"
+            "KNOWN_TARGETS = ['sm_70','sm_75','sm_80','sm_86','sm_89','sm_90']\n"
+        ),
+        specialist_opinions={
+            "runtime": SpecialistOpinion("CUDA and Triton installed correctly.", 0.78, False),
+            "dispatch": SpecialistOpinion(
+                "Triton's target list doesn't include sm_120. Need to add Blackwell family.", 0.90, True
+            ),
+            "kernel": SpecialistOpinion(
+                "The MoE kernel uses standard tl.dot which works on any SM >= 70.", 0.82, True
+            ),
+            "loader": SpecialistOpinion(
+                "Weights load fine. Error is at JIT compilation stage.", 0.70, False
+            ),
+        },
+        inspect_results=InspectResult(
+            logs="[Triton] JIT target 'sm_120' not recognized\n[Triton] Compilation aborted before PTX generation",
+            config="triton_target: sm_120\nknown_targets: [sm_70..sm_90]\ntriton_version: 3.2",
+            snippet="# Triton target registry doesn't know sm_120\n# sm_120 can use sm_90 codegen path\n# Add sm_120 to target list or use family mapping",
+            metrics="jit_compile_failures: 1\nkernel_cache_hits: 0",
+        ),
+        specialist_followups={
+            "runtime": "No runtime issue. Triton JIT compiler is the blocker.",
+            "dispatch": "Triton target registry needs sm_120. Can map to sm_90 codegen path since instruction set overlaps.",
+            "kernel": "The kernel code is fine — it's the compiler target check, not the kernel logic.",
+            "loader": "No weight involvement at this stage.",
+        },
+    ))
+    # --- backend_whitelist additional scenarios ---
+    scenarios.append(Scenario(
+        id="backend_whitelist_03",
+        root_cause="backend_whitelist",
+        correct_fix="add_whitelist_entry",
+        incident_ticket=(
+            "INCIDENT: GPTQ quantization fails on B200 with 'GPU not whitelisted for Marlin'. "
+            "Same quantized model serves fine on H100. B200 has FP16 working."
+        ),
+        hardware="NVIDIA B200",
+        model_name="Mistral-Large-2",
+        backend="vLLM 0.8.x",
+        initial_log=(
+            "[vLLM] Loading GPTQ model on B200...\n"
+            "[vLLM] Marlin check: GPU 'NVIDIA B200' not whitelisted\n"
+            "[vLLM] Available kernels for non-whitelisted: none\n"
+            "[vLLM] ERROR: Cannot serve quantized model"
+        ),
+        initial_snippet=(
+            "# vllm/quantization/marlin.py\n"
+            "WHITELIST = {'A100','H100','A10G','L40S','RTX 4090'}\n"
+            "if gpu_name not in WHITELIST:\n"
+            "    raise RuntimeError('GPU not whitelisted')\n"
+        ),
+        specialist_opinions={
+            "runtime": SpecialistOpinion("CUDA runtime healthy on B200.", 0.80, False),
+            "dispatch": SpecialistOpinion(
+                "Whitelist check is string-based. 'B200' not in the set. Add it.", 0.93, True
+            ),
+            "kernel": SpecialistOpinion(
+                "B200 FP8 is different from H100. Might need a different quantization kernel.", 0.50, False
+            ),
+            "loader": SpecialistOpinion("Quantized weights loaded correctly.", 0.75, False),
+        },
+        inspect_results=InspectResult(
+            logs="[Marlin] GPU 'NVIDIA B200' not in whitelist\n[Marlin] Whitelist: {A100,H100,A10G,L40S,RTX 4090}",
+            config="gpu_name: NVIDIA B200\nmarlin_whitelist: [A100,H100,A10G,L40S,RTX 4090]\nquant_method: gptq",
+            snippet="# B200 supports all Marlin GEMM ops (INT4 deq + FP16 MMA)\n# Name-based whitelist just doesn't include it\n# Fix: add 'B200' or switch to arch-based check",
+            metrics="quant_init_failures: 1\nfp16_serving: available\nquant_serving: blocked",
+        ),
+        specialist_followups={
+            "runtime": "Runtime fine.",
+            "dispatch": "Simple whitelist gap. Add 'B200' to WHITELIST set.",
+            "kernel": "I was wrong — B200 Marlin kernels use same INT4 deq + MMA path as H100. Whitelist issue only.",
+            "loader": "Weights are fine.",
+        },
+    ))
+    scenarios.append(Scenario(
+        id="backend_whitelist_04",
+        root_cause="backend_whitelist",
+        correct_fix="add_whitelist_entry",
+        incident_ticket=(
+            "INCIDENT: FlashInfer FP8 GEMM blocked on DGX Spark. "
+            "Error: 'FP8 dispatch not available for this GPU'. "
+            "SM121 should support FP8 natively."
+        ),
+        hardware="NVIDIA SM121 (DGX Spark)",
+        model_name="DeepSeek-R1-Distill-70B",
+        backend="FlashInfer 0.4",
+        initial_log=(
+            "[FlashInfer] FP8 GEMM dispatch...\n"
+            "[FlashInfer] GPU family check: sm_121\n"
+            "[FlashInfer] FP8 whitelist: [sm_89, sm_90]\n"
+            "[FlashInfer] ERROR: FP8 not available for sm_121"
+        ),
+        initial_snippet=(
+            "# flashinfer/gemm/fp8_dispatch.py\n"
+            "FP8_ENABLED_SM = {89, 90}  # Ada, Hopper\n"
+            "# Missing SM12x which has FP8 MMA\n"
+        ),
+        specialist_opinions={
+            "runtime": SpecialistOpinion("CUDA 13 runtime fine.", 0.78, False),
+            "dispatch": SpecialistOpinion(
+                "FP8 dispatch whitelist only has Ada/Hopper. SM121 supports FP8 MMA natively but isn't listed.", 0.94, True
+            ),
+            "kernel": SpecialistOpinion(
+                "SM121 FP8 might use different MMA instruction encoding.", 0.48, False
+            ),
+            "loader": SpecialistOpinion("FP8 weights loaded. Dispatch is the blocker.", 0.82, True),
+        },
+        inspect_results=InspectResult(
+            logs="[FlashInfer] sm_121 not in FP8_ENABLED_SM {89, 90}\n[FlashInfer] FP8 GEMM dispatch blocked",
+            config="gpu_sm: 121\nfp8_whitelist: [89, 90]\nfp8_hw_support: true",
+            snippet="# SM121 uses m16n8k32 FP8 MMA (same encoding as SM90)\n# Just not in FP8_ENABLED_SM set\n# Add 120, 121 to enable FP8 dispatch",
+            metrics="fp8_dispatch_blocked: true\nfp8_hw_capable: true\nfallback_to_bf16: not_attempted",
+        ),
+        specialist_followups={
+            "runtime": "Runtime is fine.",
+            "dispatch": "Add SM12x to FP8_ENABLED_SM. SM121 uses identical FP8 MMA to SM90.",
+            "kernel": "I checked — SM121 uses the same m16n8k32 encoding as SM90. My concern was unfounded.",
+            "loader": "FP8 weights are ready. Just need dispatch to be unblocked.",
+        },
+    ))
+    scenarios.append(Scenario(
+        id="backend_whitelist_05",
+        root_cause="backend_whitelist",
+        correct_fix="add_whitelist_entry",
+        incident_ticket=(
+            "INCIDENT: SGLang refuses to enable speculative decoding on RTX 5090. "
+            "Error: 'Speculative decoding not supported for consumer GPUs'. "
+            "Feature works on A100."
+        ),
+        hardware="NVIDIA SM120 (GeForce RTX 5090)",
+        model_name="Llama-3.3-70B-Instruct",
+        backend="SGLang 0.5.x",
+        initial_log=(
+            "[SGLang] Speculative decoding requested...\n"
+            "[SGLang] GPU: GeForce RTX 5090\n"
+            "[SGLang] Spec decode whitelist: [A100, H100, A10G]\n"
+            "[SGLang] ERROR: Consumer GPU not in spec-decode whitelist"
+        ),
+        initial_snippet=(
+            "# sglang/server/spec_decode.py\n"
+            "SPEC_DECODE_GPUS = ['A100', 'H100', 'A10G']\n"
+            "# Only data center GPUs whitelisted\n"
+        ),
+        specialist_opinions={
+            "runtime": SpecialistOpinion("Runtime fine. GPU has 24GB VRAM.", 0.78, False),
+            "dispatch": SpecialistOpinion(
+                "RTX 5090 not in spec-decode whitelist. Datacenter-only check is too restrictive.", 0.91, True
+            ),
+            "kernel": SpecialistOpinion(
+                "RTX 5090 might not have enough VRAM for speculative decoding with 70B.", 0.60, False
+            ),
+            "loader": SpecialistOpinion("Model weights fine.", 0.72, False),
+        },
+        inspect_results=InspectResult(
+            logs="[SGLang] GPU 'GeForce RTX 5090' not in SPEC_DECODE_GPUS\n[SGLang] Whitelist is datacenter-only",
+            config="gpu_name: GeForce RTX 5090\nspec_decode_whitelist: [A100,H100,A10G]\nvram: 32GB",
+            snippet="# RTX 5090 has 32GB VRAM, sufficient for spec decode\n# Whitelist artificially restricts to datacenter GPUs\n# Add RTX 5090 or use VRAM-based check",
+            metrics="spec_decode_attempts: 1\nspec_decode_blocked: true\nvram_available: 32GB",
+        ),
+        specialist_followups={
+            "runtime": "No runtime issue.",
+            "dispatch": "Add RTX 5090 to whitelist. 32GB VRAM is plenty for spec decode.",
+            "kernel": "32GB is sufficient for speculative decoding with 70B quantized. VRAM isn't the issue.",
+            "loader": "Weights loaded. Dispatch blocker only.",
+        },
+    ))
+    # --- runtime_loader additional scenarios ---
+    scenarios.append(Scenario(
+        id="runtime_loader_03",
+        root_cause="runtime_loader",
+        correct_fix="fix_runtime_path",
+        incident_ticket=(
+            "INCIDENT: vLLM fails with 'libcublas.so.13 not found' on freshly provisioned node. "
+            "nvidia-smi shows GPU. CUDA toolkit installed. Other CUDA apps work."
+        ),
+        hardware="NVIDIA H100",
+        model_name="Llama-4-Maverick-17Bx128E",
+        backend="vLLM 0.8.x",
+        initial_log=(
+            "[vLLM] Initializing CUDA...\n"
+            "[vLLM] ERROR: libcublas.so.13: cannot open shared object file\n"
+            "[vLLM] LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu\n"
+            "[vLLM] Note: /usr/local/cuda-13/lib64 not in path"
+        ),
+        initial_snippet=(
+            "# /etc/environment\n"
+            "LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu\n"
+            "# Missing: /usr/local/cuda-13/lib64\n"
+        ),
+        specialist_opinions={
+            "runtime": SpecialistOpinion(
+                "CUDA 13 is installed but its lib64 directory isn't in LD_LIBRARY_PATH. Path fix needed.", 0.95, True
+            ),
+            "dispatch": SpecialistOpinion("Server crashes before any dispatch.", 0.65, False),
+            "kernel": SpecialistOpinion("Not a kernel issue — can't load CUDA libraries.", 0.70, False),
+            "loader": SpecialistOpinion(
+                "Dynamic linker can't find libcublas.so.13. Add CUDA 13 lib path.", 0.90, True
+            ),
+        },
+        inspect_results=InspectResult(
+            logs="[ldconfig] libcublas.so.13 not in cache\n[System] /usr/local/cuda-13/lib64/libcublas.so.13 EXISTS but not in path",
+            config="LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu\ncuda_13_libs=/usr/local/cuda-13/lib64\nldconfig_cache: stale",
+            snippet="# libcublas.so.13 exists at /usr/local/cuda-13/lib64/\n# But LD_LIBRARY_PATH doesn't include it\n# Fix: add /usr/local/cuda-13/lib64 to LD_LIBRARY_PATH",
+            metrics="import_failures: 1\ncuda_available: false (library missing)",
+        ),
+        specialist_followups={
+            "runtime": "Classic provisioning issue. CUDA installed but path not configured. Add to LD_LIBRARY_PATH.",
+            "dispatch": "Nothing to dispatch — server won't start.",
+            "kernel": "No kernel involvement.",
+            "loader": "Add /usr/local/cuda-13/lib64 to LD_LIBRARY_PATH or run ldconfig.",
+        },
+    ))
+    scenarios.append(Scenario(
+        id="runtime_loader_04",
+        root_cause="runtime_loader",
+        correct_fix="fix_runtime_path",
+        incident_ticket=(
+            "INCIDENT: FlashInfer JIT compilation fails with 'nvcc not found'. "
+            "GPU inference should work but JIT kernels can't compile. "
+            "nvidia-smi works fine."
+        ),
+        hardware="NVIDIA SM121 (DGX Spark)",
+        model_name="Qwen3-235B-A22B",
+        backend="FlashInfer 0.4",
+        initial_log=(
+            "[FlashInfer] JIT compiling attention kernel for sm_121...\n"
+            "[FlashInfer] Searching for nvcc...\n"
+            "[FlashInfer] ERROR: nvcc not found in PATH\n"
+            "[FlashInfer] CUDA_HOME not set"
+        ),
+        initial_snippet=(
+            "# Container environment\n"
+            "PATH=/usr/local/bin:/usr/bin:/bin\n"
+            "# Missing: /usr/local/cuda-13/bin (where nvcc lives)\n"
+            "CUDA_HOME=  # not set\n"
+        ),
+        specialist_opinions={
+            "runtime": SpecialistOpinion(
+                "CUDA toolkit is installed but nvcc isn't in PATH and CUDA_HOME isn't set.", 0.93, True
+            ),
+            "dispatch": SpecialistOpinion("Dispatch can't run without JIT-compiled kernels.", 0.60, False),
+            "kernel": SpecialistOpinion(
+                "SM121 needs JIT compilation for attention kernels. Without nvcc, it can't compile.", 0.80, True
+            ),
+            "loader": SpecialistOpinion("Try using pre-compiled AOT kernels instead.", 0.45, False),
+        },
+        inspect_results=InspectResult(
+            logs="[System] which nvcc -> not found\n[System] ls /usr/local/cuda-13/bin/nvcc -> EXISTS\n[System] CUDA_HOME unset",
+            config="PATH=/usr/local/bin:/usr/bin:/bin\nCUDA_HOME=(unset)\nnvcc_location=/usr/local/cuda-13/bin/nvcc",
+            snippet="# nvcc exists at /usr/local/cuda-13/bin/ but not in PATH\n# Fix: export CUDA_HOME=/usr/local/cuda-13\n# Fix: export PATH=$CUDA_HOME/bin:$PATH",
+            metrics="jit_compile_attempts: 3\njit_compile_failures: 3\naot_kernels_available: false",
+        ),
+        specialist_followups={
+            "runtime": "Set CUDA_HOME=/usr/local/cuda-13 and add its bin/ to PATH.",
+            "dispatch": "Once nvcc is found, JIT compilation will work and dispatch proceeds normally.",
+            "kernel": "The kernel code is ready to compile. Just need the compiler to be findable.",
+            "loader": "AOT kernels aren't available for SM121 yet. JIT path is needed.",
+        },
+    ))
+    scenarios.append(Scenario(
+        id="runtime_loader_05",
+        root_cause="runtime_loader",
+        correct_fix="fix_runtime_path",
+        incident_ticket=(
+            "INCIDENT: Python can't import torch on MI300X node. "
+            "Error: 'libtorch_hip.so: cannot open shared object'. "
+            "PyTorch ROCm wheel installed but missing HIP libs."
+        ),
+        hardware="AMD MI300X",
+        model_name="Mistral-Large-2",
+        backend="vLLM 0.8.x",
+        initial_log=(
+            "[Python] import torch\n"
+            "[Python] ERROR: libtorch_hip.so: cannot open shared object file\n"
+            "[System] ROCm installed at /opt/rocm-6.3\n"
+            "[System] LD_LIBRARY_PATH does not include /opt/rocm-6.3/lib"
+        ),
+        initial_snippet=(
+            "# Container env\n"
+            "LD_LIBRARY_PATH=/usr/local/lib\n"
+            "# Needs: /opt/rocm-6.3/lib:/opt/rocm-6.3/hip/lib\n"
+        ),
+        specialist_opinions={
+            "runtime": SpecialistOpinion(
+                "ROCm 6.3 installed but libs not in LD_LIBRARY_PATH. Classic path issue.", 0.94, True
+            ),
+            "dispatch": SpecialistOpinion("Can't assess — Python crashes on import.", 0.50, False),
+            "kernel": SpecialistOpinion("Maybe PyTorch ROCm wheel is for wrong ROCm version.", 0.55, False),
+            "loader": SpecialistOpinion(
+                "Dynamic linker needs /opt/rocm-6.3/lib in LD_LIBRARY_PATH.", 0.90, True
+            ),
+        },
+        inspect_results=InspectResult(
+            logs="[System] /opt/rocm-6.3/lib/libtorch_hip.so EXISTS\n[System] ldd: libtorch_hip.so => not found\n[System] LD_LIBRARY_PATH=/usr/local/lib only",
+            config="LD_LIBRARY_PATH=/usr/local/lib\nrocm_path=/opt/rocm-6.3\nrocm_lib=/opt/rocm-6.3/lib",
+            snippet="# ROCm libs at /opt/rocm-6.3/lib/ and /opt/rocm-6.3/hip/lib/\n# Not in LD_LIBRARY_PATH\n# Fix: export LD_LIBRARY_PATH=/opt/rocm-6.3/lib:/opt/rocm-6.3/hip/lib:$LD_LIBRARY_PATH",
+            metrics="import_failures: 1\ntorch_available: false",
+        ),
+        specialist_followups={
+            "runtime": "Add ROCm lib paths to LD_LIBRARY_PATH. Standard post-install issue.",
+            "dispatch": "Can't run without PyTorch importing.",
+            "kernel": "The ROCm version matches the wheel. It's just a path issue.",
+            "loader": "Add /opt/rocm-6.3/lib to LD_LIBRARY_PATH.",
+        },
+    ))
+    # --- backend_selector additional scenarios ---
+    scenarios.append(Scenario(
+        id="backend_selector_03",
+        root_cause="backend_selector",
+        correct_fix="switch_backend",
+        incident_ticket=(
+            "INCIDENT: SGLang MoE expert parallelism selecting wrong GEMM backend. "
+            "Using generic GEMM instead of grouped GEMM for MoE layers. "
+            "Throughput is 5x lower than expected."
+        ),
+        hardware="NVIDIA H100",
+        model_name="DeepSeek-V3-671B",
+        backend="SGLang 0.5.x",
+        initial_log=(
+            "[SGLang] MoE layer: 256 experts, top-8 routing\n"
+            "[SGLang] GEMM backend: generic (cublas)\n"
+            "[SGLang] WARNING: Grouped GEMM backend not selected\n"
+            "[SGLang] Throughput: 15 tok/s (expected: 80 tok/s)"
+        ),
+        initial_snippet=(
+            "# sglang/moe/dispatch.py\n"
+            "def select_moe_backend(num_experts, gpu):\n"
+            "    if num_experts <= 64:\n"
+            "        return 'grouped_gemm'\n"
+            "    return 'generic'  # Wrong fallback for large expert count\n"
+        ),
+        specialist_opinions={
+            "runtime": SpecialistOpinion("CUDA runtime fine. No errors.", 0.75, False),
+            "dispatch": SpecialistOpinion(
+                "MoE backend selector falls back to generic GEMM when experts > 64. "
+                "Should use grouped GEMM for any expert count on H100.", 0.95, True
+            ),
+            "kernel": SpecialistOpinion(
+                "Generic cuBLAS GEMM launches one kernel per expert. Grouped GEMM batches them. "
+                "Switch to grouped GEMM backend.", 0.88, True
+            ),
+            "loader": SpecialistOpinion("Weights loaded. Not a loading issue.", 0.72, False),
+        },
+        inspect_results=InspectResult(
+            logs="[SGLang] 256 experts > 64 threshold -> generic backend\n[SGLang] Each expert: separate cuBLAS call\n[SGLang] Kernel launch overhead: 256 launches/layer",
+            config="num_experts: 256\nmoe_backend: generic\nthreshold: 64\ngpu: H100",
+            snippet="# Backend selector has wrong threshold logic\n# Should use grouped_gemm for ALL expert counts on H100\n# Current: only grouped_gemm when experts <= 64",
+            metrics="throughput_tok_s: 15\nexpected_throughput: 80\nkernel_launches_per_step: 256\ngpu_utilization: 18%",
+        ),
+        specialist_followups={
+            "runtime": "No runtime issues.",
+            "dispatch": "Switch to grouped_gemm backend. The 64-expert threshold is a bug.",
+            "kernel": "Grouped GEMM would batch all 256 experts into one kernel launch. 10-15x fewer launches.",
+            "loader": "Not a weight issue.",
+        },
+    ))
+    scenarios.append(Scenario(
+        id="backend_selector_04",
+        root_cause="backend_selector",
+        correct_fix="switch_backend",
+        incident_ticket=(
+            "INCIDENT: Attention on B200 using FlashAttention v1 path instead of v2. "
+            "Memory usage 3x higher than expected. OOM on large batch sizes. "
+            "Same model fits in memory on H100."
+        ),
+        hardware="NVIDIA B200",
+        model_name="Llama-4-Maverick-17Bx128E",
+        backend="vLLM 0.8.x",
+        initial_log=(
+            "[vLLM] Attention backend: flash_attn_v1\n"
+            "[vLLM] WARNING: v2 backend not selected (GPU not in v2 list)\n"
+            "[vLLM] Memory: attention uses O(n^2) instead of O(n)\n"
+            "[vLLM] OOM at batch_size=32 (expected to fit at batch_size=128)"
+        ),
+        initial_snippet=(
+            "# vllm/attention/selector.py\n"
+            "def select_flash_version(gpu_sm):\n"
+            "    if gpu_sm in {80, 86, 89, 90}:\n"
+            "        return 'v2'\n"
+            "    return 'v1'  # B200 (sm_120) falls here\n"
+        ),
+        specialist_opinions={
+            "runtime": SpecialistOpinion("CUDA runtime OK. Memory allocation works.", 0.75, False),
+            "dispatch": SpecialistOpinion(
+                "Backend selector picks FA v1 for sm_120. B200 supports v2 — selector needs updating.", 0.93, True
+            ),
+            "kernel": SpecialistOpinion(
+                "FA v1 uses O(n^2) memory. v2 uses O(n). That explains the OOM.", 0.85, True
+            ),
+            "loader": SpecialistOpinion(
+                "Maybe model weights are larger than expected for this architecture.", 0.45, False
+            ),
+        },
+        inspect_results=InspectResult(
+            logs="[vLLM] sm_120 not in {80,86,89,90} -> flash_attn_v1\n[vLLM] FA v1 attention memory: O(seq_len^2)\n[vLLM] OOM threshold hit at 32 batch",
+            config="gpu_sm: 120\nflash_attn_version: v1\nv2_supported_sm: [80,86,89,90]\nmemory_profile: quadratic",
+            snippet="# B200 (sm_120) supports FlashAttention v2\n# Selector only checks old SM list\n# Fix: add sm_120 to v2 supported set or switch to v2 backend",
+            metrics="attention_memory_gb: 24.5\nexpected_attention_memory_gb: 2.1\nbatch_size_limit: 32\nexpected_batch_limit: 128",
+        ),
+        specialist_followups={
+            "runtime": "Memory system works. Problem is FA v1's quadratic memory.",
+            "dispatch": "Add sm_120 to v2 supported set. B200 has full v2 support.",
+            "kernel": "FA v1 materializes full attention matrix. v2 uses tiling. Fix the selector.",
+            "loader": "Weight size is correct. It's the attention memory that's excessive.",
+        },
+    ))
+    scenarios.append(Scenario(
+        id="backend_selector_05",
+        root_cause="backend_selector",
+        correct_fix="switch_backend",
+        incident_ticket=(
+            "INCIDENT: MI300X inference using CK (Composable Kernel) attention but should use Triton. "
+            "CK path has a known bug with GQA + variable-length sequences. "
+            "Random crashes during batched inference."
+        ),
+        hardware="AMD MI300X",
+        model_name="Qwen3-235B-A22B",
+        backend="vLLM 0.8.x",
+        initial_log=(
+            "[vLLM] AMD GPU detected -> Composable Kernel attention\n"
+            "[vLLM] GQA + varlen: CK backend selected\n"
+            "[vLLM] CRASH: segfault in ck_attention_varlen_gqa\n"
+            "[vLLM] This is a known CK bug. Use Triton backend instead."
+        ),
+        initial_snippet=(
+            "# vllm/attention/backends/rocm.py\n"
+            "def get_rocm_backend(config):\n"
+            "    return 'composable_kernel'  # Always uses CK\n"
+            "    # Should check for known CK bugs and use Triton\n"
+        ),
+        specialist_opinions={
+            "runtime": SpecialistOpinion("ROCm runtime fine before the segfault.", 0.72, False),
+            "dispatch": SpecialistOpinion(
+                "Backend selector always picks CK on AMD. Should use Triton for GQA+varlen due to known CK bug.", 0.94, True
+            ),
+            "kernel": SpecialistOpinion(
+                "Known CK bug with GQA + varlen sequences. Triton attention works correctly.", 0.90, True
+            ),
+            "loader": SpecialistOpinion("Might be a weight alignment issue for AMD.", 0.40, False),
+        },
+        inspect_results=InspectResult(
+            logs="[CK] ck_attention_varlen_gqa: SIGSEGV\n[CK] Known issue: GQA + variable-length triggers OOB access\n[Triton] Triton attention works for this config",
+            config="rocm_attention: composable_kernel\ngqa_enabled: true\nvarlen: true\nknown_ck_bugs: [gqa_varlen]",
+            snippet="# CK has a bug in GQA + varlen attention (OOB memory access)\n# Triton backend handles this correctly\n# Fix: route GQA+varlen to Triton on AMD",
+            metrics="crashes: 3/10 requests\nsegfaults: 3\ntriton_fallback: not_configured",
+        ),
+        specialist_followups={
+            "runtime": "The segfault is in CK library code, not a runtime issue.",
+            "dispatch": "Switch to Triton attention for GQA+varlen on AMD. CK bug is known and not yet fixed upstream.",
+            "kernel": "CK varlen GQA kernel has off-by-one in tile boundary. Triton implementation doesn't have this bug.",
+            "loader": "Not a weight issue. The crash is in the attention computation.",
+        },
+    ))
+    # --- model_config additional scenarios ---
+    scenarios.append(Scenario(
+        id="model_config_03",
+        root_cause="model_config",
+        correct_fix="update_model_config",
+        incident_ticket=(
+            "INCIDENT: DeepSeek MLA attention produces wrong KV cache size. "
+            "OOM on sequences that should fit. Config shows standard MHA dimensions "
+            "but model uses MLA with compressed KV."
+        ),
+        hardware="NVIDIA SM121 (DGX Spark)",
+        model_name="DeepSeek-V3-671B",
+        backend="FlashInfer 0.4",
+        initial_log=(
+            "[FlashInfer] KV cache: allocating for 64 KV heads x 128 dim = 8192 per token\n"
+            "[FlashInfer] Expected MLA: kv_lora_rank=512, much smaller KV cache\n"
+            "[FlashInfer] OOM: KV cache exceeds 80GB at seq_len=4096"
+        ),
+        initial_snippet=(
+            "# config.json\n"
+            '{\n'
+            '  "num_key_value_heads": 64,\n'
+            '  "head_dim": 128\n'
+            '  // Missing: kv_lora_rank, qk_rope_head_dim for MLA\n'
+            '}\n'
+        ),
+        specialist_opinions={
+            "runtime": SpecialistOpinion("Memory allocation works. Just allocating too much.", 0.72, False),
+            "dispatch": SpecialistOpinion("FlashInfer correctly reading config. Config is the problem.", 0.68, False),
+            "kernel": SpecialistOpinion(
+                "MLA attention needs kv_lora_rank in config to use compressed KV. "
+                "Without it, falls back to full MHA KV cache sizing.", 0.92, True
+            ),
+            "loader": SpecialistOpinion(
+                "Config.json doesn't have MLA parameters. Need kv_lora_rank=512 and qk_rope_head_dim=64.", 0.93, True
+            ),
+        },
+        inspect_results=InspectResult(
+            logs="[FlashInfer] No kv_lora_rank in config -> full MHA KV\n[FlashInfer] KV per token: 64*128*2=16384 (should be 512*2=1024 with MLA)\n[FlashInfer] 16x memory overhead",
+            config="num_kv_heads: 64\nhead_dim: 128\nkv_lora_rank: (missing)\nqk_rope_head_dim: (missing)\nattention_type: inferred as MHA",
+            snippet="# DeepSeek MLA config needs:\n# kv_lora_rank: 512\n# qk_rope_head_dim: 64\n# Without these, system allocates full MHA KV cache",
+            metrics="kv_cache_per_token_bytes: 16384\nexpected_bytes: 1024\nmemory_overhead: 16x\noom_at_seq_len: 4096",
+        ),
+        specialist_followups={
+            "runtime": "No runtime issue. Memory allocation succeeds until OOM.",
+            "dispatch": "Config drives the dispatch. Fix the config.",
+            "kernel": "MLA kernel exists but won't activate without kv_lora_rank in config.",
+            "loader": "Add kv_lora_rank=512 and qk_rope_head_dim=64 to config.json.",
+        },
+    ))
+    scenarios.append(Scenario(
+        id="model_config_04",
+        root_cause="model_config",
+        correct_fix="update_model_config",
+        incident_ticket=(
+            "INCIDENT: Llama-4 Maverick MoE model failing with 'Expected 128 experts'. "
+            "Config lists num_local_experts=128 but actual checkpoint uses sparse layout "
+            "with 16 active experts per token from 128 total, stored differently."
+        ),
+        hardware="NVIDIA H100",
+        model_name="Llama-4-Maverick-17Bx128E",
+        backend="vLLM 0.8.x",
+        initial_log=(
+            "[vLLM] MoE init: 128 experts, 2 active per token\n"
+            "[vLLM] Loading expert weights...\n"
+            "[vLLM] WARNING: Expert weight tensor shape doesn't match config\n"
+            "[vLLM] Expected: [128, hidden, ffn] Got: [128, ffn//4, hidden]"
+        ),
+        initial_snippet=(
+            "# config.json\n"
+            '{\n'
+            '  "num_local_experts": 128,\n'
+            '  "num_experts_per_tok": 2,\n'
+            '  "expert_layout": "dense"\n'
+            '  // Should be "interleaved" for Maverick architecture\n'
+            '}\n'
+        ),
+        specialist_opinions={
+            "runtime": SpecialistOpinion("Runtime OK.", 0.75, False),
+            "dispatch": SpecialistOpinion("MoE dispatch looks correct for the config.", 0.60, False),
+            "kernel": SpecialistOpinion(
+                "Expert weight tensor shape is transposed vs config expectation. "
+                "Config says dense layout but weights are in interleaved format.", 0.85, True
+            ),
+            "loader": SpecialistOpinion(
+                "Config expert_layout should be 'interleaved' not 'dense'. "
+                "Maverick uses interleaved expert storage.", 0.93, True
+            ),
+        },
+        inspect_results=InspectResult(
+            logs="[vLLM] Config: expert_layout=dense\n[vLLM] Actual weights: interleaved layout\n[vLLM] Shape mismatch in MoE layer 0",
+            config="expert_layout: dense (wrong)\nactual_layout: interleaved\nnum_experts: 128\nexperts_per_token: 2",
+            snippet="# Maverick checkpoint uses interleaved expert layout:\n# experts stored as [expert_idx, ffn_chunk, hidden]\n# Config says 'dense' which expects [expert_idx, hidden, ffn]\n# Fix: set expert_layout='interleaved'",
+            metrics="model_load_progress: 5%\nshape_mismatches: 128\nerror_at: expert_layer_0",
+        ),
+        specialist_followups={
+            "runtime": "Not a runtime issue.",
+            "dispatch": "Dispatch follows config. Fix the config first.",
+            "kernel": "Weight shapes don't match the layout assumption. Config needs updating.",
+            "loader": "Set expert_layout to 'interleaved' in config.json. Maverick stores experts interleaved.",
+        },
+    ))
+    scenarios.append(Scenario(
+        id="model_config_05",
+        root_cause="model_config",
+        correct_fix="update_model_config",
+        incident_ticket=(
+            "INCIDENT: Sliding window attention not activating for Mistral model. "
+            "Memory usage growing linearly with sequence length. "
+            "Should plateau after window size."
+        ),
+        hardware="NVIDIA B200",
+        model_name="Mistral-Large-2",
+        backend="SGLang 0.5.x",
+        initial_log=(
+            "[SGLang] Attention config: full attention (no sliding window)\n"
+            "[SGLang] KV cache growing linearly with seq_len\n"
+            "[SGLang] Memory at 32k tokens: 40GB (expected: 12GB with sliding window)\n"
+            "[SGLang] sliding_window not found in config.json"
+        ),
+        initial_snippet=(
+            "# config.json\n"
+            '{\n'
+            '  "max_position_embeddings": 32768,\n'
+            '  "num_attention_heads": 96\n'
+            '  // Missing: "sliding_window": 4096\n'
+            '}\n'
+        ),
+        specialist_opinions={
+            "runtime": SpecialistOpinion("Runtime fine. Memory growing as expected for full attention.", 0.78, False),
+            "dispatch": SpecialistOpinion(
+                "Backend correctly doing full attention because config doesn't specify sliding window.", 0.70, True
+            ),
+            "kernel": SpecialistOpinion(
+                "Kernel supports sliding window. Config just needs the parameter.", 0.82, True
+            ),
+            "loader": SpecialistOpinion(
+                "Config.json missing sliding_window=4096. Mistral models use 4096-token sliding window.", 0.92, True
+            ),
+        },
+        inspect_results=InspectResult(
+            logs="[SGLang] No sliding_window in config -> full attention\n[SGLang] KV cache: 32k * 96 heads * 128 dim * 2 = 40GB",
+            config="sliding_window: null\nmax_position_embeddings: 32768\nexpected_sliding_window: 4096",
+            snippet="# Mistral-Large-2 uses 4096-token sliding window\n# Config missing: sliding_window: 4096\n# Without it, full O(n) KV cache used",
+            metrics="kv_cache_32k_gb: 40\nexpected_kv_cache_gb: 12\nmemory_overhead: 3.3x",
+        ),
+        specialist_followups={
+            "runtime": "Memory growth is correct for the config given. Fix the config.",
+            "dispatch": "Backend reads config. Add sliding_window=4096.",
+            "kernel": "Sliding window attention kernel exists. Just needs the config parameter to activate.",
+            "loader": "Add sliding_window: 4096 to config.json.",
+        },
+    ))
+    # --- weight_layout additional scenarios ---
+    scenarios.append(Scenario(
+        id="weight_layout_03",
+        root_cause="weight_layout",
+        correct_fix="fix_weight_mapping",
+        incident_ticket=(
+            "INCIDENT: Model outputs garbage after quantization with GPTQ. "
+            "Original FP16 model is fine. GPTQ quantization reports success "
+            "but group indices are misaligned."
+        ),
+        hardware="NVIDIA H100",
+        model_name="Qwen3-235B-A22B",
+        backend="vLLM 0.8.x",
+        initial_log=(
+            "[vLLM] Loading GPTQ-quantized Qwen3...\n"
+            "[vLLM] Quantization: 4-bit, group_size=128\n"
+            "[vLLM] WARNING: g_idx tensor shape mismatch in layer 0\n"
+            "[vLLM] Output: incoherent (perplexity 1247)"
+        ),
+        initial_snippet=(
+            "# GPTQ packing\n"
+            "# g_idx maps each weight column to its quantization group\n"
+            "# Expected shape: [in_features]\n"
+            "# Got shape: [in_features // group_size] (wrong!)\n"
+        ),
+        specialist_opinions={
+            "runtime": SpecialistOpinion("CUDA fine. Kernels launch.", 0.78, False),
+            "dispatch": SpecialistOpinion("GPTQ backend selected correctly.", 0.65, False),
+            "kernel": SpecialistOpinion(
+                "Dequantization kernel gets wrong group assignments because g_idx is wrong shape.", 0.82, True
+            ),
+            "loader": SpecialistOpinion(
+                "GPTQ group index (g_idx) tensor has wrong shape. The quantization script packed it incorrectly. "
+                "Needs regeneration with correct per-column group mapping.", 0.94, True
+            ),
+        },
+        inspect_results=InspectResult(
+            logs="[GPTQ] g_idx shape: [128] (wrong) vs expected [16384]\n[GPTQ] Each column needs its own group index\n[GPTQ] Wrong g_idx causes random dequant scale selection",
+            config="group_size: 128\nin_features: 16384\ng_idx_shape: [128]\nexpected_g_idx_shape: [16384]",
+            snippet="# g_idx should be per-column: shape [in_features]\n# But quantizer produced per-group: shape [in_features//group_size]\n# This assigns wrong scales during dequantization",
+            metrics="perplexity: 1247\nexpected_perplexity: 10.2\nlayers_affected: all\ng_idx_misaligned: true",
+        ),
+        specialist_followups={
+            "runtime": "No runtime issues.",
+            "dispatch": "Backend selection is fine.",
+            "kernel": "Kernel dequantizes correctly when given right g_idx. Fix the mapping.",
+            "loader": "Regenerate g_idx with per-column mapping (shape [in_features], not [in_features//group_size]).",
+        },
+    ))
+    scenarios.append(Scenario(
+        id="weight_layout_04",
+        root_cause="weight_layout",
+        correct_fix="fix_weight_mapping",
+        incident_ticket=(
+            "INCIDENT: FP8 model on MI300X gives NaN after first layer. "
+            "Dequantization scales appear transposed. "
+            "Same checkpoint works on NVIDIA with e4m3fn format."
+        ),
+        hardware="AMD MI300X",
+        model_name="DeepSeek-R1-Distill-70B",
+        backend="vLLM 0.8.x",
+        initial_log=(
+            "[vLLM] FP8 dequant: loading scales...\n"
+            "[vLLM] Scale tensor shape: [out_features, 1] — expected [1, out_features] for AMD\n"
+            "[vLLM] Layer 0 output: NaN (scale applied to wrong dimension)\n"
+            "[vLLM] All subsequent layers: NaN"
+        ),
+        initial_snippet=(
+            "# fp8_weights.py\n"
+            "# NVIDIA: scales are per-output-channel [out, 1]\n"
+            "# AMD: scales are per-input-channel [1, in]\n"
+            "# Converter didn't transpose for AMD\n"
+        ),
+        specialist_opinions={
+            "runtime": SpecialistOpinion("ROCm runtime fine.", 0.78, False),
+            "dispatch": SpecialistOpinion("FP8 backend selected. Format mismatch possible.", 0.65, False),
+            "kernel": SpecialistOpinion(
+                "FP8 GEMM applies scale in wrong dimension due to transposed scale tensor.", 0.85, True
+            ),
+            "loader": SpecialistOpinion(
+                "FP8 scale tensors need transposing for AMD. NVIDIA uses [out,1], AMD uses [1,in]. "
+                "Weight converter didn't handle this.", 0.95, True
+            ),
+        },
+        inspect_results=InspectResult(
+            logs="[FP8] Scale shape [4096,1] but AMD MFMA expects [1,4096]\n[FP8] Dequant: scale broadcast on wrong axis -> NaN\n[FP8] First non-NaN result never produced",
+            config="fp8_scale_shape: [out_features, 1]\namd_expected: [1, in_features]\nscale_transpose_needed: true",
+            snippet="# NVIDIA layout: W_fp8 * scale[out,1] -> per-output-channel\n# AMD layout: W_fp8 * scale[1,in] -> per-input-channel\n# Converter assumed NVIDIA layout\n# Fix: transpose scales for AMD",
+            metrics="nan_outputs: 100%\nlayers_producing_nan: all\nfirst_nan_at: layer_0",
+        ),
+        specialist_followups={
+            "runtime": "Not a runtime issue.",
+            "dispatch": "FP8 selected correctly. Scale orientation is the issue.",
+            "kernel": "GEMM kernel applies scale along wrong dimension. Transpose the scales.",
+            "loader": "Transpose FP8 scale tensors from [out,1] to [1,in] for AMD.",
+        },
+    ))
+    scenarios.append(Scenario(
+        id="weight_layout_05",
+        root_cause="weight_layout",
+        correct_fix="fix_weight_mapping",
+        incident_ticket=(
+            "INCIDENT: Embedding layer produces identical vectors for all tokens. "
+            "After checkpoint conversion, embedding weights appear row-shuffled. "
+            "Tokenizer maps to wrong rows."
+        ),
+        hardware="NVIDIA SM121 (DGX Spark)",
+        model_name="Llama-4-Maverick-17Bx128E",
+        backend="SGLang 0.5.x",
+        initial_log=(
+            "[SGLang] Embedding layer: 128256 tokens x 4096 dim\n"
+            "[SGLang] Token 'Hello' -> embedding row 85432 (expected: row 9906)\n"
+            "[SGLang] All outputs identical — embeddings mapped to wrong rows\n"
+            "[SGLang] Suspect: tokenizer vocab offset not applied during conversion"
+        ),
+        initial_snippet=(
+            "# convert_checkpoint.py\n"
+            "embed = original_weights['embed_tokens.weight']  # [128256, 4096]\n"
+            "# BUG: added_tokens offset not applied\n"
+            "# Tokenizer expects base_vocab at rows 0-127999\n"
+            "# Converter put added_tokens at rows 0-255\n"
+        ),
+        specialist_opinions={
+            "runtime": SpecialistOpinion("Runtime fine. Model loads.", 0.75, False),
+            "dispatch": SpecialistOpinion("Backend dispatch correct.", 0.68, False),
+            "kernel": SpecialistOpinion(
+                "Embedding lookup works mechanically but returns wrong vectors. Data issue.", 0.78, True
+            ),
+            "loader": SpecialistOpinion(
+                "Embedding weight rows are misaligned after conversion. Tokenizer indices map to wrong rows. "
+                "Converter needs to preserve original row ordering.", 0.94, True
+            ),
+        },
+        inspect_results=InspectResult(
+            logs="[SGLang] Token 'Hello' (id=9906) -> embedding from original row 85432\n[SGLang] Row mapping offset: 75526\n[SGLang] Converter applied wrong row permutation",
+            config="vocab_size: 128256\nembed_dim: 4096\nrow_offset_error: 75526",
+            snippet="# Converter reordered rows: put added_tokens (256) first, then base vocab\n# Tokenizer expects base vocab at row 0\n# Fix: preserve original row order in embedding conversion",
+            metrics="embedding_cosine_sim_to_expected: 0.02\nall_outputs_identical: true\nperplexity: infinity",
+        ),
+        specialist_followups={
+            "runtime": "No runtime issue.",
+            "dispatch": "Dispatch is correct.",
+            "kernel": "Embedding lookup returns whatever is at the indexed row. The rows are just wrong.",
+            "loader": "Converter put added_tokens at index 0. Fix: keep original row order.",
+        },
+    ))
+    # --- Additional eval scenarios (_06 suffix) ---
+    scenarios.append(Scenario(
+        id="arch_guard_06",
+        root_cause="arch_guard",
+        correct_fix="relax_arch_check",
+        incident_ticket=(
+            "INCIDENT: CUTLASS GEMM kernel rejects SM121 with 'unsupported architecture'. "
+            "is_family_of() check fails because SM121 not in family table. "
+            "FP8 inference completely blocked."
+        ),
+        hardware="NVIDIA SM121 (DGX Spark)",
+        model_name="Mistral-Large-2",
+        backend="TensorRT-LLM 0.18",
+        initial_log=(
+            "[CUTLASS] is_family_of(sm_121, sm_90) = false\n"
+            "[CUTLASS] SM121 not registered in family hierarchy\n"
+            "[CUTLASS] FP8 GEMM dispatch: BLOCKED"
+        ),
+        initial_snippet=(
+            "# cutlass/arch/family.py\n"
+            "FAMILY_MAP = {90: [90], 89: [89], 86: [86], 80: [80]}\n"
+            "# SM121 not in any family\n"
+        ),
+        specialist_opinions={
+            "runtime": SpecialistOpinion("CUDA 13 fine.", 0.78, False),
+            "dispatch": SpecialistOpinion(
+                "CUTLASS family map doesn't include SM12x. Need to register SM120/121 family.", 0.93, True
+            ),
+            "kernel": SpecialistOpinion(
+                "The kernel weight format might be wrong for SM121.", 0.40, False
+            ),
+            "loader": SpecialistOpinion("Engine built. Weights loaded. GEMM dispatch blocked.", 0.70, False),
+        },
+        inspect_results=InspectResult(
+            logs="[CUTLASS] FAMILY_MAP has no entry for 121\n[CUTLASS] is_family_of(121, 90) -> False\n[CUTLASS] FP8 GEMM requires family >= 90",
+            config="gpu_sm: 121\nfamily_map: {90:[90],89:[89],...}\nsm121_family: undefined",
+            snippet="# SM12x is its own family but shares FP8 MMA with SM90\n# Fix: add 120: [120, 121] and 121: [120, 121] to FAMILY_MAP\n# Or: register SM12x as SM90-compatible for GEMM",
+            metrics="fp8_gemm_blocked: true\nbf16_gemm: functional",
+        ),
+        specialist_followups={
+            "runtime": "Runtime fine.",
+            "dispatch": "Register SM12x family in CUTLASS. SM121 FP8 MMA is SM90-compatible.",
+            "kernel": "Weight format is fine. It's the arch family check blocking dispatch.",
+            "loader": "Weights loaded correctly. GEMM dispatch is the issue.",
+        },
+    ))
+    scenarios.append(Scenario(
+        id="backend_selector_06",
+        root_cause="backend_selector",
+        correct_fix="switch_backend",
+        incident_ticket=(
+            "INCIDENT: DGX Spark running PagedAttention v1 instead of v2. "
+            "Prefix caching not working. Cache hit rate near 0%. "
+            "Same prompts re-computed every request."
+        ),
+        hardware="NVIDIA SM121 (DGX Spark)",
+        model_name="DeepSeek-V3-671B",
+        backend="vLLM 0.8.x",
+        initial_log=(
+            "[vLLM] PagedAttention version: v1\n"
+            "[vLLM] Prefix caching: disabled (requires PA v2)\n"
+            "[vLLM] Cache hit rate: 0.1% (expected: 60%+ with repeated prefixes)\n"
+            "[vLLM] TTFT p99: 2100ms (expected: 400ms with caching)"
+        ),
+        initial_snippet=(
+            "# vllm/core/scheduler.py\n"
+            "def select_paged_attention(gpu_sm):\n"
+            "    if gpu_sm >= 80 and gpu_sm <= 90:\n"
+            "        return 'v2'  # with prefix caching\n"
+            "    return 'v1'  # SM121 > 90, falls here\n"
+        ),
+        specialist_opinions={
+            "runtime": SpecialistOpinion("CUDA runtime fine. Server runs.", 0.75, False),
+            "dispatch": SpecialistOpinion(
+                "PagedAttention version selector has range bug. SM121 > 90 so gets v1 without prefix caching.", 0.94, True
+            ),
+            "kernel": SpecialistOpinion(
+                "PA v2 kernel works on SM121. It's the selector that's wrong.", 0.85, True
+            ),
+            "loader": SpecialistOpinion("Model loaded fine. Not a weight issue.", 0.72, False),
+        },
+        inspect_results=InspectResult(
+            logs="[vLLM] sm_121 not in range [80,90] -> PA v1\n[vLLM] PA v1 doesn't support prefix caching\n[vLLM] Every prefix re-computed from scratch",
+            config="paged_attention: v1\nprefix_caching: disabled\ngpu_sm: 121\nv2_range: [80, 90]",
+            snippet="# PA v2 supports prefix caching, reducing TTFT 3-5x\n# Selector range [80,90] excludes SM121\n# Fix: include SM12x in v2-eligible set",
+            metrics="cache_hit_rate: 0.1%\nexpected_cache_hit_rate: 62%\nttft_p99_ms: 2100\nexpected_ttft_ms: 400",
+        ),
+        specialist_followups={
+            "runtime": "Server runs fine. Performance issue only.",
+            "dispatch": "Fix the range check to include SM12x. PA v2 works on SM121.",
+            "kernel": "PA v2 kernel is compatible. Just need the selector to pick it.",
+            "loader": "Not a loading issue.",
+        },
+    ))
+    scenarios.append(Scenario(
+        id="runtime_loader_06",
+        root_cause="runtime_loader",
+        correct_fix="fix_runtime_path",
+        incident_ticket=(
+            "INCIDENT: Container on B200 node fails with 'CUDA driver version insufficient'. "
+            "Host has driver 565 but container sees driver 535. "
+            "nvidia-smi inside container shows old driver."
+        ),
+        hardware="NVIDIA B200",
+        model_name="Llama-3.3-70B-Instruct",
+        backend="vLLM 0.8.x",
+        initial_log=(
+            "[Container] nvidia-smi: Driver Version: 535.183.01\n"
+            "[Host] nvidia-smi: Driver Version: 565.57.01\n"
+            "[vLLM] CUDA 13 requires driver >= 560\n"
+            "[vLLM] ERROR: CUDA driver version insufficient for CUDA runtime"
+        ),
+        initial_snippet=(
+            "# Docker run command\n"
+            "docker run --gpus all \\\n"
+            "  -e NVIDIA_DRIVER_CAPABILITIES=compute,utility \\\n"
+            "  -e NVIDIA_VISIBLE_DEVICES=all \\\n"
+            "  # Missing: --runtime=nvidia or proper CDI config\n"
+        ),
+        specialist_opinions={
+            "runtime": SpecialistOpinion(
+                "Container seeing old driver. Docker GPU passthrough not configured correctly. "
+                "Need proper nvidia-container-runtime setup.", 0.94, True
+            ),
+            "dispatch": SpecialistOpinion("Server never starts. Can't assess dispatch.", 0.50, False),
+            "kernel": SpecialistOpinion(
+                "Maybe the B200 needs a newer CUDA toolkit version.", 0.45, False
+            ),
+            "loader": SpecialistOpinion(
+                "Container's nvidia driver libs are stale. Bind mount is pointing to wrong driver version.", 0.88, True
+            ),
+        },
+        inspect_results=InspectResult(
+            logs="[Container] /usr/lib/x86_64-linux-gnu/libnvidia-ml.so -> driver 535\n[Host] /usr/lib/x86_64-linux-gnu/libnvidia-ml.so -> driver 565\n[Docker] nvidia-container-runtime not in daemon.json",
+            config="host_driver: 565.57.01\ncontainer_driver: 535.183.01\nnvidia_runtime: not_configured",
+            snippet="# Docker daemon.json missing nvidia runtime\n# Container bundles old driver libs instead of using host driver\n# Fix: configure nvidia-container-runtime or CDI",
+            metrics="container_start_failures: 1\ndriver_mismatch: true\ncuda_init: failed",
+        ),
+        specialist_followups={
+            "runtime": "nvidia-container-toolkit needs to be configured to pass host driver into container.",
+            "dispatch": "Can't run without CUDA init.",
+            "kernel": "The toolkit version is fine. It's the driver passthrough that's broken.",
+            "loader": "Container needs host's driver libs mounted. Fix Docker runtime config.",
+        },
+    ))
+    scenarios.append(Scenario(
+        id="model_config_06",
+        root_cause="model_config",
+        correct_fix="update_model_config",
+        incident_ticket=(
+            "INCIDENT: BF16 model serving on MI300X has 2x expected memory usage. "
+            "Config says float16 dtype but model should use bfloat16. "
+            "Unnecessary fp16->bf16 conversion happening at runtime."
+        ),
+        hardware="AMD MI300X",
+        model_name="DeepSeek-R1-Distill-70B",
+        backend="vLLM 0.8.x",
+        initial_log=(
+            "[vLLM] Config dtype: float16\n"
+            "[vLLM] Actual weights: bfloat16\n"
+            "[vLLM] Runtime conversion float16 config -> bfloat16 weights\n"
+            "[vLLM] Extra memory for conversion buffers: 35GB"
+        ),
+        initial_snippet=(
+            "# config.json\n"
+            '{\n'
+            '  "torch_dtype": "float16"\n'
+            '  // Actual checkpoint is bfloat16\n'
+            '  // Mismatch causes runtime conversion overhead\n'
+            '}\n'
+        ),
+        specialist_opinions={
+            "runtime": SpecialistOpinion("ROCm runtime healthy. Memory available.", 0.78, False),
+            "dispatch": SpecialistOpinion("Backend dispatch fine.", 0.65, False),
+            "kernel": SpecialistOpinion(
+                "Kernels running with dtype conversion overhead. "
+                "Config says fp16 but weights are bf16, so vLLM converts at load time.", 0.82, True
+            ),
+            "loader": SpecialistOpinion(
+                "Config torch_dtype=float16 doesn't match checkpoint dtype=bfloat16. "
+                "Fix config to say bfloat16 to avoid conversion overhead.", 0.93, True
+            ),
+        },
+        inspect_results=InspectResult(
+            logs="[vLLM] Config: float16, Checkpoint: bfloat16\n[vLLM] Allocating conversion buffers: 35GB\n[vLLM] Total memory: model(35GB) + conversion(35GB) = 70GB",
+            config="torch_dtype: float16\ncheckpoint_dtype: bfloat16\nmismatch: true",
+            snippet="# Config says float16 but checkpoint is bfloat16\n# vLLM allocates both versions during conversion\n# Fix: set torch_dtype='bfloat16' in config.json",
+            metrics="memory_used_gb: 70\nexpected_memory_gb: 35\nconversion_overhead_gb: 35",
+        ),
+        specialist_followups={
+            "runtime": "Memory subsystem fine. Just using too much.",
+            "dispatch": "Dispatch fine after conversion.",
+            "kernel": "Conversion overhead is the issue. Fix config to match checkpoint dtype.",
+            "loader": "Set torch_dtype to bfloat16 in config.json.",
+        },
+    ))
+    scenarios.append(Scenario(
+        id="weight_layout_06",
+        root_cause="weight_layout",
+        correct_fix="fix_weight_mapping",
+        incident_ticket=(
+            "INCIDENT: Rotary position encoding giving wrong angles after checkpoint merge. "
+            "Two LoRA adapters merged into base model, but RoPE inv_freq tensor "
+            "accidentally overwritten with adapter values. Outputs degrade past position 128."
+        ),
+        hardware="NVIDIA H100",
+        model_name="Mistral-Large-2",
+        backend="vLLM 0.8.x",
+        initial_log=(
+            "[vLLM] Loading merged checkpoint...\n"
+            "[vLLM] RoPE inv_freq shape: [64] (correct)\n"
+            "[vLLM] RoPE inv_freq values: [0.001, 0.001, ...] (all same — WRONG)\n"
+            "[vLLM] Expected: geometric sequence 1/10000^(2i/d)"
+        ),
+        initial_snippet=(
+            "# merge_lora.py\n"
+            "# BUG: LoRA merge accidentally overwrote inv_freq\n"
+            "merged['inv_freq'] = adapter_state['inv_freq']  # adapter had dummy values\n"
+            "# Should have kept base model's inv_freq\n"
+        ),
+        specialist_opinions={
+            "runtime": SpecialistOpinion("Runtime fine.", 0.78, False),
+            "dispatch": SpecialistOpinion("Backend dispatch correct.", 0.65, False),
+            "kernel": SpecialistOpinion(
+                "RoPE kernel computes correct rotations for the freq values given. But freq values are wrong.", 0.80, True
+            ),
+            "loader": SpecialistOpinion(
+                "LoRA merge script overwrote inv_freq with adapter's dummy values. "
+                "Need to restore base model's inv_freq or regenerate from formula.", 0.95, True
+            ),
+        },
+        inspect_results=InspectResult(
+            logs="[RoPE] inv_freq: all values = 0.001 (constant)\n[RoPE] Expected: geometric decay from 1.0 to 1e-4\n[RoPE] Position encoding essentially constant -> no position info after ~128 tokens",
+            config="inv_freq_values: [0.001]*64\nexpected: geometric_series(1/10000, dim=128)\nrope_theta: 10000",
+            snippet="# inv_freq should be: 1 / (theta ** (torch.arange(0, dim, 2) / dim))\n# Instead: all 0.001 from LoRA adapter dummy init\n# Fix: regenerate inv_freq from formula or restore from base model",
+            metrics="quality_0_128: 90%\nquality_128_1k: 25%\nquality_1k_plus: 5%",
+        ),
+        specialist_followups={
+            "runtime": "No runtime issue.",
+            "dispatch": "Dispatch correct.",
+            "kernel": "RoPE kernel works. Just getting wrong frequencies.",
+            "loader": "Restore inv_freq from base model. LoRA merge script has a bug that overwrites non-LoRA tensors.",
+        },
+    ))
+    return scenarios
+# Build the full scenario pool
+SCENARIOS = _make_scenarios()
+# _01, _03, _04, _05 = train; _02, _06 = eval
+TRAIN_SCENARIOS = [s for s in SCENARIOS if s.id.endswith(("_01", "_03", "_04", "_05"))]
+EVAL_SCENARIOS = [s for s in SCENARIOS if s.id.endswith(("_02", "_06"))]
+def get_scenario(scenario_id: str | None = None, split: str = "train") -> Scenario:
+    """Get a scenario by ID, or random from the given split."""
+    if scenario_id:
+        for s in SCENARIOS:
+            if s.id == scenario_id:
+                return s
+        raise ValueError(f"Unknown scenario: {scenario_id}")
+    pool = TRAIN_SCENARIOS if split == "train" else EVAL_SCENARIOS
+    return random.choice(pool)

server/stack_doctor_environment.py ADDED Viewed

	@@ -0,0 +1,269 @@

+"""
+Stack Doctor Environment.
+An overseer LLM diagnoses sick inference stacks by probing subsystems,
+reconciling conflicting specialist-agent reports, and selecting the
+minimal correct fix.
+Inspired by real SM12x enablement bugs across vLLM, FlashInfer, SGLang,
+CUTLASS, and Flash-Attention.
+"""
+from __future__ import annotations
+import json
+from uuid import uuid4
+from openenv.core.env_server.interfaces import Environment
+from openenv.core.env_server.types import State
+from models import StackDoctorAction, StackDoctorObservation
+from .scenarios import (
+    ROOT_CAUSE_TO_FIX,
+    FIX_TO_ROOT_CAUSE,
+    ROOT_CAUSES,
+    FIXES,
+    SPECIALISTS,
+    Scenario,
+    get_scenario,
+)
+MAX_STEPS = 6
+INSPECT_TARGETS = {"logs", "config", "snippet", "metrics"}
+VALID_FIXES = set(FIXES)
+VALID_ROOT_CAUSES = set(ROOT_CAUSES)
+class EpisodeState:
+    """Internal mutable episode state (not exposed to agent)."""
+    def __init__(self, scenario: Scenario):
+        self.scenario = scenario
+        self.step_count = 0
+        self.fix_applied = False
+        self.fix_was_correct: bool | None = None
+        self.done = False
+        self.cumulative_reward = 0.0
+        self.actions_taken: list[dict] = []
+class StackDoctorEnvironment(Environment):
+    """
+    Stack Doctor: incident-response RL environment for
+    inference-stack diagnosis.
+    """
+    SUPPORTS_CONCURRENT_SESSIONS: bool = True
+    def __init__(self):
+        self._state = State(episode_id=str(uuid4()), step_count=0)
+        self._episode: EpisodeState | None = None
+    def reset(self, seed=None, episode_id=None, **kwargs) -> StackDoctorObservation:
+        scenario_id = kwargs.get("scenario_id")
+        split = kwargs.get("split", "train")
+        scenario = get_scenario(scenario_id, split=split)
+        self._state = State(
+            episode_id=episode_id or str(uuid4()),
+            step_count=0,
+        )
+        self._episode = EpisodeState(scenario)
+        specialist_obs = {}
+        for name, op in scenario.specialist_opinions.items():
+            specialist_obs[name] = {
+                "opinion": op.opinion,
+                "confidence": op.confidence,
+            }
+        return StackDoctorObservation(
+            output=(
+                "STACK DOCTOR — New incident assigned.\n"
+                "Diagnose the root cause, optionally apply a fix, then submit your diagnosis.\n"
+                "You have 6 steps. Use them wisely.\n\n"
+                "Available actions (send as JSON):\n"
+                '  {"type":"inspect","target":"logs|config|snippet|metrics"}\n'
+                '  {"type":"ask_specialist","specialist":"runtime|dispatch|kernel|loader"}\n'
+                '  {"type":"apply_fix","fix":"relax_arch_check|add_whitelist_entry|fix_runtime_path|switch_backend|update_model_config|fix_weight_mapping"}\n'
+                '  {"type":"submit","root_cause":"...","fix":"...","justification":"reason for diagnosis"}\n'
+            ),
+            incident_ticket=scenario.incident_ticket,
+            hardware=scenario.hardware,
+            model_name=scenario.model_name,
+            backend=scenario.backend,
+            log_excerpt=scenario.initial_log,
+            code_snippet=scenario.initial_snippet,
+            specialist_opinions=specialist_obs,
+            steps_remaining=MAX_STEPS,
+            fix_used=False,
+            done=False,
+            reward=0.0,
+        )
+    def step(self, action: StackDoctorAction, **kwargs) -> StackDoctorObservation:
+        ep = self._episode
+        if ep is None or ep.done:
+            return self._terminal_obs("Episode is over. Call reset() to start a new incident.", 0.0)
+        self._state.step_count += 1
+        ep.step_count += 1
+        try:
+            parsed = json.loads(action.message)
+        except (json.JSONDecodeError, TypeError):
+            return self._handle_invalid(ep, f"Invalid JSON: {action.message[:200]}")
+        action_type = parsed.get("type")
+        if action_type == "inspect":
+            return self._handle_inspect(ep, parsed)
+        elif action_type == "ask_specialist":
+            return self._handle_ask_specialist(ep, parsed)
+        elif action_type == "apply_fix":
+            return self._handle_apply_fix(ep, parsed)
+        elif action_type == "submit":
+            return self._handle_submit(ep, parsed)
+        else:
+            return self._handle_invalid(ep, f"Unknown action type: {action_type}")
+    @property
+    def state(self) -> State:
+        return self._state
+    def _handle_inspect(self, ep: EpisodeState, parsed: dict) -> StackDoctorObservation:
+        target = parsed.get("target")
+        if target not in INSPECT_TARGETS:
+            return self._handle_invalid(ep, f"Invalid inspect target: {target}. Use: {INSPECT_TARGETS}")
+        reward = -0.25
+        ep.cumulative_reward += reward
+        ep.actions_taken.append({"type": "inspect", "target": target})
+        ir = ep.scenario.inspect_results
+        result_map = {"logs": ir.logs, "config": ir.config, "snippet": ir.snippet, "metrics": ir.metrics}
+        return self._step_obs(ep, output=f"[INSPECT {target.upper()}]\n{result_map[target]}", reward=reward)
+    def _handle_ask_specialist(self, ep: EpisodeState, parsed: dict) -> StackDoctorObservation:
+        specialist = parsed.get("specialist")
+        if specialist not in SPECIALISTS:
+            return self._handle_invalid(ep, f"Invalid specialist: {specialist}. Use: {SPECIALISTS}")
+        reward = -0.25
+        ep.cumulative_reward += reward
+        ep.actions_taken.append({"type": "ask_specialist", "specialist": specialist})
+        followup = ep.scenario.specialist_followups.get(specialist, "No additional information.")
+        return self._step_obs(ep, output=f"[SPECIALIST: {specialist.upper()}]\n{followup}", reward=reward)
+    def _handle_apply_fix(self, ep: EpisodeState, parsed: dict) -> StackDoctorObservation:
+        if ep.fix_applied:
+            return self._handle_invalid(ep, "apply_fix already used this episode. You can only apply one fix.")
+        fix = parsed.get("fix")
+        if fix not in VALID_FIXES:
+            return self._handle_invalid(ep, f"Invalid fix: {fix}. Use one of: {sorted(VALID_FIXES)}")
+        ep.fix_applied = True
+        is_correct = fix == ep.scenario.correct_fix
+        ep.fix_was_correct = is_correct
+        reward = 3.0 if is_correct else -2.0
+        ep.cumulative_reward += reward
+        ep.actions_taken.append({"type": "apply_fix", "fix": fix, "correct": is_correct})
+        if is_correct:
+            output = f"[FIX APPLIED: {fix}] Fix applied successfully. Systems recovering. Now submit your diagnosis."
+        else:
+            output = f"[FIX APPLIED: {fix}] Fix applied but the issue persists. Consider your diagnosis carefully."
+        return self._step_obs(ep, output=output, reward=reward)
+    def _handle_submit(self, ep: EpisodeState, parsed: dict) -> StackDoctorObservation:
+        root_cause = parsed.get("root_cause")
+        fix = parsed.get("fix")
+        justification = parsed.get("justification", "")
+        if root_cause not in VALID_ROOT_CAUSES:
+            return self._handle_invalid(ep, f"Invalid root_cause: {root_cause}. Use one of: {sorted(VALID_ROOT_CAUSES)}")
+        if fix not in VALID_FIXES:
+            return self._handle_invalid(ep, f"Invalid fix: {fix}. Use one of: {sorted(VALID_FIXES)}")
+        ep.done = True
+        correct_rc = ep.scenario.root_cause
+        correct_fix = ep.scenario.correct_fix
+        rc_correct = root_cause == correct_rc
+        fix_correct = fix == correct_fix
+        has_justification = len(justification.strip()) >= 10
+        reward = 0.0
+        reward += 8.0 if rc_correct else -4.0
+        reward += 8.0 if fix_correct else -4.0
+        if (rc_correct and fix_correct) and ep.step_count <= 4:
+            reward += 2.0
+        if has_justification:
+            reward += 1.0
+        ep.cumulative_reward += reward
+        ep.actions_taken.append({
+            "type": "submit", "root_cause": root_cause, "fix": fix,
+            "justification": justification,
+            "rc_correct": rc_correct, "fix_correct": fix_correct,
+            "has_justification": has_justification,
+        })
+        output_lines = ["[DIAGNOSIS SUBMITTED]"]
+        output_lines.append(f"  Root cause: {root_cause} — {'CORRECT' if rc_correct else 'WRONG (was: ' + correct_rc + ')'}")
+        output_lines.append(f"  Fix: {fix} — {'CORRECT' if fix_correct else 'WRONG (was: ' + correct_fix + ')'}")
+        if has_justification:
+            output_lines.append(f"  Justification: {justification.strip()}")
+            output_lines.append("  JUSTIFICATION BONUS: +1")
+        else:
+            output_lines.append("  No justification provided (missed +1 bonus)")
+        output_lines.append(f"  Steps used: {ep.step_count}/{MAX_STEPS}")
+        if rc_correct and fix_correct and ep.step_count <= 4:
+            output_lines.append("  EFFICIENCY BONUS: +2 (solved in <= 4 steps)")
+        output_lines.append(f"  Episode reward: {ep.cumulative_reward:.2f}")
+        return self._terminal_obs("\n".join(output_lines), reward)
+    def _handle_invalid(self, ep: EpisodeState, msg: str) -> StackDoctorObservation:
+        reward = -2.0
+        ep.cumulative_reward += reward
+        ep.actions_taken.append({"type": "invalid", "message": msg})
+        if ep.step_count >= MAX_STEPS:
+            ep.done = True
+            return self._terminal_obs(f"[INVALID ACTION] {msg}\n[EPISODE OVER] Max steps reached. Auto-fail.", reward)
+        return self._step_obs(ep, output=f"[INVALID ACTION] {msg}", reward=reward)
+    def _step_obs(self, ep: EpisodeState, output: str, reward: float) -> StackDoctorObservation:
+        remaining = MAX_STEPS - ep.step_count
+        if remaining <= 0 and not ep.done:
+            ep.done = True
+            reward -= 4.0
+            ep.cumulative_reward += -4.0
+            output += "\n\n[EPISODE OVER] Max steps reached without submission. Auto-fail. Reward: -4"
+        return StackDoctorObservation(
+            output=output, incident_ticket=ep.scenario.incident_ticket,
+            hardware=ep.scenario.hardware, model_name=ep.scenario.model_name,
+            backend=ep.scenario.backend, log_excerpt="", code_snippet="",
+            specialist_opinions={}, steps_remaining=remaining, fix_used=ep.fix_applied,
+            done=ep.done, reward=reward,
+            metadata={"cumulative_reward": ep.cumulative_reward, "step": ep.step_count, "scenario_id": ep.scenario.id},
+        )
+    def _terminal_obs(self, output: str, reward: float) -> StackDoctorObservation:
+        ep = self._episode
+        return StackDoctorObservation(
+            output=output, incident_ticket=ep.scenario.incident_ticket if ep else "",
+            hardware=ep.scenario.hardware if ep else "", model_name=ep.scenario.model_name if ep else "",
+            backend=ep.scenario.backend if ep else "", log_excerpt="", code_snippet="",
+            specialist_opinions={}, steps_remaining=0, fix_used=ep.fix_applied if ep else False,
+            done=True, reward=reward,
+            metadata={"cumulative_reward": ep.cumulative_reward if ep else 0.0, "step": ep.step_count if ep else 0, "scenario_id": ep.scenario.id if ep else ""},
+        )

server/stack_doctor_mcp.py ADDED Viewed

	@@ -0,0 +1,393 @@

+"""
+Stack Doctor MCP Environment.
+Wraps the core Stack Doctor environment with MCP tools that agents
+can discover and invoke. This is the agent-facing interface —
+agents call tools like read_log(), query_specialist(), submit_diagnosis()
+instead of constructing JSON action strings.
+The training (WebSocket) API still works through _step_impl().
+"""
+from __future__ import annotations
+import json
+from typing import Any, Optional
+from uuid import uuid4
+from mcp.server.fastmcp import FastMCP
+from openenv.core.env_server.mcp_environment import MCPEnvironment
+from openenv.core.env_server.types import Action, Observation, State
+from models import StackDoctorAction, StackDoctorObservation
+from .scenarios import (
+    ROOT_CAUSE_TO_FIX,
+    FIX_TO_ROOT_CAUSE,
+    ROOT_CAUSES,
+    FIXES,
+    SPECIALISTS,
+    Scenario,
+    get_scenario,
+)
+MAX_STEPS = 6
+VALID_FIXES = set(FIXES)
+VALID_ROOT_CAUSES = set(ROOT_CAUSES)
+class StackDoctorMCPEnvironment(MCPEnvironment):
+    """
+    Stack Doctor with MCP tool interface for agent interaction.
+    Agents discover available tools (read_log, check_config, view_code,
+    run_diagnostic, query_specialist, apply_fix, submit_diagnosis) and
+    call them to investigate incidents and submit diagnoses.
+    """
+    SUPPORTS_CONCURRENT_SESSIONS: bool = True
+    def __init__(self):
+        mcp = FastMCP("stack_doctor")
+        self._state_obj = State(episode_id=str(uuid4()), step_count=0)
+        self._scenario: Scenario | None = None
+        self._step_count = 0
+        self._fix_applied = False
+        self._fix_was_correct: bool | None = None
+        self._done = False
+        self._cumulative_reward = 0.0
+        self._actions_taken: list[dict] = []
+        env = self  # capture for closures
+        @mcp.tool()
+        def read_log() -> str:
+            """Read system and application logs for the current incident.
+            Returns log output from the affected inference stack including
+            error messages, warnings, and system state information.
+            Costs 1 step (-0.25 reward)."""
+            return env._do_inspect("logs")
+        @mcp.tool()
+        def check_config() -> str:
+            """Check configuration files for the current incident.
+            Returns relevant configuration parameters including GPU settings,
+            backend configuration, model parameters, and environment variables.
+            Costs 1 step (-0.25 reward)."""
+            return env._do_inspect("config")
+        @mcp.tool()
+        def view_code() -> str:
+            """View relevant source code snippets for the current incident.
+            Returns code from the affected component showing the likely
+            location of the bug or misconfiguration.
+            Costs 1 step (-0.25 reward)."""
+            return env._do_inspect("snippet")
+        @mcp.tool()
+        def run_diagnostic() -> str:
+            """Run performance diagnostics and metrics collection.
+            Returns metrics like latency, throughput, GPU utilization,
+            error rates, and memory usage for the affected system.
+            Costs 1 step (-0.25 reward)."""
+            return env._do_inspect("metrics")
+        @mcp.tool()
+        def query_specialist(specialist: str) -> str:
+            """Ask a specialist for their analysis of the incident.
+            Specialists: 'runtime', 'dispatch', 'kernel', 'loader'.
+            WARNING: At least one specialist gives wrong advice per incident.
+            Cross-verify specialist opinions before trusting them.
+            Costs 1 step (-0.25 reward)."""
+            return env._do_ask_specialist(specialist)
+        @mcp.tool()
+        def apply_fix(fix: str) -> str:
+            """Apply a fix to the system. Can only be used ONCE per incident.
+            Available fixes: 'relax_arch_check', 'add_whitelist_entry',
+            'fix_runtime_path', 'switch_backend', 'update_model_config',
+            'fix_weight_mapping'.
+            Correct fix: +3 reward. Wrong fix: -2 reward."""
+            return env._do_apply_fix(fix)
+        @mcp.tool()
+        def submit_diagnosis(root_cause: str, fix: str, justification: str = "") -> str:
+            """Submit your final diagnosis. This ends the episode.
+            Root causes: 'arch_guard', 'backend_whitelist', 'runtime_loader',
+            'backend_selector', 'model_config', 'weight_layout'.
+            Fixes: 'relax_arch_check', 'add_whitelist_entry', 'fix_runtime_path',
+            'switch_backend', 'update_model_config', 'fix_weight_mapping'.
+            justification: A short sentence explaining WHY you chose this root cause
+            and fix based on the evidence you gathered. Bonus +1 if provided.
+            Correct root_cause: +8. Wrong: -4. Correct fix: +8. Wrong: -4.
+            Bonus +2 if solved in 4 or fewer steps. Bonus +1 for justification."""
+            return env._do_submit(root_cause, fix, justification)
+        super().__init__(mcp)
+    # ------------------------------------------------------------------
+    # MCP tool implementations
+    # ------------------------------------------------------------------
+    def _check_episode(self) -> str | None:
+        """Return error message if episode is not active."""
+        if self._scenario is None:
+            return "No active incident. Call reset() first."
+        if self._done:
+            return "Episode is over. Call reset() to start a new incident."
+        if self._step_count >= MAX_STEPS:
+            self._done = True
+            return "Max steps reached. Episode over."
+        return None
+    def _record_step(self, reward: float, action: dict) -> None:
+        self._step_count += 1
+        self._state_obj.step_count = self._step_count
+        self._cumulative_reward += reward
+        self._actions_taken.append(action)
+    def _do_inspect(self, target: str) -> str:
+        err = self._check_episode()
+        if err:
+            return err
+        ir = self._scenario.inspect_results
+        result_map = {
+            "logs": ir.logs,
+            "config": ir.config,
+            "snippet": ir.snippet,
+            "metrics": ir.metrics,
+        }
+        self._record_step(-0.25, {"type": "inspect", "target": target})
+        remaining = MAX_STEPS - self._step_count
+        return (
+            f"[INSPECT {target.upper()}]\n"
+            f"{result_map[target]}\n\n"
+            f"[Steps remaining: {remaining} | Reward: -0.25 | Cumulative: {self._cumulative_reward:.2f}]"
+        )
+    def _do_ask_specialist(self, specialist: str) -> str:
+        err = self._check_episode()
+        if err:
+            return err
+        if specialist not in SPECIALISTS:
+            self._record_step(-2.0, {"type": "invalid", "message": f"Unknown specialist: {specialist}"})
+            return f"Invalid specialist '{specialist}'. Available: {SPECIALISTS}. Penalty: -2.0"
+        followup = self._scenario.specialist_followups.get(specialist, "No additional information.")
+        self._record_step(-0.25, {"type": "ask_specialist", "specialist": specialist})
+        remaining = MAX_STEPS - self._step_count
+        return (
+            f"[SPECIALIST: {specialist.upper()}]\n"
+            f"{followup}\n\n"
+            f"[Steps remaining: {remaining} | Reward: -0.25 | Cumulative: {self._cumulative_reward:.2f}]"
+        )
+    def _do_apply_fix(self, fix: str) -> str:
+        err = self._check_episode()
+        if err:
+            return err
+        if self._fix_applied:
+            self._record_step(-2.0, {"type": "invalid", "message": "Fix already applied"})
+            return "You already applied a fix this episode. Only one fix allowed. Penalty: -2.0"
+        if fix not in VALID_FIXES:
+            self._record_step(-2.0, {"type": "invalid", "message": f"Invalid fix: {fix}"})
+            return f"Invalid fix '{fix}'. Available: {sorted(VALID_FIXES)}. Penalty: -2.0"
+        self._fix_applied = True
+        is_correct = fix == self._scenario.correct_fix
+        self._fix_was_correct = is_correct
+        reward = 3.0 if is_correct else -2.0
+        self._record_step(reward, {"type": "apply_fix", "fix": fix, "correct": is_correct})
+        remaining = MAX_STEPS - self._step_count
+        if is_correct:
+            return (
+                f"[FIX APPLIED: {fix}] Fix applied successfully. Systems recovering.\n"
+                f"Now submit your diagnosis with submit_diagnosis().\n\n"
+                f"[Steps remaining: {remaining} | Reward: +3.0 | Cumulative: {self._cumulative_reward:.2f}]"
+            )
+        else:
+            return (
+                f"[FIX APPLIED: {fix}] Fix applied but the issue persists.\n"
+                f"Consider your diagnosis carefully.\n\n"
+                f"[Steps remaining: {remaining} | Reward: -2.0 | Cumulative: {self._cumulative_reward:.2f}]"
+            )
+    def _do_submit(self, root_cause: str, fix: str, justification: str = "") -> str:
+        err = self._check_episode()
+        if err:
+            return err
+        if root_cause not in VALID_ROOT_CAUSES:
+            self._record_step(-2.0, {"type": "invalid", "message": f"Invalid root_cause: {root_cause}"})
+            return f"Invalid root_cause '{root_cause}'. Available: {sorted(VALID_ROOT_CAUSES)}. Penalty: -2.0"
+        if fix not in VALID_FIXES:
+            self._record_step(-2.0, {"type": "invalid", "message": f"Invalid fix: {fix}"})
+            return f"Invalid fix '{fix}'. Available: {sorted(VALID_FIXES)}. Penalty: -2.0"
+        self._done = True
+        rc_correct = root_cause == self._scenario.root_cause
+        fix_correct = fix == self._scenario.correct_fix
+        has_justification = len(justification.strip()) >= 10
+        reward = 0.0
+        reward += 8.0 if rc_correct else -4.0
+        reward += 8.0 if fix_correct else -4.0
+        if rc_correct and fix_correct and self._step_count + 1 <= 4:
+            reward += 2.0
+        if has_justification:
+            reward += 1.0
+        self._record_step(reward, {
+            "type": "submit", "root_cause": root_cause, "fix": fix,
+            "justification": justification,
+            "rc_correct": rc_correct, "fix_correct": fix_correct,
+            "has_justification": has_justification,
+        })
+        lines = ["[DIAGNOSIS SUBMITTED]"]
+        lines.append(f"  Root cause: {root_cause} — {'CORRECT' if rc_correct else 'WRONG (was: ' + self._scenario.root_cause + ')'}")
+        lines.append(f"  Fix: {fix} — {'CORRECT' if fix_correct else 'WRONG (was: ' + self._scenario.correct_fix + ')'}")
+        if has_justification:
+            lines.append(f"  Justification: {justification.strip()}")
+            lines.append("  JUSTIFICATION BONUS: +1")
+        else:
+            lines.append("  No justification provided (missed +1 bonus)")
+        lines.append(f"  Steps used: {self._step_count}/{MAX_STEPS}")
+        if rc_correct and fix_correct and self._step_count <= 4:
+            lines.append("  EFFICIENCY BONUS: +2 (solved in <= 4 steps)")
+        lines.append(f"  Episode reward: {self._cumulative_reward:.2f}")
+        return "\n".join(lines)
+    # ------------------------------------------------------------------
+    # OpenEnv Environment interface (for training / WebSocket API)
+    # ------------------------------------------------------------------
+    def reset(self, seed=None, episode_id=None, **kwargs) -> StackDoctorObservation:
+        scenario_id = kwargs.get("scenario_id")
+        split = kwargs.get("split", "train")
+        self._scenario = get_scenario(scenario_id, split=split)
+        self._state_obj = State(
+            episode_id=episode_id or str(uuid4()),
+            step_count=0,
+        )
+        self._step_count = 0
+        self._fix_applied = False
+        self._fix_was_correct = None
+        self._done = False
+        self._cumulative_reward = 0.0
+        self._actions_taken = []
+        specialist_obs = {}
+        for name, op in self._scenario.specialist_opinions.items():
+            specialist_obs[name] = {
+                "opinion": op.opinion,
+                "confidence": op.confidence,
+            }
+        return StackDoctorObservation(
+            output=(
+                "STACK DOCTOR — New incident assigned.\n"
+                "Investigate using the available tools: read_log(), check_config(), "
+                "view_code(), run_diagnostic(), query_specialist(name).\n"
+                "When ready, apply_fix(fix) and/or submit_diagnosis(root_cause, fix).\n"
+                "You have 6 steps. At least one specialist is WRONG — cross-verify.\n"
+            ),
+            incident_ticket=self._scenario.incident_ticket,
+            hardware=self._scenario.hardware,
+            model_name=self._scenario.model_name,
+            backend=self._scenario.backend,
+            log_excerpt=self._scenario.initial_log,
+            code_snippet=self._scenario.initial_snippet,
+            specialist_opinions=specialist_obs,
+            steps_remaining=MAX_STEPS,
+            fix_used=False,
+            done=False,
+            reward=0.0,
+        )
+    def _step_impl(
+        self,
+        action: Action,
+        timeout_s: Optional[float] = None,
+        **kwargs: Any,
+    ) -> Observation:
+        """Handle non-MCP actions (JSON action strings for training)."""
+        if not isinstance(action, StackDoctorAction):
+            return self._make_obs("Invalid action type.", -2.0)
+        try:
+            parsed = json.loads(action.message)
+        except (json.JSONDecodeError, TypeError):
+            return self._make_obs(f"Invalid JSON: {action.message[:200]}", -2.0)
+        action_type = parsed.get("type")
+        if action_type == "inspect":
+            result = self._do_inspect(parsed.get("target", "logs"))
+        elif action_type == "ask_specialist":
+            result = self._do_ask_specialist(parsed.get("specialist", ""))
+        elif action_type == "apply_fix":
+            result = self._do_apply_fix(parsed.get("fix", ""))
+        elif action_type == "submit":
+            result = self._do_submit(parsed.get("root_cause", ""), parsed.get("fix", ""), parsed.get("justification", ""))
+        else:
+            self._record_step(-2.0, {"type": "invalid", "message": f"Unknown: {action_type}"})
+            result = f"Unknown action type: {action_type}. Penalty: -2.0"
+        # Extract last reward from actions
+        last_reward = 0.0
+        if self._actions_taken:
+            last = self._actions_taken[-1]
+            if last.get("type") == "submit":
+                # Calculate submit reward
+                rc_c = last.get("rc_correct", False)
+                fx_c = last.get("fix_correct", False)
+                last_reward = (8.0 if rc_c else -4.0) + (8.0 if fx_c else -4.0)
+                if rc_c and fx_c and self._step_count <= 4:
+                    last_reward += 2.0
+                if last.get("has_justification", False):
+                    last_reward += 1.0
+            elif last.get("type") == "apply_fix":
+                last_reward = 3.0 if last.get("correct") else -2.0
+            elif last.get("type") == "invalid":
+                last_reward = -2.0
+            else:
+                last_reward = -0.25
+        return self._make_obs(result, last_reward)
+    def _make_obs(self, output: str, reward: float) -> StackDoctorObservation:
+        remaining = MAX_STEPS - self._step_count
+        return StackDoctorObservation(
+            output=output,
+            incident_ticket=self._scenario.incident_ticket if self._scenario else "",
+            hardware=self._scenario.hardware if self._scenario else "",
+            model_name=self._scenario.model_name if self._scenario else "",
+            backend=self._scenario.backend if self._scenario else "",
+            log_excerpt="",
+            code_snippet="",
+            specialist_opinions={},
+            steps_remaining=remaining,
+            fix_used=self._fix_applied,
+            done=self._done,
+            reward=reward,
+            metadata={
+                "cumulative_reward": self._cumulative_reward,
+                "step": self._step_count,
+                "scenario_id": self._scenario.id if self._scenario else "",
+            },
+        )
+    @property
+    def state(self) -> State:
+        return self._state_obj

training/Dockerfile ADDED Viewed

	@@ -0,0 +1,39 @@

+FROM pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel
+WORKDIR /app
+RUN apt-get update && apt-get install -y --no-install-recommends git && rm -rf /var/lib/apt/lists/*
+RUN pip install --no-cache-dir --upgrade pip
+# Step 1: Install unsloth first (it pulls the torch version it wants)
+RUN pip install --no-cache-dir \
+    "unsloth @ git+https://github.com/unslothai/unsloth.git" \
+    unsloth_zoo \
+    xformers
+# Step 2: Now install torchvision to match whatever torch unsloth installed
+RUN pip install --no-cache-dir --upgrade torchvision
+# Step 3: Install TRL + training deps
+RUN pip install --no-cache-dir \
+    "trl>=0.18.2,<=0.24.0" \
+    "peft>=0.18.0" \
+    "accelerate>=0.34.1" \
+    "bitsandbytes>=0.45.5" \
+    "datasets>=3.4.1" \
+    "transformers>=4.51.3" \
+    "huggingface_hub>=0.34.0" \
+    sentencepiece \
+    hf_transfer \
+    torchao \
+    triton
+# Step 4: Install openenv for environment stepping
+RUN pip install --no-cache-dir openenv-core
+# Copy project code
+COPY . /app/stack_doctor/
+ENV PYTHONPATH="/app/stack_doctor:$PYTHONPATH"
+CMD ["python", "/app/stack_doctor/training/train_stack_doctor.py"]

training/__init__.py ADDED Viewed

File without changes

training/eval_stack_doctor.py ADDED Viewed

	@@ -0,0 +1,143 @@

+"""
+Stack Doctor — Evaluation Script
+Produces the 4 metrics for judges:
+1. Root-cause accuracy
+2. Fix-family accuracy
+3. Average steps to resolution
+4. Mean reward before vs after RL
+Can evaluate any model (base or fine-tuned) against held-out eval scenarios.
+"""
+import json
+import os
+import sys
+SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
+PROJECT_DIR = os.path.dirname(SCRIPT_DIR)
+sys.path.insert(0, PROJECT_DIR)
+from server.stack_doctor_environment import StackDoctorEnvironment
+from server.scenarios import EVAL_SCENARIOS
+from models import StackDoctorAction
+from training.train_stack_doctor import (
+    SYSTEM_PROMPT,
+    format_scenario_prompt,
+    extract_actions,
+)
+def evaluate_model(model, tokenizer, scenarios, label="Model"):
+    """Run model against scenarios and compute metrics."""
+    from unsloth import FastLanguageModel
+    FastLanguageModel.for_inference(model)
+    total_rc_correct = 0
+    total_fix_correct = 0
+    total_steps = 0
+    total_reward = 0.0
+    n = 0
+    for sc in scenarios:
+        messages = [
+            {"role": "system", "content": SYSTEM_PROMPT},
+            {"role": "user", "content": format_scenario_prompt(sc)},
+        ]
+        prompt = tokenizer.apply_chat_template(
+            messages, add_generation_prompt=True, tokenize=False,
+        )
+        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+        outputs = model.generate(
+            **inputs,
+            max_new_tokens=512,
+            temperature=0.3,
+            do_sample=True,
+        )
+        response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
+        actions = extract_actions(response)
+        if actions is None:
+            total_reward -= 5.0
+            n += 1
+            continue
+        env = StackDoctorEnvironment()
+        env.reset(scenario_id=sc.id)
+        cum_reward = 0.0
+        steps = 0
+        last_submit = None
+        for action_dict in actions:
+            if not isinstance(action_dict, dict):
+                continue
+            try:
+                obs = env.step(StackDoctorAction(message=json.dumps(action_dict)))
+                cum_reward += obs.reward
+                steps += 1
+                if action_dict.get("type") == "submit":
+                    last_submit = action_dict
+                if obs.done:
+                    break
+            except Exception:
+                break
+        if last_submit:
+            if last_submit.get("root_cause") == sc.root_cause:
+                total_rc_correct += 1
+            if last_submit.get("fix") == sc.correct_fix:
+                total_fix_correct += 1
+        total_steps += steps
+        total_reward += cum_reward
+        n += 1
+        print(f"  {sc.id}: rc={'OK' if last_submit and last_submit.get('root_cause')==sc.root_cause else 'FAIL'} "
+              f"fix={'OK' if last_submit and last_submit.get('fix')==sc.correct_fix else 'FAIL'} "
+              f"steps={steps} reward={cum_reward:.1f}")
+    print(f"\n{'='*50}")
+    print(f"{label} Results ({n} episodes):")
+    print(f"  Root-cause accuracy: {total_rc_correct/n:.1%}")
+    print(f"  Fix accuracy:        {total_fix_correct/n:.1%}")
+    print(f"  Avg steps:           {total_steps/n:.1f}")
+    print(f"  Avg reward:          {total_reward/n:.1f}")
+    print(f"{'='*50}")
+    return {
+        "rc_accuracy": total_rc_correct / n,
+        "fix_accuracy": total_fix_correct / n,
+        "avg_steps": total_steps / n,
+        "avg_reward": total_reward / n,
+    }
+def main():
+    from unsloth import FastLanguageModel
+    import argparse
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model", default="unsloth/Qwen3-1.7B", help="Model name or path")
+    parser.add_argument("--lora", default=None, help="Path to LoRA adapter")
+    args = parser.parse_args()
+    model, tokenizer = FastLanguageModel.from_pretrained(
+        model_name=args.model,
+        load_in_4bit=True,
+        max_seq_length=2048,
+    )
+    if args.lora:
+        from peft import PeftModel
+        model = PeftModel.from_pretrained(model, args.lora)
+    print(f"Evaluating {args.model}" + (f" + {args.lora}" if args.lora else ""))
+    print(f"Eval scenarios: {len(EVAL_SCENARIOS)}")
+    print()
+    evaluate_model(model, tokenizer, EVAL_SCENARIOS, label=args.model)
+if __name__ == "__main__":
+    main()

training/train_stack_doctor.py ADDED Viewed

	@@ -0,0 +1,311 @@

+"""
+Stack Doctor — GRPO Training Script
+Train an LLM to diagnose inference-stack incidents using Group Relative
+Policy Optimization (GRPO) with Unsloth + TRL.
+The model generates a JSON action plan, which gets executed against the
+Stack Doctor environment. Reward = cumulative episode reward.
+Fleet AI sub-theme: the agent must reconcile conflicting specialist reports
+(some specialists lie) to identify the correct root cause and fix.
+Usage (Colab with GPU):
+    !pip install unsloth trl openenv-core
+    !python train_stack_doctor.py
+"""
+import json
+import os
+import sys
+import random
+# ---------------------------------------------------------------------------
+# 1. Environment setup — add server to path for direct import
+# ---------------------------------------------------------------------------
+SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
+PROJECT_DIR = os.path.dirname(SCRIPT_DIR)
+sys.path.insert(0, PROJECT_DIR)
+from server.stack_doctor_environment import StackDoctorEnvironment
+from server.scenarios import SCENARIOS, TRAIN_SCENARIOS, EVAL_SCENARIOS
+from models import StackDoctorAction, StackDoctorObservation
+# ---------------------------------------------------------------------------
+# 2. Build the system prompt and dataset
+# ---------------------------------------------------------------------------
+SYSTEM_PROMPT = """You are Stack Doctor, an expert AI agent that diagnoses inference-stack incidents.
+You receive an incident ticket with hardware/model/backend context, log excerpts, code snippets, and specialist opinions. Some specialists may be wrong — you must reconcile conflicting reports.
+You must output a JSON array of actions to investigate and then submit your diagnosis. Available actions:
+  {"type":"inspect","target":"logs|config|snippet|metrics"}
+  {"type":"ask_specialist","specialist":"runtime|dispatch|kernel|loader"}
+  {"type":"apply_fix","fix":"relax_arch_check|add_whitelist_entry|fix_runtime_path|switch_backend|update_model_config|fix_weight_mapping"}
+  {"type":"submit","root_cause":"arch_guard|backend_whitelist|runtime_loader|backend_selector|model_config|weight_layout","fix":"relax_arch_check|add_whitelist_entry|fix_runtime_path|switch_backend|update_model_config|fix_weight_mapping","justification":"short explanation of your reasoning"}
+Rules:
+- You have 6 steps max. Each inspect/ask costs -0.25. Wrong fix costs -2. Wrong submit costs -4 per field.
+- Correct submit: +8 per correct field. Efficiency bonus +2 if solved in ≤4 steps.
+- Justification bonus: +1 if you include a justification (≥10 chars) explaining your reasoning.
+- apply_fix can only be used once per episode.
+- submit MUST be your final action.
+- Minimize investigation steps — be decisive.
+- Always include a justification explaining what evidence led to your diagnosis.
+Output ONLY a JSON array, e.g.:
+[{"type":"inspect","target":"logs"},{"type":"submit","root_cause":"arch_guard","fix":"relax_arch_check","justification":"Logs show sm_121 rejected by arch check despite being SM90-compatible"}]"""
+def format_scenario_prompt(scenario):
+    """Convert a scenario's initial observation into a user prompt."""
+    specialist_text = ""
+    for name, op in scenario.specialist_opinions.items():
+        specialist_text += f"\n  {name} (confidence {op.confidence:.2f}): {op.opinion}"
+    return f"""INCIDENT TICKET:
+{scenario.incident_ticket}
+HARDWARE: {scenario.hardware}
+MODEL: {scenario.model_name}
+BACKEND: {scenario.backend}
+LOG EXCERPT:
+{scenario.initial_log}
+CODE SNIPPET:
+{scenario.initial_snippet}
+SPECIALIST OPINIONS:{specialist_text}
+Provide your action plan as a JSON array. End with a submit action."""
+def build_dataset(scenarios, n_repeats=50):
+    """Build training dataset: each scenario repeated n times."""
+    data = []
+    for _ in range(n_repeats):
+        for sc in scenarios:
+            data.append({
+                "prompt": [
+                    {"role": "system", "content": SYSTEM_PROMPT},
+                    {"role": "user", "content": format_scenario_prompt(sc)},
+                ],
+                "scenario_id": sc.id,
+            })
+    random.shuffle(data)
+    return data
+# ---------------------------------------------------------------------------
+# 3. Reward functions
+# ---------------------------------------------------------------------------
+def extract_actions(text):
+    """Extract JSON action array from model output."""
+    text = text.strip()
+    # Try to find JSON array in the text
+    start = text.find("[")
+    end = text.rfind("]")
+    if start != -1 and end != -1 and end > start:
+        try:
+            actions = json.loads(text[start:end + 1])
+            if isinstance(actions, list):
+                return actions
+        except json.JSONDecodeError:
+            pass
+    # Try parsing the whole thing
+    try:
+        actions = json.loads(text)
+        if isinstance(actions, list):
+            return actions
+        return [actions]  # single action
+    except json.JSONDecodeError:
+        return None
+def valid_json_reward(completions, **kwargs):
+    """Reward for producing valid JSON action array."""
+    scores = []
+    for completion in completions:
+        response = completion[0]["content"] if isinstance(completion, list) else completion
+        actions = extract_actions(response)
+        if actions is None:
+            scores.append(-3.0)
+        elif not any(a.get("type") == "submit" for a in actions):
+            scores.append(-1.0)  # no submit = useless
+        else:
+            scores.append(1.0)
+    return scores
+def environment_reward(completions, scenario_id=None, **kwargs):
+    """Execute action plan against Stack Doctor and return episode reward."""
+    scores = []
+    scenario_ids = kwargs.get("scenario_id", [None] * len(completions))
+    for i, completion in enumerate(completions):
+        response = completion[0]["content"] if isinstance(completion, list) else completion
+        actions = extract_actions(response)
+        if actions is None:
+            scores.append(-5.0)
+            continue
+        sid = scenario_ids[i] if i < len(scenario_ids) else None
+        env = StackDoctorEnvironment()
+        env.reset(scenario_id=sid)
+        cumulative = 0.0
+        for action_dict in actions:
+            if not isinstance(action_dict, dict):
+                cumulative -= 2.0
+                continue
+            try:
+                obs = env.step(StackDoctorAction(message=json.dumps(action_dict)))
+                cumulative += obs.reward
+                if obs.done:
+                    break
+            except Exception:
+                cumulative -= 2.0
+                break
+        scores.append(cumulative)
+    return scores
+def efficiency_reward(completions, **kwargs):
+    """Bonus for shorter action plans that still submit."""
+    scores = []
+    for completion in completions:
+        response = completion[0]["content"] if isinstance(completion, list) else completion
+        actions = extract_actions(response)
+        if actions is None:
+            scores.append(0.0)
+        elif len(actions) <= 2 and any(a.get("type") == "submit" for a in actions):
+            scores.append(2.0)  # very efficient
+        elif len(actions) <= 4 and any(a.get("type") == "submit" for a in actions):
+            scores.append(1.0)
+        else:
+            scores.append(0.0)
+    return scores
+def justification_reward(completions, **kwargs):
+    """Reward for including a justification in the submit action."""
+    scores = []
+    for completion in completions:
+        response = completion[0]["content"] if isinstance(completion, list) else completion
+        actions = extract_actions(response)
+        if actions is None:
+            scores.append(0.0)
+            continue
+        submit_actions = [a for a in actions if a.get("type") == "submit"]
+        if not submit_actions:
+            scores.append(0.0)
+            continue
+        justification = submit_actions[-1].get("justification", "")
+        if len(justification.strip()) >= 10:
+            scores.append(1.0)
+        else:
+            scores.append(-0.5)
+    return scores
+# ---------------------------------------------------------------------------
+# 4. Training (Unsloth + TRL GRPO)
+# ---------------------------------------------------------------------------
+def main():
+    from unsloth import FastLanguageModel
+    from datasets import Dataset
+    from trl import GRPOConfig, GRPOTrainer
+    import torch
+    max_seq_length = 2048
+    lora_rank = 8
+    # Load model
+    model, tokenizer = FastLanguageModel.from_pretrained(
+        model_name="unsloth/Qwen3-1.7B",
+        load_in_4bit=True,
+        max_seq_length=max_seq_length,
+    )
+    model = FastLanguageModel.get_peft_model(
+        model,
+        r=lora_rank,
+        target_modules=[
+            "q_proj", "k_proj", "v_proj", "o_proj",
+            "gate_proj", "up_proj", "down_proj",
+        ],
+        lora_alpha=lora_rank * 2,
+        use_gradient_checkpointing="unsloth",
+        random_state=42,
+    )
+    # Build dataset
+    train_data = build_dataset(TRAIN_SCENARIOS, n_repeats=80)
+    dataset = Dataset.from_list(train_data)
+    # Compute prompt length for config
+    sample_prompt = tokenizer.apply_chat_template(
+        train_data[0]["prompt"],
+        add_generation_prompt=True,
+        tokenize=False,
+    )
+    max_prompt_length = len(tokenizer.encode(sample_prompt)) + 10
+    max_completion_length = max_seq_length - max_prompt_length
+    print(f"Prompt length: ~{max_prompt_length} tokens")
+    print(f"Completion budget: ~{max_completion_length} tokens")
+    print(f"Dataset size: {len(dataset)} episodes")
+    print(f"Train scenarios: {len(TRAIN_SCENARIOS)}, Eval scenarios: {len(EVAL_SCENARIOS)}")
+    # GRPO config
+    training_args = GRPOConfig(
+        temperature=1.0,
+        learning_rate=2e-4,
+        weight_decay=0.001,
+        warmup_ratio=0.1,
+        lr_scheduler_type="linear",
+        optim="adamw_8bit",
+        logging_steps=1,
+        per_device_train_batch_size=1,
+        gradient_accumulation_steps=1,
+        num_generations=2,
+        max_prompt_length=max_prompt_length,
+        max_completion_length=max_completion_length,
+        max_steps=300,
+        save_steps=50,
+        report_to="none",
+        output_dir="outputs",
+    )
+    # Trainer
+    trainer = GRPOTrainer(
+        model=model,
+        processing_class=tokenizer,
+        reward_funcs=[
+            valid_json_reward,
+            environment_reward,
+            efficiency_reward,
+            justification_reward,
+        ],
+        args=training_args,
+        train_dataset=dataset,
+    )
+    print("Starting GRPO training...")
+    trainer.train()
+    # Save
+    model.save_pretrained("stack_doctor_lora")
+    tokenizer.save_pretrained("stack_doctor_lora")
+    print("Training complete. LoRA saved to stack_doctor_lora/")
+if __name__ == "__main__":
+    main()

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff