Spaces:

v4xsh
/

nervousystem-env

Running

App Files Files Community

vx7sh commited on 28 days ago

Commit

ff665de

1 Parent(s): 0f99e53

Ship Round 2 manifest/docs, dashboard, and GRPO training pipeline

Browse files

Files changed (7) hide show

README.md +102 -184
dashboard/README.md +14 -0
dashboard/war_room.py +369 -0
inference.py +4 -1
openenv.yaml +55 -18
requirements.txt +7 -9
training/grpo_train.py +368 -0

README.md CHANGED Viewed

@@ -23,232 +23,150 @@ models like the one evaluating this environment.
 ## Why This Matters
 Large-scale AI training runs on clusters of hundreds of
-GPUs across many nodes. These clusters fail constantly:
-- **GPU OOM errors** stall entire training jobs
-- **Network congestion** cuts throughput by 40%+
-- **Code desynchronization** across ranks hangs jobs silently
-These failures require expert human SREs to debug and fix.
-There is no standardized benchmark for evaluating whether
-AI agents can handle these failures autonomously.
-NervousSystem-Env fills that gap.
----
-## Environment Design
-### Observation Space
-The agent receives a `ClusterObservation` at each step:
-| Field | Type | Description |
-|---|---|---|
-| `nodes` | `list[NodeState]` | Per-node GPU memory, utilization, health, XID errors |
-| `training` | `TrainingMetrics` | Throughput, target, job status, stalled steps |
-| `visible_logs` | `list[str]` | Surface telemetry — throughput and status logs |
-| `step_count` | `int` | Current step number in this episode |
-| `episode_id` | `str` | Deterministic episode identifier |
-**Key design:** Deep diagnostic data (Flight Recorder
-buffers, NCCL logs) is hidden by default. The agent must
-actively query for it using investigation actions.
-### Action Space
-| Action | Parameters | Destructive | Description |
-|---|---|---|---|
-| `inspect_flight_recorder` | `rank_id: int` | No | Get PyTorch Flight Recorder data for a rank |
-| `query_nccl_logs` | `time_window: int` | No | Get NCCL communication log entries |
-| `topo_reorder` | `affinity: str` | No | Reorder ring topology (use "rack" for fix) |
-| `patch_divergent_code` | `file: str, fix_type: str` | No | Patch desynchronized code |
-| `restart_rank` | `rank_id: int` | **Yes** | Restart a specific rank (-0.2 penalty) |
-| `reset_ib_interface` | `node_id: int` | **Yes** | Reset IB interface (-0.2 penalty) |
-| `adjust_sharding_strategy` | `strategy: str` | No | Change sharding strategy |
-| `noop` | none | No | Take no action |
----
 ## Tasks
-### Easy — Culprit Rank Identification
-**Difficulty:** Easy
-Training is stalled. A NCCL watchdog timeout has fired
-across all 8 nodes. One rank failed to join a collective
-operation due to an OOM error (XID 79).
-The agent must use `inspect_flight_recorder(rank_id)` to
-examine each rank's Flight Recorder buffer and identify
-which rank has a stalled collective sequence.
-**Grader:** 1.0 for correct rank identified, 0.0 otherwise.
-Efficiency bonus up to +0.2 for early diagnosis.
-Penalty -0.1 per destructive action taken.
-**Anti-cheat:** The failing rank is randomly seeded on
-every `reset()` call. Hardcoding a rank ID scores 0.0.
----
-### Medium — Spine Switch Congestion Resolution
-**Difficulty:** Medium
-Training is running but at 55-65% of target throughput.
-The ring topology stretches across oversubscribed spine
-switches. The agent must call `topo_reorder(affinity="rack")`
-to enforce rack-local communication.
-**Grader:** Continuous score based on
-`throughput / target_throughput` (0.0 to 1.0).
-Bonus +0.15 for sustaining recovery for 5+ steps.
-Penalty -0.2 per destructive action.
----
-### Hard — Asymmetric Compilation Desync Fix
-**Difficulty:** Hard
-Training is completely hung. Different ranks compiled
-different NCCL collectives due to data-dependent branching
-in the model code. The job will never recover on its own.
-The agent must:
-1. Investigate using `query_nccl_logs` or
-	`inspect_flight_recorder`
-2. Identify the divergent source file using
-	`patch_divergent_code(file=..., fix_type=...)`
-3. Verify training resumes for 5+ steps
-**Grader:** 3-stage scoring:
-- 0.3 for identifying the correct file
-- +0.4 for applying the correct patch
-- +0.3 for sustained training recovery (5+ steps)
-= 1.0 maximum
----
-## Reward Function
-Rewards are continuous — the agent receives signal at
-every step, not just at episode end.
-| Situation | Reward |
-|---|---|
-| Correct rank identified (easy) | +0.5 |
-| Investigation action taken | +0.05 |
-| Throughput improvement (medium) | proportional to ratio |
-| Correct file identified (hard) | +0.3 |
-| Correct patch applied (hard) | +0.7 cumulative |
-| Training recovered 5+ steps | +0.3 |
-| Destructive action taken | -0.2 |
-| Noop | 0.0 |
----
-## Setup and Usage
-### Run with Docker
-```bash
-# Build
-docker build -t nervousystem-env .
-# Run
-docker run -p 7860:7860 nervousystem-env
-# Verify
-curl http://localhost:7860/health
-```
-### Run locally
 ```bash
-# Install dependencies
 pip install -r requirements.txt
-# Start server
 uvicorn app.main:app --host 0.0.0.0 --port 7860
-# In another terminal, run inference
-export API_BASE_URL=https://router.huggingface.co/v1
-export MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct
-export HF_TOKEN=your_token_here
-export ENV_BASE_URL=http://localhost:7860
-python inference.py
-```
-### Environment Variables
-| Variable | Required | Description |
-|---|---|---|
-| `API_BASE_URL` | Yes | LLM API endpoint |
-| `MODEL_NAME` | Yes | Model identifier |
-| `HF_TOKEN` | Yes | HuggingFace API token |
-| `ENV_BASE_URL` | No | Env server URL (default: http://localhost:7860) |
----
 ## API Endpoints
 | Endpoint | Method | Description |
 |---|---|---|
 | `/health` | GET | Health check |
-| `/reset` | POST | Start new episode |
-| `/step` | POST | Take an action |
-| `/state` | GET | Get current observation |
-| `/grade` | POST | Get episode score |
-| `/tasks` | GET | List available tasks |
----
-## Baseline Scores
-Scores produced by running inference.py with
-`meta-llama/Llama-3.1-8B-Instruct` (seed=42):
-| Task | Score | Passed |
-|---|---|---|
-| easy | TBD | TBD |
-| medium | TBD | TBD |
-| hard | TBD | TBD |
-*Run `python inference.py` to reproduce scores.
-Scores will be updated after HF Space deployment.*
----
 ## Project Structure
-```
 nervousystem-env/
 ├── app/
-│   ├── main.py        # FastAPI endpoints
-│   ├── env.py         # Environment core logic
-│   ├── models.py      # Pydantic typed models
-│   └── config.py      # Scenarios and constants
-├── simulation/
-│   ├── cluster.py     # GPU cluster state machine
-│   ├── failures.py    # Failure injection
-│   └── telemetry.py   # Log generation
-├── tasks/
-│   ├── easy.py        # Culprit rank identification
-│   ├── medium.py      # Congestion resolution
-│   └── hard.py        # Desync fix
 ├── graders/
 │   ├── easy_grader.py
-│   ├── medium_grader.py
-│   └── hard_grader.py
-├── inference.py       # Baseline agent script
-├── Dockerfile
 ├── openenv.yaml
-└── pyproject.toml
 ```
----
 ## OpenEnv Compliance
-- `openenv validate` passes ✅
-- Typed Pydantic v2 models ✅
-- Deterministic graders ✅
-- Docker deployment ✅
-- 3 tasks with difficulty progression ✅

 ## Why This Matters
 Large-scale AI training runs on clusters of hundreds of
+# 🧠 NervousSystem-Env
+> An AI agent fixing the infrastructure that trains AI.
+> Every minute of cluster downtime wastes $5,000 in compute.
+## The Problem
+Large-scale AI training across 1000+ GPU clusters fails constantly due to hardware faults, network bottlenecks, distributed synchronization bugs, and runtime version drift. Human SREs are forced to diagnose these incidents at 3am under extreme time pressure. NervousSystem-Env turns that operational pain into a training environment where autonomous agents learn to detect failures, route work to specialist workers, and recover jobs before expensive downtime compounds.
+## Why This Matters
+- GPU OOM (XID 79): stalls entire training job.
+- Spine switch congestion: cuts throughput 40%+.
+- Compilation desync: hangs job permanently.
+- LD_LIBRARY_PATH cascade: Severity-1 fleet-wide incident.
+## Architecture
+NervousSystem-Env uses a Fleet AI Supervisor-Worker design. A supervisor agent receives global cluster state and delegates targeted sub-tasks to specialist workers via `/delegate`. Workers return structured results with confidence and coordination reward signals, enabling multi-agent training for routing, diagnosis, and remediation.
+```text
+Supervisor Agent
+│
+├── LogInspectorWorker  (flight recorder, NCCL logs)
+├── PatchAgentWorker    (code patching, verification)
+├── TopoAgentWorker     (topology, bandwidth)
+└── VersionCheckerWorker (NCCL version, LD_LIBRARY_PATH)
+```
 ## Tasks
+| Task | Difficulty | Max Steps | Failure Type | Key Actions |
+|---|---:|---:|---|---|
+| easy | easy | 50 | OOM rank failure | `inspect_flight_recorder` |
+| medium | medium | 50 | network congestion | `topo_reorder(affinity="rack")` |
+| hard | hard | 50 | collective desync | `query_nccl_logs`, `patch_divergent_code` |
+| cascade | cascade | 120 | version cascade (OOM→congestion→desync) | ordered multi-phase recovery |
+## Reward Model
+```text
+Reward = 0.60 * R_success + 0.30 * R_subgoal - 0.10 * log(total_tokens)
+```
+- `R_success`: binary completion signal (recovered/running within step limit).
+- `R_subgoal`: continuous task-progress score.
+- `log(total_tokens)`: efficiency penalty to discourage verbose reasoning.
+## Quick Start
 ```bash
+# Install
 pip install -r requirements.txt
+# Start environment server
 uvicorn app.main:app --host 0.0.0.0 --port 7860
+# Start war room dashboard
+python dashboard/war_room.py
+# Run baseline agent
+python inference.py
+# Train with GRPO
+python training/grpo_train.py
+```
 ## API Endpoints
 | Endpoint | Method | Description |
 |---|---|---|
 | `/health` | GET | Health check |
+| `/reset` | POST | Reset episode by task and seed |
+| `/step` | POST | Apply one SRE action |
+| `/state` | GET | Fetch current observation |
+| `/grade` | POST | Grade current episode |
+| `/tasks` | GET | List task metadata |
+| `/delegate` | POST | Supervisor delegates to worker agent |
+## Hackathon Themes
+- Theme 1 (Fleet AI): Supervisor-Worker with `/delegate` endpoint.
+- Theme 2 (Long-Horizon): Cascade task (120 steps), Mercor reward shaping.
+- Theme 3.1 (Professional Tasks): NCCL diagnostics + Flight Recorder v2.5 workflow.
+- Theme 4 (Self-Improvement): Adversarial curriculum via seeded failure permutations.
+## Training Results
+| Task | Baseline | Trained | Improvement |
+|---|---:|---:|---:|
+| easy | TBD | TBD | TBD |
+| medium | TBD | TBD | TBD |
+| hard | TBD | TBD | TBD |
+| cascade | TBD | TBD | TBD |
+Run `python training/grpo_train.py` to reproduce.
 ## Project Structure
+```text
 nervousystem-env/
 ├── app/
+│   ├── config.py
+│   ├── env.py
+│   ├── main.py
+│   └── models.py
 ├── graders/
+│   ├── base.py
+│   ├── cascade_grader.py
 │   ├── easy_grader.py
+│   ├── hard_grader.py
+│   └── medium_grader.py
+├── simulation/
+│   ├── cluster.py
+│   ├── failures.py
+│   ├── fleet.py
+│   └── telemetry.py
+├── tasks/
+│   ├── base.py
+│   ├── cascade.py
+│   ├── easy.py
+│   ├── hard.py
+│   └── medium.py
+├── dashboard/
+│   ├── README.md
+│   └── war_room.py
+├── training/
+│   └── grpo_train.py
+├── tests/
+│   ├── test_fleet.py
+│   └── test_graders.py
+├── inference.py
 ├── openenv.yaml
+├── requirements.txt
+└── server/
+    └── app.py
 ```
 ## OpenEnv Compliance
+- `openenv validate` passes.
+- Typed Pydantic v2 models.
+- Deterministic graders.
+- Docker deployment.
+- 4 tasks with difficulty progression.
+- Multi-agent `/delegate` endpoint.
+# In another terminal, run inference

dashboard/README.md ADDED Viewed

	@@ -0,0 +1,14 @@

+# SRE War Room Dashboard
+Run the environment server first on `7860`, then launch the Gradio dashboard on `7861`.
+```bash
+uvicorn app.main:app --host 0.0.0.0 --port 7860
+python dashboard/war_room.py
+```
+Optional custom server URL:
+```bash
+ENV_BASE_URL=http://localhost:7860 python dashboard/war_room.py
+```

dashboard/war_room.py ADDED Viewed

	@@ -0,0 +1,369 @@

+from __future__ import annotations
+import json
+import os
+from datetime import datetime
+import gradio as gr
+import requests
+ENV_BASE_URL = os.getenv("ENV_BASE_URL", "http://localhost:7860")
+def render_ring(nodes: list[dict]) -> str:
+    """Return an HTML string with 8 colored divs arranged in a circle."""
+    health_to_color = {
+        "healthy": "#22c55e",
+        "degraded": "#eab308",
+        "failed": "#ef4444",
+    }
+    health_to_emoji = {
+        "healthy": "🟢",
+        "degraded": "🟡",
+        "failed": "🔴",
+    }
+    padded_nodes = list(nodes[:8])
+    while len(padded_nodes) < 8:
+        padded_nodes.append(
+            {
+                "node_id": len(padded_nodes),
+                "health_status": "failed",
+                "gpu_memory_used_mb": 0,
+                "xid_errors": [],
+            }
+        )
+    cards: list[str] = []
+    for index, node in enumerate(padded_nodes):
+        health = str(node.get("health_status", "failed"))
+        color = health_to_color.get(health, "#ef4444")
+        emoji = health_to_emoji.get(health, "🔴")
+        node_id = node.get("node_id", index)
+        gpu_mem = node.get("gpu_memory_used_mb", 0)
+        xid_errors = node.get("xid_errors", [])
+        xid_text = ",".join(str(code) for code in xid_errors) if xid_errors else "none"
+        angle = index * 45
+        cards.append(
+            f"""
+            <div class='node-card' style='background:{color};
+                transform: rotate({angle}deg) translate(155px) rotate(-{angle}deg);'>
+                <div><strong>{emoji} node {node_id}</strong></div>
+                <div>health: {health}</div>
+                <div>gpu_mem: {float(gpu_mem):.0f} MB</div>
+                <div>xid: {xid_text}</div>
+            </div>
+            """
+        )
+    return f"""
+    <style>
+      .ring-wrap {{
+        position: relative;
+        width: 420px;
+        height: 420px;
+        margin: 0 auto;
+        border-radius: 50%;
+        background: radial-gradient(circle, #0b1220 0%, #111827 65%, #1f2937 100%);
+        border: 1px solid #374151;
+      }}
+      .ring-center {{
+        position: absolute;
+        left: 50%; top: 50%;
+        transform: translate(-50%, -50%);
+        color: #d1d5db;
+        font-weight: 700;
+        font-size: 14px;
+      }}
+      .node-card {{
+        position: absolute;
+        left: 50%;
+        top: 50%;
+        width: 132px;
+        min-height: 72px;
+        margin-left: -66px;
+        margin-top: -36px;
+        border-radius: 10px;
+        padding: 8px;
+        color: #111827;
+        box-shadow: 0 6px 20px rgba(0,0,0,0.25);
+        font-size: 11px;
+        line-height: 1.2;
+      }}
+    </style>
+    <div class='ring-wrap'>
+      <div class='ring-center'>Cluster Ring</div>
+      {''.join(cards)}
+    </div>
+    """
+def _safe_get(path: str) -> dict | None:
+    try:
+        response = requests.get(f"{ENV_BASE_URL}{path}", timeout=5)
+        response.raise_for_status()
+        return response.json()
+    except Exception:
+        return None
+def _safe_post(path: str, payload: dict) -> dict | None:
+    try:
+        response = requests.post(f"{ENV_BASE_URL}{path}", json=payload, timeout=8)
+        response.raise_for_status()
+        return response.json()
+    except Exception:
+        return None
+def _offline_panel(action_log: list[list]) -> tuple:
+    offline_row = [["-", "offline", 0.0, 0.0, "⚠️ Server offline"]]
+    return (
+        "<h3>⚠️ Server offline</h3>",
+        "⚠️ Server offline",
+        0.0,
+        0.0,
+        0.0,
+        0.0,
+        0.0,
+        "## ⚠️ Server offline",
+        action_log[-20:] if action_log else offline_row,
+        action_log,
+    )
+def _panel_from_state(state: dict, action_log: list[list]) -> tuple:
+    nodes = state.get("nodes", [])
+    training = state.get("training", {})
+    throughput = float(training.get("throughput_tokens_per_sec", 0.0))
+    target = float(training.get("target_throughput", 1.0))
+    stalled_steps = float(training.get("stalled_steps", 0.0))
+    status = str(training.get("job_status", "unknown"))
+    cumulative_tokens = float(state.get("cumulative_tokens", 0))
+    throughput_pct = (throughput / max(1.0, target)) * 100.0
+    simulated_loss_prevented = stalled_steps * 83.33
+    loss_text = f"## 💰 ${simulated_loss_prevented:,.2f} saved"
+    return (
+        render_ring(nodes),
+        status,
+        throughput,
+        throughput_pct,
+        stalled_steps,
+        cumulative_tokens,
+        simulated_loss_prevented,
+        loss_text,
+        action_log[-20:],
+        action_log,
+    )
+def refresh_panels(task_id: str, action_log: list[list]) -> tuple:
+    """Refresh dashboard panels from live server state."""
+    _ = task_id
+    state = _safe_get("/state")
+    if state is None:
+        return _offline_panel(action_log)
+    return _panel_from_state(state, action_log)
+def reset_episode(task_id: str) -> tuple:
+    """Reset episode for selected task and clear action log."""
+    result = _safe_post("/reset", {"task_id": task_id})
+    if result is None:
+        offline = _offline_panel([])
+        return (*offline, gr.update(active=True))
+    panel = _panel_from_state(result, [])
+    return (*panel, gr.update(active=True))
+def _demo_actions() -> list[dict]:
+    return [
+        {"action_type": "inspect_flight_recorder", "parameters": {"rank_id": 0}},
+        {"action_type": "inspect_flight_recorder", "parameters": {"rank_id": 1}},
+        {"action_type": "inspect_flight_recorder", "parameters": {"rank_id": 2}},
+        {"action_type": "query_nccl_logs", "parameters": {"time_window": 5}},
+        {"action_type": "query_nccl_logs", "parameters": {"time_window": 5}},
+        {"action_type": "topo_reorder", "parameters": {"affinity": "rack"}},
+        {"action_type": "topo_reorder", "parameters": {"affinity": "rack"}},
+        {"action_type": "noop", "parameters": {}},
+        {"action_type": "noop", "parameters": {}},
+        {"action_type": "noop", "parameters": {}},
+    ]
+def run_demo_agent(task_id: str, action_log: list[list]) -> tuple:
+    """Run exactly 10 hardcoded demo steps for visualization."""
+    _ = _safe_post("/reset", {"task_id": task_id})
+    rows = list(action_log)
+    for action in _demo_actions():
+        step_result = _safe_post("/step", action)
+        if step_result is None:
+            return (*_offline_panel(rows), gr.update(active=True))
+        reward = step_result.get("reward", {})
+        observation = step_result.get("observation", {})
+        step_num = observation.get("step_count", len(rows) + 1)
+        row = [
+            step_num,
+            action["action_type"],
+            float(reward.get("value", 0.0)),
+            float(reward.get("token_efficiency_score", 0.0)),
+            f"{reward.get('info', '')} @ {datetime.utcnow().isoformat(timespec='seconds')}",
+        ]
+        rows.append(row)
+    state = _safe_get("/state")
+    if state is None:
+        return (*_offline_panel(rows), gr.update(active=True))
+    panel = _panel_from_state(state, rows)
+    return (*panel, gr.update(active=True))
+def stop_refresh():
+    """Stop the auto-refresh timer."""
+    return gr.update(active=False)
+def delegate_task(worker: str, action: str) -> dict:
+    """Submit delegation request to /delegate endpoint."""
+    payload = {
+        "worker": worker,
+        "action": action,
+        "parameters": {},
+        "supervisor_reasoning": f"War Room delegation at {datetime.utcnow().isoformat()}",
+        "token_count": 0,
+    }
+    result = _safe_post("/delegate", payload)
+    if result is None:
+        return {
+            "worker": worker,
+            "action": action,
+            "success": False,
+            "output": {"error": "⚠️ Server offline"},
+            "confidence": 0.0,
+            "coordination_reward": 0.0,
+            "explanation": "⚠️ Server offline",
+            "cumulative_coordination_reward": 0.0,
+            "raw": json.dumps(payload),
+        }
+    return result
+with gr.Blocks(title="SRE War Room") as demo:
+    gr.Markdown("# 🛠️ SRE War Room")
+    gr.Markdown(f"Connected env: `{ENV_BASE_URL}`")
+    with gr.Row():
+        task_dropdown = gr.Dropdown(
+            choices=["easy", "medium", "hard", "cascade"],
+            value="easy",
+            label="Task",
+        )
+        reset_btn = gr.Button("Reset Episode", variant="primary")
+        demo_btn = gr.Button("Run Demo Agent")
+        stop_btn = gr.Button("Stop")
+    with gr.Row():
+        with gr.Column(scale=2):
+            gr.Markdown("## Panel A: Cluster Ring Topology")
+            ring_html = gr.HTML(render_ring([]))
+        with gr.Column(scale=1):
+            gr.Markdown("## Panel B: Training Metrics")
+            job_status_label = gr.Label(label="job_status", value="unknown")
+            throughput_num = gr.Number(label="throughput_tokens_per_sec", value=0.0)
+            throughput_pct_num = gr.Number(label="throughput_%_of_target", value=0.0)
+            stalled_steps_num = gr.Number(label="stalled_steps", value=0.0)
+            cumulative_tokens_num = gr.Number(label="cumulative_tokens", value=0.0)
+            loss_num = gr.Number(label="Simulated Loss Prevented $", value=0.0)
+            loss_text = gr.Markdown("## 💰 $0.00 saved")
+    with gr.Row():
+        with gr.Column(scale=3):
+            gr.Markdown("## Panel C: Agent Action Log")
+            action_df = gr.Dataframe(
+                headers=["step", "action_type", "reward", "mer_score", "info"],
+                value=[],
+                row_count=20,
+                column_count=(5, "fixed"),
+                datatype=["number", "str", "number", "number", "str"],
+                wrap=True,
+            )
+        with gr.Column(scale=2):
+            gr.Markdown("## Fleet Delegation")
+            worker_dropdown = gr.Dropdown(
+                choices=["log_inspector", "patch_agent", "topo_agent", "version_checker"],
+                value="log_inspector",
+                label="worker",
+            )
+            delegation_action = gr.Textbox(value="check_nccl_version", label="action")
+            delegate_btn = gr.Button("Delegate")
+            delegate_json = gr.JSON(label="Last delegation result")
+    action_log_state = gr.State([])
+    refresh_timer = gr.Timer(value=2.0, active=True)
+    refresh_timer.tick(
+        fn=refresh_panels,
+        inputs=[task_dropdown, action_log_state],
+        outputs=[
+            ring_html,
+            job_status_label,
+            throughput_num,
+            throughput_pct_num,
+            stalled_steps_num,
+            cumulative_tokens_num,
+            loss_num,
+            loss_text,
+            action_df,
+            action_log_state,
+        ],
+    )
+    reset_btn.click(
+        fn=reset_episode,
+        inputs=[task_dropdown],
+        outputs=[
+            ring_html,
+            job_status_label,
+            throughput_num,
+            throughput_pct_num,
+            stalled_steps_num,
+            cumulative_tokens_num,
+            loss_num,
+            loss_text,
+            action_df,
+            action_log_state,
+            refresh_timer,
+        ],
+    )
+    demo_btn.click(
+        fn=run_demo_agent,
+        inputs=[task_dropdown, action_log_state],
+        outputs=[
+            ring_html,
+            job_status_label,
+            throughput_num,
+            throughput_pct_num,
+            stalled_steps_num,
+            cumulative_tokens_num,
+            loss_num,
+            loss_text,
+            action_df,
+            action_log_state,
+            refresh_timer,
+        ],
+    )
+    stop_btn.click(fn=stop_refresh, outputs=[refresh_timer])
+    delegate_btn.click(
+        fn=delegate_task,
+        inputs=[worker_dropdown, delegation_action],
+        outputs=[delegate_json],
+    )
+if __name__ == "__main__":
+    demo.launch(server_port=7861, share=False)

inference.py CHANGED Viewed

@@ -69,7 +69,7 @@ MODEL_NAME = os.getenv(
 )
 API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY", "")
 ENV_BASE_URL = os.getenv("ENV_BASE_URL", "http://localhost:7860")
-MAX_STEPS = 15
 TEMPERATURE = 0.1
 MAX_TOKENS = 300
 SEED = 42
@@ -100,6 +100,9 @@ Rules:
 - Use query_nccl_logs to see communication errors.
 - Avoid restart_rank unless absolutely necessary — it is destructive.
 - If you already know the failing rank, fix it directly.
 Example response:
 {"action_type": "inspect_flight_recorder", "parameters": {"rank_id": 3}}

 )
 API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY", "")
 ENV_BASE_URL = os.getenv("ENV_BASE_URL", "http://localhost:7860")
+MAX_STEPS = 20
 TEMPERATURE = 0.1
 MAX_TOKENS = 300
 SEED = 42
 - Use query_nccl_logs to see communication errors.
 - Avoid restart_rank unless absolutely necessary — it is destructive.
 - If you already know the failing rank, fix it directly.
+- For cascade failures: solve phases in order. Phase 1=OOM diagnosis,
+  Phase 2=topo_reorder, Phase 3=query_nccl_logs then patch_divergent_code
+- Token efficiency matters: fewer tokens = higher reward
 Example response:
 {"action_type": "inspect_flight_recorder", "parameters": {"rank_id": 3}}

openenv.yaml CHANGED Viewed

@@ -1,43 +1,80 @@
 name: nervousystem-env
-version: "1.0.0"
 description: >
-  SRE environment for diagnosing and fixing failures
-  in a distributed GPU training cluster. The cluster
-  is training a large-scale AI model. Agents must
-  investigate, diagnose, and repair the system using
-  real SRE workflows.
 author: v4xsh
 tags:
   - openenv
   - sre
   - distributed-training
-  - gpu
-  - infrastructure
-  - hpc
-entry_point: "app.main:app"
 tasks:
   - id: easy
     name: "Culprit Rank Identification"
     difficulty: easy
     description: >
-      Training is stalled due to an OOM failure on one rank.
-      Identify the failing rank using Flight Recorder inspection.
   - id: medium
     name: "Spine Switch Congestion Resolution"
     difficulty: medium
     description: >
-      Training throughput is degraded due to network congestion.
-      Reorder the ring topology to restore bandwidth.
   - id: hard
     name: "Asymmetric Compilation Desync Fix"
     difficulty: hard
     description: >
-      Training is hung due to different ranks compiling different
-      NCCL collectives. Find and patch the divergent code.
 observation_space:
   type: object
-  description: "ClusterObservation with node health, training metrics, and surface logs"
 action_space:
   type: object
-  description: "SREAction with action_type and parameters dict"
 reward_range: [0.0, 1.0]

 name: nervousystem-env
+version: "2.0.0"
 description: >
+  Fleet AI environment for autonomous SRE agents managing distributed
+  GPU training clusters. Agents act as supervisors orchestrating
+  specialized worker agents to diagnose and fix cascading failures
+  across 1000+ GPU clusters. Every minute of downtime costs $5,000
+  in wasted compute.
 author: v4xsh
 tags:
   - openenv
   - sre
   - distributed-training
+  - fleet-ai
+  - multi-agent
+  - long-horizon
+  - mercor
+  - gpu-infrastructure
+entry_point: "server.app:app"
 tasks:
   - id: easy
     name: "Culprit Rank Identification"
     difficulty: easy
+    max_steps: 50
     description: >
+      Training stalled by OOM on one rank. Identify the failing rank
+      using PyTorch 2.5 Flight Recorder inspection.
   - id: medium
     name: "Spine Switch Congestion Resolution"
     difficulty: medium
+    max_steps: 50
     description: >
+      Training throughput degraded to 45-65% target due to spine switch
+      congestion. Reorder ring topology to restore bandwidth.
   - id: hard
     name: "Asymmetric Compilation Desync Fix"
     difficulty: hard
+    max_steps: 50
     description: >
+      Training hung due to different ranks compiling different NCCL
+      collectives. Investigate and patch the divergent source file.
+  - id: cascade
+    name: "Inter-Version Cascade"
+    difficulty: cascade
+    max_steps: 120
+    description: >
+      Severity-1 incident: LD_LIBRARY_PATH corruption loads wrong NCCL
+      version (2.21.5 vs 2.27.0), triggering a cascade of OOM →
+      congestion → desync across the fleet. Solve all 3 phases in order.
 observation_space:
   type: object
+  description: >
+    ClusterObservation with 8-node health states, training metrics,
+    surface NCCL logs, step count, episode id, and cumulative token count.
 action_space:
   type: object
+  description: >
+    SREAction with action_type and parameters dict. 8 action types
+    including inspect_flight_recorder, query_nccl_logs, topo_reorder,
+    patch_divergent_code, restart_rank, reset_ib_interface,
+    adjust_sharding_strategy, noop.
 reward_range: [0.0, 1.0]
+reward_description: >
+  Mercor-style efficiency reward: 0.60*R_success + 0.30*R_subgoal
+  - 0.10*log(total_tokens). Rewards accurate diagnosis with minimal
+  token usage. Destructive actions penalized -0.2.
+multi_agent:
+  enabled: true
+  architecture: "supervisor-worker"
+  workers:
+    - log_inspector
+    - patch_agent
+    - topo_agent
+    - version_checker
+  endpoint: "/delegate"
+training:
+  algorithm: GRPO
+  framework: "TRL + Unsloth"
+  script: "training/grpo_train.py"
+  model: "unsloth/Qwen2.5-7B-Instruct-bnb-4bit"

requirements.txt CHANGED Viewed

@@ -1,9 +1,7 @@
-fastapi
-uvicorn[standard]
-pydantic>=2.0
-openai
-pyyaml
-numpy
-pytest
-requests
-openenv-core>=0.2.0

+fastapi>=0.111.0
+uvicorn>=0.29.0
+pydantic>=2.7.0
+requests>=2.31.0
+gradio>=4.31.0
+datasets>=2.19.0
+openai>=1.30.0

training/grpo_train.py ADDED Viewed

	@@ -0,0 +1,368 @@

+# ============================================================
+# NervousSystem-Env — GRPO Training Script
+# ============================================================
+# Colab setup (run these first):
+#   !pip install unsloth trl datasets transformers accelerate
+#   !pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
+#   !uvicorn app.main:app --port 7860 &  # start env server
+# ============================================================
+from __future__ import annotations
+import json
+import math
+import os
+import random
+import re
+from typing import Any
+import requests
+import torch
+from datasets import Dataset
+from trl import GRPOConfig, GRPOTrainer
+from unsloth import FastLanguageModel
+ENV_BASE_URL = os.getenv("ENV_BASE_URL", "http://localhost:7860")
+MODEL_NAME = os.getenv("MODEL_NAME", "unsloth/Qwen2.5-7B-Instruct-bnb-4bit")
+MAX_SEQ_LENGTH = 1024
+LORA_RANK = 16
+SRE_SYSTEM_PROMPT = """You are an SRE agent managing a distributed
+GPU training cluster. Diagnose and fix failures efficiently.
+IMPORTANT: You are penalized for using too many tokens.
+Reason concisely. Identify the failure type first, then act directly.
+Available actions (respond with JSON only):
+  {"action_type": "inspect_flight_recorder", "parameters": {"rank_id": <0-7>}}
+  {"action_type": "query_nccl_logs", "parameters": {"time_window": <int>}}
+  {"action_type": "topo_reorder", "parameters": {"affinity": "rack"}}
+  {"action_type": "patch_divergent_code", "parameters": {"file": "<path>", "fix_type": "synchronize_conditional"}}
+  {"action_type": "noop", "parameters": {}}
+Rules:
+- Respond ONLY with a JSON object, no explanation
+- Check job_status first: stalled=investigate, running=optimize
+- Use inspect_flight_recorder to find failing ranks
+- Use topo_reorder(affinity="rack") for congestion
+"""
+_current_task_id = "easy"
+_prompt_task_map: dict[str, str] = {}
+model, tokenizer = FastLanguageModel.from_pretrained(
+    model_name=MODEL_NAME,
+    max_seq_length=MAX_SEQ_LENGTH,
+    load_in_4bit=True,
+    dtype=None,
+)
+model = FastLanguageModel.get_peft_model(
+    model,
+    r=LORA_RANK,
+    target_modules=[
+        "q_proj",
+        "v_proj",
+        "k_proj",
+        "o_proj",
+        "gate_proj",
+        "up_proj",
+        "down_proj",
+    ],
+    lora_alpha=16,
+    lora_dropout=0,
+    bias="none",
+    use_gradient_checkpointing="unsloth",
+    random_state=42,
+)
+def _safe_post(path: str, payload: dict[str, Any], timeout: int = 10) -> dict[str, Any] | None:
+    try:
+        response = requests.post(f"{ENV_BASE_URL}{path}", json=payload, timeout=timeout)
+        response.raise_for_status()
+        return response.json()
+    except Exception:
+        return None
+def _safe_get(path: str, timeout: int = 5) -> dict[str, Any] | None:
+    try:
+        response = requests.get(f"{ENV_BASE_URL}{path}", timeout=timeout)
+        response.raise_for_status()
+        return response.json()
+    except Exception:
+        return None
+def _prompt_key(prompt: Any) -> str:
+    try:
+        return json.dumps(prompt, sort_keys=True)
+    except Exception:
+        return str(prompt)
+def _task_id_from_prompt(prompt: Any) -> str:
+    global _current_task_id
+    key = _prompt_key(prompt)
+    task_id = _prompt_task_map.get(key, _current_task_id)
+    _current_task_id = task_id
+    return task_id
+def _extract_json_action(completion: str) -> dict[str, Any] | None:
+    text = completion.strip()
+    if text.startswith("```"):
+        text = "\n".join(line for line in text.splitlines() if not line.strip().startswith("```"))
+    try:
+        parsed = json.loads(text)
+        if isinstance(parsed, dict) and "action_type" in parsed:
+            parsed.setdefault("parameters", {})
+            return parsed
+    except Exception:
+        pass
+    match = re.search(r"\{.*\}", text, flags=re.DOTALL)
+    if not match:
+        return None
+    try:
+        parsed = json.loads(match.group(0))
+        if isinstance(parsed, dict) and "action_type" in parsed:
+            parsed.setdefault("parameters", {})
+            return parsed
+    except Exception:
+        return None
+    return None
+def make_sre_dataset(n_samples: int = 200) -> Dataset:
+    """
+    Generate prompt-only dataset for GRPO.
+    Each sample is one initial observation from the env.
+    GRPO generates completions and scores them via the reward fn.
+    For each sample:
+    - Pick task_id randomly from ["easy", "medium", "hard"]
+      (skip cascade for initial training — too long)
+    - Pick seed randomly from range(1000)
+    - Call POST /reset with task_id and seed
+    - Format the observation as the user prompt
+    - Return dataset with column "prompt" containing
+      [{"role": "system", "content": SRE_SYSTEM_PROMPT},
+       {"role": "user", "content": <observation_json>}]
+    observation_json format:
+    {
+      "job_status": ...,
+      "throughput": ...,
+      "target_throughput": ...,
+      "stalled_steps": ...,
+      "node_health": [...],
+      "visible_logs": [...],
+      "task_hint": "Diagnose and fix the cluster failure."
+    }
+    """
+    global _current_task_id
+    rows: list[dict[str, Any]] = []
+    task_pool = ["easy", "medium", "hard"]
+    for _ in range(n_samples):
+        task_id = random.choice(task_pool)
+        seed = random.randint(0, 999)
+        reset_result = _safe_post("/reset", {"task_id": task_id, "seed": seed})
+        if reset_result is None:
+            continue
+        training = reset_result.get("training", {})
+        nodes = reset_result.get("nodes", [])
+        observation_payload = {
+            "job_status": training.get("job_status", "unknown"),
+            "throughput": training.get("throughput_tokens_per_sec", 0.0),
+            "target_throughput": training.get("target_throughput", 0.0),
+            "stalled_steps": training.get("stalled_steps", 0),
+            "node_health": [
+                {
+                    "node_id": node.get("node_id"),
+                    "health_status": node.get("health_status"),
+                    "xid_errors": node.get("xid_errors", []),
+                }
+                for node in nodes
+            ],
+            "visible_logs": reset_result.get("visible_logs", []),
+            "task_hint": "Diagnose and fix the cluster failure.",
+        }
+        prompt = [
+            {"role": "system", "content": SRE_SYSTEM_PROMPT},
+            {"role": "user", "content": json.dumps(observation_payload, ensure_ascii=False)},
+        ]
+        _current_task_id = task_id
+        _prompt_task_map[_prompt_key(prompt)] = task_id
+        rows.append({"prompt": prompt})
+    if not rows:
+        fallback_prompt = [
+            {"role": "system", "content": SRE_SYSTEM_PROMPT},
+            {
+                "role": "user",
+                "content": json.dumps(
+                    {
+                        "job_status": "stalled",
+                        "throughput": 0.0,
+                        "target_throughput": 9000.0,
+                        "stalled_steps": 1,
+                        "node_health": [],
+                        "visible_logs": ["Server offline during dataset build"],
+                        "task_hint": "Diagnose and fix the cluster failure.",
+                    }
+                ),
+            },
+        ]
+        _prompt_task_map[_prompt_key(fallback_prompt)] = "easy"
+        rows.append({"prompt": fallback_prompt})
+    return Dataset.from_list(rows)
+def sre_reward_fn(
+    completions: list[str],
+    prompts: list[Any],
+    **kwargs: Any,
+) -> list[float]:
+    """
+    Called by GRPOTrainer to score each completion.
+    For each completion:
+    1. Parse the JSON action from the completion string
+    2. POST the action to /step
+    3. Extract reward.value and reward.token_efficiency_score
+    4. Apply MER formula:
+       tokens = len(completion.split())  # word count proxy
+       mer = max(0.01, min(0.99,
+           0.60 * r_success + 0.30 * step_reward - 0.10 * math.log(max(1, tokens))
+       ))
+       where r_success = 1.0 if job_status in {"recovered","running"} else 0.0
+    5. Return mer as the reward for this completion
+    If parse fails or /step errors: return 0.01
+    If server is offline: return 0.01
+    IMPORTANT: Each call to sre_reward_fn must first call /reset
+    to get a fresh episode state before stepping.
+    """
+    rewards: list[float] = []
+    for index, completion in enumerate(completions):
+        prompt = prompts[index] if index < len(prompts) else None
+        task_id = _task_id_from_prompt(prompt)
+        seed = random.randint(0, 999)
+        reset_result = _safe_post("/reset", {"task_id": task_id, "seed": seed})
+        if reset_result is None:
+            rewards.append(0.01)
+            continue
+        action = _extract_json_action(completion)
+        if action is None:
+            rewards.append(0.01)
+            continue
+        step_result = _safe_post("/step", action)
+        if step_result is None:
+            rewards.append(0.01)
+            continue
+        reward_obj = step_result.get("reward", {})
+        step_reward = float(reward_obj.get("value", 0.01))
+        observation = step_result.get("observation", {})
+        job_status = str(observation.get("training", {}).get("job_status", "unknown"))
+        r_success = 1.0 if job_status in {"recovered", "running"} else 0.0
+        tokens = len(completion.split())
+        mer = max(
+            0.01,
+            min(
+                0.99,
+                0.60 * r_success
+                + 0.30 * step_reward
+                - 0.10 * math.log(max(1, tokens)),
+            ),
+        )
+        rewards.append(float(mer))
+    while len(rewards) < len(completions):
+        rewards.append(0.01)
+    return rewards
+training_args = GRPOConfig(
+    output_dir="./sre_grpo_output",
+    num_train_epochs=1,
+    per_device_train_batch_size=1,
+    gradient_accumulation_steps=4,
+    learning_rate=5e-6,
+    max_grad_norm=0.1,
+    warmup_ratio=0.1,
+    lr_scheduler_type="cosine",
+    logging_steps=1,
+    save_steps=50,
+    report_to="none",
+    num_generations=4,
+    max_new_tokens=128,
+    temperature=0.7,
+    beta=0.001,
+)
+def plot_reward_curve(trainer: GRPOTrainer) -> None:
+    """Print reward progression as ASCII bar chart."""
+    history = trainer.state.log_history
+    rewards = [
+        entry.get("reward", entry.get("train/reward", 0.0))
+        for entry in history
+        if "reward" in entry or "train/reward" in entry
+    ]
+    if not rewards:
+        print("No reward history found.")
+        return
+    print("\n=== REWARD CURVE ===")
+    max_r = max(rewards) if rewards else 1.0
+    for i, reward in enumerate(rewards):
+        bar = "█" * int((reward / max(0.01, max_r)) * 30)
+        print(f"  step {i + 1:3d}: {reward:.3f} {bar}")
+    print(f"\nInitial reward: {rewards[0]:.3f}")
+    print(f"Final reward:   {rewards[-1]:.3f}")
+    delta = rewards[-1] - rewards[0]
+    print(f"Improvement:    {delta:+.3f}")
+if __name__ == "__main__":
+    random.seed(42)
+    torch.manual_seed(42)
+    try:
+        health = _safe_get("/health")
+        assert health is not None and health.get("status") == "ok"
+        print(f"✅ Server healthy at {ENV_BASE_URL}")
+    except Exception as exc:
+        print(f"❌ Server not reachable: {exc}")
+        print("Start it with: uvicorn app.main:app --port 7860")
+        raise SystemExit(1)
+    dataset = make_sre_dataset(n_samples=200)
+    print(f"✅ Dataset: {len(dataset)} samples")
+    trainer = GRPOTrainer(
+        model=model,
+        args=training_args,
+        train_dataset=dataset,
+        reward_funcs=sre_reward_fn,
+        processing_class=tokenizer,
+    )
+    trainer.train()
+    plot_reward_curve(trainer)
+    model.save_pretrained("sre_agent_lora")
+    tokenizer.save_pretrained("sre_agent_lora")
+    print("✅ Model saved to sre_agent_lora/")