Spaces:

XcodeAddy
/

incident-triage-env

Running

App Files Files Community

Xcode_Addy commited on Apr 8

Commit

25a72c2

unverified ·

2 Parent(s): c43e397 9347ce5

Merge pull request #1 from ADITYAGABA1322/development

Browse files

Files changed (14) hide show

.gitignore +89 -1
Dockerfile +12 -0
Readme.md +375 -0
__init__.py +16 -0
app.py +77 -0
client.py +99 -0
environment.py +62 -0
graders.py +33 -0
incidents.py +461 -0
inference.py +194 -0
models.py +65 -0
openenv.yaml +74 -0
pyproject.toml +45 -0
requirements.txt +5 -0

.gitignore CHANGED Viewed

	@@ -1 +1,89 @@
1	- .DS_Store

+.DS_Store
+# =========================
+# ENV & SECRETS 🔐
+# =========================
+.env
+.env.*
+*.env
+# =========================
+# PYTHON 🐍
+# =========================
+__pycache__/
+*.pyc
+*.pyo
+*.pyd
+*.pyc.*
+*.egg-info/
+dist/
+build/
+.eggs/
+*.egg
+venv/
+env/
+.venv/
+# =========================
+# LOG FILES 📄
+# =========================
+*.log
+logs.jsonl
+# =========================
+# OS FILES 💻
+# =========================
+.DS_Store
+Thumbs.db
+# =========================
+# IDE / EDITOR ⚙️
+# =========================
+.vscode/
+.idea/
+*.swp
+*.swo
+# =========================
+# MODEL / DATA FILES 🤖
+# =========================
+*.onnx
+*.pt
+*.pth
+*.ckpt
+*.h5
+# Large datasets (customize if needed)
+data/
+datasets/
+# =========================
+# BUILD / OUTPUT 🚀
+# =========================
+dist/
+build/
+out/
+# =========================
+# TEMP FILES 🗑️
+# =========================
+*.tmp
+*.temp
+.cache/
+# =========================
+# TEST / COVERAGE 🧪
+# =========================
+coverage/
+.nyc_output/
+# =========================
+# DOCKER 🐳 (optional)
+# =========================
+*.pid
+*.seed
+# =========================
+# MISC
+# =========================
+*.bak
+*.old

Dockerfile ADDED Viewed

	@@ -0,0 +1,12 @@

+FROM python:3.10-slim
+WORKDIR /app
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+COPY . .
+EXPOSE 7860
+CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]

Readme.md ADDED Viewed

	@@ -0,0 +1,375 @@

+# 🚨 Production Incident Triage Environment
+An OpenEnv-compatible backend evaluation system where an AI agent triages production incidents like a real SRE (Site Reliability Engineer). Built for deterministic, RL-style evaluation — no UI, no chatbot, pure backend.
+---
+## 📌 What This Is
+This is **not** a chatbot. It is a structured evaluation environment where:
+1. Environment returns a production incident (alert + context)
+2. AI agent reads the incident
+3. Agent returns a structured JSON action
+4. Environment sends action to a deterministic grader
+5. Grader compares against ground truth
+6. Returns a score between `0.0` and `1.0`
+---
+## 🗂️ Project Structure
+```
+Incident_Triage/
+│
+├── models.py               # Pydantic schemas — source of truth for all types
+├── incidents.py            # Dataset of 15 production incidents
+├── inference.py            # LLM agent (Mistral via NVIDIA API)
+├── openenv.yaml            # OpenEnv submission config
+├── pyproject.toml          # Project metadata
+├── requirements.txt        # Dependencies
+├── README.md
+│
+└── server/
+    ├── __init__.py         # Empty — do not add imports here
+    ├── app.py              # FastAPI server
+    ├── environment.py      # Core RL-style logic (reset / step)
+    ├── graders.py          # Deterministic scoring functions
+    ├── Dockerfile
+    └── requirements.txt
+```
+---
+## ⚙️ Setup
+### 1. Clone and install dependencies
+```bash
+git clone <your-repo-url>
+cd Incident_Triage
+pip install -r requirements.txt
+```
+### 2. Set your NVIDIA / Mistral API key
+```bash
+# Windows
+set NVIDIA_API_KEY=your_nvidia_api_key_here
+# Mac / Linux
+export NVIDIA_API_KEY=your_nvidia_api_key_here
+```
+### 3. Start the server
+```bash
+uvicorn server.app:app --reload
+```
+Server runs at: `http://localhost:8000`
+### 4. Run the agent
+```bash
+python inference.py
+```
+---
+## 🔗 API Endpoints
+### `GET /tasks`
+Returns available task types and their descriptions.
+**Response:**
+```json
+{
+  "tasks": {
+    "task1": "Severity Classification  → SeverityLevel enum",
+    "task2": "Root Cause Category     → RootCauseCategory enum",
+    "task3": "Recommended Action      → RecommendedAction enum"
+  }
+}
+```
+---
+### `POST /reset`
+Resets the environment and returns a new incident for the agent to triage.
+**Query Params:**
+| Param | Type | Required | Description |
+|---|---|---|---|
+| `task_type` | string | No | Filter by `task1`, `task2`, or `task3`. If omitted, picks any incident randomly. |
+**Example:**
+```bash
+curl -X POST "http://localhost:8000/reset?task_type=task1"
+```
+**Response:**
+```json
+{
+  "incident_id": "INC-001",
+  "task_type": "task1",
+  "alert_text": "[CRITICAL] Payment service returning HTTP 503. Error rate: 94%.",
+  "context": {
+    "service": "payment-service",
+    "error_rate_pct": 94,
+    "affected_users": 120000,
+    "region": "us-east-1"
+  }
+}
+```
+---
+### `POST /step`
+Submits the agent's action and returns a graded result.
+**Request Body:**
+```json
+{
+  "incident_id": "INC-001",
+  "task_type": "task1",
+  "severity": "SEV1",
+  "root_cause": null,
+  "action": null
+}
+```
+> Only populate the field relevant to the `task_type`. Set others to `null`.
+**Response:**
+```json
+{
+  "incident_id": "INC-001",
+  "task_type": "task1",
+  "reward": 1.0,
+  "correct": true,
+  "ground_truth": "SEV1",
+  "agent_answer": "SEV1"
+}
+```
+| Field | Type | Description |
+|---|---|---|
+| `reward` | float | `1.0` = correct, `0.0` = wrong |
+| `correct` | bool | True if reward == 1.0 |
+| `ground_truth` | string | Expected answer |
+| `agent_answer` | string | What agent returned |
+---
+### `GET /grader`
+Returns grader configuration for transparency.
+**Response:**
+```json
+{
+  "grading": "deterministic",
+  "scoring": "binary (0.0 or 1.0)",
+  "tasks": {
+    "task1": "action.severity   == ground_truth.severity",
+    "task2": "action.root_cause == ground_truth.root_cause",
+    "task3": "action.action     == ground_truth.action"
+  }
+}
+```
+---
+## 📋 Enum Reference
+All agent outputs must use **exactly** these enum values (case-sensitive):
+### Task 1 — Severity Classification (`severity` field)
+| Value | Meaning |
+|---|---|
+| `SEV1` | Total outage / confirmed revenue impact |
+| `SEV2` | Partial outage / degraded performance |
+| `SEV3` | Minor / cosmetic / internal only |
+### Task 2 — Root Cause Category (`root_cause` field)
+| Value | Meaning |
+|---|---|
+| `DATABASE` | DB lag, connection pool, replica issues |
+| `NETWORK` | Packet loss, BGP flap, cross-region failures |
+| `APPLICATION` | Code bug, exception, bad deploy |
+| `INFRASTRUCTURE` | Kubernetes, EC2, spot interruption |
+| `THIRD_PARTY` | Stripe, SendGrid, external vendor |
+| `UNKNOWN` | Cannot determine root cause |
+### Task 3 — Recommended Action (`action` field)
+| Value | Meaning |
+|---|---|
+| `ROLLBACK` | Revert to last stable deploy |
+| `SCALE_UP` | Increase replicas / resources |
+| `RESTART_SERVICE` | Restart stuck / deadlocked process |
+| `FAILOVER` | Switch to replica / standby |
+| `NOTIFY_VENDOR` | Escalate to third-party vendor |
+| `INVESTIGATE` | Need more info before acting |
+| `NO_ACTION` | Monitor only, no action needed |
+---
+## 🤖 Agent JSON Format
+The agent must return **strict JSON only** — no markdown, no explanation, no extra text.
+```json
+{
+  "incident_id": "INC-006",
+  "task_type": "task2",
+  "severity": null,
+  "root_cause": "DATABASE",
+  "action": null
+}
+```
+Rules:
+- `incident_id` must match the one returned by `/reset`
+- `task_type` must match the one returned by `/reset`
+- Only one field (`severity`, `root_cause`, or `action`) should be non-null
+- The non-null field must use a valid enum value
+---
+## 🧠 How Grading Works
+Grading is **fully deterministic** — no LLM is used inside the grader.
+```
+agent_answer == ground_truth  →  reward: 1.0  (correct)
+agent_answer != ground_truth  →  reward: 0.0  (wrong)
+missing field (null)          →  reward: 0.0  (wrong)
+```
+Scoring is binary because incident triage is a classification task. A wrong severity leads to a wrong on-call response — partial credit would mask bad agent behavior.
+---
+## 🧪 Quick Test (curl)
+```bash
+# 1. Check available tasks
+curl http://localhost:8000/tasks
+# 2. Get a task1 incident
+curl -X POST "http://localhost:8000/reset?task_type=task1"
+# 3. Submit agent action (replace incident_id with one from step 2)
+curl -X POST http://localhost:8000/step \
+  -H "Content-Type: application/json" \
+  -d '{"incident_id": "INC-001", "task_type": "task1", "severity": "SEV1", "root_cause": null, "action": null}'
+# 4. Check grader config
+curl http://localhost:8000/grader
+```
+---
+## 📊 Dataset Overview
+15 production incidents across 3 task types (5 per task):
+| Task | Incidents | What agent classifies |
+|---|---|---|
+| `task1` | INC-001 to INC-005 | Severity level |
+| `task2` | INC-006 to INC-010 | Root cause category |
+| `task3` | INC-011 to INC-015 | Recommended action |
+Incident types include: payment outages, DB replica lag, Kubernetes node failures, BGP flapping, bad deploys, vendor degradations, memory deadlocks, and more.
+---
+## 🔧 Inference Script (Mistral via NVIDIA API)
+`inference.py` uses the Mistral model via NVIDIA's OpenAI-compatible API endpoint.
+Update the client in `inference.py`:
+```python
+from openai import OpenAI
+client = OpenAI(
+    base_url="https://integrate.api.nvidia.com/v1",
+    api_key=os.environ["NVIDIA_API_KEY"]
+)
+response = client.chat.completions.create(
+    model="mistralai/mistral-7b-instruct-v0.3",
+    messages=[
+        {"role": "system", "content": SYSTEM_PROMPT},
+        {"role": "user", "content": build_user_prompt(observation)}
+    ],
+    max_tokens=256,
+    temperature=0.0
+)
+raw = response.choices[0].message.content.strip()
+```
+> `temperature=0.0` is critical — keeps outputs deterministic across runs.
+---
+## 📦 Requirements
+```
+fastapi
+uvicorn
+pydantic
+openai
+requests
+```
+Install:
+```bash
+pip install fastapi uvicorn pydantic openai requests
+```
+---
+## 🚀 Run Full Evaluation
+```bash
+# Terminal 1
+uvicorn server.app:app --reload
+# Terminal 2
+python inference.py
+```
+Expected output:
+```
+==================================================
+Incident : INC-003
+Task     : task1
+Alert    : [INFO] Admin dashboard CSS assets returning 404...
+LLM Raw  : {"incident_id": "INC-003", "task_type": "task1", "severity": "SEV3", "root_cause": null, "action": null}
+Answer   : SEV3
+Expected : SEV3
+Correct  : True  |  Reward: 1.0
+==================================================
+Total Episodes : 15
+Total Correct  : 13
+Accuracy       : 86.7%
+```
+---
+## 📝 Important Rules
+- Never modify enum values in `models.py` — graders depend on exact string matching
+- Never add LLM calls inside `graders.py` — grading must be deterministic
+- Always call `/reset` before `/step` — environment maintains current incident state
+- `server/__init__.py` must stay empty — do not add imports there
+- Always run uvicorn from the project root: `uvicorn server.app:app --reload`

__init__.py ADDED Viewed

	@@ -0,0 +1,16 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""Incident Triage Environment."""
+from .client import IncidentTriageEnv
+from .models import IncidentTriageAction, IncidentTriageObservation
+__all__ = [
+    "IncidentTriageAction",
+    "IncidentTriageObservation",
+    "IncidentTriageEnv",
+]

app.py ADDED Viewed

	@@ -0,0 +1,77 @@

+#----- Edited file--------------
+# app.py
+import uuid
+from fastapi import FastAPI, HTTPException
+from models import IncidentAction, StepResult
+from environment import IncidentEnv
+from graders import GRADERS
+app = FastAPI(title="Incident Triage Environment")
+# Session store: session_id -> IncidentEnv instance
+sessions: dict[str, IncidentEnv] = {}
+@app.get("/tasks")
+def get_tasks():
+    return {
+        "tasks": {
+            "task1": "Severity Classification  → SEV1, SEV2, SEV3",
+            "task2": "Root Cause Category     → DATABASE, NETWORK, APPLICATION, INFRASTRUCTURE, THIRD_PARTY, UNKNOWN",
+            "task3": "Recommended Action      → ROLLBACK, SCALE_UP, RESTART_SERVICE, FAILOVER, NOTIFY_VENDOR, INVESTIGATE, NO_ACTION",
+        }
+    }
+@app.post("/reset")
+def reset(task_type: str = None):
+    session_id = str(uuid.uuid4())
+    env = IncidentEnv()
+    try:
+        observation = env.reset(task_type=task_type)
+    except ValueError as e:
+        raise HTTPException(status_code=400, detail=str(e))
+    sessions[session_id] = env
+    return {"session_id": session_id, **observation.model_dump()}
+@app.post("/step", response_model=StepResult)
+def step(action: IncidentAction, session_id: str):
+    env = sessions.get(session_id)
+    if not env:
+        raise HTTPException(status_code=404, detail="Session not found. Call /reset first.")
+    try:
+        result = env.step(action)
+    except (RuntimeError, ValueError) as e:
+        raise HTTPException(status_code=400, detail=str(e))
+    # Clean up session after step — one action per episode
+    sessions.pop(session_id, None)
+    return result
+@app.get("/state")
+def state(session_id: str):
+    env = sessions.get(session_id)
+    if not env or env.current_ticket is None:
+        raise HTTPException(status_code=404, detail="No active session.")
+    t = env.current_ticket
+    return {
+        "session_id": session_id,
+        "incident_id": t["incident_id"],
+        "task_type":   t["task_type"],
+        "status":      "awaiting_action",
+    }
+@app.get("/grader")
+def get_grader_info():
+    return {
+        "grading": "deterministic",
+        "scoring": "task1: partial (1.0/0.5/0.0), task2/task3: binary (1.0/0.0)",
+        "tasks": {
+            "task1": "exact=1.0, adjacent=0.5, far=0.0",
+            "task2": "action.root_cause == ground_truth.root_cause",
+            "task3": "action.action     == ground_truth.action",
+        }
+    }

client.py ADDED Viewed

	@@ -0,0 +1,99 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""Incident Triage Environment Client."""
+from typing import Dict
+from openenv.core import EnvClient
+from openenv.core.client_types import StepResult
+from openenv.core.env_server.types import State
+from .models import IncidentTriageAction, IncidentTriageObservation
+class IncidentTriageEnv(
+    EnvClient[IncidentTriageAction, IncidentTriageObservation, State]
+):
+    """
+    Client for the Incident Triage Environment.
+    This client maintains a persistent WebSocket connection to the environment server,
+    enabling efficient multi-step interactions with lower latency.
+    Each client instance has its own dedicated environment session on the server.
+    Example:
+        >>> # Connect to a running server
+        >>> with IncidentTriageEnv(base_url="http://localhost:8000") as client:
+        ...     result = client.reset()
+        ...     print(result.observation.echoed_message)
+        ...
+        ...     result = client.step(IncidentTriageAction(message="Hello!"))
+        ...     print(result.observation.echoed_message)
+    Example with Docker:
+        >>> # Automatically start container and connect
+        >>> client = IncidentTriageEnv.from_docker_image("Incident_Triage-env:latest")
+        >>> try:
+        ...     result = client.reset()
+        ...     result = client.step(IncidentTriageAction(message="Test"))
+        ... finally:
+        ...     client.close()
+    """
+    def _step_payload(self, action: IncidentTriageAction) -> Dict:
+        """
+        Convert IncidentTriageAction to JSON payload for step message.
+        Args:
+            action: IncidentTriageAction instance
+        Returns:
+            Dictionary representation suitable for JSON encoding
+        """
+        return {
+            "message": action.message,
+        }
+    def _parse_result(self, payload: Dict) -> StepResult[IncidentTriageObservation]:
+        """
+        Parse server response into StepResult[IncidentTriageObservation].
+        Args:
+            payload: JSON response data from server
+        Returns:
+            StepResult with IncidentTriageObservation
+        """
+        obs_data = payload.get("observation", {})
+        observation = IncidentTriageObservation(
+            echoed_message=obs_data.get("echoed_message", ""),
+            message_length=obs_data.get("message_length", 0),
+            done=payload.get("done", False),
+            reward=payload.get("reward"),
+            metadata=obs_data.get("metadata", {}),
+        )
+        return StepResult(
+            observation=observation,
+            reward=payload.get("reward"),
+            done=payload.get("done", False),
+        )
+    def _parse_state(self, payload: Dict) -> State:
+        """
+        Parse server response into State object.
+        Args:
+            payload: JSON response from state request
+        Returns:
+            State object with episode_id and step_count
+        """
+        return State(
+            episode_id=payload.get("episode_id"),
+            step_count=payload.get("step_count", 0),
+        )

environment.py ADDED Viewed

	@@ -0,0 +1,62 @@

+#----- Edited file--------------
+# environment.py
+import random
+from models import IncidentAction, IncidentObservation, StepResult
+from incidents import TICKETS
+from graders import GRADERS
+class IncidentEnv:
+    def __init__(self):
+        self.current_ticket = None
+    def reset(self, task_type: str = None) -> IncidentObservation:
+        pool = TICKETS
+        if task_type:
+            pool = [t for t in TICKETS if t["task_type"] == task_type]
+        if not pool:
+            raise ValueError(f"No tickets found for task_type: {task_type}")
+        self.current_ticket = random.choice(pool)
+        return IncidentObservation(
+            incident_id=self.current_ticket["incident_id"],
+            task_type=self.current_ticket["task_type"],
+            alert_text=self.current_ticket["alert_text"],
+            context=self.current_ticket["context"],
+        )
+    def step(self, action: IncidentAction) -> StepResult:
+        if self.current_ticket is None:
+            raise RuntimeError("Call reset() before step()")
+        if action.incident_id != self.current_ticket["incident_id"]:
+            raise ValueError(
+                f"Action incident_id '{action.incident_id}' does not match "
+                f"current ticket '{self.current_ticket['incident_id']}'"
+            )
+        task_type = self.current_ticket["task_type"]
+        ground_truth = self.current_ticket["ground_truth"]
+        grader_fn = GRADERS[task_type]
+        reward = grader_fn(action, ground_truth)
+        agent_answer = (
+            action.severity.value    if task_type == "task1" and action.severity   else
+            action.root_cause.value  if task_type == "task2" and action.root_cause else
+            action.action.value      if task_type == "task3" and action.action      else
+            "NONE"
+        )
+        gt_field = list(ground_truth.values())[0]
+        return StepResult(
+            incident_id=self.current_ticket["incident_id"],
+            task_type=task_type,
+            reward=reward,
+            correct=reward == 1.0,
+            ground_truth=gt_field,
+            agent_answer=agent_answer,
+        )

graders.py ADDED Viewed

	@@ -0,0 +1,33 @@

+#----- Edited file--------------
+# graders.py
+from models import IncidentAction
+_SEV_ORDER = {"SEV1": 0, "SEV2": 1, "SEV3": 2}
+def grade_task1(action: IncidentAction, ground_truth: dict) -> float:
+    if action.severity is None:
+        return 0.0
+    predicted = _SEV_ORDER.get(action.severity.value, -1)
+    expected  = _SEV_ORDER.get(ground_truth["severity"], -1)
+    distance  = abs(predicted - expected)
+    return {0: 1.0, 1: 0.5, 2: 0.0}[distance]
+def grade_task2(action: IncidentAction, ground_truth: dict) -> float:
+    if action.root_cause is None:
+        return 0.0
+    return 1.0 if action.root_cause.value == ground_truth["root_cause"] else 0.0
+def grade_task3(action: IncidentAction, ground_truth: dict) -> float:
+    if action.action is None:
+        return 0.0
+    return 1.0 if action.action.value == ground_truth["action"] else 0.0
+GRADERS = {
+    "task1": grade_task1,
+    "task2": grade_task2,
+    "task3": grade_task3,
+}

incidents.py ADDED Viewed

	@@ -0,0 +1,461 @@

+#----- Edited file--------------
+# incidents.py
+TICKETS = [
+    # ── TASK 1: Severity Classification ───────────────────────────────────────
+    {
+        "incident_id": "INC-001",
+        "task_type": "task1",
+        "alert_text": "[CRITICAL] Payment service returning HTTP 503. Error rate: 94%. Affected users: ~120,000. Revenue impact confirmed.",
+        "context": {
+            "service": "payment-service",
+            "error_rate_pct": 94,
+            "affected_users": 120000,
+            "region": "us-east-1",
+            "last_deploy": "2h ago",
+            "on_call_notified": True
+        },
+        "ground_truth": {"severity": "SEV1"}
+    },
+    {
+        "incident_id": "INC-002",
+        "task_type": "task1",
+        "alert_text": "[WARNING] Checkout latency elevated. p99 response time: 4800ms (threshold: 2000ms). 18% of requests timing out.",
+        "context": {
+            "service": "checkout-service",
+            "p99_latency_ms": 4800,
+            "timeout_rate_pct": 18,
+            "db_connections": "82/100",
+            "region": "eu-west-1"
+        },
+        "ground_truth": {"severity": "SEV2"}
+    },
+    {
+        "incident_id": "INC-003",
+        "task_type": "task1",
+        "alert_text": "[INFO] Admin dashboard CSS assets returning 404. Static file path misconfigured after deploy.",
+        "context": {
+            "service": "admin-ui",
+            "affected_users": "internal only",
+            "error_type": "404 on /static/main.css",
+            "last_deploy": "30m ago",
+            "user_impact": "cosmetic"
+        },
+        "ground_truth": {"severity": "SEV3"}
+    },
+    {
+        "incident_id": "INC-004",
+        "task_type": "task1",
+        "alert_text": "[CRITICAL] Auth service down. All login attempts failing with 500. SSO token validation endpoint unreachable.",
+        "context": {
+            "service": "auth-service",
+            "http_500_rate": "100%",
+            "affected_flows": ["login", "token_refresh", "SSO"],
+            "pod_status": "CrashLoopBackOff",
+            "region": "global"
+        },
+        "ground_truth": {"severity": "SEV1"}
+    },
+    {
+        "incident_id": "INC-005",
+        "task_type": "task1",
+        "alert_text": "[WARNING] Notification service email queue backlog growing. 14,000 emails pending. Delivery delay: ~22 minutes.",
+        "context": {
+            "service": "notification-service",
+            "queue_backlog": 14000,
+            "avg_delay_min": 22,
+            "consumer_lag": "high",
+            "revenue_impact": False
+        },
+        "ground_truth": {"severity": "SEV2"}
+    },
+    # ── TASK 2: Root Cause Classification ─────────────────────────────────────
+    {
+        "incident_id": "INC-006",
+        "task_type": "task2",
+        "alert_text": "[CRITICAL] PostgreSQL replica lag: 94 seconds. Write queries spilling to disk. Connection pool exhausted on primary.",
+        "context": {
+            "db": "postgres-primary",
+            "replica_lag_sec": 94,
+            "connection_pool": "500/500",
+            "disk_spill": True,
+            "slow_query_count": 312
+        },
+        "ground_truth": {"root_cause": "DATABASE"}
+    },
+    {
+        "incident_id": "INC-007",
+        "task_type": "task2",
+        "alert_text": "[CRITICAL] Packet loss 38% between us-east-1 and eu-west-1. Cross-region API calls failing. BGP route flapping detected.",
+        "context": {
+            "packet_loss_pct": 38,
+            "affected_regions": ["us-east-1", "eu-west-1"],
+            "bgp_flap": True,
+            "provider": "AWS",
+            "traceroute": "drops at transit hop 7"
+        },
+        "ground_truth": {"root_cause": "NETWORK"}
+    },
+    {
+        "incident_id": "INC-008",
+        "task_type": "task2",
+        "alert_text": "[ERROR] NullPointerException in order-processing-service. Stack trace points to discount_calculator.py line 84. Deploy happened 40min ago.",
+        "context": {
+            "service": "order-processing",
+            "exception": "NullPointerException",
+            "file": "discount_calculator.py",
+            "line": 84,
+            "last_deploy": "40min ago",
+            "git_commit": "a3f9c21"
+        },
+        "ground_truth": {"root_cause": "APPLICATION"}
+    },
+    {
+        "incident_id": "INC-009",
+        "task_type": "task2",
+        "alert_text": "[WARNING] Stripe webhook delivery failures spiking. 503s from Stripe API. Stripe status page shows degraded payment processing.",
+        "context": {
+            "vendor": "Stripe",
+            "webhook_failures": 840,
+            "stripe_status": "degraded",
+            "our_service_health": "healthy",
+            "stripe_status_url": "https://status.stripe.com"
+        },
+        "ground_truth": {"root_cause": "THIRD_PARTY"}
+    },
+    {
+        "incident_id": "INC-010",
+        "task_type": "task2",
+        "alert_text": "[CRITICAL] Node group in Kubernetes cluster terminated. 6/10 worker nodes NotReady. Pods evicted across analytics namespace.",
+        "context": {
+            "cluster": "prod-k8s-us-east",
+            "nodes_not_ready": 6,
+            "total_nodes": 10,
+            "evicted_pods": 47,
+            "namespace": "analytics",
+            "cause": "EC2 spot interruption"
+        },
+        "ground_truth": {"root_cause": "INFRASTRUCTURE"}
+    },
+    # ── TASK 3: Recommended Action ────────────────────────────────────────────
+    {
+        "incident_id": "INC-011",
+        "task_type": "task3",
+        "alert_text": "[CRITICAL] API error rate jumped from 0.2% to 67% immediately after deploy v2.4.1. Rollback candidate identified.",
+        "context": {
+            "service": "api-gateway",
+            "error_rate_before": "0.2%",
+            "error_rate_after": "67%",
+            "deploy_version": "v2.4.1",
+            "previous_stable": "v2.4.0",
+            "rollback_tested": True
+        },
+        "ground_truth": {"action": "ROLLBACK"}
+    },
+    {
+        "incident_id": "INC-012",
+        "task_type": "task3",
+        "alert_text": "[WARNING] Search service CPU at 98%. Request queue growing. Pod autoscaler at max replicas. Flash sale traffic spike ongoing.",
+        "context": {
+            "service": "search-service",
+            "cpu_pct": 98,
+            "current_replicas": 20,
+            "max_replicas_configured": 20,
+            "queue_depth": 9400,
+            "event": "flash sale"
+        },
+        "ground_truth": {"action": "SCALE_UP"}
+    },
+    {
+        "incident_id": "INC-013",
+        "task_type": "task3",
+        "alert_text": "[ERROR] Worker service stuck in deadlock. Memory usage flat at 99%. Process not responding to health checks. No deploy in 6 days.",
+        "context": {
+            "service": "background-worker",
+            "memory_pct": 99,
+            "health_check": "failing",
+            "last_deploy_days_ago": 6,
+            "deadlock_detected": True
+        },
+        "ground_truth": {"action": "RESTART_SERVICE"}
+    },
+    {
+        "incident_id": "INC-014",
+        "task_type": "task3",
+        "alert_text": "[CRITICAL] Primary RDS instance unresponsive. Failover to read replica not yet triggered. Data writes failing across all services.",
+        "context": {
+            "db": "rds-postgres-primary",
+            "status": "unresponsive",
+            "read_replica": "healthy",
+            "auto_failover": "disabled",
+            "write_failure_rate": "100%"
+        },
+        "ground_truth": {"action": "FAILOVER"}
+    },
+    {
+        "incident_id": "INC-015",
+        "task_type": "task3",
+        "alert_text": "[WARNING] SendGrid bounce rate at 34% for transactional emails. Delivery failures concentrated on @yahoo.com domains. No infra changes.",
+        "context": {
+            "vendor": "SendGrid",
+            "bounce_rate_pct": 34,
+            "affected_domains": ["yahoo.com"],
+            "our_infra_changes": False,
+            "sendgrid_status": "investigating"
+        },
+        "ground_truth": {"action": "NOTIFY_VENDOR"}
+    },
+    {
+        "incident_id": "INC-016",
+        "task_type": "task1",
+        "alert_text": "[INFO] Cart service intermittently failing for premium users only. Error rate: 12%.",
+        "context": {
+            "service": "cart-service",
+            "error_rate_pct": 12,
+            "affected_segment": "premium users",
+            "revenue_dependency": "high",
+            "region": "global"
+        },
+        "ground_truth": {"severity": "SEV1"}
+    },
+    # ── TASK 1: Severity (Ambiguous + Edge) ─────────────────────────────
+    {
+        "incident_id": "INC-017",
+        "task_type": "task1",
+        "alert_text": "[WARNING] API latency increased to 3.2s. Error rate low (2%) but affecting checkout flow.",
+        "context": {
+            "service": "api-service",
+            "latency_ms": 3200,
+            "error_rate_pct": 2,
+            "business_impact": "checkout delay"
+        },
+        "ground_truth": {"severity": "SEV2"}
+    },
+    {
+        "incident_id": "INC-018",
+        "task_type": "task1",
+        "alert_text": "[CRITICAL] Cart service failing for 40% users. Premium users impacted more. Revenue drop observed.",
+        "context": {
+            "error_rate_pct": 40,
+            "affected_segment": "premium",
+            "revenue_impact": True
+        },
+        "ground_truth": {"severity": "SEV1"}
+    },
+    {
+        "incident_id": "INC-019",
+        "task_type": "task1",
+        "alert_text": "[INFO] Logging service delay in ingestion pipeline. No user-facing impact.",
+        "context": {
+            "service": "logging",
+            "delay_sec": 120,
+            "user_impact": False
+        },
+        "ground_truth": {"severity": "SEV3"}
+    },
+    # ── TASK 2: Root Cause (Confusing Signals) ───────────────────────────
+    {
+        "incident_id": "INC-020",
+        "task_type": "task2",
+        "alert_text": "[CRITICAL] API failures with DB latency high and packet loss observed.",
+        "context": {
+            "db_latency_ms": 2800,
+            "packet_loss_pct": 15,
+            "recent_deploy": False
+        },
+        "ground_truth": {"root_cause": "NETWORK"}
+    },
+    {
+        "incident_id": "INC-021",
+        "task_type": "task2",
+        "alert_text": "[ERROR] Service throwing timeout exceptions. No infra alerts. Code deployed 10 mins ago.",
+        "context": {
+            "exception": "TimeoutException",
+            "deploy_time": "10m ago",
+            "infra_health": "normal"
+        },
+        "ground_truth": {"root_cause": "APPLICATION"}
+    },
+    {
+        "incident_id": "INC-022",
+        "task_type": "task2",
+        "alert_text": "[WARNING] DB CPU high and slow queries increasing gradually.",
+        "context": {
+            "db_cpu_pct": 92,
+            "slow_queries": 210,
+            "replica_lag": 5
+        },
+        "ground_truth": {"root_cause": "DATABASE"}
+    },
+    {
+        "incident_id": "INC-023",
+        "task_type": "task2",
+        "alert_text": "[CRITICAL] Multiple pods evicted. Node memory pressure warnings.",
+        "context": {
+            "pods_evicted": 30,
+            "node_memory_pressure": True,
+            "cluster_health": "degraded"
+        },
+        "ground_truth": {"root_cause": "INFRASTRUCTURE"}
+    },
+    # ── TASK 3: Action (Ambiguous Decisions) ─────────────────────────────
+    {
+        "incident_id": "INC-024",
+        "task_type": "task3",
+        "alert_text": "[WARNING] CPU high but traffic spike detected. Autoscaling already active.",
+        "context": {
+            "cpu_pct": 90,
+            "traffic_spike": True,
+            "autoscaling": "active"
+        },
+        "ground_truth": {"action": "SCALE_UP"}
+    },
+    {
+        "incident_id": "INC-025",
+        "task_type": "task3",
+        "alert_text": "[ERROR] New deploy caused minor errors (5%). System stable otherwise.",
+        "context": {
+            "error_rate": 5,
+            "deploy": "recent",
+            "system_stability": "mostly stable"
+        },
+        "ground_truth": {"action": "INVESTIGATE"}
+    },
+    {
+        "incident_id": "INC-026",
+        "task_type": "task3",
+        "alert_text": "[CRITICAL] Service stuck. No response. Health checks failing continuously.",
+        "context": {
+            "health_check": "failing",
+            "response": "none",
+            "deploy": "old"
+        },
+        "ground_truth": {"action": "RESTART_SERVICE"}
+    },
+    {
+        "incident_id": "INC-027",
+        "task_type": "task3",
+        "alert_text": "[WARNING] Vendor API returning intermittent failures.",
+        "context": {
+            "vendor": "Twilio",
+            "failure_rate": 18,
+            "our_system": "healthy"
+        },
+        "ground_truth": {"action": "NOTIFY_VENDOR"}
+    },
+    {
+        "incident_id": "INC-028",
+        "task_type": "task3",
+        "alert_text": "[CRITICAL] DB primary down, replica healthy.",
+        "context": {
+            "primary_status": "down",
+            "replica": "healthy",
+            "writes": "failing"
+        },
+        "ground_truth": {"action": "FAILOVER"}
+    },
+    # ── HARD CASES (REAL THINKING) ──────────────────────────────────────
+    {
+        "incident_id": "INC-029",
+        "task_type": "task3",
+        "alert_text": "[WARNING] Latency increased after deploy but no errors observed.",
+        "context": {
+            "latency": 2500,
+            "error_rate": 0,
+            "deploy": "recent"
+        },
+        "ground_truth": {"action": "INVESTIGATE"}
+    },
+    {
+        "incident_id": "INC-030",
+        "task_type": "task2",
+        "alert_text": "[CRITICAL] Failures observed. External API slow and DB connections also high.",
+        "context": {
+            "external_api_latency": 3000,
+            "db_connections": "95%",
+            "recent_deploy": False
+        },
+        "ground_truth": {"root_cause": "THIRD_PARTY"}
+    },
+    {
+        "incident_id": "INC-031",
+        "task_type": "task1",
+        "alert_text": "[WARNING] Partial outage in recommendation engine. Affects 10% users.",
+        "context": {
+            "affected_users_pct": 10,
+            "service": "recommendation",
+            "revenue_impact": "low"
+        },
+        "ground_truth": {"severity": "SEV2"}
+    },
+    {
+        "incident_id": "INC-032",
+        "task_type": "task2",
+        "alert_text": "[ERROR] Random crashes in service. No infra issues. No recent deploy.",
+        "context": {
+            "crash_logs": True,
+            "infra_health": "good",
+            "deploy": "none"
+        },
+        "ground_truth": {"root_cause": "APPLICATION"}
+    },
+    {
+        "incident_id": "INC-033",
+        "task_type": "task3",
+        "alert_text": "[INFO] Minor UI glitch reported by users.",
+        "context": {
+            "impact": "cosmetic",
+            "users_affected": 50
+        },
+        "ground_truth": {"action": "NO_ACTION"}
+    },
+    {
+        "incident_id": "INC-034",
+        "task_type": "task1",
+        "alert_text": "[CRITICAL] Login failures spike to 70% but only in one region.",
+        "context": {
+            "failure_rate": 70,
+            "region": "ap-south-1",
+            "global_impact": False
+        },
+        "ground_truth": {"severity": "SEV1"}
+    },
+    {
+        "incident_id": "INC-035",
+        "task_type": "task2",
+        "alert_text": "[WARNING] Increased retries and timeouts. Network stable. DB stable.",
+        "context": {
+            "timeouts": True,
+            "network": "stable",
+            "db": "stable"
+        },
+        "ground_truth": {"root_cause": "APPLICATION"}
+    },
+    {
+        "incident_id": "INC-036",
+        "task_type": "task3",
+        "alert_text": "[WARNING] Memory leak suspected. Service degrading slowly.",
+        "context": {
+            "memory_growth": True,
+            "crash": False,
+            "impact": "gradual"
+        },
+        "ground_truth": {"action": "INVESTIGATE"}
+    }
+]

inference.py ADDED Viewed

	@@ -0,0 +1,194 @@

+# inference.py
+import os
+import json
+import re
+import requests
+from openai import OpenAI
+from incidents import TICKETS
+from dotenv import load_dotenv
+load_dotenv()
+BASE_URL = "http://localhost:8000"
+client = OpenAI(
+    base_url=os.getenv("API_BASE_URL"),
+    api_key=os.getenv("HF_TOKEN")
+)
+SYSTEM_PROMPT = """You are an expert SRE (Site Reliability Engineer) triaging production incidents.
+You will receive an incident alert and context.
+You must respond with ONLY a valid JSON object. No explanation. No markdown. No extra text. No code blocks.
+Rules:
+- For task1: classify severity. Choose ONLY from: SEV1, SEV2, SEV3
+- For task2: classify root cause. Choose ONLY from: DATABASE, NETWORK, APPLICATION, INFRASTRUCTURE, THIRD_PARTY, UNKNOWN
+- For task3: recommend action. Choose ONLY from: ROLLBACK, SCALE_UP, RESTART_SERVICE, FAILOVER, NOTIFY_VENDOR, INVESTIGATE, NO_ACTION
+Response format (return this exact structure):
+{"incident_id": "<incident_id>", "task_type": "<task_type>", "severity": "<value or null>", "root_cause": "<value or null>", "action": "<value or null>"}
+Only populate the field relevant to the task_type. Set others to null.
+"""
+def build_user_prompt(observation: dict) -> str:
+    return f"""Incident ID: {observation['incident_id']}
+Task Type: {observation['task_type']}
+Alert:
+{observation['alert_text']}
+Context:
+{json.dumps(observation['context'], indent=2)}
+Respond with JSON only. No markdown. No explanation."""
+# 🔥 Robust JSON extractor
+def extract_json(raw: str) -> dict:
+    match = re.search(r"\{.*\}", raw, re.DOTALL)
+    if not match:
+        raise ValueError("No JSON found in response")
+    return json.loads(match.group(0))
+def normalize_action(action: dict, task_type: str) -> dict:
+    return {
+        "incident_id": action.get("incident_id"),
+        "task_type": task_type,
+        "severity": action.get("severity") if task_type == "task1" else None,
+        "root_cause": action.get("root_cause") if task_type == "task2" else None,
+        "action": action.get("action") if task_type == "task3" else None,
+    }
+def call_llm(observation: dict) -> str:
+    full_response = ""
+    try:
+        completion = client.chat.completions.create(
+            model=os.getenv("MODEL_NAME"),
+            messages=[
+                {"role": "system", "content": SYSTEM_PROMPT},
+                {"role": "user", "content": build_user_prompt(observation)}
+            ],
+            temperature=0.1,
+            top_p=0.9,
+            max_tokens=200,
+            seed=42,
+            stream=True
+        )
+        for chunk in completion:
+            if chunk.choices and chunk.choices[0].delta.content is not None:
+                full_response += chunk.choices[0].delta.content
+    except Exception as e:
+        print(f"Error calling LLM: {e}")
+        return ""
+    return full_response.strip()
+def run_episode(task_type: str = None) -> dict:
+    # Step 1 — Reset environment
+    params = {"task_type": task_type} if task_type else {}
+    reset_response = requests.post(f"{BASE_URL}/reset", params=params)
+    reset_response.raise_for_status()
+    reset_data = reset_response.json()
+    session_id = reset_data["session_id"]
+    observation = reset_data
+    print(f"\n{'='*60}")
+    print(f"Incident : {observation['incident_id']}")
+    print(f"Task     : {observation['task_type']}")
+    print(f"Alert    : {observation['alert_text'][:80]}...")
+    # Step 2 — LLM with retry
+    action = None
+    raw = ""
+    for attempt in range(3):
+        raw = call_llm(observation)
+        print(f"LLM Raw (attempt {attempt+1}): {raw}")
+        try:
+            parsed = extract_json(raw)
+            action = normalize_action(parsed, observation["task_type"])
+            break
+        except Exception as e:
+            print(f"Parse failed: {e}")
+    if not action:
+        return {"error": "invalid_json", "raw": raw}
+        # Step 3 — Validate schema
+    required_keys = {"incident_id", "task_type", "severity", "root_cause", "action"}
+    if not required_keys.issubset(action.keys()):
+        print("Invalid schema from LLM")
+        return {"error": "invalid_schema", "raw": raw}
+    # Step 4 — Submit to /step
+    step_response = requests.post(f"{BASE_URL}/step", json=action, params={"session_id": session_id})
+    step_response.raise_for_status()
+    result = step_response.json()
+    # This need to be kept for submission grading, so we print it in a structured way
+    print(f"[STEP] task_id={result['task_type']} action={result['agent_answer']} reward={result['reward']}")
+    print(f"Answer   : {result['agent_answer']}")
+    print(f"Expected : {result['ground_truth']}")
+    print(f"Correct  : {result['correct']}  |  Reward: {result['reward']}")
+    # 🔥 Logging
+    with open("logs.jsonl", "a") as f:
+        f.write(json.dumps({
+            "observation": observation,
+            "response": action,
+            "result": result
+        }) + "\n")
+    return result
+def run_full_eval():
+    print("[START]")
+    task_types = ["task1", "task2", "task3"]
+    rounds = len(TICKETS)  # 🔥 FIXED
+    scores = []
+    errors = 0
+    task_scores = {
+        "task1": [],
+        "task2": [],
+        "task3": []
+    }
+    for i in range(rounds):
+        task = task_types[i % 3]
+        result = run_episode(task_type=task)
+        if "reward" in result:
+            scores.append(result["reward"])
+            task_scores[task].append(result["reward"])
+        else:
+            errors += 1
+    print(f"\n{'='*60}")
+    print(f"Total Episodes : {rounds}")
+    print(f"Graded         : {len(scores)}")
+    print(f"JSON Errors    : {errors}")
+    if scores:
+        print(f"Total Reward : {sum(scores)}")
+        print(f"Average Reward : {sum(scores)/len(scores):.2f}")
+        print(f"Overall Accuracy : {sum(scores)/len(scores)*100:.2f}%")
+        for task in task_scores:
+            if task_scores[task]:
+                acc = sum(task_scores[task]) / len(task_scores[task]) * 100
+                print(f"{task} Accuracy : {acc:.2f}%")
+    print("[END]")
+if __name__ == "__main__":
+    run_full_eval()

models.py ADDED Viewed

	@@ -0,0 +1,65 @@

+#----- Edited file--------------
+from pydantic import BaseModel, Field
+from enum import Enum
+from typing import Optional, Dict
+# ── Enums ─────────────────────────────────────────────
+class SeverityLevel(str, Enum):
+    SEV1 = "SEV1"
+    SEV2 = "SEV2"
+    SEV3 = "SEV3"
+class RootCauseCategory(str, Enum):
+    DATABASE       = "DATABASE"
+    NETWORK        = "NETWORK"
+    APPLICATION    = "APPLICATION"
+    INFRASTRUCTURE = "INFRASTRUCTURE"
+    THIRD_PARTY    = "THIRD_PARTY"
+    UNKNOWN        = "UNKNOWN"
+class RecommendedAction(str, Enum):
+    ROLLBACK        = "ROLLBACK"
+    SCALE_UP        = "SCALE_UP"
+    RESTART_SERVICE = "RESTART_SERVICE"
+    FAILOVER        = "FAILOVER"
+    NOTIFY_VENDOR   = "NOTIFY_VENDOR"
+    INVESTIGATE     = "INVESTIGATE"
+    NO_ACTION       = "NO_ACTION"
+# ── Observation (Input to Agent) ──────────────────────
+class IncidentObservation(BaseModel):
+    incident_id: str
+    task_type: str   # "task1" | "task2" | "task3"
+    alert_text: str
+    context: Dict
+# ── Action (Output from Agent) ────────────────────────
+class IncidentAction(BaseModel):
+    incident_id: str
+    task_type: str
+    severity:   Optional[SeverityLevel]     = Field(None)
+    root_cause: Optional[RootCauseCategory] = Field(None)
+    action:     Optional[RecommendedAction] = Field(None)
+# ── Step Result ───────────────────────────────────────
+class StepResult(BaseModel):
+    incident_id: str
+    task_type: str
+    reward: float
+    correct: bool
+    ground_truth: str
+    agent_answer: str

openenv.yaml ADDED Viewed

	@@ -0,0 +1,74 @@

+spec_version: 1
+name: Incident_Triage
+type: space
+runtime: fastapi
+app: app:app
+port: 7860
+version: "1.0.0"
+description: >
+  RL-style environment for SRE incident triage.
+  An LLM agent receives production alerts and must classify severity,
+  identify root cause, or recommend remediation actions.
+api:
+  base_url: http://0.0.0.0:7860
+  endpoints:
+    reset:
+      method: POST
+      path: /reset
+      params:
+        task_type:
+          type: string
+          required: false
+          enum: [task1, task2, task3]
+      returns: IncidentObservation + session_id
+    step:
+      method: POST
+      path: /step
+      params:
+        session_id:
+          type: string
+          required: true
+      body: IncidentAction
+      returns: StepResult
+    state:
+      method: GET
+      path: /state
+      params:
+        session_id:
+          type: string
+          required: true
+      returns: current episode state
+tasks:
+  task1:
+    name: Severity Classification
+    output_field: severity
+    labels: [SEV1, SEV2, SEV3]
+    reward: partial  # 1.0 exact | 0.5 adjacent | 0.0 far
+  task2:
+    name: Root Cause Classification
+    output_field: root_cause
+    labels: [DATABASE, NETWORK, APPLICATION, INFRASTRUCTURE, THIRD_PARTY, UNKNOWN]
+    reward: binary  # 1.0 correct | 0.0 incorrect
+  task3:
+    name: Recommended Action
+    output_field: action
+    labels: [ROLLBACK, SCALE_UP, RESTART_SERVICE, FAILOVER, NOTIFY_VENDOR, INVESTIGATE, NO_ACTION]
+    reward: binary  # 1.0 correct | 0.0 incorrect
+dataset:
+  total_tickets: 36
+  split:
+    task1: 13
+    task2: 12
+    task3: 11
+reproducibility:
+  llm_seed: 42
+  llm_temperature: 0.15
+  selection: random per task_type pool

pyproject.toml ADDED Viewed

	@@ -0,0 +1,45 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+[build-system]
+requires = ["setuptools>=45", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "openenv-Incident_Triage"
+version = "0.1.0"
+description = "Incident Triage environment for OpenEnv"
+requires-python = ">=3.10"
+dependencies = [
+    # Core OpenEnv runtime (provides FastAPI server + HTTP client types)
+    # install from github
+    # "openenv-core[core] @ git+https://github.com/meta-pytorch/OpenEnv.git",
+    "openenv-core[core]>=0.2.2",
+    # Environment-specific dependencies
+    # Add all dependencies needed for your environment here
+    # Examples:
+    # "numpy>=1.19.0",
+    # "torch>=2.0.0",
+    # "gymnasium>=0.29.0",
+    # "openspiel>=1.0.0",
+    # "smolagents>=1.22.0,<2",
+]
+[project.optional-dependencies]
+dev = [
+    "pytest>=8.0.0",
+    "pytest-cov>=4.0.0",
+]
+[project.scripts]
+# Server entry point - enables running via: uv run --project . server
+# or: python -m Incident_Triage.server.app
+server = "Incident_Triage.server.app:main"
+[tool.setuptools]
+include-package-data = true
+packages = ["Incident_Triage", "Incident_Triage.server"]
+package-dir = { "Incident_Triage" = ".", "Incident_Triage.server" = "server" }

requirements.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+fastapi
+uvicorn
+pydantic
+openai
+requests