Spaces:

Keshav051
/

AntiAtropos

Sleeping

App Files Files Community

Keshav051 commited on Apr 26

Commit

a2abb70

verified ·

1 Parent(s): 2699848

Upload folder using huggingface_hub

Browse files

Files changed (2) hide show

README.md +13 -13
training/openenv_loop.py +26 -12

README.md CHANGED Viewed

@@ -20,15 +20,15 @@ AntiAtropos is a production-grade Autonomous SRE (Site Reliability Engineering)
 ---
-## 🚀 The Vision: Beyond Runbooks
 Traditional DevOps relies on static thresholds and "If-This-Then-That" runbooks. This doesn't scale with the complexity of modern microservice DAGs. AntiAtropos moves from reactive scripts to **Dynamical System Control**.
-Agents in AntiAtropos are trained to minimize the **Lyapunov Energy** of the cluster—balancing the potential energy of backlogs to maintain equilibrium under extreme pressure.
 ---
-## 🧪 The Physics Engine
 AntiAtropos simulates a 5-node cluster with high-fidelity operational dynamics:
@@ -39,7 +39,7 @@ AntiAtropos simulates a 5-node cluster with high-fidelity operational dynamics:
 ---
-## 🛠️ OpenEnv Specification Compliance
 AntiAtropos implements typed OpenEnv interfaces using Pydantic models and an OpenEnv-compatible FastAPI server:
@@ -61,16 +61,16 @@ AntiAtropos implements typed OpenEnv interfaces using Pydantic models and an Ope
 ---
-## 🏗️ Cluster Architecture & Control Plane
 AntiAtropos models a 5-node production DAG with a centralized control plane.
 ### Topology (The Directed Graph)
 Traffic flows through a hierarchical structure, enabling realistic cascading failure simulations:
 ```
-node-0 (VIP Ingress) ──┬──► node-1 (Checkout)
-                       └──► node-2 (Catalog) ──► node-3 (Database)
-node-4 (Auth Ingress) ──┘
 ```
 - **node-0**: The VIP Payment Gateway. Business-critical; load shedding is forbidden.
 - **node-4**: Independent ingress for Auth services.
@@ -84,9 +84,9 @@ The environment includes a `KubernetesExecutor` that allows the same agent logic
 ---
-## 🏆 Reward Engineering: The Differentiable SRE
-Our reward function is grounded in Neely’s **Drift-Plus-Penalty** framework, providing a dense, informative signal:
 1.  **Lyapunov Drift ($\Delta V$)**: Measures the one-tick change in system energy. Negative drift means the cluster is stabilizing.
 2.  **Smooth Sigmoid SLA**: Dual sigmoids (Latency and Error Rate) provide gradient **before** a violation.
@@ -95,7 +95,7 @@ Our reward function is grounded in Neely’s **Drift-Plus-Penalty** framework, p
 ---
-## 📊 Task Curriculum & Results
 | Task | Category | Weight | Mean Score (Baseline) | Mean Score (Trained) |
 |---|---|---|:---:|:---:|
@@ -105,7 +105,7 @@ Our reward function is grounded in Neely’s **Drift-Plus-Penalty** framework, p
 ---
-## 🏁 Quick Start
 ### Local Installation
 ```bash
@@ -122,4 +122,4 @@ python inference.py --task all --mode trained
 ---
-*Built with ❤️ for the 2026 AntiAtropos Hackathon.*

 ---
+## The Vision: Beyond Runbooks
 Traditional DevOps relies on static thresholds and "If-This-Then-That" runbooks. This doesn't scale with the complexity of modern microservice DAGs. AntiAtropos moves from reactive scripts to **Dynamical System Control**.
+Agents in AntiAtropos are trained to minimize the **Lyapunov Energy** of the cluster-balancing the potential energy of backlogs to maintain equilibrium under extreme pressure.
 ---
+## The Physics Engine
 AntiAtropos simulates a 5-node cluster with high-fidelity operational dynamics:
 ---
+## OpenEnv Specification Compliance
 AntiAtropos implements typed OpenEnv interfaces using Pydantic models and an OpenEnv-compatible FastAPI server:
 ---
+## Cluster Architecture & Control Plane
 AntiAtropos models a 5-node production DAG with a centralized control plane.
 ### Topology (The Directed Graph)
 Traffic flows through a hierarchical structure, enabling realistic cascading failure simulations:
 ```
+node-0 (VIP Ingress) --+--> node-1 (Checkout)
+                       +--> node-2 (Catalog) --> node-3 (Database)
+node-4 (Auth Ingress) --+
 ```
 - **node-0**: The VIP Payment Gateway. Business-critical; load shedding is forbidden.
 - **node-4**: Independent ingress for Auth services.
 ---
+## Reward Engineering: The Differentiable SRE
+Our reward function is grounded in Neely's **Drift-Plus-Penalty** framework, providing a dense, informative signal:
 1.  **Lyapunov Drift ($\Delta V$)**: Measures the one-tick change in system energy. Negative drift means the cluster is stabilizing.
 2.  **Smooth Sigmoid SLA**: Dual sigmoids (Latency and Error Rate) provide gradient **before** a violation.
 ---
+## Task Curriculum & Results
 | Task | Category | Weight | Mean Score (Baseline) | Mean Score (Trained) |
 |---|---|---|:---:|:---:|
 ---
+## Quick Start
 ### Local Installation
 ```bash
 ---
+*Built with passion for the 2026 AntiAtropos Hackathon.*

training/openenv_loop.py CHANGED Viewed

@@ -66,20 +66,34 @@ TASK_BRIEFS = {
     ),
 }
-SYSTEM_PROMPT = """SRE controller for a 5-node cluster. Output ONE JSON object. No tags. No text.
 Topology: node-0(VIP)→node-1,node-2 | node-2→node-3 | node-4(Auth)
-Boot delay: 5 ticks for new capacity. FAILED → outflow=0, children starved.
-Actions — choose based on node state:
-  SCALE_UP   node param 0.3-0.8   when queue rising OR status=DEGRADED
-  SCALE_DOWN node param 0.2-0.5   when queue low AND (capacity>0.6 OR pending>0)
-  SHED_LOAD  node param 0.3-0.5   queue spike on node-3 or node-4; NEVER node-0/1/2
-  REROUTE    node param 0.5-1.0   ONLY when status=FAILED
-  NO_OP      node-0 param 0.0     ALL queues <0.1 AND all Healthy
-Key rules:
-  q>0.3 rising → SCALE_UP.  q<0.1 with spare cap → SCALE_DOWN.  FAILED → REROUTE then SCALE_UP children.
 {"action_type":"SCALE_UP","target_node_id":"node-1","parameter":0.5}"""

     ),
 }
+SYSTEM_PROMPT = """SRE controller for a 5-node cluster. Output ONE JSON. No tags. No text.
 Topology: node-0(VIP)→node-1,node-2 | node-2→node-3 | node-4(Auth)
+Boot: 5 ticks. FAILED→outflow=0, children starved.
+Your observation shows each node as: {"node":"node-0","status":"H","queue":0.35,"lat_ms":12.0,"inflow":5.0,"capacity":0.5,"pending":0.0}
+- status: H=Healthy D=Degraded F=Failed
+- queue: request backlog (higher = more pressure)
+- capacity: compute currently allocated (0.0-1.0)
+- pending: new capacity being booted (will activate after 5 ticks)
+- inflow: incoming requests per tick
+DECIDE based on these observation values:
+queue > 0.3 → SCALE_UP the node (param 0.3-0.8). Waiting increases latency.
+queue < 0.1 AND capacity > 0.6 → SCALE_DOWN (param 0.2-0.5). Saves cost, reward increases.
+queue < 0.1 AND pending > 0 → SCALE_DOWN (param 0.2-0.3). Cancel unnecessary boots.
+status = D → SCALE_UP immediately (param 0.5-0.8). Node is degrading.
+status = F → REROUTE (param 0.5-1.0). Then SCALE_UP the failed node's children.
+queue spike on node-3 or node-4 ONLY → SHED_LOAD (param 0.3-0.5). Never on node-0/1/2.
+NO_OP only when ALL nodes have queue<0.1 AND capacity<0.6 AND status=H.
+CRITICAL: Do NOT default to NO_OP. Each step should have an active action unless the cluster is perfectly stable. Overusing NO_OP will cost SLA violations.
+Examples:
+  {"node":"node-1","status":"H","queue":0.35,"capacity":0.4,"pending":0.0} → SCALE_UP (queue rising)
+  {"node":"node-2","status":"H","queue":0.05,"capacity":0.7,"pending":0.0} → SCALE_DOWN (empty, over-provisioned)
+  {"node":"node-1","status":"H","queue":0.05,"capacity":0.3,"pending":0.0} → NO_OP (all good)
 {"action_type":"SCALE_UP","target_node_id":"node-1","parameter":0.5}"""