Spaces:
Sleeping
Sleeping
Upload folder using huggingface_hub
Browse files- README.md +13 -13
- training/openenv_loop.py +26 -12
README.md
CHANGED
|
@@ -20,15 +20,15 @@ AntiAtropos is a production-grade Autonomous SRE (Site Reliability Engineering)
|
|
| 20 |
|
| 21 |
---
|
| 22 |
|
| 23 |
-
##
|
| 24 |
|
| 25 |
Traditional DevOps relies on static thresholds and "If-This-Then-That" runbooks. This doesn't scale with the complexity of modern microservice DAGs. AntiAtropos moves from reactive scripts to **Dynamical System Control**.
|
| 26 |
|
| 27 |
-
Agents in AntiAtropos are trained to minimize the **Lyapunov Energy** of the cluster
|
| 28 |
|
| 29 |
---
|
| 30 |
|
| 31 |
-
##
|
| 32 |
|
| 33 |
AntiAtropos simulates a 5-node cluster with high-fidelity operational dynamics:
|
| 34 |
|
|
@@ -39,7 +39,7 @@ AntiAtropos simulates a 5-node cluster with high-fidelity operational dynamics:
|
|
| 39 |
|
| 40 |
---
|
| 41 |
|
| 42 |
-
##
|
| 43 |
|
| 44 |
AntiAtropos implements typed OpenEnv interfaces using Pydantic models and an OpenEnv-compatible FastAPI server:
|
| 45 |
|
|
@@ -61,16 +61,16 @@ AntiAtropos implements typed OpenEnv interfaces using Pydantic models and an Ope
|
|
| 61 |
|
| 62 |
---
|
| 63 |
|
| 64 |
-
##
|
| 65 |
|
| 66 |
AntiAtropos models a 5-node production DAG with a centralized control plane.
|
| 67 |
|
| 68 |
### Topology (The Directed Graph)
|
| 69 |
Traffic flows through a hierarchical structure, enabling realistic cascading failure simulations:
|
| 70 |
```
|
| 71 |
-
node-0 (VIP Ingress)
|
| 72 |
-
|
| 73 |
-
node-4 (Auth Ingress)
|
| 74 |
```
|
| 75 |
- **node-0**: The VIP Payment Gateway. Business-critical; load shedding is forbidden.
|
| 76 |
- **node-4**: Independent ingress for Auth services.
|
|
@@ -84,9 +84,9 @@ The environment includes a `KubernetesExecutor` that allows the same agent logic
|
|
| 84 |
|
| 85 |
---
|
| 86 |
|
| 87 |
-
##
|
| 88 |
|
| 89 |
-
Our reward function is grounded in Neely
|
| 90 |
|
| 91 |
1. **Lyapunov Drift ($\Delta V$)**: Measures the one-tick change in system energy. Negative drift means the cluster is stabilizing.
|
| 92 |
2. **Smooth Sigmoid SLA**: Dual sigmoids (Latency and Error Rate) provide gradient **before** a violation.
|
|
@@ -95,7 +95,7 @@ Our reward function is grounded in Neely’s **Drift-Plus-Penalty** framework, p
|
|
| 95 |
|
| 96 |
---
|
| 97 |
|
| 98 |
-
##
|
| 99 |
|
| 100 |
| Task | Category | Weight | Mean Score (Baseline) | Mean Score (Trained) |
|
| 101 |
|---|---|---|:---:|:---:|
|
|
@@ -105,7 +105,7 @@ Our reward function is grounded in Neely’s **Drift-Plus-Penalty** framework, p
|
|
| 105 |
|
| 106 |
---
|
| 107 |
|
| 108 |
-
##
|
| 109 |
|
| 110 |
### Local Installation
|
| 111 |
```bash
|
|
@@ -122,4 +122,4 @@ python inference.py --task all --mode trained
|
|
| 122 |
|
| 123 |
---
|
| 124 |
|
| 125 |
-
*Built with
|
|
|
|
| 20 |
|
| 21 |
---
|
| 22 |
|
| 23 |
+
## The Vision: Beyond Runbooks
|
| 24 |
|
| 25 |
Traditional DevOps relies on static thresholds and "If-This-Then-That" runbooks. This doesn't scale with the complexity of modern microservice DAGs. AntiAtropos moves from reactive scripts to **Dynamical System Control**.
|
| 26 |
|
| 27 |
+
Agents in AntiAtropos are trained to minimize the **Lyapunov Energy** of the cluster-balancing the potential energy of backlogs to maintain equilibrium under extreme pressure.
|
| 28 |
|
| 29 |
---
|
| 30 |
|
| 31 |
+
## The Physics Engine
|
| 32 |
|
| 33 |
AntiAtropos simulates a 5-node cluster with high-fidelity operational dynamics:
|
| 34 |
|
|
|
|
| 39 |
|
| 40 |
---
|
| 41 |
|
| 42 |
+
## OpenEnv Specification Compliance
|
| 43 |
|
| 44 |
AntiAtropos implements typed OpenEnv interfaces using Pydantic models and an OpenEnv-compatible FastAPI server:
|
| 45 |
|
|
|
|
| 61 |
|
| 62 |
---
|
| 63 |
|
| 64 |
+
## Cluster Architecture & Control Plane
|
| 65 |
|
| 66 |
AntiAtropos models a 5-node production DAG with a centralized control plane.
|
| 67 |
|
| 68 |
### Topology (The Directed Graph)
|
| 69 |
Traffic flows through a hierarchical structure, enabling realistic cascading failure simulations:
|
| 70 |
```
|
| 71 |
+
node-0 (VIP Ingress) --+--> node-1 (Checkout)
|
| 72 |
+
+--> node-2 (Catalog) --> node-3 (Database)
|
| 73 |
+
node-4 (Auth Ingress) --+
|
| 74 |
```
|
| 75 |
- **node-0**: The VIP Payment Gateway. Business-critical; load shedding is forbidden.
|
| 76 |
- **node-4**: Independent ingress for Auth services.
|
|
|
|
| 84 |
|
| 85 |
---
|
| 86 |
|
| 87 |
+
## Reward Engineering: The Differentiable SRE
|
| 88 |
|
| 89 |
+
Our reward function is grounded in Neely's **Drift-Plus-Penalty** framework, providing a dense, informative signal:
|
| 90 |
|
| 91 |
1. **Lyapunov Drift ($\Delta V$)**: Measures the one-tick change in system energy. Negative drift means the cluster is stabilizing.
|
| 92 |
2. **Smooth Sigmoid SLA**: Dual sigmoids (Latency and Error Rate) provide gradient **before** a violation.
|
|
|
|
| 95 |
|
| 96 |
---
|
| 97 |
|
| 98 |
+
## Task Curriculum & Results
|
| 99 |
|
| 100 |
| Task | Category | Weight | Mean Score (Baseline) | Mean Score (Trained) |
|
| 101 |
|---|---|---|:---:|:---:|
|
|
|
|
| 105 |
|
| 106 |
---
|
| 107 |
|
| 108 |
+
## Quick Start
|
| 109 |
|
| 110 |
### Local Installation
|
| 111 |
```bash
|
|
|
|
| 122 |
|
| 123 |
---
|
| 124 |
|
| 125 |
+
*Built with passion for the 2026 AntiAtropos Hackathon.*
|
training/openenv_loop.py
CHANGED
|
@@ -66,20 +66,34 @@ TASK_BRIEFS = {
|
|
| 66 |
),
|
| 67 |
}
|
| 68 |
|
| 69 |
-
SYSTEM_PROMPT = """SRE controller for a 5-node cluster. Output ONE JSON
|
| 70 |
|
| 71 |
Topology: node-0(VIP)→node-1,node-2 | node-2→node-3 | node-4(Auth)
|
| 72 |
-
Boot
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 83 |
{"action_type":"SCALE_UP","target_node_id":"node-1","parameter":0.5}"""
|
| 84 |
|
| 85 |
|
|
|
|
| 66 |
),
|
| 67 |
}
|
| 68 |
|
| 69 |
+
SYSTEM_PROMPT = """SRE controller for a 5-node cluster. Output ONE JSON. No tags. No text.
|
| 70 |
|
| 71 |
Topology: node-0(VIP)→node-1,node-2 | node-2→node-3 | node-4(Auth)
|
| 72 |
+
Boot: 5 ticks. FAILED→outflow=0, children starved.
|
| 73 |
+
|
| 74 |
+
Your observation shows each node as: {"node":"node-0","status":"H","queue":0.35,"lat_ms":12.0,"inflow":5.0,"capacity":0.5,"pending":0.0}
|
| 75 |
+
- status: H=Healthy D=Degraded F=Failed
|
| 76 |
+
- queue: request backlog (higher = more pressure)
|
| 77 |
+
- capacity: compute currently allocated (0.0-1.0)
|
| 78 |
+
- pending: new capacity being booted (will activate after 5 ticks)
|
| 79 |
+
- inflow: incoming requests per tick
|
| 80 |
+
|
| 81 |
+
DECIDE based on these observation values:
|
| 82 |
+
|
| 83 |
+
queue > 0.3 → SCALE_UP the node (param 0.3-0.8). Waiting increases latency.
|
| 84 |
+
queue < 0.1 AND capacity > 0.6 → SCALE_DOWN (param 0.2-0.5). Saves cost, reward increases.
|
| 85 |
+
queue < 0.1 AND pending > 0 → SCALE_DOWN (param 0.2-0.3). Cancel unnecessary boots.
|
| 86 |
+
status = D → SCALE_UP immediately (param 0.5-0.8). Node is degrading.
|
| 87 |
+
status = F → REROUTE (param 0.5-1.0). Then SCALE_UP the failed node's children.
|
| 88 |
+
queue spike on node-3 or node-4 ONLY → SHED_LOAD (param 0.3-0.5). Never on node-0/1/2.
|
| 89 |
+
NO_OP only when ALL nodes have queue<0.1 AND capacity<0.6 AND status=H.
|
| 90 |
+
|
| 91 |
+
CRITICAL: Do NOT default to NO_OP. Each step should have an active action unless the cluster is perfectly stable. Overusing NO_OP will cost SLA violations.
|
| 92 |
+
|
| 93 |
+
Examples:
|
| 94 |
+
{"node":"node-1","status":"H","queue":0.35,"capacity":0.4,"pending":0.0} → SCALE_UP (queue rising)
|
| 95 |
+
{"node":"node-2","status":"H","queue":0.05,"capacity":0.7,"pending":0.0} → SCALE_DOWN (empty, over-provisioned)
|
| 96 |
+
{"node":"node-1","status":"H","queue":0.05,"capacity":0.3,"pending":0.0} → NO_OP (all good)
|
| 97 |
{"action_type":"SCALE_UP","target_node_id":"node-1","parameter":0.5}"""
|
| 98 |
|
| 99 |
|