Keshav051 commited on
Commit
a2abb70
·
verified ·
1 Parent(s): 2699848

Upload folder using huggingface_hub

Browse files
Files changed (2) hide show
  1. README.md +13 -13
  2. training/openenv_loop.py +26 -12
README.md CHANGED
@@ -20,15 +20,15 @@ AntiAtropos is a production-grade Autonomous SRE (Site Reliability Engineering)
20
 
21
  ---
22
 
23
- ## 🚀 The Vision: Beyond Runbooks
24
 
25
  Traditional DevOps relies on static thresholds and "If-This-Then-That" runbooks. This doesn't scale with the complexity of modern microservice DAGs. AntiAtropos moves from reactive scripts to **Dynamical System Control**.
26
 
27
- Agents in AntiAtropos are trained to minimize the **Lyapunov Energy** of the clusterbalancing the potential energy of backlogs to maintain equilibrium under extreme pressure.
28
 
29
  ---
30
 
31
- ## 🧪 The Physics Engine
32
 
33
  AntiAtropos simulates a 5-node cluster with high-fidelity operational dynamics:
34
 
@@ -39,7 +39,7 @@ AntiAtropos simulates a 5-node cluster with high-fidelity operational dynamics:
39
 
40
  ---
41
 
42
- ## 🛠️ OpenEnv Specification Compliance
43
 
44
  AntiAtropos implements typed OpenEnv interfaces using Pydantic models and an OpenEnv-compatible FastAPI server:
45
 
@@ -61,16 +61,16 @@ AntiAtropos implements typed OpenEnv interfaces using Pydantic models and an Ope
61
 
62
  ---
63
 
64
- ## 🏗️ Cluster Architecture & Control Plane
65
 
66
  AntiAtropos models a 5-node production DAG with a centralized control plane.
67
 
68
  ### Topology (The Directed Graph)
69
  Traffic flows through a hierarchical structure, enabling realistic cascading failure simulations:
70
  ```
71
- node-0 (VIP Ingress) ──┬──► node-1 (Checkout)
72
- └──► node-2 (Catalog) ──► node-3 (Database)
73
- node-4 (Auth Ingress) ──┘
74
  ```
75
  - **node-0**: The VIP Payment Gateway. Business-critical; load shedding is forbidden.
76
  - **node-4**: Independent ingress for Auth services.
@@ -84,9 +84,9 @@ The environment includes a `KubernetesExecutor` that allows the same agent logic
84
 
85
  ---
86
 
87
- ## 🏆 Reward Engineering: The Differentiable SRE
88
 
89
- Our reward function is grounded in Neelys **Drift-Plus-Penalty** framework, providing a dense, informative signal:
90
 
91
  1. **Lyapunov Drift ($\Delta V$)**: Measures the one-tick change in system energy. Negative drift means the cluster is stabilizing.
92
  2. **Smooth Sigmoid SLA**: Dual sigmoids (Latency and Error Rate) provide gradient **before** a violation.
@@ -95,7 +95,7 @@ Our reward function is grounded in Neely’s **Drift-Plus-Penalty** framework, p
95
 
96
  ---
97
 
98
- ## 📊 Task Curriculum & Results
99
 
100
  | Task | Category | Weight | Mean Score (Baseline) | Mean Score (Trained) |
101
  |---|---|---|:---:|:---:|
@@ -105,7 +105,7 @@ Our reward function is grounded in Neely’s **Drift-Plus-Penalty** framework, p
105
 
106
  ---
107
 
108
- ## 🏁 Quick Start
109
 
110
  ### Local Installation
111
  ```bash
@@ -122,4 +122,4 @@ python inference.py --task all --mode trained
122
 
123
  ---
124
 
125
- *Built with ❤️ for the 2026 AntiAtropos Hackathon.*
 
20
 
21
  ---
22
 
23
+ ## The Vision: Beyond Runbooks
24
 
25
  Traditional DevOps relies on static thresholds and "If-This-Then-That" runbooks. This doesn't scale with the complexity of modern microservice DAGs. AntiAtropos moves from reactive scripts to **Dynamical System Control**.
26
 
27
+ Agents in AntiAtropos are trained to minimize the **Lyapunov Energy** of the cluster-balancing the potential energy of backlogs to maintain equilibrium under extreme pressure.
28
 
29
  ---
30
 
31
+ ## The Physics Engine
32
 
33
  AntiAtropos simulates a 5-node cluster with high-fidelity operational dynamics:
34
 
 
39
 
40
  ---
41
 
42
+ ## OpenEnv Specification Compliance
43
 
44
  AntiAtropos implements typed OpenEnv interfaces using Pydantic models and an OpenEnv-compatible FastAPI server:
45
 
 
61
 
62
  ---
63
 
64
+ ## Cluster Architecture & Control Plane
65
 
66
  AntiAtropos models a 5-node production DAG with a centralized control plane.
67
 
68
  ### Topology (The Directed Graph)
69
  Traffic flows through a hierarchical structure, enabling realistic cascading failure simulations:
70
  ```
71
+ node-0 (VIP Ingress) --+--> node-1 (Checkout)
72
+ +--> node-2 (Catalog) --> node-3 (Database)
73
+ node-4 (Auth Ingress) --+
74
  ```
75
  - **node-0**: The VIP Payment Gateway. Business-critical; load shedding is forbidden.
76
  - **node-4**: Independent ingress for Auth services.
 
84
 
85
  ---
86
 
87
+ ## Reward Engineering: The Differentiable SRE
88
 
89
+ Our reward function is grounded in Neely's **Drift-Plus-Penalty** framework, providing a dense, informative signal:
90
 
91
  1. **Lyapunov Drift ($\Delta V$)**: Measures the one-tick change in system energy. Negative drift means the cluster is stabilizing.
92
  2. **Smooth Sigmoid SLA**: Dual sigmoids (Latency and Error Rate) provide gradient **before** a violation.
 
95
 
96
  ---
97
 
98
+ ## Task Curriculum & Results
99
 
100
  | Task | Category | Weight | Mean Score (Baseline) | Mean Score (Trained) |
101
  |---|---|---|:---:|:---:|
 
105
 
106
  ---
107
 
108
+ ## Quick Start
109
 
110
  ### Local Installation
111
  ```bash
 
122
 
123
  ---
124
 
125
+ *Built with passion for the 2026 AntiAtropos Hackathon.*
training/openenv_loop.py CHANGED
@@ -66,20 +66,34 @@ TASK_BRIEFS = {
66
  ),
67
  }
68
 
69
- SYSTEM_PROMPT = """SRE controller for a 5-node cluster. Output ONE JSON object. No tags. No text.
70
 
71
  Topology: node-0(VIP)→node-1,node-2 | node-2→node-3 | node-4(Auth)
72
- Boot delay: 5 ticks for new capacity. FAILED outflow=0, children starved.
73
-
74
- Actions choose based on node state:
75
- SCALE_UP node param 0.3-0.8 when queue rising OR status=DEGRADED
76
- SCALE_DOWN node param 0.2-0.5 when queue low AND (capacity>0.6 OR pending>0)
77
- SHED_LOAD node param 0.3-0.5 queue spike on node-3 or node-4; NEVER node-0/1/2
78
- REROUTE node param 0.5-1.0 ONLY when status=FAILED
79
- NO_OP node-0 param 0.0 ALL queues <0.1 AND all Healthy
80
-
81
- Key rules:
82
- q>0.3 rising → SCALE_UP. q<0.1 with spare cap → SCALE_DOWN. FAILED → REROUTE then SCALE_UP children.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
83
  {"action_type":"SCALE_UP","target_node_id":"node-1","parameter":0.5}"""
84
 
85
 
 
66
  ),
67
  }
68
 
69
+ SYSTEM_PROMPT = """SRE controller for a 5-node cluster. Output ONE JSON. No tags. No text.
70
 
71
  Topology: node-0(VIP)→node-1,node-2 | node-2→node-3 | node-4(Auth)
72
+ Boot: 5 ticks. FAILED→outflow=0, children starved.
73
+
74
+ Your observation shows each node as: {"node":"node-0","status":"H","queue":0.35,"lat_ms":12.0,"inflow":5.0,"capacity":0.5,"pending":0.0}
75
+ - status: H=Healthy D=Degraded F=Failed
76
+ - queue: request backlog (higher = more pressure)
77
+ - capacity: compute currently allocated (0.0-1.0)
78
+ - pending: new capacity being booted (will activate after 5 ticks)
79
+ - inflow: incoming requests per tick
80
+
81
+ DECIDE based on these observation values:
82
+
83
+ queue > 0.3 → SCALE_UP the node (param 0.3-0.8). Waiting increases latency.
84
+ queue < 0.1 AND capacity > 0.6 → SCALE_DOWN (param 0.2-0.5). Saves cost, reward increases.
85
+ queue < 0.1 AND pending > 0 → SCALE_DOWN (param 0.2-0.3). Cancel unnecessary boots.
86
+ status = D → SCALE_UP immediately (param 0.5-0.8). Node is degrading.
87
+ status = F → REROUTE (param 0.5-1.0). Then SCALE_UP the failed node's children.
88
+ queue spike on node-3 or node-4 ONLY → SHED_LOAD (param 0.3-0.5). Never on node-0/1/2.
89
+ NO_OP only when ALL nodes have queue<0.1 AND capacity<0.6 AND status=H.
90
+
91
+ CRITICAL: Do NOT default to NO_OP. Each step should have an active action unless the cluster is perfectly stable. Overusing NO_OP will cost SLA violations.
92
+
93
+ Examples:
94
+ {"node":"node-1","status":"H","queue":0.35,"capacity":0.4,"pending":0.0} → SCALE_UP (queue rising)
95
+ {"node":"node-2","status":"H","queue":0.05,"capacity":0.7,"pending":0.0} → SCALE_DOWN (empty, over-provisioned)
96
+ {"node":"node-1","status":"H","queue":0.05,"capacity":0.3,"pending":0.0} → NO_OP (all good)
97
  {"action_type":"SCALE_UP","target_node_id":"node-1","parameter":0.5}"""
98
 
99