Spaces:

Pranavkk
/

AntiAtropos

Sleeping

App Files Files Community

Pranavkk commited on Apr 25

Commit

6e2b3ef

verified ·

1 Parent(s): 5039886

Upload folder using huggingface_hub

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

AGENTS.md +56 -0
CLAUDE.md +1 -0
Dockerfile +13 -1
README.md +1 -1
__init__.py +25 -25
agent_smoke.py +80 -0
client.py +143 -140
control/__init__.py +2 -2
control/kubernetes_executor.py +396 -230
control/validation.py +69 -38
curriculum.py +131 -0
deploy-local.ps1 +91 -0
deploy/LOCAL_LAPTOP_FASTAPI_GUIDE.md +74 -0
deploy/aws/ARCHITECTURE.md +361 -0
deploy/aws/FASTAPI_AWS_MODE_GUIDE.md +72 -0
deploy/aws/OPERATIONS.md +465 -0
deploy/aws/README.md +361 -0
deploy/aws/cluster-autoscaler-values.yaml +57 -0
deploy/aws/deploy-all.ps1 +493 -0
deploy/aws/deploy.ps1 +369 -0
deploy/aws/deploy.sh +204 -0
deploy/aws/eksctl-cluster.yaml +58 -0
deploy/aws/generate-kubeconfig.ps1 +131 -0
deploy/aws/generate-kubeconfig.sh +138 -0
deploy/aws/grafana-trust-policy.json +12 -0
deploy/aws/grafana-values.yaml +68 -0
deploy/aws/k8s-workloads.yaml +296 -0
deploy/aws/kubeconfig-antiatropos.yaml +34 -0
deploy/aws/prometheus-agent-values.yaml +95 -0
deploy/aws/teardown-all.ps1 +242 -0
deploy/do/README.md +92 -0
deploy/do/antiatropos-control.service +16 -0
deploy/do/deploy-droplet-one-shot.sh +183 -0
deploy/do/uninstall-legacy-openenv.sh +25 -0
deploy/entrypoint.sh +71 -62
deploy/grafana-datasource-local.yaml +11 -0
deploy/grafana-helm-values.yaml +46 -0
deploy/grafana/grafana.ini +21 -21
deploy/grafana/provisioning/dashboards/dashboard.yaml +12 -12
deploy/grafana/provisioning/dashboards/json/antiatropos-live.json +334 -334
deploy/grafana/provisioning/dashboards/json/antiatropos-overview.json +21 -16
deploy/grafana/provisioning/dashboards/json/antiatropos-workloads.json +436 -0
deploy/grafana/provisioning/datasources/prometheus.yaml +2 -2
deploy/index.html +473 -473
deploy/kind-maxpods-250.yaml +11 -0
deploy/local-laptop.yaml +365 -0
deploy/local/datasource-local.yaml +10 -0
deploy/local/grafana-local-values.yaml +34 -0
deploy/local/prometheus-local-values.yaml +49 -0
deploy/nginx.conf +89 -89

AGENTS.md ADDED Viewed

	@@ -0,0 +1,56 @@

+# AntiAtropos: The Physics of Autonomous SRE
+> **"Infrastructure is not a static set of configurations; it is a dynamic system of energy, flow, and stability."**
+## The Vision
+AntiAtropos is a next-generation **Autonomous SRE (Site Reliability Engineering) Control Environment**. While traditional DevOps relies on static thresholds (e.g., "if CPU > 80%"), AntiAtropos treats a microservice cluster as a **Physics Engine**.
+Our vision is to move from reactive scripts to **Dynamical System Control**. We are building an environment where AI agents don't just "fix things"—they balance the "Potential Energy" of a cluster to maintain equilibrium under extreme pressure.
+---
+## 1. The Physics Engine Concept
+Traditional observability measures metrics; we measure **Stability**. We have modeled our 5-node cluster using **Fluid Queue Dynamics**, treating request flow like water and nodes like reservoirs.
+### The Lyapunov Potential ($V$)
+The "North Star" of our environment is the **Lyapunov Energy Function**:
+$$V(s) = \sum_{i=1}^{N} w_i \cdot Q_i^2$$
+*   **$Q_i$ (Queue Depth):** The "Potential Energy" or mass accumulated in a service.
+*   **$w_i$ (Weight):** The "Gravity" or business importance (node-0 is the VIP Payment Gateway).
+*   **Cascading Failures:** Our physics engine models "Backlog Pressure," where one failing node can trigger a chain reaction across its neighbors.
+### Advanced Latency Dynamics (M/M/1)
+We move beyond linear latency models. AntiAtropos implements a **"Hockey-Stick" Latency Curve**. As utilization approaches 100%, latency increases exponentially—modeling the "Point of No Return" that real-world on-call engineers fear.
+---
+## 2. Training Strategy: The Professional Loop
+To build a hackathon-winning agent, we use a complex training pipeline coordinated between **Google Colab** and **Hugging Face**:
+### Progressive Curriculum Learning
+Agents are not trained at random. They follow a **Curriculum** (`curriculum.py`) that graduates them through increasingly difficult stages:
+1.  **Stage 1-3:** Capacity Ramping (Learning to scale).
+2.  **Stage 4-5:** Fault Tolerance (Learning to reroute).
+3.  **Stage 6-8:** Surge Stability (Learning to balance competing pressures).
+4.  **Finals:** Sustained protection under cascading failure conditions.
+### Episodic Replay Buffer
+Using `replay.py`, our agents maintain a "Long-term Memory" of **Key Transitions**. Instead of relearning from scratch, the model uses **Few-Shot Demonstrations** to see how successful previous strategies were executed.
+---
+## 3. Upcoming & Unconfirmed Roadmap
+> [!IMPORTANT]
+> **DISCLAIMER:** The following features are in the research phase and are NOT yet finalized or confirmed. Please consult with the core team before assuming implementation details.
+*   **Multi-Token Attention for SRE:** Investigating the use of frequency-selective transformation to capture "cluster breathiness" (p99 jitter) rather than just global averages.
+*   **Graph Neural Network (GNN) Control:** Potential pivot toward modeling the cluster as a dynamic graph to directly manage the "topology of stress."
+*   **Cross-Cluster Generalization:** Testing models trained on 5 nodes against 10 and 20 node environments.
+---
+## Why This Wins
+AntiAtropos doesn't follow runbooks. It understands the **laws of motion** within a cluster. By training agents to minimize "System Energy," we create infrastructure that is inherently self-healing, cost-efficient, and mathematically stable.
+---
+*Created for the 2026 AntiAtropos Hackathon.*

CLAUDE.md ADDED Viewed

	@@ -0,0 +1 @@


1	+ Refer to AGENT.md for instructions

Dockerfile CHANGED Viewed

@@ -6,7 +6,19 @@ ENV DEBIAN_FRONTEND=noninteractive \
     PROMETHEUS_VERSION=3.5.1 \
     GRAFANA_VERSION=12.3.1 \
     PROMETHEUS_ARCH=linux-amd64 \
-    GRAFANA_ARCH=linux-amd64
 RUN apt-get update && apt-get install -y --no-install-recommends \
     bash \

     PROMETHEUS_VERSION=3.5.1 \
     GRAFANA_VERSION=12.3.1 \
     PROMETHEUS_ARCH=linux-amd64 \
+    GRAFANA_ARCH=linux-amd64 \
+    ANTIATROPOS_ENV_MODE=live \
+    ANTIATROPOS_REWARD_OUTPUT_MODE=normalized \
+    ANTIATROPOS_CONTROL_TIMEOUT_S=8.0 \
+    ANTIATROPOS_PROM_TIMEOUT_S=5.0 \
+    ANTIATROPOS_STRICT_REAL=false \
+    ANTIATROPOS_METRIC_AGGREGATION=sum \
+    ANTIATROPOS_K8S_NAMESPACE=prod-sre \
+    ANTIATROPOS_MIN_REPLICAS=1 \
+    ANTIATROPOS_SCALE_STEP=3 \
+    ANTIATROPOS_CONTROL_PLANE_URL=http://206.189.136.21:8010 \
+    PROMETHEUS_URL=http://206.189.136.21:30090 \
+    ANTIATROPOS_WORKLOAD_MAP={"node-0":{"deployment":"payments","namespace":"prod-sre"},"node-1":{"deployment":"checkout","namespace":"prod-sre"},"node-2":{"deployment":"catalog","namespace":"prod-sre"},"node-3":{"deployment":"cart","namespace":"prod-sre"},"node-4":{"deployment":"auth","namespace":"prod-sre"}}
 RUN apt-get update && apt-get install -y --no-install-recommends \
     bash \

README.md CHANGED Viewed

@@ -274,4 +274,4 @@ For fixed-seed studies, use controlled simulator seeding in evaluation harnesses
 | Grader quality | Deterministic, interpretable composite score in `[0, 1]` |
 | Environment design | Dense Lyapunov-grounded reward, clean reset/step loop, explicit episode boundaries |
 | Code quality | Typed Pydantic models, modular components, OpenEnv manifest, containerized runtime |
-| Novelty | Lyapunov reward shaping + live K8s control plane + Prometheus telemetry + observability-first design |

 | Grader quality | Deterministic, interpretable composite score in `[0, 1]` |
 | Environment design | Dense Lyapunov-grounded reward, clean reset/step loop, explicit episode boundaries |
 | Code quality | Typed Pydantic models, modular components, OpenEnv manifest, containerized runtime |
+| Novelty | Lyapunov reward shaping + live K8s control plane + Prometheus telemetry + observability-first design |

__init__.py CHANGED Viewed

@@ -1,25 +1,25 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the BSD-style license found in the
-# LICENSE file in the root directory of this source tree.
-"""AntiAtropos Environment."""
-from .client import AntiAtroposEnv
-from .models import (
-    SREAction,
-    ActionType,
-    ClusterObservation,
-    NodeObservation,
-    NodeStatus,
-)
-__all__ = [
-    "AntiAtroposEnv",
-    "SREAction",
-    "ActionType",
-    "ClusterObservation",
-    "NodeObservation",
-    "NodeStatus",
-]

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""AntiAtropos Environment."""
+from .client import AntiAtroposEnv
+from .models import (
+    SREAction,
+    ActionType,
+    ClusterObservation,
+    NodeObservation,
+    NodeStatus,
+)
+__all__ = [
+    "AntiAtroposEnv",
+    "SREAction",
+    "ActionType",
+    "ClusterObservation",
+    "NodeObservation",
+    "NodeStatus",
+]

agent_smoke.py ADDED Viewed

	@@ -0,0 +1,80 @@

+#!/usr/bin/env python3
+"""
+Quick autonomous agent smoke test against the running AntiAtropos FastAPI server.
+This does NOT require an LLM API key.
+It uses a simple heuristic policy to validate end-to-end control-plane + telemetry wiring.
+"""
+import asyncio
+import os
+from dataclasses import dataclass
+try:
+    from AntiAtropos.client import AntiAtroposEnv
+    from AntiAtropos.models import SREAction, ActionType
+except ImportError:
+    from client import AntiAtroposEnv  # type: ignore
+    from models import SREAction, ActionType  # type: ignore
+@dataclass
+class Config:
+    env_url: str = os.getenv("ENV_URL", "http://localhost:8000")
+    task_id: str = os.getenv("ANTIATROPOS_TASK", "task-1")
+    mode: str = os.getenv("ANTIATROPOS_MODE", os.getenv("ANTIATROPOS_ENV_MODE", "aws"))
+    max_steps: int = int(os.getenv("ANTIATROPOS_SMOKE_STEPS", "20"))
+def pick_action(obs) -> SREAction:
+    # Pick node with highest queue depth as target
+    target = max(obs.nodes, key=lambda n: float(getattr(n, "queue_depth", 0.0)))
+    avg_latency = float(getattr(obs, "average_latency_ms", 0.0))
+    backlog = float(getattr(obs, "total_queue_backlog", 0.0))
+    # Heuristic policy:
+    # - If stressed, scale up busiest node
+    # - If very calm, scale down non-VIP node
+    # - Otherwise no-op
+    if avg_latency > 0.20 or backlog > 0.45:
+        return SREAction(action_type=ActionType.SCALE_UP, target_node_id=target.node_id, parameter=0.6)
+    non_vips = [n for n in obs.nodes if not bool(getattr(n, "is_vip", False))]
+    if avg_latency < 0.08 and backlog < 0.15 and non_vips:
+        down_target = max(non_vips, key=lambda n: float(getattr(n, "capacity", 0.0)))
+        return SREAction(action_type=ActionType.SCALE_DOWN, target_node_id=down_target.node_id, parameter=0.4)
+    return SREAction(action_type=ActionType.NO_OP, target_node_id=target.node_id, parameter=0.0)
+async def main() -> None:
+    cfg = Config()
+    print(f"[agent-smoke] env={cfg.env_url} task={cfg.task_id} mode={cfg.mode} steps={cfg.max_steps}")
+    async with AntiAtroposEnv(cfg.env_url, message_timeout_s=120) as env:
+        result = await env.reset(task_id=cfg.task_id, mode=cfg.mode)
+        print(f"[reset] step={result.observation.step} latency={result.observation.average_latency_ms:.3f} backlog={result.observation.total_queue_backlog:.3f}")
+        rewards = []
+        for i in range(1, cfg.max_steps + 1):
+            action = pick_action(result.observation)
+            result = await env.step(action)
+            rewards.append(float(result.reward or 0.0))
+            ack = getattr(result.observation, "action_ack_status", "")
+            print(
+                f"[step {i:02d}] {action.action_type.value} {action.target_node_id} p={action.parameter:.2f} "
+                f"reward={float(result.reward or 0.0):.3f} done={bool(result.done)} ack={ack}"
+            )
+            if result.done:
+                break
+        if rewards:
+            avg_reward = sum(rewards) / len(rewards)
+            print(f"[done] steps={len(rewards)} avg_reward={avg_reward:.3f} final_latency={result.observation.average_latency_ms:.3f} final_backlog={result.observation.total_queue_backlog:.3f}")
+        else:
+            print("[done] no steps executed")
+if __name__ == "__main__":
+    asyncio.run(main())

client.py CHANGED Viewed

@@ -1,140 +1,143 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the BSD-style license found in the
-# LICENSE file in the root directory of this source tree.
-"""AntiAtropos Environment Client."""
-from typing import Dict
-from openenv.core import EnvClient
-from openenv.core.client_types import StepResult
-from openenv.core.env_server.types import State
-from .models import SREAction, ClusterObservation, NodeObservation, NodeStatus
-class AntiAtroposEnv(
-    EnvClient[SREAction, ClusterObservation, State]
-):
-    """
-    Client for the AntiAtropos Environment.
-    This client maintains a persistent WebSocket connection to the environment server,
-    enabling efficient multi-step interactions with lower latency.
-    Each client instance has its own dedicated environment session on the server.
-    Example:
-        >>> # Connect to a running server
-        >>> with AntiAtroposEnv(base_url="http://localhost:8000") as client:
-        ...     result = client.reset()
-        ...     print(result.observation.average_latency_ms)
-        ...
-        ...     action = SREAction(action_type="SCALE_UP", target_node_id="node-0", parameter=2.0)
-        ...     result = client.step(action)
-        ...     print(result.observation.lyapunov_energy)
-    Example with Docker:
-        >>> # Automatically start container and connect
-        >>> client = AntiAtroposEnv.from_docker_image("AntiAtropos-env:latest")
-        >>> try:
-        ...     result = client.reset()
-        ...     result = client.step(SREAction(action_type="NO_OP"))
-        ... finally:
-        ...     client.close()
-    """
-    def _step_payload(self, action: SREAction) -> Dict:
-        """
-        Convert SREAction to JSON payload for step message.
-        Args:
-            action: SREAction instance
-        Returns:
-            Dictionary representation suitable for JSON encoding
-        """
-        return {
-            "action_type": action.action_type.value,
-            "target_node_id": action.target_node_id,
-            "parameter": action.parameter,
-        }
-    def _parse_result(self, payload: Dict) -> StepResult[ClusterObservation]:
-        """
-        Parse server response into StepResult[ClusterObservation].
-        Args:
-            payload: JSON response data from server
-        Returns:
-            StepResult with ClusterObservation
-        """
-        obs_data = payload.get("observation", {})
-        # Parse per-node list into NodeObservation objects
-        raw_nodes = obs_data.get("nodes", [])
-        node_obs = [
-            NodeObservation(
-                node_id=n.get("node_id", ""),
-                status=NodeStatus(n.get("status", NodeStatus.HEALTHY)),
-                is_vip=n.get("is_vip", False),
-                queue_depth=n.get("queue_depth", 0),
-                latency_ms=n.get("latency_ms", 0.0),
-                incoming_request_rate=n.get("incoming_request_rate", 0.0),
-                cpu_utilization=n.get("cpu_utilization", 0.0),
-                importance_weight=n.get("importance_weight", 1.0),
-                done=n.get("done", False),
-                reward=n.get("reward", 0.0),
-            )
-            for n in raw_nodes
-        ]
-        observation = ClusterObservation(
-            cluster_id=obs_data.get("cluster_id", ""),
-            task_id=obs_data.get("task_id", "task-1"),
-            mode=obs_data.get("mode", "simulated"),
-            active_nodes=obs_data.get("active_nodes", 0),
-            average_latency_ms=obs_data.get("average_latency_ms", 0.0),
-            error_rate=obs_data.get("error_rate", 0.0),
-            total_queue_backlog=obs_data.get("total_queue_backlog", 0),
-            current_cost_per_hour=obs_data.get("current_cost_per_hour", 0.0),
-            lyapunov_energy=obs_data.get("lyapunov_energy", 0.0),
-            nodes=node_obs,
-            step=obs_data.get("step", 0),
-            max_steps=obs_data.get("max_steps", 100),
-            sla_violations=obs_data.get("sla_violations", 0),
-            invalid_action_count=obs_data.get("invalid_action_count", 0),
-            vip_failure_count=obs_data.get("vip_failure_count", 0),
-            metric_timestamp=obs_data.get("metric_timestamp", 0.0),
-            data_freshness_ms=obs_data.get("data_freshness_ms", 0),
-            action_ack_status=obs_data.get("action_ack_status", "success"),
-            choke_level=obs_data.get("choke_level", 0.0),
-            raw_reward=obs_data.get("raw_reward", 0.0),
-            normalized_reward=obs_data.get("normalized_reward", 0.0),
-            reward_scale_version=obs_data.get("reward_scale_version", "sigmoid-v1"),
-            done=payload.get("done", False),
-            reward=payload.get("reward", 0.0),
-        )
-        return StepResult(
-            observation=observation,
-            reward=payload.get("reward", 0.0),
-            done=payload.get("done", False),
-        )
-    def _parse_state(self, payload: Dict) -> State:
-        """
-        Parse server response into State object.
-        Args:
-            payload: JSON response from state request
-        Returns:
-            State object with episode_id and step_count
-        """
-        return State(
-            episode_id=payload.get("episode_id"),
-            step_count=payload.get("step_count", 0),
-        )

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""AntiAtropos Environment Client."""
+from typing import Dict
+from openenv.core import EnvClient
+from openenv.core.client_types import StepResult
+from openenv.core.env_server.types import State
+try:
+    from .models import SREAction, ClusterObservation, NodeObservation, NodeStatus
+except ImportError:
+    from models import SREAction, ClusterObservation, NodeObservation, NodeStatus  # type: ignore
+class AntiAtroposEnv(
+    EnvClient[SREAction, ClusterObservation, State]
+):
+    """
+    Client for the AntiAtropos Environment.
+    This client maintains a persistent WebSocket connection to the environment server,
+    enabling efficient multi-step interactions with lower latency.
+    Each client instance has its own dedicated environment session on the server.
+    Example:
+        >>> # Connect to a running server
+        >>> with AntiAtroposEnv(base_url="http://localhost:8000") as client:
+        ...     result = client.reset()
+        ...     print(result.observation.average_latency_ms)
+        ...
+        ...     action = SREAction(action_type="SCALE_UP", target_node_id="node-0", parameter=2.0)
+        ...     result = client.step(action)
+        ...     print(result.observation.lyapunov_energy)
+    Example with Docker:
+        >>> # Automatically start container and connect
+        >>> client = AntiAtroposEnv.from_docker_image("AntiAtropos-env:latest")
+        >>> try:
+        ...     result = client.reset()
+        ...     result = client.step(SREAction(action_type="NO_OP"))
+        ... finally:
+        ...     client.close()
+    """
+    def _step_payload(self, action: SREAction) -> Dict:
+        """
+        Convert SREAction to JSON payload for step message.
+        Args:
+            action: SREAction instance
+        Returns:
+            Dictionary representation suitable for JSON encoding
+        """
+        return {
+            "action_type": action.action_type.value,
+            "target_node_id": action.target_node_id,
+            "parameter": action.parameter,
+        }
+    def _parse_result(self, payload: Dict) -> StepResult[ClusterObservation]:
+        """
+        Parse server response into StepResult[ClusterObservation].
+        Args:
+            payload: JSON response data from server
+        Returns:
+            StepResult with ClusterObservation
+        """
+        obs_data = payload.get("observation", {})
+        # Parse per-node list into NodeObservation objects
+        raw_nodes = obs_data.get("nodes", [])
+        node_obs = [
+            NodeObservation(
+                node_id=n.get("node_id", ""),
+                status=NodeStatus(n.get("status", NodeStatus.HEALTHY)),
+                is_vip=n.get("is_vip", False),
+                queue_depth=n.get("queue_depth", 0),
+                latency_ms=n.get("latency_ms", 0.0),
+                incoming_request_rate=n.get("incoming_request_rate", 0.0),
+                cpu_utilization=n.get("cpu_utilization", 0.0),
+                importance_weight=n.get("importance_weight", 1.0),
+                done=n.get("done", False),
+                reward=n.get("reward", 0.0),
+            )
+            for n in raw_nodes
+        ]
+        observation = ClusterObservation(
+            cluster_id=obs_data.get("cluster_id", ""),
+            task_id=obs_data.get("task_id", "task-1"),
+            mode=obs_data.get("mode", "simulated"),
+            active_nodes=obs_data.get("active_nodes", 0),
+            average_latency_ms=obs_data.get("average_latency_ms", 0.0),
+            error_rate=obs_data.get("error_rate", 0.0),
+            total_queue_backlog=obs_data.get("total_queue_backlog", 0),
+            current_cost_per_hour=obs_data.get("current_cost_per_hour", 0.0),
+            lyapunov_energy=obs_data.get("lyapunov_energy", 0.0),
+            nodes=node_obs,
+            step=obs_data.get("step", 0),
+            max_steps=obs_data.get("max_steps", 100),
+            sla_violations=obs_data.get("sla_violations", 0),
+            invalid_action_count=obs_data.get("invalid_action_count", 0),
+            vip_failure_count=obs_data.get("vip_failure_count", 0),
+            metric_timestamp=obs_data.get("metric_timestamp", 0.0),
+            data_freshness_ms=obs_data.get("data_freshness_ms", 0),
+            action_ack_status=obs_data.get("action_ack_status", "success"),
+            choke_level=obs_data.get("choke_level", 0.0),
+            raw_reward=obs_data.get("raw_reward", 0.0),
+            normalized_reward=obs_data.get("normalized_reward", 0.0),
+            reward_scale_version=obs_data.get("reward_scale_version", "sigmoid-v1"),
+            done=payload.get("done", False),
+            reward=payload.get("reward", 0.0),
+        )
+        return StepResult(
+            observation=observation,
+            reward=payload.get("reward", 0.0),
+            done=payload.get("done", False),
+        )
+    def _parse_state(self, payload: Dict) -> State:
+        """
+        Parse server response into State object.
+        Args:
+            payload: JSON response from state request
+        Returns:
+            State object with episode_id and step_count
+        """
+        return State(
+            episode_id=payload.get("episode_id"),
+            step_count=payload.get("step_count", 0),
+        )

control/__init__.py CHANGED Viewed

	@@ -1,2 +1,2 @@
1	- from .kubernetes_executor import KubernetesExecutor
2	- from .validation import ActionValidator


1	+ from .kubernetes_executor import KubernetesExecutor
2	+ from .validation import ActionValidator

control/kubernetes_executor.py CHANGED Viewed

@@ -1,230 +1,396 @@
-import os
-import json
-import time
-from uuid import uuid4
-from typing import Optional
-class KubernetesExecutor:
-    """
-    Executes high-level SRE actions on a Kubernetes cluster.
-    Provides a safe layer between SREAgent and actual infrastructure.
-    """
-    def __init__(self, kubeconfig: Optional[str] = None):
-        # Use provided path or env var, defaulting to mock if neither is found
-        self.kubeconfig = kubeconfig or os.getenv("KUBECONFIG")
-        self.is_mock = not self.kubeconfig or self.kubeconfig.lower() == "mock"
-        self.namespace = os.getenv("ANTIATROPOS_K8S_NAMESPACE", "default")
-        self.min_replicas = int(os.getenv("ANTIATROPOS_MIN_REPLICAS", "1"))
-        self.max_replicas = int(os.getenv("ANTIATROPOS_MAX_REPLICAS", "20"))
-        self.scale_step = int(os.getenv("ANTIATROPOS_SCALE_STEP", "3"))
-        self._apps_v1_api = None
-        self._node_workload_map = self._load_node_workload_map()
-        self._live_supported_actions = {"NO_OP", "SCALE_UP", "SCALE_DOWN"}
-    @staticmethod
-    def _normalize_action_type(action_type) -> str:
-        if hasattr(action_type, "value"):
-            return str(action_type.value)
-        return str(action_type)
-    def execute(self, action_type: str, target: str, parameter: float) -> str:
-        """
-        Translates SRE actions to Kube requests (ScaleDeployment, PatchIngress, etc.)
-        """
-        return self.execute_with_metadata(action_type, target, parameter)["ack_status"]
-    def execute_with_metadata(self, action_type: str, target: str, parameter: float) -> dict:
-        """
-        Execute action and return acknowledgement plus executor metadata.
-        """
-        action_id = str(uuid4())
-        started = time.perf_counter()
-        ack_status = ""
-        error_code = ""
-        if self.is_mock:
-            ack_status = self._mock_execution(action_type, target, parameter)
-        else:
-            try:
-                ack_status = self._real_execution(action_type, target, parameter)
-            except Exception as e:
-                ack_status = f"Error: Failed to execute {action_type} on {target}: {str(e)}"
-                error_code = "EXECUTION_ERROR"
-        if ack_status.startswith("Rejected:") and not error_code:
-            error_code = "REJECTED_ACTION"
-        elif ack_status.startswith("Error:") and not error_code:
-            error_code = "EXECUTION_ERROR"
-        latency_ms = (time.perf_counter() - started) * 1000.0
-        return {
-            "action_id": action_id,
-            "ack_status": ack_status,
-            "executor_latency_ms": latency_ms,
-            "executor_error_code": error_code,
-        }
-    def live_enabled_actions(self) -> set[str]:
-        """Action types that are actually executable in real live mode."""
-        if self.is_mock:
-            return {"NO_OP"}
-        return set(self._live_supported_actions)
-    def live_capability_error(self, action_type: str) -> Optional[str]:
-        """Returns reason when action is not runnable in live mode, else None."""
-        action = self._normalize_action_type(action_type)
-        if action not in self.live_enabled_actions():
-            if self.is_mock:
-                return (
-                    f"Live mode rejected {action}: no real Kubernetes executor is configured "
-                    "(set KUBECONFIG and ANTIATROPOS_WORKLOAD_MAP)."
-                )
-            return f"Live mode rejected {action}: no executor is enabled for this action."
-        return None
-    def _real_execution(self, action_type: str, target: str, parameter: float) -> str:
-        """Execute bounded actions on a Kubernetes cluster."""
-        action = self._normalize_action_type(action_type)
-        if action == "NO_OP":
-            return "Ack: NO_OP - no cluster mutation"
-        if action in ("SCALE_UP", "SCALE_DOWN"):
-            return self._scale_deployment(action, target, parameter)
-        return f"Rejected: {action} is not enabled for live Kubernetes execution"
-    def _mock_execution(self, action_type: str, target: str, parameter: float) -> str:
-        """Returns mock acknowledgement for actions."""
-        # TODO: Add realistic latency simulation for K8s control plane
-        action = self._normalize_action_type(action_type)
-        return f"Ack: {action} for {target} with value {parameter} - Status: Applied"
-    def _scale_deployment(self, action_type: str, target: str, parameter: float) -> str:
-        namespace, deployment_name = self._resolve_workload_target(target)
-        apps_v1 = self._get_apps_v1_api()
-        scale_obj = apps_v1.read_namespaced_deployment_scale(
-            name=deployment_name,
-            namespace=namespace,
-        )
-        current = int(scale_obj.spec.replicas or self.min_replicas)
-        delta = max(1, int(float(parameter) * self.scale_step))
-        if action_type == "SCALE_UP":
-            desired = min(self.max_replicas, current + delta)
-        else:
-            desired = max(self.min_replicas, current - delta)
-        if desired == current:
-            return (
-                f"Ack: {action_type} for {target} - replicas unchanged at {current} "
-                f"(bounds {self.min_replicas}-{self.max_replicas})"
-            )
-        apps_v1.patch_namespaced_deployment_scale(
-            name=deployment_name,
-            namespace=namespace,
-            body={"spec": {"replicas": desired}},
-        )
-        return (
-            f"Ack: {action_type} for {target} - deployment {deployment_name} "
-            f"in namespace {namespace} scaled {current}->{desired}"
-        )
-    def _get_apps_v1_api(self):
-        if self._apps_v1_api is not None:
-            return self._apps_v1_api
-        from kubernetes import client, config
-        if self.kubeconfig and self.kubeconfig.lower() not in ("mock", ""):
-            config.load_kube_config(config_file=self.kubeconfig)
-        else:
-            config.load_incluster_config()
-        self._apps_v1_api = client.AppsV1Api()
-        return self._apps_v1_api
-    def _load_node_workload_map(self) -> dict[str, dict[str, str]]:
-        """
-        Load node->workload mapping.
-        Preferred format (ANTIATROPOS_WORKLOAD_MAP):
-        {
-          "node-0": {"deployment": "payments", "namespace": "prod-sre"},
-          "node-1": {"deployment": "checkout"}
-        }
-        Legacy fallback (ANTIATROPOS_NODE_DEPLOYMENT_MAP):
-        {
-          "node-0": "payments",
-          "node-1": "checkout"
-        }
-        """
-        raw = os.getenv("ANTIATROPOS_WORKLOAD_MAP", "")
-        if raw:
-            parsed = self._parse_json_mapping(raw)
-            if parsed is not None:
-                return parsed
-        legacy_raw = os.getenv("ANTIATROPOS_NODE_DEPLOYMENT_MAP", "")
-        if legacy_raw:
-            legacy = self._parse_legacy_mapping(legacy_raw)
-            if legacy is not None:
-                return legacy
-        return {}
-    def _parse_json_mapping(self, raw: str) -> Optional[dict[str, dict[str, str]]]:
-        try:
-            data = json.loads(raw)
-        except json.JSONDecodeError:
-            return None
-        if not isinstance(data, dict):
-            return None
-        out: dict[str, dict[str, str]] = {}
-        for node_id, workload in data.items():
-            if not isinstance(workload, dict):
-                return None
-            deployment = workload.get("deployment")
-            if not deployment:
-                return None
-            namespace = workload.get("namespace", self.namespace)
-            out[str(node_id)] = {
-                "deployment": str(deployment),
-                "namespace": str(namespace),
-            }
-        return out
-    def _parse_legacy_mapping(self, raw: str) -> Optional[dict[str, dict[str, str]]]:
-        try:
-            data = json.loads(raw)
-        except json.JSONDecodeError:
-            return None
-        if not isinstance(data, dict):
-            return None
-        out: dict[str, dict[str, str]] = {}
-        for node_id, deployment in data.items():
-            if not deployment:
-                return None
-            out[str(node_id)] = {
-                "deployment": str(deployment),
-                "namespace": self.namespace,
-            }
-        return out
-    def _resolve_workload_target(self, target: str) -> tuple[str, str]:
-        if target not in self._node_workload_map:
-            raise ValueError(
-                f"Missing workload mapping for target '{target}'. "
-                "Set ANTIATROPOS_WORKLOAD_MAP with node->deployment bindings."
-            )
-        workload = self._node_workload_map[target]
-        return workload["namespace"], workload["deployment"]

+import os
+import json
+import time
+import logging
+import requests
+from uuid import uuid4
+from typing import Optional
+logger = logging.getLogger("kubernetes_executor")
+class KubernetesExecutor:
+    """
+    Executes high-level SRE actions on a Kubernetes cluster.
+    Provides a safe layer between SREAgent and actual infrastructure.
+    """
+    def __init__(self, kubeconfig: Optional[str] = None):
+        # Use provided path or env var, defaulting to mock if neither is found
+        self.kubeconfig = kubeconfig or os.getenv("KUBECONFIG")
+        self.remote_control_url = os.getenv("ANTIATROPOS_CONTROL_PLANE_URL", "").strip().rstrip("/")
+        self.remote_timeout_s = float(os.getenv("ANTIATROPOS_CONTROL_TIMEOUT_S", "5.0"))
+        self.remote_retry_count = int(os.getenv("ANTIATROPOS_CONTROL_RETRY_COUNT", "2"))
+        self.remote_retry_backoff_s = float(os.getenv("ANTIATROPOS_CONTROL_RETRY_BACKOFF_S", "0.25"))
+        self.is_mock = (
+            not self.remote_control_url
+            and (not self.kubeconfig or self.kubeconfig.lower() == "mock")
+        )
+        self.namespace = os.getenv("ANTIATROPOS_K8S_NAMESPACE", "default")
+        self.min_replicas = int(os.getenv("ANTIATROPOS_MIN_REPLICAS", "1"))
+        self.max_replicas = self._parse_max_replicas(os.getenv("ANTIATROPOS_MAX_REPLICAS"))
+        self.scale_step = int(os.getenv("ANTIATROPOS_SCALE_STEP", "3"))
+        self._apps_v1_api = None
+        self._node_workload_map = self._load_node_workload_map()
+        self._live_supported_actions = {"NO_OP", "SCALE_UP", "SCALE_DOWN"}
+        self.k8s_retry_count = int(os.getenv("ANTIATROPOS_K8S_RETRY_COUNT", "2"))
+        self.k8s_retry_backoff_s = float(os.getenv("ANTIATROPOS_K8S_RETRY_BACKOFF_S", "0.2"))
+    @staticmethod
+    def _parse_max_replicas(raw: Optional[str]) -> Optional[int]:
+        """
+        Parse optional max replicas.
+        Returns:
+          - int when a positive explicit cap is provided
+          - None when scale-up should be unbounded
+        """
+        if raw is None:
+            return None
+        value = str(raw).strip().lower()
+        if value in ("", "none", "unbounded", "inf", "infinite"):
+            return None
+        try:
+            parsed = int(value)
+        except ValueError:
+            return None
+        if parsed <= 0:
+            return None
+        return parsed
+    @staticmethod
+    def _normalize_action_type(action_type) -> str:
+        if hasattr(action_type, "value"):
+            return str(action_type.value)
+        return str(action_type)
+    def execute(self, action_type: str, target: str, parameter: float) -> str:
+        """
+        Translates SRE actions to Kube requests (ScaleDeployment, PatchIngress, etc.)
+        """
+        return self.execute_with_metadata(action_type, target, parameter)["ack_status"]
+    def execute_with_metadata(self, action_type: str, target: str, parameter: float) -> dict:
+        """
+        Execute action and return acknowledgement plus executor metadata.
+        """
+        action_id = str(uuid4())
+        started = time.perf_counter()
+        ack_status = ""
+        error_code = ""
+        if self.is_mock:
+            ack_status = self._mock_execution(action_type, target, parameter)
+        else:
+            try:
+                ack_status = self._real_execution(action_type, target, parameter)
+            except Exception as e:
+                logger.error(f"Execution failed for {action_type} on {target}: {str(e)}")
+                ack_status = f"Error: Failed to execute {action_type} on {target}: {str(e)}"
+                error_code = "EXECUTION_ERROR"
+        if ack_status.startswith("Rejected:") and not error_code:
+            error_code = "REJECTED_ACTION"
+        elif ack_status.startswith("Error:") and not error_code:
+            error_code = "EXECUTION_ERROR"
+        latency_ms = (time.perf_counter() - started) * 1000.0
+        return {
+            "action_id": action_id,
+            "ack_status": ack_status,
+            "executor_latency_ms": latency_ms,
+            "executor_error_code": error_code,
+        }
+    def live_enabled_actions(self) -> set[str]:
+        """Action types that are actually executable in real live mode."""
+        if self.is_mock:
+            return {"NO_OP"}
+        return set(self._live_supported_actions)
+    def live_capability_error(self, action_type: str) -> Optional[str]:
+        """Returns reason when action is not runnable in live mode, else None."""
+        action = self._normalize_action_type(action_type)
+        if action not in self.live_enabled_actions():
+            if self.is_mock:
+                return (
+                    f"Live mode rejected {action}: no real Kubernetes executor is configured "
+                    "(set KUBECONFIG and ANTIATROPOS_WORKLOAD_MAP)."
+                )
+            return f"Live mode rejected {action}: no executor is enabled for this action."
+        return None
+    def _real_execution(self, action_type: str, target: str, parameter: float) -> str:
+        """Execute bounded actions on a Kubernetes cluster."""
+        action = self._normalize_action_type(action_type)
+        if self.remote_control_url:
+            return self._remote_execution(action, target, parameter)
+        if action == "NO_OP":
+            return "Ack: NO_OP - no cluster mutation"
+        if action in ("SCALE_UP", "SCALE_DOWN"):
+            return self._scale_deployment(action, target, parameter)
+        return f"Rejected: {action} is not enabled for live Kubernetes execution"
+    def _mock_execution(self, action_type: str, target: str, parameter: float) -> str:
+        """Returns mock acknowledgement for actions."""
+        # TODO: Add realistic latency simulation for K8s control plane
+        action = self._normalize_action_type(action_type)
+        return f"Ack: {action} for {target} with value {parameter} - Status: Applied"
+    def _scale_deployment(self, action_type: str, target: str, parameter: float) -> str:
+        namespace, deployment_name = self._resolve_workload_target(target)
+        apps_v1 = self._get_apps_v1_api()
+        scale_obj = apps_v1.read_namespaced_deployment_scale(
+            name=deployment_name,
+            namespace=namespace,
+        )
+        current = int(scale_obj.spec.replicas or self.min_replicas)
+        delta = max(1, int(float(parameter) * self.scale_step))
+        if action_type == "SCALE_UP":
+            if self.max_replicas is None:
+                desired = current + delta
+            else:
+                desired = min(self.max_replicas, current + delta)
+        else:
+            desired = max(self.min_replicas, current - delta)
+        if desired == current:
+            upper = "unbounded" if self.max_replicas is None else str(self.max_replicas)
+            return (
+                f"Ack: {action_type} for {target} - replicas unchanged at {current} "
+                f"(bounds {self.min_replicas}-{upper})"
+            )
+        self._patch_deployment_scale_with_retry(
+            apps_v1=apps_v1,
+            deployment_name=deployment_name,
+            namespace=namespace,
+            desired=desired,
+        )
+        return (
+            f"Ack: {action_type} for {target} - deployment {deployment_name} "
+            f"in namespace {namespace} scaled {current}->{desired}"
+        )
+    def _patch_deployment_scale_with_retry(self, apps_v1, deployment_name: str, namespace: str, desired: int) -> None:
+        """
+        Patch deployment replicas with retries for transient API server errors.
+        """
+        from kubernetes.client.rest import ApiException
+        max_attempts = max(1, self.k8s_retry_count + 1)
+        for attempt in range(1, max_attempts + 1):
+            try:
+                apps_v1.patch_namespaced_deployment_scale(
+                    name=deployment_name,
+                    namespace=namespace,
+                    body={"spec": {"replicas": desired}},
+                )
+                return
+            except ApiException as exc:
+                retryable = exc.status in (409, 429, 500, 502, 503, 504)
+                if (not retryable) or attempt >= max_attempts:
+                    raise
+                sleep_s = self.k8s_retry_backoff_s * (2 ** (attempt - 1))
+                logger.warning(
+                    "Retrying deployment scale patch after ApiException status=%s attempt=%s/%s",
+                    exc.status,
+                    attempt,
+                    max_attempts,
+                )
+                time.sleep(sleep_s)
+    def _remote_execution(self, action: str, target: str, parameter: float) -> str:
+        """
+        Delegate action execution to a remote FastAPI control plane.
+        Expected remote endpoint contract:
+          - POST /step
+          - Request: {action_type, target_node_id, parameter}
+          - Success response includes ack_status and starts with "Ack:"
+        This contract matches server.local_laptop_control and is the only
+        supported remote control-plane format.
+        """
+        if not self.remote_control_url:
+            raise ValueError("ANTIATROPOS_CONTROL_PLANE_URL is not configured")
+        endpoint = f"{self.remote_control_url}/step"
+        action_payload = {
+            "action_type": action,
+            "target_node_id": target,
+            "parameter": float(parameter),
+        }
+        payload = action_payload
+        response = self._post_with_retry(endpoint=endpoint, payload=payload)
+        if response.status_code >= 400:
+            detail = ""
+            try:
+                body = response.json()
+                detail = str(body.get("detail", body))
+            except Exception:
+                detail = response.text.strip()
+            if response.status_code == 422 and "action" in detail:
+                detail = (
+                    f"{detail}. Expected lightweight control-plane contract at "
+                    f"{endpoint}: "
+                    '{"action_type":"SCALE_UP","target_node_id":"node-0","parameter":1.0}'
+                )
+            raise RuntimeError(
+                f"Remote control-plane rejected action ({response.status_code}): {detail}"
+            )
+        try:
+            data = response.json()
+        except Exception as exc:
+            raise RuntimeError("Remote control-plane returned non-JSON response") from exc
+        ack = str(data.get("ack_status", "")).strip()
+        if not ack:
+            action_id = str(data.get("action_id", "")).strip() or "remote"
+            return f"Ack: {action} for {target} via remote control-plane ({action_id})"
+        return ack
+    def _post_with_retry(self, endpoint: str, payload: dict) -> requests.Response:
+        """
+        POST helper with retries for transient HTTP/network failures.
+        """
+        max_attempts = max(1, self.remote_retry_count + 1)
+        last_exc: Optional[Exception] = None
+        for attempt in range(1, max_attempts + 1):
+            try:
+                response = requests.post(endpoint, json=payload, timeout=self.remote_timeout_s)
+            except requests.RequestException as exc:
+                last_exc = exc
+                if attempt >= max_attempts:
+                    break
+                sleep_s = self.remote_retry_backoff_s * (2 ** (attempt - 1))
+                logger.warning(
+                    "Retrying remote control-plane POST after network error attempt=%s/%s: %s",
+                    attempt,
+                    max_attempts,
+                    exc,
+                )
+                time.sleep(sleep_s)
+                continue
+            if response.status_code >= 500 and attempt < max_attempts:
+                sleep_s = self.remote_retry_backoff_s * (2 ** (attempt - 1))
+                logger.warning(
+                    "Retrying remote control-plane POST after HTTP %s attempt=%s/%s",
+                    response.status_code,
+                    attempt,
+                    max_attempts,
+                )
+                time.sleep(sleep_s)
+                continue
+            return response
+        if last_exc is not None:
+            raise RuntimeError(f"Remote control-plane request failed: {last_exc}") from last_exc
+        raise RuntimeError("Remote control-plane request failed after retries")
+    def _get_apps_v1_api(self):
+        if self._apps_v1_api is not None:
+            return self._apps_v1_api
+        from kubernetes import client, config
+        if self.kubeconfig and self.kubeconfig.lower() not in ("mock", ""):
+            config.load_kube_config(config_file=self.kubeconfig)
+        else:
+            config.load_incluster_config()
+        self._apps_v1_api = client.AppsV1Api()
+        return self._apps_v1_api
+    def _load_node_workload_map(self) -> dict[str, dict[str, str]]:
+        """
+        Load node->workload mapping.
+        Preferred format (ANTIATROPOS_WORKLOAD_MAP):
+        {
+          "node-0": {"deployment": "payments", "namespace": "prod-sre"},
+          "node-1": {"deployment": "checkout"}
+        }
+        Legacy fallback (ANTIATROPOS_NODE_DEPLOYMENT_MAP):
+        {
+          "node-0": "payments",
+          "node-1": "checkout"
+        }
+        """
+        raw = os.getenv("ANTIATROPOS_WORKLOAD_MAP", "")
+        if raw:
+            parsed = self._parse_json_mapping(raw)
+            if parsed is not None:
+                return parsed
+        legacy_raw = os.getenv("ANTIATROPOS_NODE_DEPLOYMENT_MAP", "")
+        if legacy_raw:
+            legacy = self._parse_legacy_mapping(legacy_raw)
+            if legacy is not None:
+                return legacy
+        return {}
+    def _parse_json_mapping(self, raw: str) -> Optional[dict[str, dict[str, str]]]:
+        try:
+            data = json.loads(raw)
+        except json.JSONDecodeError:
+            return None
+        if not isinstance(data, dict):
+            return None
+        out: dict[str, dict[str, str]] = {}
+        for node_id, workload in data.items():
+            if not isinstance(workload, dict):
+                return None
+            deployment = workload.get("deployment")
+            if not deployment:
+                return None
+            namespace = workload.get("namespace", self.namespace)
+            out[str(node_id)] = {
+                "deployment": str(deployment),
+                "namespace": str(namespace),
+            }
+        return out
+    def _parse_legacy_mapping(self, raw: str) -> Optional[dict[str, dict[str, str]]]:
+        try:
+            data = json.loads(raw)
+        except json.JSONDecodeError:
+            return None
+        if not isinstance(data, dict):
+            return None
+        out: dict[str, dict[str, str]] = {}
+        for node_id, deployment in data.items():
+            if not deployment:
+                return None
+            out[str(node_id)] = {
+                "deployment": str(deployment),
+                "namespace": self.namespace,
+            }
+        return out
+    def _resolve_workload_target(self, target: str) -> tuple[str, str]:
+        if target not in self._node_workload_map:
+            raise ValueError(
+                f"Missing workload mapping for target '{target}'. "
+                "Set ANTIATROPOS_WORKLOAD_MAP with node->deployment bindings."
+            )
+        workload = self._node_workload_map[target]
+        return workload["namespace"], workload["deployment"]

control/validation.py CHANGED Viewed

@@ -1,38 +1,69 @@
-from typing import List, Optional
-class ActionValidator:
-    """
-    Validates SRE actions to ensure they stay within safety boundaries.
-    Prevents destructive operations like 100% shedding on critical nodes.
-    """
-    def __init__(self, critical_nodes: Optional[List[str]] = None):
-        self.critical_nodes = critical_nodes or ["node-0", "node-1", "node-2"]
-    def validate(self, action_type: str, target: str, parameter: float, valid_targets: Optional[List[str]] = None) -> (bool, str):
-        """
-        Returns (is_valid, error_message).
-        """
-        if hasattr(action_type, "value"):
-            action = str(action_type.value)
-        else:
-            action = str(action_type)
-        if valid_targets is not None and target not in valid_targets:
-            return False, f"Unknown target node: {target}"
-        if action == "SHED_LOAD" and target in self.critical_nodes:
-            return False, f"Forbidden: Load shedding on critical node {target}."
-        if action in ["SCALE_UP", "SCALE_DOWN"]:
-            if parameter < 0.0:
-                return False, "Negative scaling parameters are not allowed."
-            if parameter > 10.0:
-                return False, "Scaling parameter must be <= 10.0."
-        if action in ["REROUTE_TRAFFIC", "SHED_LOAD"] and not (0.0 <= parameter <= 1.0):
-            return False, f"{action} parameter must be in [0.0, 1.0]."
-        if action == "NO_OP" and parameter != 0.0:
-            return False, "NO_OP requires parameter=0.0."
-        return True, "Success"

+from typing import List, Optional, Tuple
+class ActionValidator:
+    """
+    Validates SRE actions to ensure they stay within safety boundaries.
+    Prevents destructive operations like 100% shedding on critical nodes.
+    Implements soft cooldown for scaling actions: instead of hard-rejecting
+    a rapid re-scale, the action passes with a penalty signal. The environment
+    can use this penalty to reduce the reward, teaching the agent to wait
+    without blocking emergency scaling.
+    """
+    def __init__(self, critical_nodes: Optional[List[str]] = None, cooldown_ticks: int = 3):
+        self.critical_nodes = critical_nodes or ["node-0", "node-1", "node-2"]
+        self.cooldown_ticks = cooldown_ticks
+        # Track last scale action per node: {node_id: (tick, action_type)}
+        self._last_scale: dict[str, Tuple[int, str]] = {}
+        self._current_tick: int = 0
+    def set_tick(self, tick: int) -> None:
+        """Update the current tick counter for cooldown tracking."""
+        self._current_tick = tick
+    def validate(self, action_type: str, target: str, parameter: float, valid_targets: Optional[List[str]] = None) -> Tuple[bool, str, float]:
+        """
+        Returns (is_valid, error_message, cooldown_penalty).
+        cooldown_penalty is in [0, 1]:
+          0.0 = no penalty (action is fine)
+          >0  = soft penalty for rapid re-scaling (action still executes)
+        Hard violations (critical shed, out-of-range) still reject with penalty=0.
+        """
+        if hasattr(action_type, "value"):
+            action = str(action_type.value)
+        else:
+            action = str(action_type)
+        cooldown_penalty = 0.0
+        if valid_targets is not None and target not in valid_targets:
+            return False, f"Unknown target node: {target}", 0.0
+        if action == "SHED_LOAD" and target in self.critical_nodes:
+            return False, f"Forbidden: Load shedding on critical node {target}.", 0.0
+        if action in ["SCALE_UP", "SCALE_DOWN"]:
+            if parameter < 0.0:
+                return False, "Negative scaling parameters are not allowed.", 0.0
+            if parameter > 10.0:
+                return False, "Scaling parameter must be <= 10.0.", 0.0
+            # Soft cooldown: penalize but don't block rapid re-scaling.
+            # Dynamic window: if the node is DEGRADED, reduce cooldown (emergency allowed).
+            last_tick, last_action = self._last_scale.get(target, (0, ""))
+            ticks_since = self._current_tick - last_tick
+            if ticks_since < self.cooldown_ticks and last_action == action:
+                # Penalty decays linearly: full penalty at 0 ticks, 0 at cooldown_ticks
+                cooldown_penalty = (self.cooldown_ticks - ticks_since) / self.cooldown_ticks
+                # Don't reject — just flag the penalty
+            self._last_scale[target] = (self._current_tick, action)
+        if action in ["REROUTE_TRAFFIC", "SHED_LOAD"] and not (0.0 <= parameter <= 1.0):
+            return False, f"{action} parameter must be in [0.0, 1.0].", 0.0
+        if action == "NO_OP" and parameter != 0.0:
+            return False, "NO_OP requires parameter=0.0.", 0.0
+        return True, "Success", cooldown_penalty

curriculum.py ADDED Viewed

	@@ -0,0 +1,131 @@

+"""
+AntiAtropos Curriculum Training.
+Defines progressive difficulty stages that the agent must pass before advancing.
+Failed stages are retried with higher temperature for exploration.
+Each stage specifies:
+- task: Which task to run
+- max_steps: Episode length (shorter = easier)
+- pass_threshold: Minimum composite score to advance
+- temperature: Suggest LLM temperature for this stage
+- description: Human-readable label
+"""
+from dataclasses import dataclass
+from typing import List, Optional
+@dataclass
+class CurriculumStage:
+    """A single stage in the training curriculum."""
+    task: str
+    max_steps: int
+    pass_threshold: float
+    temperature: float = 0.0
+    description: str = ""
+    retries: int = 0  # Number of failed attempts so far
+    max_retries: int = 3  # Max retries before advancing anyway
+    @property
+    def retry_temperature(self) -> float:
+        """Temperature increases with retries to encourage exploration."""
+        if self.retries == 0:
+            return self.temperature
+        # 0.3, 0.6, 0.9 on retries
+        return min(1.0, self.temperature + self.retries * 0.3)
+    @property
+    def should_skip(self) -> bool:
+        """Skip this stage if too many retries."""
+        return self.retries >= self.max_retries
+# Progressive curriculum: start easy, add complexity
+CURRICULUM: List[CurriculumStage] = [
+    CurriculumStage(
+        task="task-1", max_steps=40, pass_threshold=0.40,
+        temperature=0.0, description="Short ramp — learn basic scaling",
+    ),
+    CurriculumStage(
+        task="task-1", max_steps=60, pass_threshold=0.50,
+        temperature=0.0, description="Standard ramp — scale proactively",
+    ),
+    CurriculumStage(
+        task="task-1", max_steps=100, pass_threshold=0.55,
+        temperature=0.0, description="Full ramp — cost-aware scaling",
+    ),
+    CurriculumStage(
+        task="task-2", max_steps=40, pass_threshold=0.35,
+        temperature=0.0, description="Short fault — learn reroute/scale on failure",
+    ),
+    CurriculumStage(
+        task="task-2", max_steps=60, pass_threshold=0.45,
+        temperature=0.3, description="Standard fault — fast recovery",
+    ),
+    CurriculumStage(
+        task="task-3", max_steps=40, pass_threshold=0.35,
+        temperature=0.0, description="Short surge — protect VIP during spike",
+    ),
+    CurriculumStage(
+        task="task-3", max_steps=60, pass_threshold=0.45,
+        temperature=0.3, description="Standard surge — sustained VIP protection",
+    ),
+    # Final combined test
+    CurriculumStage(
+        task="task-1", max_steps=100, pass_threshold=0.55,
+        temperature=0.0, description="Final: full ramp at low temp",
+    ),
+    CurriculumStage(
+        task="task-2", max_steps=60, pass_threshold=0.50,
+        temperature=0.0, description="Final: fault recovery at low temp",
+    ),
+    CurriculumStage(
+        task="task-3", max_steps=60, pass_threshold=0.50,
+        temperature=0.0, description="Final: surge protection at low temp",
+    ),
+]
+class CurriculumTracker:
+    """Tracks progress through the curriculum stages."""
+    def __init__(self, stages: Optional[List[CurriculumStage]] = None):
+        self._stages = stages or CURRICULUM
+        self._current_idx: int = 0
+    @property
+    def current(self) -> CurriculumStage:
+        return self._stages[self._current_idx]
+    @property
+    def current_index(self) -> int:
+        return self._current_idx
+    @property
+    def total_stages(self) -> int:
+        return len(self._stages)
+    @property
+    def is_complete(self) -> bool:
+        return self._current_idx >= len(self._stages)
+    def report_score(self, score: float) -> bool:
+        """Report a score for the current stage. Returns True if passed."""
+        if score >= self.current.pass_threshold:
+            self._current_idx += 1
+            return True
+        else:
+            self.current.retries += 1
+            if self.current.should_skip:
+                self._current_idx += 1
+            return False
+    def progress_summary(self) -> str:
+        stage = self.current
+        return (
+            f"Stage {self._current_idx + 1}/{self.total_stages}: "
+            f"{stage.description} "
+            f"(task={stage.task}, max_steps={stage.max_steps}, "
+            f"threshold={stage.pass_threshold}, retries={stage.retries})"
+        )

deploy-local.ps1 ADDED Viewed

	@@ -0,0 +1,91 @@

+# AntiAtropos Local Cluster Deploy
+# Deploys workloads, Prometheus, and Grafana on the Kind cluster.
+# Grafana port-forward starts automatically at the end.
+param(
+    [switch]$SkipPortForward,
+    [int]$GrafanaPort = 3000
+)
+Write-Host "=== AntiAtropos Local Deploy ===" -ForegroundColor Cyan
+Write-Host ""
+# --- 1. Check cluster ---
+Write-Host "[1/5] Checking Kind cluster..." -ForegroundColor Yellow
+$cluster = kubectl config current-context 2>$null
+if ($cluster -notmatch "antiatropos") {
+    Write-Host "WARNING: Current context is '$cluster', expected 'kind-antiatropos-local'. Proceed anyway? [Y/n]"
+    $r = Read-Host
+    if ($r -eq 'n') { exit 1 }
+}
+# --- 2. Deploy workload pods ---
+Write-Host "[2/5] Deploying workload pods..." -ForegroundColor Yellow
+kubectl create ns prod-sre 2>&1 | Out-Null
+kubectl create ns monitoring 2>&1 | Out-Null
+kubectl apply -f "$PSScriptRoot\deploy\local-laptop.yaml"
+Write-Host "  Waiting for workloads to be ready..."
+kubectl wait --for=condition=ready pod -l app --all -n prod-sre --timeout=120s 2>$null
+Write-Host "  Workloads ready."
+# --- 3. Deploy Prometheus ---
+Write-Host "[3/5] Deploying Prometheus..." -ForegroundColor Yellow
+$promRelease = helm list -n monitoring -q 2>$null | Select-String "prometheus"
+if ($promRelease) {
+    helm upgrade prometheus prometheus-community/prometheus -n monitoring -f "$PSScriptRoot\deploy\prometheus-helm-values.yaml"
+} else {
+    helm install prometheus prometheus-community/prometheus -n monitoring -f "$PSScriptRoot\deploy\prometheus-helm-values.yaml"
+}
+Write-Host "  Waiting for Prometheus server..."
+kubectl wait --for=condition=ready pod -l "app.kubernetes.io/name=prometheus" -n monitoring --timeout=120s 2>$null
+Write-Host "  Prometheus ready."
+# --- 4. Deploy Grafana ---
+Write-Host "[4/5] Deploying Grafana..." -ForegroundColor Yellow
+# Update dashboard ConfigMap
+kubectl delete configmap grafana-dashboards -n monitoring 2>$null
+kubectl create configmap grafana-dashboards -n monitoring --from-file="$PSScriptRoot\deploy\grafana\provisioning\dashboards\json\"
+$grafRelease = helm list -n monitoring -q 2>$null | Select-String "grafana"
+if ($grafRelease) {
+    helm upgrade grafana grafana/grafana -n monitoring -f "$PSScriptRoot\deploy\grafana-helm-values.yaml"
+} else {
+    helm install grafana grafana/grafana -n monitoring -f "$PSScriptRoot\deploy\grafana-helm-values.yaml"
+}
+Write-Host "  Waiting for Grafana..."
+kubectl wait --for=condition=ready pod -l "app.kubernetes.io/name=grafana" -n monitoring --timeout=120s 2>$null
+Write-Host "  Grafana ready."
+# --- 5. Start Grafana port-forward ---
+Write-Host "[5/5] Grafana port-forward..." -ForegroundColor Yellow
+if (-not $SkipPortForward) {
+    # Kill any existing port-forward on the same port
+    $existing = Get-NetTCPConnection -LocalPort $GrafanaPort -ErrorAction SilentlyContinue 2>$null
+    if ($existing) {
+        $pid = $existing.OwningProcess
+        Stop-Process -Id $pid -Force -ErrorAction SilentlyContinue 2>$null
+        Start-Sleep -Seconds 1
+    }
+    Write-Host "  Starting port-forward on localhost:$GrafanaPort..."
+    $proc = Start-Process -PassThru -NoNewWindow kubectl -ArgumentList "port-forward","-n","monitoring","svc/grafana","${GrafanaPort}:80"
+    Start-Sleep -Seconds 2
+    # Verify the port-forward is alive
+    try {
+        $null = Invoke-WebRequest -Uri "http://localhost:$GrafanaPort/api/health" -UseBasicParsing -TimeoutSec 5
+        Write-Host ""
+        Write-Host "=== Deploy Complete ===" -ForegroundColor Green
+        Write-Host "  Grafana:  http://localhost:$GrafanaPort  (admin / antiatropos)"
+        Write-Host "  Dashboards: AntiAtropos Overview, AntiAtropos Live Control Plane"
+        Write-Host "  Port-forward PID: $($proc.Id)"
+        Write-Host ""
+        Write-Host "To stop port-forward: Stop-Process -Id $($proc.Id)"
+    } catch {
+        Write-Host "WARNING: Port-forward started but Grafana not reachable yet. Try: kubectl port-forward -n monitoring svc/grafana ${GrafanaPort}:80"
+    }
+} else {
+    Write-Host ""
+    Write-Host "=== Deploy Complete ===" -ForegroundColor Green
+    Write-Host "  To access Grafana: kubectl port-forward -n monitoring svc/grafana ${GrafanaPort}:80"
+}

deploy/LOCAL_LAPTOP_FASTAPI_GUIDE.md ADDED Viewed

	@@ -0,0 +1,74 @@

+# Local Laptop Kubernetes Control with FastAPI
+This guide uses your local manifest [deploy/local-laptop.yaml](deploy/local-laptop.yaml) and a lightweight server [server/local_laptop_control.py](server/local_laptop_control.py).
+## 1) Deploy local workloads
+```powershell
+kubectl apply -f deploy/local-laptop.yaml
+kubectl get deploy -n prod-sre
+```
+Expected deployments:
+- `auth`
+- `cart`
+- `catalog`
+- `checkout`
+- `payments`
+## 2) Set required environment variables
+The controller requires `KUBECONFIG` and `ANTIATROPOS_WORKLOAD_MAP`.
+```powershell
+$env:KUBECONFIG = "$HOME/.kube/config"
+$env:ANTIATROPOS_K8S_NAMESPACE = "prod-sre"
+$env:ANTIATROPOS_MIN_REPLICAS = "1"
+$env:ANTIATROPOS_MAX_REPLICAS = ""   # empty => unbounded scale-up
+$env:ANTIATROPOS_SCALE_STEP = "3"
+$env:ANTIATROPOS_WORKLOAD_MAP = '{"node-0":{"deployment":"payments","namespace":"prod-sre"},"node-1":{"deployment":"checkout","namespace":"prod-sre"},"node-2":{"deployment":"catalog","namespace":"prod-sre"},"node-3":{"deployment":"cart","namespace":"prod-sre"},"node-4":{"deployment":"auth","namespace":"prod-sre"}}'
+```
+If you already have these in [.env](.env), load them first.
+## 3) Start lightweight FastAPI server
+```powershell
+uvicorn server.local_laptop_control:app --host 0.0.0.0 --port 8010
+```
+## 4) Validate server health
+```powershell
+Invoke-RestMethod http://localhost:8010/health
+```
+Check:
+- `is_mock` should be `False`
+- `mapped_targets` should include `node-0`..`node-4`
+## 5) Let your agent execute actions
+The server accepts `POST /step` with:
+- `action_type`: `NO_OP` | `SCALE_UP` | `SCALE_DOWN`
+- `target_node_id`: `node-*`
+- `parameter`: float
+Example:
+```powershell
+Invoke-RestMethod -Method Post -Uri http://localhost:8010/step -ContentType "application/json" -Body '{"action_type":"SCALE_UP","target_node_id":"node-3","parameter":0.6}'
+```
+## 6) Verify Kubernetes effect
+```powershell
+kubectl get deploy cart -n prod-sre
+kubectl get deploy -n prod-sre
+```
+## Notes
+- This controller is intentionally minimal and does not provide simulator rewards.
+- It is suitable for direct action execution tests from your agent.
+- If you need OpenEnv-compatible `/reset` + `/step` + reward loop, use [server/app.py](server/app.py) in `aws` mode.

deploy/aws/ARCHITECTURE.md ADDED Viewed

	@@ -0,0 +1,361 @@

+# AntiAtropos Architecture Guide
+A complete explanation of how AntiAtropos works across Hugging Face Spaces and AWS, written for someone who is technically strong but new to Kubernetes.
+---
+## The Big Picture
+AntiAtropos trains AI agents to be Site Reliability Engineers (SREs). An SRE agent watches a simulated microservice cluster and decides when to scale services, reroute traffic, or shed load to keep things running smoothly.
+The system is split across two platforms:
+```
+Hugging Face Spaces                      AWS
+=====================                    ======================
+The "brain"                              The "muscle"
+AntiAtropos FastAPI server               EKS (Kubernetes cluster)
+  - Runs the simulator                     - Runs the actual microservice pods
+  - Runs the SRE agent logic               - The agent scales these pods
+  - Queries Prometheus for metrics         - Prometheus Agent scrapes metrics
+  - Sends scale commands to K8s            - Metrics flow to AMP
+                                           - Grafana (AMG) visualizes it all
+```
+Why split? HF Spaces is free/cheap for running the Python server. AWS EKS is where the real infrastructure lives that the agent practices on.
+---
+## Kubernetes Concepts You Need
+### Pod
+The smallest unit in Kubernetes. A pod is one or more containers that run together. In our case, each pod runs a single nginx container that simulates a microservice (like "payments" or "checkout").
+Think of it as: one running instance of a service.
+### Deployment
+A Deployment is a recipe that tells Kubernetes "keep N copies of this pod running at all times." If a pod dies, the Deployment automatically replaces it.
+The key field is `spec.replicas` — this is the number the SRE agent changes when it scales a service up or down.
+```
+Deployment: payments
+  replicas: 3         <-- the agent changes this number
+    |
+    +-- Pod: payments-abc123   (running)
+    +-- Pod: payments-def456   (running)
+    +-- Pod: payments-ghi789   (running)
+```
+**The agent scales replicas, not pods.** When it sets `replicas: 5`, Kubernetes creates 5 pods. When it sets `replicas: 2`, Kubernetes kills 3 pods.
+### Service
+A Service gives pods a stable network name. Instead of connecting to `payments-abc123` directly (which changes when the pod is recreated), you connect to `payments` (the Service), which routes to whichever pods are healthy.
+### Namespace
+A namespace is a folder for organizing resources. We use:
+- `prod-sre` — where the 5 microservice Deployments live
+- `monitoring` — where the Prometheus Agent pod lives
+- `kube-system` — where AWS/EKS system pods live
+### Node
+A node is one EC2 virtual machine in the EKS cluster. Our cluster has 2-4 nodes. Each node runs multiple pods. When all nodes are full and the agent wants to scale up, Kubernetes adds more nodes (up to `maxSize: 4` in our config).
+```
+EKS Cluster
+  Node 1 (t3.medium - 4 vCPU, 8GB RAM)
+    Pod: payments-abc123
+    Pod: checkout-def456
+    Pod: catalog-ghi789
+    Pod: prometheus-agent-xyz
+  Node 2 (t3.medium - 4 vCPU, 8GB RAM)
+    Pod: payments-jkl012    <-- agent scaled payments from 1 to 2
+    Pod: cart-mno345
+    Pod: auth-pqr678
+```
+### ResourceQuota
+A hard limit on how many resources a namespace can use. We set one on `prod-sre` that caps total pods at 30. This is a safety net — even if the Python code cap fails, Kubernetes itself will refuse to create more than 30 pods.
+---
+## How the SRE Agent Works
+### The Loop
+Every "tick" (one step of the simulation), the agent goes through this cycle:
+```
+1. OBSERVE  -- Read telemetry (CPU, latency, queue depth) from Prometheus
+2. DECIDE   -- Choose an action (SCALE_UP, SCALE_DOWN, REROUTE_TRAFFIC, SHED_LOAD, NO_OP)
+3. ACT      -- Send the action to KubernetesExecutor
+4. REWARD   -- Compute Lyapunov stability reward (was the cluster more or less stable?)
+5. REPEAT
+```
+### How Each Action Works
+| Action | What the Agent Decides | What Happens on EKS |
+|---|---|---|
+| `SCALE_UP` | "node-0 needs more capacity" | `KubernetesExecutor` patches `payments` Deployment: `replicas: 2 -> 5` |
+| `SCALE_DOWN` | "node-3 is over-provisioned" | `KubernetesExecutor` patches `cart` Deployment: `replicas: 4 -> 1` |
+| `REROUTE_TRAFFIC` | "Move traffic away from node-2" | Currently simulation-only (no live K8s ingress patching) |
+| `SHED_LOAD` | "Drop 50% of traffic to node-3" | Currently simulation-only (no live K8s traffic shaping) |
+| `NO_OP` | "Do nothing this tick" | Nothing changes on EKS |
+### The SCALE_UP Flow in Detail
+Here is exactly what happens when the agent decides to scale up `node-0` (the payments service):
+```
+HF Spaces                                    AWS EKS
+----------                                   --------
+Agent: "SCALE_UP, node-0, parameter=0.5"
+  |
+  v
+AntiAtroposEnvironment.step()
+  |
+  v
+KubernetesExecutor.execute_with_metadata()
+  |
+  v
+_load_node_workload_map()
+  reads: node-0 -> {"deployment": "payments", "namespace": "prod-sre"}
+  |
+  v
+_scale_deployment("SCALE_UP", "node-0", 0.5)
+  |
+  +-- 1. Read current replicas: apps_v1.read_namespaced_deployment_scale("payments", "prod-sre")
+  |      Current replicas = 2
+  |
+  +-- 2. Calculate delta: max(1, int(0.5 * 3)) = 1
+  |      Desired = min(6, 2 + 1) = 3        <-- max_replicas cap from env var
+  |
+  +-- 3. Patch: apps_v1.patch_namespaced_deployment_scale("payments", "prod-sre",
+  |         body={"spec": {"replicas": 3}})
+  |
+  v                                                     +---------------------------+
+Returns: "Ack: SCALE_UP for node-0 -                    | K8s creates 1 new pod:    |
+  deployment payments in namespace                       |   payments-newpod-xyz     |
+  prod-sre scaled 2->3"                                  +---------------------------+
+```
+### The Telemetry Flow in Detail
+How the agent reads metrics from the real cluster:
+```
+EKS Cluster                              AMP                          HF Spaces
+-----------                              ---                          ----------
+Workload pods                            AMP Workspace                AntiAtropos
+(payments, checkout...)                  stores all metrics           PrometheusClient
+  |                                           ^                        |
+  | /metrics (scraped every 15s)              |                        |
+  v                                           |                        |
+Prometheus Agent                             |                        |
+  |                                           |                        |
+  | remote-write (SigV4 auth)                 |                        |
+  +------------------------------------------->                        |
+                                              |                        |
+                                              |  HTTPS query           |
+                                              +------------------------>
+                                              (PROMETHEUS_URL env var)
+                                                                       |
+                                                                       v
+                                                                 _fetch_real_metrics()
+                                                                 runs PromQL like:
+                                                                   sum(rate(http_requests_total[1m])) by (pod)
+                                                                 returns: TelemetryRecord for each node
+```
+---
+## The Three Layers of Scaling Caps
+This is the most important thing to understand for cost control. There are **three** independent limits:
+### Layer 1: Python Code Cap (Soft)
+**Where:** `ANTIATROPOS_MAX_REPLICAS` env var on HF Spaces, read by `kubernetes_executor.py` line 18.
+**How it works:** The `_scale_deployment()` method calculates `desired = min(self.max_replicas, current + delta)`. If the agent tries to scale above 6, it gets:
+```
+Ack: SCALE_UP for node-0 - replicas unchanged at 6 (bounds 1-6)
+```
+**Can it be bypassed?** Yes. A bug in the code, or someone running `kubectl scale deployment payments --replicas=50` directly.
+**Set to:** `6` on HF Spaces.
+### Layer 2: Kubernetes ResourceQuota (Hard)
+**Where:** `k8s-workloads.yaml` — ResourceQuota on the `prod-sre` namespace.
+**How it works:** Kubernetes itself refuses to schedule pods that would exceed the quota. If the namespace already has 30 pods and something tries to create a 31st:
+```
+Error from server (Forbidden): pods "payments-new" is forbidden:
+exceeded quota: prod-sre-quota, requested: pods=1, used: pods=30, limited: pods=30
+```
+**Can it be bypassed?** Only by someone with cluster-admin access who deletes or edits the ResourceQuota.
+**Set to:** 30 pods total, 8 CPU, 8GB RAM.
+### Layer 3: EKS Node Group Max Size (Hard)
+**Where:** `eksctl-cluster.yaml` — `managedNodeGroups[0].maxSize: 4`.
+**How it works:** The Cluster Autoscaler will never add more than 4 nodes. Even if there are 100 pending pods, it stops at 4 nodes. Pending pods just wait.
+**Can it be bypassed?** Only by someone editing the node group in the AWS console.
+**Set to:** 4 nodes (4 x t3.medium = 8 vCPU, 16GB RAM max).
+### How the Three Layers Work Together
+```
+Agent wants to scale all 5 deployments to 20 replicas each:
+Layer 1 (Python cap):      6 replicas max per deployment  -> agent gets "unchanged at 6"
+                           5 x 6 = 30 pods maximum
+Layer 2 (ResourceQuota):   30 pods max in namespace       -> 31st pod is Forbidden
+Layer 3 (Node group):      4 nodes max                     -> if 30 pods don't fit on 4 nodes,
+                                                            some stay Pending (no cost)
+Worst case with all caps:  30 pods on 4 nodes = ~$160/month
+Without any caps:          100 pods on 25 nodes = ~$1,800/month
+```
+---
+## The Mapping: Simulator Nodes to Real Deployments
+The simulator has 5 abstract nodes (node-0 through node-4). The `ANTIATROPOS_WORKLOAD_MAP` env var tells the system which K8s Deployment each simulator node maps to:
+```
+Simulator Node    K8s Deployment    Namespace    Notes
+-------------     ---------------   ---------    -----
+node-0            payments          prod-sre     VIP (4x importance weight)
+node-1            checkout          prod-sre     Critical (no SHED_LOAD)
+node-2            catalog           prod-sre     Critical (no SHED_LOAD)
+node-3            cart              prod-sre     Non-critical (sheddable)
+node-4            auth              prod-sre     Non-critical (sheddable)
+```
+When the simulator says "SCALE_UP node-0 by 0.5", the system:
+1. Looks up node-0 in the workload map -> `payments` in `prod-sre`
+2. Calls `patch_namespaced_deployment_scale("payments", "prod-sre", ...)`
+3. Kubernetes creates/destroys pods to match the new replica count
+---
+## What Runs Where (Complete List)
+### On Hugging Face Spaces
+| Component | What It Does | Port |
+|---|---|---|
+| FastAPI server (`server/app.py`) | HTTP API for the agent | 7860 (via NGINX) |
+| Simulator (`simulator.py`) | 5-node microservice cluster simulation | Internal |
+| PrometheusClient (`telemetry/prometheus_client.py`) | Queries AMP for real metrics | Outbound HTTPS |
+| KubernetesExecutor (`control/kubernetes_executor.py`) | Sends scale commands to EKS | Outbound HTTPS |
+| Prometheus metrics exporter | Serves `/metrics` for HF's monitoring | 8000 |
+| Grafana + local Prometheus | Local dashboards (from the Dockerfile) | 3000, 9090 |
+### On AWS EKS
+| Component | Namespace | What It Does |
+|---|---|---|
+| payments Deployment | prod-sre | 2 nginx pods (scales with agent) |
+| checkout Deployment | prod-sre | 1 nginx pod (scales with agent) |
+| catalog Deployment | prod-sre | 1 nginx pod (scales with agent) |
+| cart Deployment | prod-sre | 1 nginx pod (scales with agent) |
+| auth Deployment | prod-sre | 1 nginx pod (scales with agent) |
+| Prometheus Agent | monitoring | Scrapes workload pods, remote-writes to AMP |
+| Cluster Autoscaler | kube-system | Adds/removes EC2 nodes based on demand |
+### On AWS Managed Services
+| Service | What It Does |
+|---|---|
+| AMP (Amazon Managed Prometheus) | Stores all metrics. Queried by HF Spaces. |
+| AMG (Amazon Managed Grafana) | Visualizes metrics in dashboards. Accessed via browser. |
+---
+## The Simulator vs Real Cluster
+AntiAtropos has three modes controlled by `ANTIATROPOS_ENV_MODE`:
+### Simulated Mode (`simulated`)
+Everything is fake. The simulator generates synthetic metrics (random CPU, latency, etc.). No K8s, no Prometheus. The agent practices in a safe sandbox.
+This is the default on HF Spaces without AWS configured.
+### Hybrid Mode (`hybrid`)
+The simulator runs, but it pulls real metrics from AMP to calibrate itself. If AMP says `payments` pods have 80% CPU, the simulator adjusts its internal model to match. The agent can read real data but actions only affect the simulator, not real pods.
+### Live Mode (`live`)
+The real deal. The agent reads real metrics from AMP and sends real scale commands to EKS. When it says `SCALE_UP`, actual pods get created on actual EC2 instances that cost actual money.
+**Set `ANTIATROPOS_ENV_MODE=live` on HF Spaces to enable this.**
+---
+## Cost Flow
+Every pod on EKS costs money. Here is how costs flow based on the agent's actions:
+```
+Agent action: SCALE_UP node-0
+  -> payments Deployment: replicas 2 -> 5
+  -> 3 new pods created
+  -> If existing nodes are full, Cluster Autoscaler adds a node
+  -> New node = another t3.medium EC2 instance = ~$0.04/hr
+  -> 3 pods running = 3 x (0.1 CPU + 64MB RAM) from the quota
+Agent action: SCALE_DOWN node-3
+  -> cart Deployment: replicas 4 -> 1
+  -> 3 pods terminated
+  -> If nodes are now underutilized, Cluster Autoscaler removes a node (after 10 min)
+  -> One fewer EC2 instance = saves ~$0.04/hr
+```
+The Lyapunov reward function penalizes the agent for both instability AND cost, so a well-trained agent should learn to scale efficiently:
+```
+R_t = -(alpha * delta_V  +  beta * cost  +  gamma * SLA_violation)
+                                  ^^^^
+                           beta=0.01 penalizes over-provisioning
+```
+---
+## Quick Reference: Key Files
+| File | Purpose |
+|---|---|
+| `kubernetes_executor.py` | Translates agent actions to K8s API calls |
+| `prometheus_client.py` | Queries AMP for real metrics |
+| `simulator.py` | 5-node fluid-queue simulation |
+| `stability.py` | Lyapunov reward computation |
+| `deploy/aws/k8s-workloads.yaml` | The 5 Deployments + ResourceQuota on EKS |
+| `deploy/aws/eksctl-cluster.yaml` | EKS cluster definition (nodes, caps) |
+| `deploy/aws/prometheus-agent-values.yaml` | Helm config for Prometheus Agent |
+| `deploy/aws/generate-kubeconfig.sh` | Creates kubeconfig for HF Spaces |

deploy/aws/FASTAPI_AWS_MODE_GUIDE.md ADDED Viewed

	@@ -0,0 +1,72 @@

+# FastAPI AWS Mode + Local Grafana Guide
+This setup keeps Kubernetes + AMP in AWS, while Grafana runs on your laptop.
+## 1) Environment file
+Use [../../.env.example](../../.env.example) as template. A starter [../../.env](../../.env) is already created.
+Important keys:
+- `ANTIATROPOS_ENV_MODE=aws`
+- `KUBECONFIG=.../deploy/aws/kubeconfig-antiatropos.yaml`
+- `PROMETHEUS_URL=https://aps-workspaces.<region>.amazonaws.com/workspaces/<workspace-id>`
+- `ANTIATROPOS_WORKLOAD_MAP=...`
+- `ANTIATROPOS_GRAFANA_MODE=external`
+## 2) Load .env in PowerShell
+From workspace root:
+```powershell
+Get-Content .env | ForEach-Object {
+  if ($_ -match '^\s*#' -or $_ -match '^\s*$') { return }
+  $name, $value = $_ -split '=', 2
+  [System.Environment]::SetEnvironmentVariable($name, $value, 'Process')
+}
+```
+## 3) Start FastAPI server
+```powershell
+uvicorn server.app:app --host 0.0.0.0 --port 8000
+```
+## 4) Verify runtime wiring
+Check runtime endpoint:
+- [server/app.py](../../server/app.py) exposes `GET /config/runtime`
+- Example URL: `http://localhost:8000/config/runtime`
+You should see:
+- `env_mode: "aws"`
+- `prometheus_url_configured: true`
+- `kubeconfig_configured: true`
+- `workload_map_configured: true`
+## 5) Reset environment in AWS mode
+Use reset with `mode="aws"`, or omit mode and rely on `ANTIATROPOS_ENV_MODE=aws`.
+## 6) Run Grafana locally (not in EKS)
+```powershell
+docker run -d --name antiatropos-grafana -p 3000:3000 grafana/grafana:latest
+```
+Open `http://localhost:3000` and add AMP as Prometheus datasource:
+- URL: `https://aps-workspaces.<region>.amazonaws.com/workspaces/<workspace-id>`
+- Auth: SigV4 enabled
+- Region: your AWS region (for example `ap-south-1`)
+Import dashboards:
+- [../grafana/provisioning/dashboards/json/antiatropos-overview.json](../grafana/provisioning/dashboards/json/antiatropos-overview.json)
+- [../grafana/provisioning/dashboards/json/antiatropos-live.json](../grafana/provisioning/dashboards/json/antiatropos-live.json)
+## Notes
+Grafana is observability-only. Agent control runs via FastAPI + Kubernetes executor.

deploy/aws/OPERATIONS.md ADDED Viewed

	@@ -0,0 +1,465 @@

+# AntiAtropos AWS Operations Guide
+Everything you need to run the AWS infrastructure for AntiAtropos without blowing up your bill.
+**Architecture: FastAPI on Hugging Face Spaces, EKS + AMP + AMG on AWS.**
+---
+## Table of Contents
+1. [Replica Strategy & Caps](#1-replica-strategy--caps)
+2. [Autoscaling Configuration](#2-autoscaling-configuration)
+3. [Cost Guardrails](#3-cost-guardrails)
+4. [Step-by-Step Deployment Walkthrough](#4-step-by-step-deployment-walkthrough)
+5. [Configuring HF Spaces to Connect to AWS](#5-configuring-hf-spaces-to-connect-to-aws)
+6. [Day-2 Operations](#6-day-2-operations)
+7. [Teardown & Cost Recovery](#7-teardown--cost-recovery)
+---
+## 1. Replica Strategy & Caps
+### What Runs Where
+| Component | Where | Scaled By | Cost Impact |
+|---|---|---|---|
+| **AntiAtropos FastAPI server** | HF Spaces | HF auto-scales | $0-5/month (HF billing) |
+| **Workload pods** (payments, checkout, etc.) | EKS | SRE agent via `KubernetesExecutor` | **HIGH** — this is where costs spiral |
+| **Prometheus Agent** | EKS (monitoring ns) | Static (1 pod) | Low |
+| **AMP** | AWS managed | Serverless | Pay per GB ingested |
+| **AMG** | AWS managed | Serverless | Pay per editor |
+### Workload Pod Replicas — Where Costs Spiral
+The SRE agent's `SCALE_UP` action calls `KubernetesExecutor._scale_deployment()`, which patches `replicas` on real K8s Deployments. A bad agent can scale every deployment to the cap.
+The `ANTIATROPOS_MAX_REPLICAS` env var (set on HF Spaces) is the **global** ceiling applied to all deployments. The default in `kubernetes_executor.py` is 20 — with 5 deployments, that's **100 pods** worst case. **Set it to 6.**
+**Recommended caps by deployment:**
+| Deployment | Min | Max Replicas | Reasoning |
+|---|---|---|---|
+| `payments` (node-0, VIP) | 2 | 6 | VIP node — needs redundancy, 6 is plenty for the traffic model |
+| `checkout` (node-1) | 1 | 5 | Can burst but shouldn't stay high |
+| `catalog` (node-2) | 1 | 5 | Same |
+| `cart` (node-3) | 1 | 4 | Non-critical, sheddable |
+| `auth` (node-4) | 1 | 4 | Non-critical, sheddable |
+**Total worst case: 24 workload pods.**
+At ~0.25 vCPU / 256MB per workload pod (nginx containers), that's ~6 vCPU and ~6GB RAM — fits on 2x t3.medium nodes with some headroom, or 3 nodes for comfort.
+### How the Cap Works
+The `KubernetesExecutor._scale_deployment()` method reads `ANTIATROPOS_MAX_REPLICAS` from the environment and refuses to scale above it:
+```
+Ack: SCALE_UP for node-0 - replicas unchanged at 6 (bounds 1-6)
+```
+This is enforced in code (`kubernetes_executor.py` line 115):
+```python
+desired = min(self.max_replicas, current + delta)
+```
+**Set `ANTIATROPOS_MAX_REPLICAS=6` on your HF Space.**
+---
+## 2. Autoscaling Configuration
+### EKS Node Autoscaling
+The cluster needs to grow nodes when the agent scales workloads. Install the Cluster Autoscaler:
+```bash
+helm repo add autoscaler https://kubernetes.github.io/autoscaler
+helm repo update
+helm install cluster-autoscaler autoscaler/cluster-autoscaler \
+  --namespace kube-system \
+  -f deploy/aws/cluster-autoscaler-values.yaml
+```
+**The node group `maxSize` in `eksctl-cluster.yaml` (4) is your ultimate cost ceiling.**
+```
+4 nodes x $0.0416/hr (t3.medium on-demand) = $0.1664/hr = ~$120/month max
+```
+With spot instances, this drops to ~$36/month max.
+### What Happens When the Agent Scales Workloads
+1. Agent on HF Spaces sends `SCALE_UP` action
+2. `KubernetesExecutor._scale_deployment()` patches the Deployment's `spec.replicas` via EKS API server
+3. Kubernetes scheduler tries to place the new pod
+4. If no node has capacity -> pod is `Pending`
+5. Cluster Autoscaler sees `Pending` pods -> adds a node (within `maxSize`)
+6. If `maxSize` is hit -> pod stays `Pending` (agent action succeeded but pod won't schedule)
+**This is why `maxSize` in the node group is your ultimate cost ceiling.**
+---
+## 3. Cost Guardrails
+### Monthly Cost Caps by Tier
+| Tier | Max Nodes | Max Workload Pods | Estimated Monthly Cost |
+|---|---|---|---|
+| **Dev/Testing** | 2 | 10 (2/deployment) | ~$80 |
+| **Training** | 3 | 15 (3/deployment) | ~$130 |
+| **Benchmark Suite** | 4 | 24 (~5/deployment) | ~$160 |
+| **Unlimited (danger)** | inf | 100 (20/deployment) | $500+ |
+### AWS Budgets — Get Alerts Before You Overspend
+```bash
+aws budgets create-budget \
+  --account-id $(aws sts get-caller-identity --query Account --output text) \
+  --budget '{
+    "BudgetName": "AntiAtropos-Monthly",
+    "BudgetLimit": {"Amount": "150", "Unit": "USD"},
+    "TimeUnit": "MONTHLY",
+    "CostFilters": {
+      "TagKeyValue": ["user:Project$AntiAtropos"]
+    },
+    "CostTypes": {
+      "IncludeTax": true,
+      "IncludeSubscription": true,
+      "UseBlended": false
+    }
+  }'
+# Alert at 50%
+aws budgets create-notification \
+  --account-id $(aws sts get-caller-identity --query Account --output text) \
+  --budget-name "AntiAtropos-Monthly" \
+  --notification '{"NotificationType":"ACTUAL","ComparisonOperator":"GREATER_THAN","Threshold":50}' \
+  --subscribers '[{"SubscriptionType":"EMAIL","Address":"your-email@example.com"}]'
+# Alert at 80%
+aws budgets create-notification \
+  --account-id $(aws sts get-caller-identity --query Account --output text) \
+  --budget-name "AntiAtropos-Monthly" \
+  --notification '{"NotificationType":"ACTUAL","ComparisonOperator":"GREATER_THAN","Threshold":80}' \
+  --subscribers '[{"SubscriptionType":"EMAIL","Address":"your-email@example.com"}]'
+```
+### Cost-Saving Checklist
+- [ ] Use **spot instances** for node groups (60-70% cheaper, OK for training)
+- [ ] Set `ANTIATROPOS_MAX_REPLICAS=6` on HF Spaces (not 20) to prevent agent runaway
+- [ ] Cap node group `maxSize` at 4 (in `eksctl-cluster.yaml`)
+- [ ] Set AWS Budget alert at $150/month
+- [ ] Scale workloads to zero between runs: `kubectl scale deployment -n prod-sre --replicas=0 --all`
+- [ ] Delete the cluster for multi-day breaks: `eksctl delete cluster --name antiatropos`
+- [ ] AMP free tier covers first 10GB ingest/month
+- [ ] AMG free tier is 1 editor for 30 days — cancel if not needed
+---
+## 4. Step-by-Step Deployment Walkthrough
+### Before You Start
+You need:
+- AWS account with billing alerts enabled
+- AWS CLI v2 installed and configured (`aws configure`)
+- eksctl, kubectl, helm installed
+- About 20-30 minutes
+### Step 1: Create the EKS Cluster (15 min)
+```bash
+eksctl create cluster -f deploy/aws/eksctl-cluster.yaml
+# Verify
+aws eks update-kubeconfig --name antiatropos --region ap-south-1
+kubectl get nodes
+```
+### Step 2: Deploy Sample Workloads (1 min)
+```bash
+kubectl apply -f deploy/aws/k8s-workloads.yaml
+kubectl get pods -n prod-sre
+```
+### Step 3: Create AMP Workspace (1 min)
+```bash
+aws amp create-workspace --alias antiatropos-metrics --region ap-south-1
+# Note the workspace ID
+aws amp list-workspaces --alias antiatropos-metrics --region ap-south-1 --query 'workspaces[0].workspaceId' --output text
+```
+### Step 4: Set Up IRSA (2 min)
+```bash
+# Prometheus agent needs to write to AMP
+eksctl create iamserviceaccount \
+  --cluster antiatropos \
+  --namespace monitoring \
+  --name prometheus-sa \
+  --attach-policy-arn arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess \
+  --approve
+```
+### Step 5: Install Prometheus Agent (2 min)
+```bash
+helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
+helm repo update
+# Replace WORKSPACE_ID
+helm install prometheus-agent prometheus-community/prometheus \
+  --namespace monitoring --create-namespace \
+  -f deploy/aws/prometheus-agent-values.yaml \
+  --set "prometheus.prometheusSpec.remoteWrite[0].url=https://aps-workspaces.ap-south-1.amazonaws.com/workspaces/WORKSPACE_ID/api/v1/remote_write"
+```
+### Step 6: Set Up AMG (5 min)
+```bash
+# Create IAM role for AMG
+aws iam create-role \
+  --role-name AntiAtroposGrafanaRole \
+  --assume-role-policy-document file://deploy/aws/grafana-trust-policy.json
+aws iam attach-role-policy \
+  --role-name AntiAtroposGrafanaRole \
+  --policy-arn arn:aws:iam::aws:policy/AmazonPrometheusQueryAccess
+# Create workspace
+aws grafana create-workspace \
+  --workspace-name antiatropos-dashboards \
+  --account-access-type CURRENT_ACCOUNT \
+  --authentication-method AWS_SSO \
+  --permission-type SERVICE_MANAGED \
+  --data-sources PROMETHEUS \
+  --region ap-south-1
+```
+Then in the AMG web UI:
+1. Sign in with AWS SSO
+2. Configuration -> Data Sources -> Add AMP workspace
+3. Dashboards -> Import -> Upload JSON from `deploy/grafana/provisioning/dashboards/json/`
+4. Select AMP data source when importing
+### Step 7: Install Cluster Autoscaler (2 min)
+```bash
+helm repo add autoscaler https://kubernetes.github.io/autoscaler
+helm repo update
+helm install cluster-autoscaler autoscaler/cluster-autoscaler \
+  --namespace kube-system \
+  -f deploy/aws/cluster-autoscaler-values.yaml
+```
+### Step 8: Generate Kubeconfig for HF Spaces (1 min)
+```bash
+./deploy/aws/generate-kubeconfig.sh
+# Outputs: deploy/aws/kubeconfig-antiatropos.yaml
+```
+### Step 9: Configure HF Spaces
+See [Section 5](#5-configuring-hf-spaces-to-connect-to-aws) below.
+---
+## 5. Configuring HF Spaces to Connect to AWS
+### Secrets (HF Space Settings -> Repository secrets)
+| Secret | Value |
+|---|---|
+| `OPENAI_API_KEY` | Your OpenAI API key |
+| `KUBECONFIG_CONTENT` | Base64-encoded content of `kubeconfig-antiatropos.yaml` |
+To encode the kubeconfig:
+```bash
+cat deploy/aws/kubeconfig-antiatropos.yaml | base64 -w 0
+```
+### Environment Variables (HF Space Settings -> Variables)
+| Variable | Value |
+|---|---|
+| `ANTIATROPOS_ENV_MODE` | `live` |
+| `ANTIATROPOS_STRICT_REAL` | `false` |
+| `PROMETHEUS_URL` | `https://aps-workspaces.ap-south-1.amazonaws.com/workspaces/WORKSPACE_ID` |
+| `KUBECONFIG` | `/app/kubeconfig.yaml` |
+| `ANTIATROPOS_K8S_NAMESPACE` | `prod-sre` |
+| `ANTIATROPOS_DEPLOYMENT_PREFIX` | `` (empty) |
+| `ANTIATROPOS_MIN_REPLICAS` | `1` |
+| `ANTIATROPOS_MAX_REPLICAS` | `6` |
+| `ANTIATROPOS_SCALE_STEP` | `3` |
+| `ANTIATROPOS_PROM_TIMEOUT_S` | `5.0` |
+| `ANTIATROPOS_METRIC_AGGREGATION` | `sum` |
+| `ANTIATROPOS_WORKLOAD_MAP` | See below |
+### Workload Map Value
+```json
+{
+  "node-0": {"deployment": "payments", "namespace": "prod-sre"},
+  "node-1": {"deployment": "checkout", "namespace": "prod-sre"},
+  "node-2": {"deployment": "catalog", "namespace": "prod-sre"},
+  "node-3": {"deployment": "cart", "namespace": "prod-sre"},
+  "node-4": {"deployment": "auth", "namespace": "prod-sre"}
+}
+```
+### Entrypoint Modification
+Add this to `deploy/entrypoint.sh` before the uvicorn line, so the kubeconfig is decoded from the HF secret:
+```bash
+# Decode kubeconfig from HF Spaces secret
+if [ -n "${KUBECONFIG_CONTENT:-}" ]; then
+    echo "${KUBECONFIG_CONTENT}" | base64 -d > /app/kubeconfig.yaml
+    export KUBECONFIG=/app/kubeconfig.yaml
+fi
+```
+### Verifying the Connection
+After deploying, check from HF Spaces that the server can reach AWS:
+1. Check the HF Space logs for `antiatropos_step` events
+2. Look for `Ack: SCALE_UP` messages (agent is reaching EKS)
+3. Look for non-zero `request_rate` / `cpu_utilization` (PrometheusClient is reaching AMP)
+4. If `ANTIATROPOS_STRICT_REAL=false` (recommended), failures fall back to mock silently
+---
+## 6. Day-2 Operations
+### Scaling Workloads Manually
+```bash
+# Scale a specific deployment
+kubectl scale deployment/payments -n prod-sre --replicas=4
+# Scale all workloads down
+kubectl scale deployment -n prod-sre --replicas=0 --all
+# Scale all workloads back up
+kubectl scale deployment payments -n prod-sre --replicas=2
+kubectl scale deployment checkout -n prod-sre --replicas=1
+kubectl scale deployment catalog -n prod-sre --replicas=1
+kubectl scale deployment cart -n prod-sre --replicas=1
+kubectl scale deployment auth -n prod-sre --replicas=1
+```
+### Pausing Everything (Without Deleting)
+```bash
+# Scale all workloads to 0
+kubectl scale deployment -n prod-sre --replicas=0 --all
+# Note: EKS nodes still run and cost money.
+# For real savings, delete the cluster (Section 7).
+```
+### Monitoring Agent Behavior
+Watch what the SRE agent is doing in real-time:
+```bash
+# Check how many workload pods the agent has created
+kubectl get deployments -n prod-sre
+# Check current replica counts
+kubectl get hpa -A  # if any HPAs are defined
+# Check node pressure
+kubectl top nodes
+```
+### Checking Current Spend
+```bash
+# Current month cost by service
+aws ce get-cost-and-usage \
+  --time-period Start=$(date -d '1st of this month' +%Y-%m-%d),End=$(date +%Y-%m-%d) \
+  --granularity MONTHLY \
+  --metrics BlendedCost \
+  --group-by Type=DIMENSION,Key=SERVICE
+```
+### Regenerating Kubeconfig
+If the EKS cluster is recreated or credentials expire:
+```bash
+./deploy/aws/generate-kubeconfig.sh
+# Re-upload the base64-encoded content to HF Spaces secret KUBECONFIG_CONTENT
+```
+---
+## 7. Teardown & Cost Recovery
+### Partial Teardown (Keep Cluster, Stop Workloads)
+```bash
+kubectl scale deployment -n prod-sre --replicas=0 --all
+# Still paying for EKS control plane ($73/month) and idle nodes
+```
+### Full Teardown (Stop All Charges)
+```bash
+# Delete workloads
+kubectl delete -f deploy/aws/k8s-workloads.yaml
+# Delete Prometheus agent
+helm uninstall prometheus-agent -n monitoring
+kubectl delete namespace monitoring
+# Delete AMP workspace
+AMP_WS_ID=$(aws amp list-workspaces --alias antiatropos-metrics --region ap-south-1 --query 'workspaces[0].workspaceId' --output text)
+aws amp delete-workspace --workspace-id $AMP_WS_ID --region ap-south-1
+# Delete AMG workspace
+AMG_WS_ID=$(aws grafana list-workspaces --region ap-south-1 --query 'workspaces[0].id' --output text)
+aws grafana delete-workspace --workspace-id $AMG_WS_ID
+# Delete IAM role for Grafana
+aws iam detach-role-policy --role-name AntiAtroposGrafanaRole --policy-arn arn:aws:iam::aws:policy/AmazonPrometheusQueryAccess
+aws iam detach-role-policy --role-name AntiAtroposGrafanaRole --policy-arn arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess
+aws iam delete-role --role-name AntiAtroposGrafanaRole
+# Delete the EKS cluster (10-15 min)
+eksctl delete cluster --name antiatropos --region ap-south-1
+# Verify nothing is left
+aws eks list-clusters --region ap-south-1
+aws amp list-workspaces --region ap-south-1
+```
+Also remove the `KUBECONFIG_CONTENT` secret and reset `PROMETHEUS_URL` to `mock` in your HF Space.
+---
+## Quick Reference Card
+| Task | Command |
+|---|---|
+| Deploy AWS infra | `./deploy/aws/deploy.sh` |
+| Check workloads | `kubectl get pods -n prod-sre` |
+| Check monitoring | `kubectl get pods -n monitoring` |
+| Scale a workload | `kubectl scale deployment/payments -n prod-sre --replicas=N` |
+| Pause all workloads | `kubectl scale deployment -n prod-sre --replicas=0 --all` |
+| Check AMP data | `awscurl --service aps "https://aps-workspaces.ap-south-1.amazonaws.com/workspaces/WS_ID/api/v1/query?query=up" --region ap-south-1` |
+| Generate kubeconfig | `./deploy/aws/generate-kubeconfig.sh` |
+| Nuke everything | `eksctl delete cluster --name antiatropos --region ap-south-1` |

deploy/aws/README.md ADDED Viewed

	@@ -0,0 +1,361 @@

+# AntiAtropos AWS Deployment Guide
+Deploy the AWS infrastructure (EKS + AMP) that AntiAtropos on Hugging Face Spaces connects to.
+For FastAPI wiring with `aws` mode and laptop Grafana, see [deploy/aws/FASTAPI_AWS_MODE_GUIDE.md](deploy/aws/FASTAPI_AWS_MODE_GUIDE.md).
+## Architecture
+```
+Hugging Face Spaces                    AWS Region (ap-south-1)
+=====================                  ======================
+                                       ┌─────────────────────────┐
+                                       │ EKS Cluster             │
+┌─────────────────┐                    │  ├── Workload pods      │
+│ AntiAtropos     │  PROMETHEUS_URL    │  │   (payments, checkout │
+│ FastAPI Server  │───────────────────>│  │    catalog, cart, auth)│
+│ (port 7860)     │  (HTTPS + SigV4)   │  ├── Prometheus Agent    │
+│                 │                    │  │   (scrapes workloads, │
+│                 │  KUBECONFIG        │  │    remote-writes AMP) │
+│                 │───────────────────>│  ├── Grafana            │
+│                 │  (EKS API server)  │  │   (self-hosted,       │
+│                 │                    │  │    dashboards)        │
+│                 │                    │  └── Monitoring ns       │
+│                 │                    └─────────────────────────┘
+│                 │                    ┌─────────────────────────┐
+│                 │                    │ Amazon Managed          │
+│                 │                    │ Prometheus (AMP)        │
+│                 │                    │  Workspace: antiatropos │
+│                 │                    └─────────────────────────┘
+└─────────────────┘
+```
+**Key principle: FastAPI runs on HF Spaces. AWS runs K8s workloads + AMP + self-hosted Grafana.**
+---
+## Phase 0: Prerequisites
+```bash
+# AWS CLI v2
+curl "https://awscli.amazonaws.com/AWSCLIV2.msi" -o "AWSCLIV2.msi"
+msiexec /i AWSCLIV2.msi
+# eksctl
+choco install eksctl
+# kubectl
+choco install kubernetes-cli
+# Helm
+choco install kubernetes-helm
+# Authenticate
+aws configure
+```
+---
+## Phase 1: Create the EKS Cluster (15 min)
+```bash
+eksctl create cluster -f deploy/aws/eksctl-cluster.yaml
+# Verify
+aws eks update-kubeconfig --name antiatropos --region ap-south-1
+kubectl get nodes
+```
+---
+## Phase 2: Deploy Sample Workloads on EKS
+These are the microservice deployments the SRE agent will scale up/down:
+```bash
+kubectl apply -f deploy/aws/k8s-workloads.yaml
+```
+This creates 5 deployments in the `prod-sre` namespace:
+- `payments` (node-0, VIP) — 2 replicas
+- `checkout` (node-1) — 1 replica
+- `catalog` (node-2) — 1 replica
+- `cart` (node-3) — 1 replica
+- `auth` (node-4) — 1 replica
+Verify:
+```bash
+kubectl get pods -n prod-sre
+```
+---
+## Phase 3: Set Up Amazon Managed Prometheus (AMP)
+### Create AMP Workspace
+```bash
+aws amp create-workspace \
+  --alias antiatropos-metrics \
+  --region ap-south-1
+# Note the workspace ID
+aws amp list-workspaces --alias antiatropos-metrics --region ap-south-1
+```
+### Set Up IRSA for Prometheus Agent
+```bash
+eksctl create iamserviceaccount \
+  --cluster antiatropos \
+  --namespace monitoring \
+  --name prometheus-sa \
+  --attach-policy-arn arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess \
+  --approve \
+  --override-existing-serviceaccounts
+```
+### Install Prometheus Agent on EKS
+The agent scrapes workload pods and remote-writes metrics to AMP:
+```bash
+helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
+helm repo update
+# Replace WORKSPACE_ID with your AMP workspace ID
+helm install prometheus-agent prometheus-community/prometheus \
+  --namespace monitoring --create-namespace \
+  -f deploy/aws/prometheus-agent-values.yaml \
+  --set prometheus.prometheusSpec.remoteWrite[0].url="https://aps-workspaces.ap-south-1.amazonaws.com/workspaces/WORKSPACE_ID/api/v1/remote_write"
+```
+### Verify AMP is Receiving Data
+```bash
+pip install awscurl
+awscurl --service aps "https://aps-workspaces.ap-south-1.amazonaws.com/workspaces/WORKSPACE_ID/api/v1/query?query=up" --region ap-south-1
+```
+---
+## Phase 4 (Optional): Set Up Self-Hosted Grafana on EKS
+If you are on free-tier nodes, skip this section and run Grafana locally on your laptop.
+### Install Grafana
+```bash
+helm repo add grafana https://grafana.github.io/helm-charts
+helm repo update
+helm install grafana grafana/grafana \
+  --namespace monitoring \
+  -f deploy/aws/grafana-values.yaml
+```
+### Create Dashboard Secret
+```bash
+kubectl create secret generic antiatropos-grafana-dashboards \
+  --from-file=antiatropos-overview.json=deploy/grafana/provisioning/dashboards/json/antiatropos-overview.json \
+  --from-file=antiatropos-live.json=deploy/grafana/provisioning/dashboards/json/antiatropos-live.json \
+  --namespace monitoring \
+  --dry-run=client -o yaml | kubectl apply -f -
+```
+### Access Grafana
+```bash
+kubectl port-forward svc/grafana 3000 -n monitoring
+```
+Open `http://localhost:3000` in your browser:
+- Username: `admin`
+- Password: `antiatropos`
+The data source `AMP-Local` is pre-configured to use the local Prometheus agent, and dashboards are auto-imported from the secret.
+---
+## Phase 5: Generate Kubeconfig for HF Spaces
+The AntiAtropos server on HF Spaces needs a kubeconfig to talk to EKS:
+```bash
+./deploy/aws/generate-kubeconfig.sh
+```
+This outputs `deploy/aws/kubeconfig-antiatropos.yaml`. You'll set this as a secret on HF Spaces.
+---
+## Phase 6: Configure HF Spaces Environment Variables
+Set these in your HF Space (Settings → Repository secrets and Variables):
+### Secrets
+| Secret | Value |
+|---|---|
+| `OPENAI_API_KEY` | Your OpenAI API key |
+| `KUBECONFIG_CONTENT` | Full content of `kubeconfig-antiatropos.yaml`, base64-encoded |
+### Environment Variables
+| Variable | Value |
+|---|---|
+| `ANTIATROPOS_ENV_MODE` | `aws` |
+| `ANTIATROPOS_STRICT_REAL` | `false` |
+| `PROMETHEUS_URL` | `https://aps-workspaces.ap-south-1.amazonaws.com/workspaces/WORKSPACE_ID` |
+| `KUBECONFIG` | `/app/kubeconfig.yaml` |
+| `ANTIATROPOS_K8S_NAMESPACE` | `prod-sre` |
+| `ANTIATROPOS_MAX_REPLICAS` | `6` |
+| `ANTIATROPOS_MIN_REPLICAS` | `1` |
+| `ANTIATROPOS_SCALE_STEP` | `3` |
+| `ANTIATROPOS_PROM_TIMEOUT_S` | `5.0` |
+| `ANTIATROPOS_METRIC_AGGREGATION` | `sum` |
+| `ANTIATROPOS_WORKLOAD_MAP` | See below |
+### Workload Map
+```json
+{
+  "node-0": {"deployment": "payments", "namespace": "prod-sre"},
+  "node-1": {"deployment": "checkout", "namespace": "prod-sre"},
+  "node-2": {"deployment": "catalog", "namespace": "prod-sre"},
+  "node-3": {"deployment": "cart", "namespace": "prod-sre"},
+  "node-4": {"deployment": "auth", "namespace": "prod-sre"}
+}
+```
+### Entrypoint Addition
+Add this to `deploy/entrypoint.sh` before starting uvicorn, so the kubeconfig is decoded from the HF secret:
+```bash
+# Decode kubeconfig from HF Spaces secret
+if [ -n "${KUBECONFIG_CONTENT:-}" ]; then
+    echo "${KUBECONFIG_CONTENT}" | base64 -d > /app/kubeconfig.yaml
+    export KUBECONFIG=/app/kubeconfig.yaml
+fi
+```
+### FastAPI Reset Mode
+Use `mode="aws"` on environment reset for AWS-backed execution. If omitted, the server will use `ANTIATROPOS_ENV_MODE`.
+---
+## Local Grafana (Recommended on Free Tier)
+Grafana is only for observability dashboards. Agent action execution stays in FastAPI + Kubernetes executor.
+Start Grafana locally:
+```bash
+docker run -d --name antiatropos-grafana -p 3000:3000 grafana/grafana:latest
+```
+Then in Grafana:
+1. Add Prometheus datasource using AMP workspace URL:
+  - `https://aps-workspaces.<region>.amazonaws.com/workspaces/<WORKSPACE_ID>`
+2. Enable SigV4 auth and set the same AWS region.
+3. Import dashboards:
+  - [deploy/grafana/provisioning/dashboards/json/antiatropos-overview.json](deploy/grafana/provisioning/dashboards/json/antiatropos-overview.json)
+  - [deploy/grafana/provisioning/dashboards/json/antiatropos-live.json](deploy/grafana/provisioning/dashboards/json/antiatropos-live.json)
+---
+## Phase 7: Install Cluster Autoscaler
+So EKS can add nodes when the agent scales workloads:
+```bash
+helm repo add autoscaler https://kubernetes.github.io/autoscaler
+helm repo update
+helm install cluster-autoscaler autoscaler/cluster-autoscaler \
+  --namespace kube-system \
+  -f deploy/aws/cluster-autoscaler-values.yaml
+```
+The node group `maxSize: 4` in `eksctl-cluster.yaml` caps your compute cost.
+---
+## Cost Estimates
+| Resource | Config | Monthly Cost (approx) |
+|---|---|---|
+| EKS Control Plane | 1 cluster | $73 |
+| EKS Nodes | 2x t3.medium | $60 |
+| AMP | <10GB ingest | ~$3-5 |
+| EBS Volume (Grafana) | 5Gi | ~$0.50 |
+| **Total** | | **~$135-145/month** |
+| HF Spaces | Free tier or $5/mo | (separate billing) |
+No ECR, no ALB, no server pods on AWS — cheaper than running everything on AWS.
+### Cost-Saving Tips
+- Use spot instances for node groups (60-70% cheaper)
+- Scale workloads to zero between runs: `kubectl scale deployment -n prod-sre --replicas=0 --all`
+- Delete the cluster between training runs: `eksctl delete cluster --name antiatropos`
+- AMP free tier covers first 10GB ingest/month
+- Grafana is self-hosted (free, runs on EKS)
+---
+## Teardown
+```bash
+# Delete workloads
+kubectl delete -f deploy/aws/k8s-workloads.yaml
+# Delete Grafana
+helm uninstall grafana -n monitoring
+# Delete Prometheus agent
+helm uninstall prometheus-agent -n monitoring
+kubectl delete namespace monitoring
+# Delete dashboard secret
+kubectl delete secret antiatropos-grafana-dashboards -n monitoring 2>/dev/null || true
+# Delete AMP workspace
+AMP_WS_ID=$(aws amp list-workspaces --alias antiatropos-metrics --region ap-south-1 --query 'workspaces[0].workspaceId' --output text)
+aws amp delete-workspace --workspace-id $AMP_WS_ID --region ap-south-1
+# Delete the EKS cluster (10-15 min)
+eksctl delete cluster --name antiatropos --region ap-south-1
+```
+---
+## Troubleshooting
+### HF Spaces can't reach AMP
+- Verify `PROMETHEUS_URL` includes the full workspace path
+- AMP requires SigV4 auth — ensure `requests-aws4auth` is in your dependencies
+- Set `ANTIATROPOS_PROM_TIMEOUT_S=5.0` (cross-network latency)
+### HF Spaces can't reach EKS
+- Verify `KUBECONFIG` path and the file is decoded properly
+- Check the EKS API server endpoint is public (default)
+- Verify the IAM user in the kubeconfig has EKS access
+- Test locally: `kubectl --kubeconfig=kubeconfig-antiatropos.yaml get nodes`
+### AMP not receiving metrics
+```bash
+kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus
+```
+### Grafana shows no data
+1. Verify the `AMP-Local` data source is configured: `http://prometheus-agent-server.monitoring.svc.cluster.local:80`
+2. Check time range (AMP default retention is 30 days)
+3. Verify PromQL queries match your metric names
+4. Check Grafana logs: `kubectl logs -n monitoring -l app.kubernetes.io/name=grafana`
+5. Verify dashboards secret exists: `kubectl get secret antiatropos-grafana-dashboards -n monitoring`

deploy/aws/cluster-autoscaler-values.yaml ADDED Viewed

	@@ -0,0 +1,57 @@

+# Cluster Autoscaler Helm values
+#
+# This ensures EKS adds/removes nodes based on pod scheduling pressure.
+# The node group maxSize in eksctl-cluster.yaml (4) is the ultimate cap.
+#
+# Install:
+#   helm repo add autoscaler https://kubernetes.github.io/autoscaler
+#   helm repo update
+#   helm install cluster-autoscaler autoscaler/cluster-autoscaler \
+#     --namespace kube-system \
+#     -f cluster-autoscaler-values.yaml
+autoDiscovery:
+  clusterName: antiatropos
+  enabled: true
+awsRegion: ap-south-1
+# Only scale nodes that have the specific tag
+# This prevents autoscaling unrelated node groups if you add them later
+nodeGroupAutoDiscovery:
+  - tags: cluster-autoscaler/cluster-name=antiatropos
+# Conservative scaling — don't overreact
+scaleDown:
+  enabled: true
+  # Wait 10 minutes before removing a node
+  # This prevents flapping when agents create/destroy pods frequently
+  delayAfterAdd: 600s
+  delayAfterDelete: 60s
+  delayAfterScaleDown: 600s
+  # Only remove nodes that are below 50% utilization
+  utilizationThreshold: "0.5"
+  # Don't remove nodes that have AntiAtropos pods on them
+  # (we don't want to kill active training sessions)
+  skipNodesWithSystemPods: true
+# Don't try to scale beyond this many nodes total
+# This is a safety net — the eksctl node group maxSize is the real limit
+maxNodeProvisionTime: 15m
+rbac:
+  create: true
+  serviceAccount:
+    create: true
+    name: cluster-autoscaler
+replicaCount: 1
+resources:
+  requests:
+    cpu: 100m
+    memory: 256Mi
+  limits:
+    cpu: 500m
+    memory: 512Mi

deploy/aws/deploy-all.ps1 ADDED Viewed

	@@ -0,0 +1,493 @@

+# AntiAtropos - One-Run Deploy Script
+# Deploys entire AWS infrastructure: EKS cluster, workloads, AMP, Prometheus, Grafana
+$ErrorActionPreference = "Stop"
+# In PowerShell 7+, prevent native stderr from becoming terminating errors.
+if (Get-Variable -Name PSNativeCommandUseErrorActionPreference -ErrorAction SilentlyContinue) {
+    $PSNativeCommandUseErrorActionPreference = $false
+}
+$Region = "ap-south-1"
+$ClusterName = "antiatropos"
+$AwsDir = Split-Path -Parent $MyInvocation.MyCommand.Path
+$GrafanaMode = if ([string]::IsNullOrWhiteSpace($env:ANTIATROPOS_GRAFANA_MODE)) { "auto" } else { $env:ANTIATROPOS_GRAFANA_MODE.Trim().ToLowerInvariant() }
+$GrafanaModeResolved = "cluster"
+function Invoke-CheckedCommand {
+    param(
+        [ScriptBlock]$Command,
+        [string]$ErrorMessage
+    )
+    $previousErrorActionPreference = $ErrorActionPreference
+    $ErrorActionPreference = "Continue"
+    try {
+        & $Command
+    } finally {
+        $ErrorActionPreference = $previousErrorActionPreference
+    }
+    if ($LASTEXITCODE -ne 0) {
+        throw $ErrorMessage
+    }
+}
+function Get-EksClusterStatus {
+    param(
+        [string]$Name,
+        [string]$AwsRegion
+    )
+    try {
+        $status = aws eks describe-cluster --name $Name --region $AwsRegion --query 'cluster.status' --output text 2>$null
+    } catch {
+        return $null
+    }
+    if ($LASTEXITCODE -ne 0 -or [string]::IsNullOrWhiteSpace($status) -or $status -eq "None") {
+        return $null
+    }
+    return $status.Trim()
+}
+function Test-EksNodegroupExists {
+    param(
+        [string]$Cluster,
+        [string]$Nodegroup,
+        [string]$AwsRegion
+    )
+    try {
+        aws eks describe-nodegroup --cluster-name $Cluster --nodegroup-name $Nodegroup --region $AwsRegion --query 'nodegroup.nodegroupName' --output text 2>$null | Out-Null
+        return ($LASTEXITCODE -eq 0)
+    } catch {
+        return $false
+    }
+}
+function Get-EksNodegroupInstanceType {
+    param(
+        [string]$Cluster,
+        [string]$Nodegroup,
+        [string]$AwsRegion
+    )
+    try {
+        $instanceType = aws eks describe-nodegroup --cluster-name $Cluster --nodegroup-name $Nodegroup --region $AwsRegion --query 'nodegroup.instanceTypes[0]' --output text 2>$null
+    } catch {
+        return $null
+    }
+    if ($LASTEXITCODE -ne 0 -or [string]::IsNullOrWhiteSpace($instanceType) -or $instanceType -eq "None") {
+        return $null
+    }
+    return $instanceType.Trim()
+}
+function Get-NodegroupSubnetSelection {
+    param(
+        [string]$Cluster,
+        [string]$AwsRegion
+    )
+    try {
+        $allSubnetIds = aws eks describe-cluster --name $Cluster --region $AwsRegion --query 'cluster.resourcesVpcConfig.subnetIds' --output text 2>$null
+    } catch {
+        throw "Failed to read cluster subnet IDs"
+    }
+    if ($LASTEXITCODE -ne 0 -or [string]::IsNullOrWhiteSpace($allSubnetIds)) {
+        throw "Failed to read cluster subnet IDs"
+    }
+    $subnetArray = @($allSubnetIds -split '\s+' | Where-Object { -not [string]::IsNullOrWhiteSpace($_) })
+    if ($subnetArray.Count -eq 0) {
+        throw "No subnets found for cluster '$Cluster' in region '$AwsRegion'"
+    }
+    $describeSubnetArgs = @(
+        'ec2', 'describe-subnets',
+        '--region', $AwsRegion,
+        '--subnet-ids'
+    ) + $subnetArray + @(
+        '--query', 'Subnets[?MapPublicIpOnLaunch==true].SubnetId',
+        '--output', 'text'
+    )
+    try {
+        $publicSubnetIdsText = & aws @describeSubnetArgs 2>$null
+    } catch {
+        throw "Failed to classify cluster subnets"
+    }
+    if ($LASTEXITCODE -ne 0) {
+        throw "Failed to classify cluster subnets"
+    }
+    $publicSubnetIds = @($publicSubnetIdsText -split '\s+' | Where-Object { -not [string]::IsNullOrWhiteSpace($_) -and $_ -ne "None" })
+    $privateSubnetIds = @($subnetArray | Where-Object { $publicSubnetIds -notcontains $_ })
+    if ($publicSubnetIds.Count -gt 0) {
+        return [PSCustomObject]@{
+            SubnetCsv = ($publicSubnetIds -join ',')
+            UsePrivateNetworking = $false
+            SubnetType = "public"
+        }
+    }
+    if ($privateSubnetIds.Count -gt 0) {
+        return [PSCustomObject]@{
+            SubnetCsv = ($privateSubnetIds -join ',')
+            UsePrivateNetworking = $true
+            SubnetType = "private"
+        }
+    }
+    throw "Could not determine valid subnets for nodegroup creation"
+}
+function Get-ReadyNodeCount {
+    $nodeLines = kubectl get nodes --no-headers 2>$null
+    if (-not $nodeLines) {
+        return 0
+    }
+    return (@($nodeLines | Select-String -Pattern '\sReady\s').Count)
+}
+function Wait-ForReadyNodes {
+    param(
+        [int]$MinimumReadyNodes,
+        [int]$TimeoutSeconds = 600
+    )
+    $attempts = [Math]::Ceiling($TimeoutSeconds / 10)
+    for ($i = 0; $i -lt $attempts; $i++) {
+        $readyCount = Get-ReadyNodeCount
+        Write-Host "Nodes ready: $readyCount (target: $MinimumReadyNodes)"
+        if ($readyCount -ge $MinimumReadyNodes) {
+            return
+        }
+        Start-Sleep -Seconds 10
+    }
+    throw "Timed out waiting for $MinimumReadyNodes Ready nodes"
+}
+Write-Host ""
+Write-Host "==========================================" -ForegroundColor Cyan
+Write-Host "   AntiAtropos AWS Infrastructure Deploy" -ForegroundColor Cyan
+Write-Host "==========================================" -ForegroundColor Cyan
+Write-Host "Region:      $Region"
+Write-Host "Cluster:     $ClusterName"
+Write-Host ""
+# Check prerequisites
+$missing = @()
+foreach ($cmd in @("aws", "eksctl", "kubectl", "helm")) {
+    if (-not (Get-Command $cmd -ErrorAction SilentlyContinue)) {
+        $missing += $cmd
+    }
+}
+if ($missing.Count -gt 0) {
+    Write-Host "ERROR: Missing: $($missing -join ', ')" -ForegroundColor Red
+    exit 1
+}
+# Phase 1: Create EKS Cluster
+Write-Host ">>> Phase 1: Creating EKS cluster..." -ForegroundColor Yellow
+$clusterStatus = Get-EksClusterStatus -Name $ClusterName -AwsRegion $Region
+if ($clusterStatus -eq "DELETING") {
+    Write-Host "Cluster is currently deleting. Waiting for deletion to complete..." -ForegroundColor Yellow
+    Invoke-CheckedCommand -Command { aws eks wait cluster-deleted --name $ClusterName --region $Region } -ErrorMessage "Failed while waiting for cluster deletion"
+    $clusterStatus = $null
+}
+if (-not $clusterStatus) {
+    $TempConfig = Join-Path $AwsDir "eksctl-cluster-only.yaml"
+    $ClusterYaml = Get-Content (Join-Path $AwsDir "eksctl-cluster.yaml") -Raw
+    $ClusterOnlyYaml = $ClusterYaml -replace '(?s)(managedNodeGroups:.*)', ''
+    $ClusterOnlyYaml | Out-File -FilePath $TempConfig -Encoding utf8
+    Invoke-CheckedCommand -Command { eksctl create cluster -f $TempConfig } -ErrorMessage "Failed to create EKS cluster"
+    Remove-Item $TempConfig -Force
+    Write-Host "Cluster created" -ForegroundColor Green
+} else {
+    if ($clusterStatus -eq "CREATING") {
+        Write-Host "Cluster creation in progress. Waiting until ACTIVE..." -ForegroundColor Yellow
+        Invoke-CheckedCommand -Command { aws eks wait cluster-active --name $ClusterName --region $Region } -ErrorMessage "Cluster did not become active"
+    }
+    Write-Host "Cluster already exists (status: $clusterStatus)" -ForegroundColor Green
+}
+Invoke-CheckedCommand -Command { aws eks wait cluster-active --name $ClusterName --region $Region } -ErrorMessage "Cluster is not active"
+Invoke-CheckedCommand -Command { aws eks update-kubeconfig --name $ClusterName --region $Region | Out-Null } -ErrorMessage "Failed to update kubeconfig"
+# Phase 2: Create Nodegroup
+Write-Host ""
+Write-Host ">>> Phase 2: Ensuring compute nodegroup..." -ForegroundColor Yellow
+$NodegroupName = "linux-nodes"
+$PreferredInstanceType = "t3.micro"
+$ngExists = Test-EksNodegroupExists -Cluster $ClusterName -Nodegroup $NodegroupName -AwsRegion $Region
+if (-not $ngExists) {
+    $SubnetSelection = Get-NodegroupSubnetSelection -Cluster $ClusterName -AwsRegion $Region
+    $SubnetCsv = $SubnetSelection.SubnetCsv
+    $UsePrivateNetworking = [bool]$SubnetSelection.UsePrivateNetworking
+    Write-Host "Using $($SubnetSelection.SubnetType) subnets: $SubnetCsv"
+    Invoke-CheckedCommand -Command {
+        $args = @(
+            'create', 'nodegroup',
+            '--cluster', $ClusterName,
+            '--region', $Region,
+            '--name', $NodegroupName,
+            '--node-type', $PreferredInstanceType,
+            '--nodes', '4',
+            '--nodes-min', '2',
+            '--nodes-max', '8',
+            '--node-volume-size', '20',
+            '--subnet-ids', $SubnetCsv
+        )
+        if ($UsePrivateNetworking) {
+            $args += '--node-private-networking'
+        }
+        eksctl @args
+    } -ErrorMessage "Failed to create nodegroup '$NodegroupName'"
+    Write-Host "Nodegroup created" -ForegroundColor Green
+} else {
+    $existingInstanceType = Get-EksNodegroupInstanceType -Cluster $ClusterName -Nodegroup $NodegroupName -AwsRegion $Region
+    Write-Host "Nodegroup already exists ($existingInstanceType)" -ForegroundColor Green
+}
+Invoke-CheckedCommand -Command { aws eks wait nodegroup-active --cluster-name $ClusterName --nodegroup-name $NodegroupName --region $Region } -ErrorMessage "Nodegroup did not become active"
+if ($GrafanaMode -in @("auto", "")) {
+    $effectiveNodeType = Get-EksNodegroupInstanceType -Cluster $ClusterName -Nodegroup $NodegroupName -AwsRegion $Region
+    if ($effectiveNodeType -eq "t3.micro") {
+        $GrafanaModeResolved = "external"
+    } else {
+        $GrafanaModeResolved = "cluster"
+    }
+} elseif ($GrafanaMode -in @("external", "local", "hf")) {
+    $GrafanaModeResolved = "external"
+} else {
+    $GrafanaModeResolved = "cluster"
+}
+Write-Host "Grafana mode: $GrafanaModeResolved" -ForegroundColor Cyan
+Write-Host "Waiting for nodes..."
+for ($i = 0; $i -lt 60; $i++) {
+    $nodes = $null
+    try {
+        $nodes = kubectl get nodes --no-headers --request-timeout=10s 2>$null
+    } catch {
+        Start-Sleep -Seconds 10
+        continue
+    }
+    if ($nodes) {
+        $readyCount = ($nodes | Select-String -Pattern '\sReady\s').Count
+        Write-Host "Nodes ready: $readyCount" -ForegroundColor Green
+        break
+    }
+    Start-Sleep -Seconds 10
+}
+# Phase 3: Deploy Workloads
+Write-Host ""
+Write-Host ">>> Phase 3: Deploying workloads..." -ForegroundColor Yellow
+kubectl create namespace prod-sre --dry-run=client -o yaml | kubectl apply -f - | Out-Null
+kubectl apply -f (Join-Path $AwsDir "k8s-workloads.yaml") | Out-Null
+Write-Host "Workloads deployed" -ForegroundColor Green
+# Phase 4: Create AMP Workspace
+Write-Host ""
+Write-Host ">>> Phase 4: Creating AMP workspace..." -ForegroundColor Yellow
+$AmpWsId = $null
+try {
+    $AmpWsId = aws amp list-workspaces --alias antiatropos-metrics --region $Region --query 'workspaces[0].workspaceId' --output text 2>$null
+    if ($AmpWsId -eq "None") { $AmpWsId = $null }
+} catch {}
+if ([string]::IsNullOrWhiteSpace($AmpWsId)) {
+    $AmpWsId = aws amp create-workspace --alias antiatropos-metrics --region $Region --query 'workspaceId' --output text
+}
+$AmpUrl = "https://aps-workspaces.$Region.amazonaws.com/workspaces/$AmpWsId"
+Write-Host "AMP: $AmpWsId" -ForegroundColor Green
+# Phase 5: Install Prometheus
+Write-Host ""
+Write-Host ">>> Phase 5: Installing Prometheus..." -ForegroundColor Yellow
+kubectl create namespace monitoring --dry-run=client -o yaml | kubectl apply -f - | Out-Null
+Invoke-CheckedCommand -Command { helm repo add prometheus-community https://prometheus-community.github.io/helm-charts 2>$null | Out-Null } -ErrorMessage "Failed to add prometheus helm repo"
+Invoke-CheckedCommand -Command { helm repo update 2>$null | Out-Null } -ErrorMessage "Failed to update helm repos"
+$promValuesYaml = Join-Path $AwsDir "prometheus-agent-values.yaml"
+$remoteWriteUrl = "$AmpUrl/api/v1/remote_write"
+Invoke-CheckedCommand -Command {
+    helm upgrade --install prometheus-agent prometheus-community/prometheus --namespace monitoring --reset-values -f $promValuesYaml `
+        --set "alertmanager.enabled=false" `
+        --set "kube-state-metrics.enabled=false" `
+        --set "prometheus-node-exporter.enabled=false" `
+        --set "pushgateway.enabled=false" `
+        --set "server.enabled=true" `
+        --set "server.persistentVolume.enabled=false" `
+        --set "server.resources.requests.cpu=50m" `
+        --set "server.resources.requests.memory=128Mi" `
+        --set "server.resources.limits.cpu=300m" `
+        --set "server.resources.limits.memory=384Mi" `
+        --set "server.global.scrape_interval=15s" `
+        --set "server.remoteWrite[0].url=$remoteWriteUrl" `
+        2>&1 | Out-Null
+} -ErrorMessage "Failed to install/upgrade Prometheus"
+Write-Host "Prometheus installed" -ForegroundColor Green
+# Phase 6: Install Grafana
+Write-Host ""
+if ($GrafanaModeResolved -eq "cluster") {
+    Write-Host ">>> Phase 6: Installing Grafana in-cluster..." -ForegroundColor Yellow
+    Invoke-CheckedCommand -Command { helm repo add grafana https://grafana.github.io/helm-charts 2>$null | Out-Null } -ErrorMessage "Failed to add grafana helm repo"
+    Invoke-CheckedCommand -Command { helm repo update 2>$null | Out-Null } -ErrorMessage "Failed to update helm repos"
+    $GrafanaValuesYaml = Join-Path $AwsDir "grafana-values.yaml"
+    Invoke-CheckedCommand -Command { helm upgrade --install grafana grafana/grafana --namespace monitoring -f $GrafanaValuesYaml 2>&1 | Out-Null } -ErrorMessage "Failed to install/upgrade Grafana"
+    Write-Host "Waiting for Grafana..."
+    try {
+        Invoke-CheckedCommand -Command { kubectl rollout status deployment/grafana --namespace monitoring --timeout=120s 2>$null | Out-Null } -ErrorMessage "Grafana rollout timed out"
+    } catch {
+        $pendingGrafanaPod = kubectl get pods -n monitoring -l app.kubernetes.io/name=grafana --field-selector=status.phase=Pending --no-headers 2>$null | Select-Object -First 1
+        $pendingReason = ""
+        if ($pendingGrafanaPod) {
+            $pendingGrafanaPodName = ($pendingGrafanaPod -split '\s+')[0]
+            $pendingReason = kubectl describe pod $pendingGrafanaPodName -n monitoring 2>$null | Select-String -Pattern "FailedScheduling|Insufficient memory|Too many pods|unbound" -Context 0,2 | Out-String
+            if (-not [string]::IsNullOrWhiteSpace($pendingReason)) {
+                Write-Host "Grafana is pending due to scheduler constraints:" -ForegroundColor Yellow
+                Write-Host $pendingReason -ForegroundColor Yellow
+            }
+        }
+        $shouldScale = $pendingReason -match "Too many pods|Insufficient memory"
+        if ($shouldScale) {
+            Write-Host "Scaling nodegroup to 8 nodes and retrying Grafana rollout..." -ForegroundColor Yellow
+            Invoke-CheckedCommand -Command { eksctl scale nodegroup --cluster $ClusterName --region $Region --name $NodegroupName --nodes 8 } -ErrorMessage "Failed to scale nodegroup"
+            Invoke-CheckedCommand -Command { aws eks wait nodegroup-active --cluster-name $ClusterName --nodegroup-name $NodegroupName --region $Region } -ErrorMessage "Nodegroup did not become active after scaling"
+            Write-Host "Waiting for newly scaled nodes to become Ready..." -ForegroundColor Yellow
+            Wait-ForReadyNodes -MinimumReadyNodes 8 -TimeoutSeconds 900
+            $pendingGrafanaPodAfterScale = kubectl get pods -n monitoring -l app.kubernetes.io/name=grafana --field-selector=status.phase=Pending --no-headers 2>$null | Select-Object -First 1
+            if ($pendingGrafanaPodAfterScale) {
+                $pendingGrafanaPodNameAfterScale = ($pendingGrafanaPodAfterScale -split '\s+')[0]
+                kubectl delete pod $pendingGrafanaPodNameAfterScale -n monitoring 2>$null | Out-Null
+            }
+            Invoke-CheckedCommand -Command { kubectl rollout status deployment/grafana --namespace monitoring --timeout=600s 2>$null | Out-Null } -ErrorMessage "Grafana rollout timed out after scaling"
+        } else {
+            throw "Grafana rollout failed. Check: kubectl -n monitoring get pods ; kubectl -n monitoring describe pod -l app.kubernetes.io/name=grafana"
+        }
+    }
+    Write-Host "Grafana installed (admin/antiatropos)" -ForegroundColor Green
+} else {
+    Write-Host ">>> Phase 6: Skipping in-cluster Grafana (external mode)..." -ForegroundColor Yellow
+    $grafanaRelease = ""
+    try {
+        $grafanaRelease = helm list -n monitoring --filter '^grafana$' --short 2>$null
+    } catch {
+        $grafanaRelease = ""
+    }
+    if (-not [string]::IsNullOrWhiteSpace($grafanaRelease)) {
+        helm uninstall grafana -n monitoring 2>$null | Out-Null
+        kubectl delete pvc grafana -n monitoring 2>$null | Out-Null
+        Write-Host "Removed existing in-cluster Grafana release to save resources" -ForegroundColor Green
+    }
+}
+# Phase 7: Install Cluster Autoscaler
+Write-Host ""
+Write-Host ">>> Phase 7: Installing Cluster Autoscaler..." -ForegroundColor Yellow
+Invoke-CheckedCommand -Command { helm repo add autoscaler https://kubernetes.github.io/autoscaler 2>$null | Out-Null } -ErrorMessage "Failed to add autoscaler helm repo"
+Invoke-CheckedCommand -Command { helm repo update 2>$null | Out-Null } -ErrorMessage "Failed to update helm repos"
+$autoscalerValues = Join-Path $AwsDir "cluster-autoscaler-values.yaml"
+Invoke-CheckedCommand -Command { helm upgrade --install cluster-autoscaler autoscaler/cluster-autoscaler --namespace kube-system -f $autoscalerValues 2>&1 | Out-Null } -ErrorMessage "Failed to install/upgrade Cluster Autoscaler"
+Write-Host "Cluster Autoscaler installed" -ForegroundColor Green
+# Phase 8: Generate Kubeconfig
+Write-Host ""
+Write-Host ">>> Phase 8: Generating kubeconfig..." -ForegroundColor Yellow
+$ClusterEndpoint = aws eks describe-cluster --name $ClusterName --region $Region --query 'cluster.endpoint' --output text
+$ClusterCa = aws eks describe-cluster --name $ClusterName --region $Region --query 'cluster.certificateAuthority.data' --output text
+$Timestamp = (Get-Date).ToUniversalTime().ToString("yyyy-MM-ddTHH:mm:ssZ")
+$output = Join-Path $AwsDir "kubeconfig-antiatropos.yaml"
+$kubeconfig = "apiVersion: v1`n" +
+"kind: Config`n" +
+"clusters:`n" +
+"  - cluster:`n" +
+"      certificate-authority-data: $ClusterCa`n" +
+"      server: $ClusterEndpoint`n" +
+"    name: $ClusterName`n" +
+"contexts:`n" +
+"  - context:`n" +
+"      cluster: $ClusterName`n" +
+"      user: antiatropos-hf-user`n" +
+"    name: $ClusterName`n" +
+"current-context: $ClusterName`n" +
+"preferences: {}`n" +
+"users:`n" +
+"  - name: antiatropos-hf-user`n" +
+"    user:`n" +
+"      exec:`n" +
+"        apiVersion: client.authentication.k8s.io/v1beta1`n" +
+"        command: aws`n" +
+"        args:`n" +
+"          - eks`n" +
+"          - get-token`n" +
+"          - --region`n" +
+"          - $Region`n" +
+"          - --cluster-name`n" +
+"          - $ClusterName`n" +
+"        env:`n" +
+"          - name: AWS_STS_REGIONAL_ENDPOINTS`n" +
+"            value: regional`n" +
+"          - name: AWS_DEFAULT_REGION`n" +
+"            value: $Region`n" +
+"        interactiveMode: IfAvailable`n"
+$kubeconfig | Out-File -FilePath $output -Encoding utf8 -Force
+Write-Host "Kubeconfig: $output" -ForegroundColor Green
+# Done
+Write-Host ""
+Write-Host "==========================================" -ForegroundColor Cyan
+Write-Host "   Deployment Complete!" -ForegroundColor Cyan
+Write-Host "==========================================" -ForegroundColor Cyan
+Write-Host ""
+Write-Host "AMP: $AmpWsId" -ForegroundColor Yellow
+if ($GrafanaModeResolved -eq "cluster") {
+    Write-Host "Grafana: kubectl port-forward svc/grafana 3000 -n monitoring" -ForegroundColor Yellow
+    Write-Host "Login: admin / antiatropos" -ForegroundColor Yellow
+} else {
+    Write-Host "Grafana: external/local mode enabled (recommended for free-tier nodes)" -ForegroundColor Yellow
+    Write-Host "Use AMP endpoint as Prometheus datasource with SigV4 auth" -ForegroundColor Yellow
+}
+Write-Host "Kubeconfig: $output" -ForegroundColor Yellow
+Write-Host ""

deploy/aws/deploy.ps1 ADDED Viewed

	@@ -0,0 +1,369 @@

+# AntiAtropos AWS Infrastructure Deploy Script (PowerShell)
+#
+# Deploys: EKS cluster, sample workloads, AMP workspace, Prometheus Agent,
+#          AMG workspace, Cluster Autoscaler, and generates kubeconfig for HF Spaces.
+#
+# The AntiAtropos FastAPI server runs on Hugging Face Spaces, NOT on AWS.
+# This script only sets up the infrastructure that HF Spaces connects to.
+#
+# Prerequisites: aws cli, eksctl, kubectl, helm
+#
+# Usage:
+#   .\deploy\aws\deploy.ps1
+#
+# Environment variables:
+#   $env:AWS_REGION     - AWS region (default: ap-south-1)
+#   $env:CLUSTER_NAME   - EKS cluster name (default: antiatropos)
+$ErrorActionPreference = "Stop"
+$Region = if ($env:AWS_REGION) { $env:AWS_REGION } else { "ap-south-1" }
+$ClusterName = if ($env:CLUSTER_NAME) { $env:CLUSTER_NAME } else { "antiatropos" }
+$AwsDir = Split-Path -Parent $MyInvocation.MyCommand.Path
+Write-Host ""
+Write-Host "=== AntiAtropos AWS Infrastructure Deployment ===" -ForegroundColor Cyan
+Write-Host "Region:      $Region"
+Write-Host "Cluster:     $ClusterName"
+Write-Host "FastAPI:     Runs on HF Spaces (not deployed here)"
+Write-Host ""
+# --- Check prerequisites ---
+$missing = @()
+foreach ($cmd in @("aws", "eksctl", "kubectl", "helm")) {
+    if (-not (Get-Command $cmd -ErrorAction SilentlyContinue)) {
+        $missing += $cmd
+    }
+}
+if ($missing.Count -gt 0) {
+    Write-Host "ERROR: Missing prerequisites: $($missing -join ', ')" -ForegroundColor Red
+    Write-Host "Install them first:" -ForegroundColor Yellow
+    Write-Host "  choco install awscli eksctl kubernetes-cli kubernetes-helm -y" -ForegroundColor Yellow
+    exit 1
+}
+# --- Phase 1: Create EKS Cluster ---
+Write-Host ""
+Write-Host ">>> Phase 1: Creating EKS cluster (without nodegroup)..." -ForegroundColor Yellow
+$clusterExists = $false
+try {
+    eksctl get cluster --name $ClusterName --region $Region 2>$null | Out-Null
+    $clusterExists = $true
+} catch {}
+if ($clusterExists) {
+    Write-Host "Cluster $ClusterName already exists, skipping creation."
+} else {
+    # Create cluster without nodegroup first (faster, avoids timeout)
+    $TempClusterConfig = Join-Path $AwsDir "eksctl-cluster-only.yaml"
+    $ClusterYaml = Get-Content (Join-Path $AwsDir "eksctl-cluster.yaml") -Raw
+    # Remove nodegroups section for initial cluster creation
+    $ClusterOnlyYaml = $ClusterYaml -replace '(?s)(managedNodeGroups:.*)', ''
+    $ClusterOnlyYaml | Out-File -FilePath $TempClusterConfig -Encoding utf8
+    eksctl create cluster -f $TempClusterConfig
+    Remove-Item $TempClusterConfig -Force
+    Write-Host "Cluster created." -ForegroundColor Green
+}
+aws eks update-kubeconfig --name $ClusterName --region $Region
+Write-Host "kubeconfig updated."
+# --- Phase 1b: Create Nodegroup Separately ---
+Write-Host ""
+Write-Host ">>> Phase 1b: Creating nodegroup (separate step to avoid timeout)..." -ForegroundColor Yellow
+$nodegroupExists = $false
+try {
+    eksctl get nodegroup --cluster $ClusterName --region $Region 2>$null | Select-String "linux-nodes" | Out-Null
+    $nodegroupExists = $true
+} catch {}
+if ($nodegroupExists) {
+    Write-Host "Nodegroup already exists, skipping creation."
+} else {
+    # Create nodegroup separately (better error handling, can retry)
+    eksctl create nodegroup --config-file (Join-Path $AwsDir "eksctl-cluster.yaml")
+    Write-Host "Nodegroup created." -ForegroundColor Green
+}
+# Verify nodes are ready
+Write-Host "Waiting for nodes to be ready..."
+$nodesReady = $false
+for ($i = 0; $i -lt 30; $i++) {
+    $nodes = kubectl get nodes --no-headers 2>$null
+    if ($nodes) {
+        Write-Host "Nodes ready:" -ForegroundColor Green
+        kubectl get nodes
+        $nodesReady = $true
+        break
+    }
+    Start-Sleep -Seconds 10
+}
+if (-not $nodesReady) {
+    Write-Host "WARNING: Nodes not ready yet. Check with: kubectl get nodes" -ForegroundColor Yellow
+}
+Write-Host "Enabling Prefix Delegation on VPC CNI..."
+kubectl set env daemonset aws-node -n kube-system ENABLE_PREFIX_DELEGATION=true
+Write-Host "Prefix Delegation enabled."
+# --- Phase 2: Deploy Sample Workloads ---
+Write-Host ""
+Write-Host ">>> Phase 2: Deploying sample workloads (payments, checkout, catalog, cart, auth)..." -ForegroundColor Yellow
+kubectl apply -f (Join-Path $AwsDir "k8s-workloads.yaml")
+Write-Host "Workloads deployed." -ForegroundColor Green
+kubectl get pods -n prod-sre
+# --- Phase 3: Create AMP Workspace ---
+Write-Host ""
+Write-Host ">>> Phase 3: Creating Amazon Managed Prometheus workspace..." -ForegroundColor Yellow
+$AmpWsId = $null
+try {
+    $AmpWsId = aws amp list-workspaces --alias antiatropos-metrics --region $Region --query 'workspaces[0].workspaceId' --output text 2>$null
+    if ($AmpWsId -eq "None") { $AmpWsId = $null }
+} catch {}
+if ([string]::IsNullOrWhiteSpace($AmpWsId)) {
+    $AmpWsId = aws amp create-workspace `
+        --alias antiatropos-metrics `
+        --region $Region `
+        --query 'workspaceId' `
+        --output text
+    Write-Host "AMP workspace created: $AmpWsId" -ForegroundColor Green
+} else {
+    Write-Host "AMP workspace already exists: $AmpWsId"
+}
+$AmpUrl = "https://aps-workspaces.$Region.amazonaws.com/workspaces/$AmpWsId"
+Write-Host "AMP URL: $AmpUrl"
+# --- Phase 4: Set up IRSA for Prometheus Agent ---
+Write-Host ""
+Write-Host ">>> Phase 4: Setting up IRSA for Prometheus Agent..." -ForegroundColor Yellow
+$saExists = $false
+try {
+    kubectl get serviceaccount prometheus-sa -n monitoring 2>$null | Out-Null
+    $saExists = $true
+} catch {}
+if ($saExists) {
+    Write-Host "prometheus-sa already exists."
+} else {
+    eksctl create iamserviceaccount `
+        --cluster $ClusterName `
+        --namespace monitoring `
+        --name prometheus-sa `
+        --attach-policy-arn "arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess" `
+        --approve `
+        --override-existing-serviceaccounts
+    Write-Host "prometheus-sa created." -ForegroundColor Green
+}
+# --- Phase 5: Install Prometheus Agent ---
+Write-Host ""
+Write-Host ">>> Phase 5: Installing Prometheus Agent (remote-writes to AMP)..." -ForegroundColor Yellow
+helm repo add prometheus-community https://prometheus-community.github.io/helm-charts 2>$null
+helm repo update
+$agentInstalled = $false
+try {
+    helm status prometheus-agent -n monitoring 2>$null | Out-Null
+    $agentInstalled = $true
+} catch {}
+$promValuesYaml = Join-Path $AwsDir "prometheus-agent-values.yaml"
+$remoteWriteUrl = "$AmpUrl/api/v1/remote_write"
+if ($agentInstalled) {
+    Write-Host "prometheus-agent already installed, upgrading..."
+    helm upgrade prometheus-agent prometheus-community/prometheus `
+        --namespace monitoring `
+        -f $promValuesYaml `
+        --set "prometheus.prometheusSpec.remoteWrite[0].url=$remoteWriteUrl"
+} else {
+    helm install prometheus-agent prometheus-community/prometheus `
+        --namespace monitoring --create-namespace `
+        -f $promValuesYaml `
+        --set "prometheus.prometheusSpec.remoteWrite[0].url=$remoteWriteUrl"
+    Write-Host "prometheus-agent installed." -ForegroundColor Green
+}
+# --- Phase 6: Install Self-Hosted Grafana on EKS ---
+Write-Host ""
+Write-Host ">>> Phase 6: Installing self-hosted Grafana on EKS..." -ForegroundColor Yellow
+# Add Grafana Helm repo
+helm repo add grafana https://grafana.github.io/helm-charts 2>$null
+helm repo update
+# Create a secret with the dashboard JSON files for Grafana to import
+$DashboardsDir = Join-Path $PSScriptRoot "..\..\grafana\provisioning\dashboards\json"
+if (Test-Path $DashboardsDir) {
+    Write-Host "Creating dashboard secret from $DashboardsDir..."
+    kubectl create secret generic antiatropos-grafana-dashboards `
+        --from-file=antiatropos-overview.json=$(Join-Path $DashboardsDir "antiatropos-overview.json") `
+        --from-file=antiatropos-live.json=$(Join-Path $DashboardsDir "antiatropos-live.json") `
+        --namespace monitoring `
+        --dry-run=client -o yaml | kubectl apply -f -
+    Write-Host "Dashboard secret created." -ForegroundColor Green
+} else {
+    Write-Host "Dashboard JSON directory not found at $DashboardsDir, skipping."
+}
+# Install Grafana
+$GrafanaValuesYaml = Join-Path $AwsDir "grafana-values.yaml"
+if (helm status grafana -n monitoring 2>$null) {
+    Write-Host "Grafana already installed, upgrading..."
+    helm upgrade grafana grafana/grafana --namespace monitoring -f $GrafanaValuesYaml
+} else {
+    helm install grafana grafana/grafana --namespace monitoring -f $GrafanaValuesYaml
+    Write-Host "Grafana installed." -ForegroundColor Green
+}
+# Wait for Grafana pod to be ready
+Write-Host "Waiting for Grafana pod to be ready..."
+kubectl rollout status deployment/grafana --namespace monitoring --timeout=120s 2>$null | Out-Null
+$GrafanaPod = kubectl get pods -n monitoring -l app.kubernetes.io/name=grafana -o jsonpath='{.items[0].metadata.name}' 2>$null
+Write-Host "Grafana pod: $GrafanaPod"
+Write-Host "To access Grafana: kubectl port-forward svc/grafana 3000 -n monitoring" -ForegroundColor Yellow
+Write-Host "Login: admin / antiatropos"
+# --- Phase 7: Install Cluster Autoscaler ---
+Write-Host ""
+Write-Host ">>> Phase 7: Installing Cluster Autoscaler..." -ForegroundColor Yellow
+helm repo add autoscaler https://kubernetes.github.io/autoscaler 2>$null
+helm repo update
+$autoscalerInstalled = $false
+try {
+    helm status cluster-autoscaler -n kube-system 2>$null | Out-Null
+    $autoscalerInstalled = $true
+} catch {}
+$autoscalerValues = Join-Path $AwsDir "cluster-autoscaler-values.yaml"
+if ($autoscalerInstalled) {
+    Write-Host "cluster-autoscaler already installed, upgrading..."
+    helm upgrade cluster-autoscaler autoscaler/cluster-autoscaler `
+        --namespace kube-system `
+        -f $autoscalerValues
+} else {
+    helm install cluster-autoscaler autoscaler/cluster-autoscaler `
+        --namespace kube-system `
+        -f $autoscalerValues
+    Write-Host "cluster-autoscaler installed." -ForegroundColor Green
+}
+# --- Phase 8: Generate Kubeconfig for HF Spaces ---
+Write-Host ""
+Write-Host ">>> Phase 8: Generating kubeconfig for HF Spaces..." -ForegroundColor Yellow
+$generateScript = Join-Path $AwsDir "generate-kubeconfig.ps1"
+if (Test-Path $generateScript) {
+    & $generateScript
+} else {
+    # Inline kubeconfig generation if the .ps1 version doesn't exist yet
+    $output = Join-Path $AwsDir "kubeconfig-antiatropos.yaml"
+    # Verify cluster exists
+    $clusterCheck = $false
+    try {
+        eksctl get cluster --name $ClusterName --region $Region 2>$null | Out-Null
+        $clusterCheck = $true
+    } catch {}
+    if (-not $clusterCheck) {
+        Write-Host "ERROR: Cluster $ClusterName not found." -ForegroundColor Red
+        exit 1
+    }
+    $ClusterEndpoint = aws eks describe-cluster --name $ClusterName --region $Region --query 'cluster.endpoint' --output text
+    $ClusterCa = aws eks describe-cluster --name $ClusterName --region $Region --query 'cluster.certificateAuthority.data' --output text
+    $Timestamp = (Get-Date).ToUniversalTime().ToString("yyyy-MM-ddTHH:mm:ssZ")
+    $kubeconfig = @"
+# Kubeconfig for AntiAtropos on Hugging Face Spaces
+# Generated: $Timestamp
+# Cluster:   $ClusterName
+# Region:    $Region
+#
+# This kubeconfig uses AWS IAM authenticator.
+# The HF Space container must have aws-cli available,
+# OR the kubernetes Python client must be configured with AWS credentials.
+apiVersion: v1
+kind: Config
+clusters:
+  - cluster:
+      certificate-authority-data: $ClusterCa
+      server: $ClusterEndpoint
+    name: $ClusterName
+contexts:
+  - context:
+      cluster: $ClusterName
+      user: antiatropos-hf-user
+    name: $ClusterName
+current-context: $ClusterName
+preferences: {}
+users:
+  - name: antiatropos-hf-user
+    user:
+      exec:
+        apiVersion: client.authentication.k8s.io/v1beta1
+        command: aws
+        args:
+          - eks
+          - get-token
+          - --region
+          - $Region
+          - --cluster-name
+          - $ClusterName
+        env:
+          - name: AWS_STS_REGIONAL_ENDPOINTS
+            value: regional
+          - name: AWS_DEFAULT_REGION
+            value: $Region
+        interactiveMode: IfAvailable
+"@
+    $kubeconfig | Out-File -FilePath $output -Encoding utf8 -Force
+    Write-Host "Kubeconfig written to: $output" -ForegroundColor Green
+    Write-Host ""
+    Write-Host "To encode for HF Spaces secret:" -ForegroundColor Yellow
+    Write-Host "  [Convert]::ToBase64String([System.IO.File]::ReadAllBytes('$output'))"
+}
+# --- Done ---
+Write-Host ""
+Write-Host "==========================================" -ForegroundColor Cyan
+Write-Host "   AntiAtropos AWS Infrastructure Ready!" -ForegroundColor Cyan
+Write-Host "==========================================" -ForegroundColor Cyan
+Write-Host ""
+Write-Host "AMP Workspace ID:  $AmpWsId"
+Write-Host "AMP URL:           $AmpUrl"
+Write-Host ""
+Write-Host "Grafana: Self-hosted on EKS (monitoring namespace)"
+Write-Host "  Access: kubectl port-forward svc/grafana 3000 -n monitoring"
+Write-Host "  Login: admin / antiatropos"
+Write-Host "  URL: http://localhost:3000"
+Write-Host ""
+Write-Host "Kubeconfig saved:  $(Join-Path $AwsDir 'kubeconfig-antiatropos.yaml')"
+Write-Host ""
+Write-Host "Next steps - configure your HF Space:" -ForegroundColor Yellow
+Write-Host "  1. Set secret KUBECONFIG_CONTENT = base64 of kubeconfig-antiatropos.yaml"
+Write-Host "  2. Set env var PROMETHEUS_URL = $AmpUrl"
+Write-Host "  3. Set env var KUBECONFIG = /app/kubeconfig.yaml"
+Write-Host "  4. Set env var ANTIATROPOS_ENV_MODE = live"
+Write-Host "  5. Set env var ANTIATROPOS_MAX_REPLICAS = 6"
+Write-Host "  6. Set env var ANTIATROPOS_WORKLOAD_MAP = (see OPERATIONS.md)"
+Write-Host "  7. Add kubeconfig decode to deploy/entrypoint.sh (see OPERATIONS.md)"

deploy/aws/deploy.sh ADDED Viewed

	@@ -0,0 +1,204 @@

+#!/usr/bin/env bash
+# AntiAtropos AWS Infrastructure Deploy Script
+#
+# Deploys: EKS cluster, sample workloads, AMP workspace, Prometheus Agent,
+#          AMG workspace, Cluster Autoscaler, and generates kubeconfig for HF Spaces.
+#
+# The AntiAtropos FastAPI server runs on Hugging Face Spaces, NOT on AWS.
+# This script only sets up the infrastructure that HF Spaces connects to.
+#
+# Prerequisites: aws cli, eksctl, kubectl, helm
+#
+# Usage:
+#   chmod +x deploy/aws/deploy.sh
+#   ./deploy/aws/deploy.sh
+#
+# Environment variables:
+#   AWS_REGION     - AWS region (default: ap-south-1)
+#   CLUSTER_NAME   - EKS cluster name (default: antiatropos)
+set -euo pipefail
+REGION="${AWS_REGION:-ap-south-1}"
+CLUSTER_NAME="${CLUSTER_NAME:-antiatropos}"
+AWS_DIR="$(cd "$(dirname "$0")" && pwd)"
+echo "=== AntiAtropos AWS Infrastructure Deployment ==="
+echo "Region:      $REGION"
+echo "Cluster:     $CLUSTER_NAME"
+echo "FastAPI:     Runs on HF Spaces (not deployed here)"
+echo ""
+# --- Check prerequisites ---
+for cmd in aws eksctl kubectl helm; do
+    if ! command -v "$cmd" &>/dev/null; then
+        echo "ERROR: $cmd is not installed. Please install it first."
+        exit 1
+    fi
+done
+# --- Phase 1: Create EKS Cluster ---
+echo ""
+echo ">>> Phase 1: Creating EKS cluster..."
+if eksctl get cluster --name "$CLUSTER_NAME" --region "$REGION" &>/dev/null; then
+    echo "Cluster $CLUSTER_NAME already exists, skipping creation."
+else
+    eksctl create cluster -f "$AWS_DIR/eksctl-cluster.yaml"
+    echo "Cluster created."
+fi
+aws eks update-kubeconfig --name "$CLUSTER_NAME" --region "$REGION"
+echo "kubeconfig updated."
+# --- Phase 2: Deploy Sample Workloads ---
+echo ""
+echo ">>> Phase 2: Deploying sample workloads (payments, checkout, catalog, cart, auth)..."
+kubectl apply -f "$AWS_DIR/k8s-workloads.yaml"
+echo "Workloads deployed."
+kubectl get pods -n prod-sre
+# --- Phase 3: Create AMP Workspace ---
+echo ""
+echo ">>> Phase 3: Creating Amazon Managed Prometheus workspace..."
+AMP_WS_ID=$(aws amp list-workspaces --alias antiatropos-metrics --region "$REGION" --query 'workspaces[0].workspaceId' --output text 2>/dev/null || echo "")
+if [ -z "$AMP_WS_ID" ] || [ "$AMP_WS_ID" = "None" ]; then
+    AMP_WS_ID=$(aws amp create-workspace \
+        --alias antiatropos-metrics \
+        --region "$REGION" \
+        --query 'workspaceId' \
+        --output text)
+    echo "AMP workspace created: $AMP_WS_ID"
+else
+    echo "AMP workspace already exists: $AMP_WS_ID"
+fi
+AMP_URL="https://aps-workspaces.$REGION.amazonaws.com/workspaces/$AMP_WS_ID"
+echo "AMP URL: $AMP_URL"
+# --- Phase 4: Set up IRSA for Prometheus Agent ---
+echo ""
+echo ">>> Phase 4: Setting up IRSA for Prometheus Agent..."
+if kubectl get serviceaccount prometheus-sa -n monitoring &>/dev/null; then
+    echo "prometheus-sa already exists."
+else
+    eksctl create iamserviceaccount \
+        --cluster "$CLUSTER_NAME" \
+        --namespace monitoring \
+        --name prometheus-sa \
+        --attach-policy-arn arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess \
+        --approve \
+        --override-existing-serviceaccounts
+    echo "prometheus-sa created."
+fi
+# --- Phase 5: Install Prometheus Agent ---
+echo ""
+echo ">>> Phase 5: Installing Prometheus Agent (remote-writes to AMP)..."
+helm repo add prometheus-community https://prometheus-community.github.io/helm-charts 2>/dev/null || true
+helm repo update
+if helm status prometheus-agent -n monitoring &>/dev/null; then
+    echo "prometheus-agent already installed, upgrading..."
+    helm upgrade prometheus-agent prometheus-community/prometheus \
+        --namespace monitoring \
+        -f "$AWS_DIR/prometheus-agent-values.yaml" \
+        --set "prometheus.prometheusSpec.remoteWrite[0].url=$AMP_URL/api/v1/remote_write"
+else
+    helm install prometheus-agent prometheus-community/prometheus \
+        --namespace monitoring --create-namespace \
+        -f "$AWS_DIR/prometheus-agent-values.yaml" \
+        --set "prometheus.prometheusSpec.remoteWrite[0].url=$AMP_URL/api/v1/remote_write"
+    echo "prometheus-agent installed."
+fi
+# --- Phase 6: Install Self-Hosted Grafana on EKS ---
+echo ""
+echo ">>> Phase 6: Installing self-hosted Grafana on EKS..."
+# Add Grafana Helm repo
+helm repo add grafana https://grafana.github.io/helm-charts 2>/dev/null || true
+helm repo update
+# Create a secret with the dashboard JSON files for Grafana to import
+DASHBOARDS_DIR="$AWS_DIR/../../grafana/provisioning/dashboards/json"
+if [ -d "$DASHBOARDS_DIR" ]; then
+    echo "Creating dashboard secret from $DASHBOARDS_DIR..."
+    kubectl create secret generic antiatropos-grafana-dashboards \
+        --from-file=antiatropos-overview.json="$DASHBOARDS_DIR/antiatropos-overview.json" \
+        --from-file=antiatropos-live.json="$DASHBOARDS_DIR/antiatropos-live.json" \
+        --namespace monitoring \
+        --dry-run=client -o yaml | kubectl apply -f -
+    echo "Dashboard secret created."
+else
+    echo "Dashboard JSON directory not found at $DASHBOARDS_DIR, skipping."
+fi
+# Install Grafana
+GRAFANA_VALUES="$AWS_DIR/grafana-values.yaml"
+if helm status grafana -n monitoring &>/dev/null; then
+    echo "Grafana already installed, upgrading..."
+    helm upgrade grafana grafana/grafana --namespace monitoring -f "$GRAFANA_VALUES"
+else
+    helm install grafana grafana/grafana --namespace monitoring -f "$GRAFANA_VALUES"
+    echo "Grafana installed."
+fi
+# Wait for Grafana pod to be ready
+echo "Waiting for Grafana pod to be ready..."
+kubectl rollout status deployment/grafana --namespace monitoring --timeout=120s 2>/dev/null || true
+GRAFANA_POD=$(kubectl get pods -n monitoring -l app.kubernetes.io/name=grafana -o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
+echo "Grafana pod: $GRAFANA_POD"
+echo "To access Grafana: kubectl port-forward svc/grafana 3000 -n monitoring"
+echo "Login: admin / antiatropos"
+# --- Phase 7: Install Cluster Autoscaler ---
+echo ""
+echo ">>> Phase 7: Installing Cluster Autoscaler..."
+helm repo add autoscaler https://kubernetes.github.io/autoscaler 2>/dev/null || true
+helm repo update
+if helm status cluster-autoscaler -n kube-system &>/dev/null; then
+    echo "cluster-autoscaler already installed, upgrading..."
+    helm upgrade cluster-autoscaler autoscaler/cluster-autoscaler \
+        --namespace kube-system \
+        -f "$AWS_DIR/cluster-autoscaler-values.yaml"
+else
+    helm install cluster-autoscaler autoscaler/cluster-autoscaler \
+        --namespace kube-system \
+        -f "$AWS_DIR/cluster-autoscaler-values.yaml"
+    echo "cluster-autoscaler installed."
+fi
+# --- Phase 8: Generate Kubeconfig for HF Spaces ---
+echo ""
+echo ">>> Phase 8: Generating kubeconfig for HF Spaces..."
+"$AWS_DIR/generate-kubeconfig.sh"
+# --- Done ---
+echo ""
+echo "=========================================="
+echo "   AntiAtropos AWS Infrastructure Ready!"
+echo "=========================================="
+echo ""
+echo "AMP Workspace ID:  $AMP_WS_ID"
+echo "AMP URL:           $AMP_URL"
+echo ""
+echo "Grafana: Self-hosted on EKS (monitoring namespace)"
+echo "  Access: kubectl port-forward svc/grafana 3000 -n monitoring"
+echo "  Login: admin / antiatropos"
+echo "  URL: http://localhost:3000"
+echo ""
+echo "Kubeconfig saved:  $AWS_DIR/kubeconfig-antiatropos.yaml"
+echo ""
+echo "Next steps — configure your HF Space:"
+echo "  1. Set secret KUBECONFIG_CONTENT = base64 of kubeconfig-antiatropos.yaml"
+echo "  2. Set env var PROMETHEUS_URL = $AMP_URL"
+echo "  3. Set env var KUBECONFIG = /app/kubeconfig.yaml"
+echo "  4. Set env var ANTIATROPOS_ENV_MODE = live"
+echo "  5. Set env var ANTIATROPOS_MAX_REPLICAS = 6"
+echo "  6. Set env var ANTIATROPOS_WORKLOAD_MAP = (see OPERATIONS.md)"
+echo "  7. Add kubeconfig decode to deploy/entrypoint.sh (see OPERATIONS.md)"

deploy/aws/eksctl-cluster.yaml ADDED Viewed

	@@ -0,0 +1,58 @@

+apiVersion: eksctl.io/v1alpha5
+kind: ClusterConfig
+metadata:
+  name: antiatropos
+  region: ap-south-1
+  version: "1.30"
+  tags:
+    Project: AntiAtropos
+    Environment: production
+autoModeConfig:
+  enabled: false
+iam:
+  withOIDC: true
+addons:
+  - name: vpc-cni
+    version: latest
+  - name: coredns
+    version: latest
+  - name: kube-proxy
+    version: latest
+  - name: aws-ebs-csi-driver
+    version: latest
+    wellKnownPolicies:
+      ebsCSIController: true
+managedNodeGroups:
+  - name: linux-nodes
+    instanceType: t3.micro
+    maxPodsPerNode: 110
+    desiredCapacity: 2
+    minSize: 1
+    maxSize: 4
+    volumeSize: 50
+    volumeType: gp3
+    availabilityZones:
+      - ap-south-1a
+      - ap-south-1b
+    labels:
+      role: worker
+    tags:
+      Project: AntiAtropos
+      NodeGroup: linux-nodes
+    iam:
+      withAddonPolicies:
+        ebs: true
+        cloudWatch: true
+        autoScaler: true
+cloudWatch:
+  clusterLogging:
+    enableTypes:
+      - api
+      - audit
+      - authenticator

deploy/aws/generate-kubeconfig.ps1 ADDED Viewed

	@@ -0,0 +1,131 @@

+# Generate a kubeconfig for HF Spaces to connect to the EKS cluster.
+#
+# This creates a kubeconfig that uses AWS IAM authenticator,
+# which works from outside the cluster (like from HF Spaces).
+#
+# Prerequisites: aws cli, kubectl, eksctl
+#
+# Usage:
+#   .\deploy\aws\generate-kubeconfig.ps1
+#
+# Output:
+#   deploy/aws/kubeconfig-antiatropos.yaml
+#
+# Then on HF Spaces:
+#   1. base64 encode: $b64 = [Convert]::ToBase64String([IO.File]::ReadAllBytes('deploy\aws\kubeconfig-antiatropos.yaml'))
+#   2. Set as HF Space secret: KUBECONFIG_CONTENT = <base64 output>
+#   3. Set env var: KUBECONFIG = /app/kubeconfig.yaml
+#   4. Add to deploy/entrypoint.sh:
+#        if [ -n "${KUBECONFIG_CONTENT:-}" ]; then
+#            echo "${KUBECONFIG_CONTENT}" | base64 -d > /app/kubeconfig.yaml
+#            export KUBECONFIG=/app/kubeconfig.yaml
+#        fi
+$ErrorActionPreference = "Stop"
+$Region = if ($env:AWS_REGION) { $env:AWS_REGION } else { "ap-south-1" }
+$ClusterName = if ($env:CLUSTER_NAME) { $env:CLUSTER_NAME } else { "antiatropos" }
+$AwsDir = Split-Path -Parent $MyInvocation.MyCommand.Path
+$Output = Join-Path $AwsDir "kubeconfig-antiatropos.yaml"
+Write-Host ""
+Write-Host "=== Generating kubeconfig for HF Spaces ===" -ForegroundColor Cyan
+Write-Host "Cluster: $ClusterName"
+Write-Host "Region:  $Region"
+Write-Host ""
+# Verify cluster exists
+$clusterExists = $false
+try {
+    eksctl get cluster --name $ClusterName --region $Region 2>$null | Out-Null
+    $clusterExists = $true
+} catch {}
+if (-not $clusterExists) {
+    Write-Host "ERROR: Cluster $ClusterName not found. Create it first with eksctl." -ForegroundColor Red
+    exit 1
+}
+# Get cluster details
+$ClusterEndpoint = aws eks describe-cluster --name $ClusterName --region $Region --query 'cluster.endpoint' --output text
+$ClusterCa = aws eks describe-cluster --name $ClusterName --region $Region --query 'cluster.certificateAuthority.data' --output text
+$AwsArn = aws sts get-caller-identity --query Arn --output text
+$Timestamp = (Get-Date).ToUniversalTime().ToString("yyyy-MM-ddTHH:mm:ssZ")
+Write-Host "Cluster endpoint: $ClusterEndpoint"
+Write-Host "AWS identity:     $AwsArn"
+Write-Host ""
+# Generate the kubeconfig
+$kubeconfig = @"
+# Kubeconfig for AntiAtropos on Hugging Face Spaces
+# Generated: $Timestamp
+# Cluster:   $ClusterName
+# Region:    $Region
+#
+# This kubeconfig uses AWS IAM authenticator.
+# The HF Space container must have aws-cli and aws-iam-authenticator available,
+# OR the kubernetes Python client must be configured with AWS credentials.
+#
+# To use this on HF Spaces:
+#   1. base64 encode this file
+#   2. Set as HF secret: KUBECONFIG_CONTENT = <base64>
+#   3. Set env var: KUBECONFIG = /app/kubeconfig.yaml
+#   4. Decode in entrypoint.sh before uvicorn starts
+apiVersion: v1
+kind: Config
+clusters:
+  - cluster:
+      certificate-authority-data: $ClusterCa
+      server: $ClusterEndpoint
+    name: $ClusterName
+contexts:
+  - context:
+      cluster: $ClusterName
+      user: antiatropos-hf-user
+    name: $ClusterName
+current-context: $ClusterName
+preferences: {}
+users:
+  - name: antiatropos-hf-user
+    user:
+      exec:
+        apiVersion: client.authentication.k8s.io/v1beta1
+        command: aws
+        args:
+          - eks
+          - get-token
+          - --region
+          - $Region
+          - --cluster-name
+          - $ClusterName
+        env:
+          - name: AWS_STS_REGIONAL_ENDPOINTS
+            value: regional
+          - name: AWS_DEFAULT_REGION
+            value: $Region
+        interactiveMode: IfAvailable
+"@
+$kubeconfig | Out-File -FilePath $Output -Encoding utf8 -Force
+Write-Host "Kubeconfig written to: $Output" -ForegroundColor Green
+Write-Host ""
+Write-Host "IMPORTANT: The HF Space container needs the AWS CLI and credentials" -ForegroundColor Yellow
+Write-Host "to authenticate with EKS. You have two options:"
+Write-Host ""
+Write-Host "Option A: Include aws-cli in your Docker image and set AWS_ACCESS_KEY_ID /"
+Write-Host "          AWS_SECRET_ACCESS_KEY as HF Space secrets."
+Write-Host ""
+Write-Host "Option B: Use the kubernetes Python client with AWS SDK (boto3)."
+Write-Host "          The kubernetes_executor.py already supports this via"
+Write-Host "          load_kube_config() which uses the Python client's auth plugins."
+Write-Host ""
+Write-Host "To encode for HF Spaces secret:" -ForegroundColor Yellow
+Write-Host "  [Convert]::ToBase64String([IO.File]::ReadAllBytes('$Output'))"

deploy/aws/generate-kubeconfig.sh ADDED Viewed

	@@ -0,0 +1,138 @@

+#!/usr/bin/env bash
+# Generate a kubeconfig for HF Spaces to connect to the EKS cluster.
+#
+# This creates a kubeconfig that uses AWS IAM authenticator,
+# which works from outside the cluster (like from HF Spaces).
+#
+# Prerequisites:
+#   - aws cli
+#   - kubectl
+#   - eksctl
+#   - The EKS cluster must already exist
+#
+# Usage:
+#   ./generate-kubeconfig.sh
+#
+# Output:
+#   deploy/aws/kubeconfig-antiatropos.yaml
+#
+# Then on HF Spaces:
+#   1. base64 encode: cat kubeconfig-antiatropos.yaml | base64 -w 0
+#   2. Set as HF Space secret: KUBECONFIG_CONTENT = <base64 output>
+#   3. Set env var: KUBECONFIG = /app/kubeconfig.yaml
+#   4. Add to deploy/entrypoint.sh:
+#        if [ -n "${KUBECONFIG_CONTENT:-}" ]; then
+#            echo "${KUBECONFIG_CONTENT}" | base64 -d > /app/kubeconfig.yaml
+#            export KUBECONFIG=/app/kubeconfig.yaml
+#        fi
+set -euo pipefail
+REGION="${AWS_REGION:-ap-south-1}"
+CLUSTER_NAME="${CLUSTER_NAME:-antiatropos}"
+AWS_DIR="$(cd "$(dirname "$0")" && pwd)"
+OUTPUT="$AWS_DIR/kubeconfig-antiatropos.yaml"
+echo "=== Generating kubeconfig for HF Spaces ==="
+echo "Cluster: $CLUSTER_NAME"
+echo "Region:  $REGION"
+echo ""
+# Verify cluster exists
+if ! eksctl get cluster --name "$CLUSTER_NAME" --region "$REGION" &>/dev/null; then
+    echo "ERROR: Cluster $CLUSTER_NAME not found. Create it first with eksctl."
+    exit 1
+fi
+# Get cluster details
+CLUSTER_ENDPOINT=$(aws eks describe-cluster \
+    --name "$CLUSTER_NAME" \
+    --region "$REGION" \
+    --query 'cluster.endpoint' \
+    --output text)
+CLUSTER_CA=$(aws eks describe-cluster \
+    --name "$CLUSTER_NAME" \
+    --region "$REGION" \
+    --query 'cluster.certificateAuthority.data' \
+    --output text)
+# Get the current AWS identity for the kubeconfig
+AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
+AWS_ARN=$(aws sts get-caller-identity --query Arn --output text)
+echo "Cluster endpoint: $CLUSTER_ENDPOINT"
+echo "AWS identity:     $AWS_ARN"
+echo ""
+# Generate the kubeconfig
+cat > "$OUTPUT" <<EOF
+# Kubeconfig for AntiAtropos on Hugging Face Spaces
+# Generated: $(date -u +"%Y-%m-%dT%H:%M:%SZ")
+# Cluster:   $CLUSTER_NAME
+# Region:    $REGION
+#
+# This kubeconfig uses AWS IAM authenticator.
+# The HF Space container must have aws-cli and aws-iam-authenticator available,
+# OR the kubernetes Python client must be configured with AWS credentials.
+#
+# To use this on HF Spaces:
+#   1. base64 encode this file: cat kubeconfig-antiatropos.yaml | base64 -w 0
+#   2. Set as HF secret: KUBECONFIG_CONTENT = <base64>
+#   3. Set env var: KUBECONFIG = /app/kubeconfig.yaml
+#   4. Decode in entrypoint.sh before uvicorn starts
+apiVersion: v1
+kind: Config
+clusters:
+  - cluster:
+      certificate-authority-data: $CLUSTER_CA
+      server: $CLUSTER_ENDPOINT
+    name: $CLUSTER_NAME
+contexts:
+  - context:
+      cluster: $CLUSTER_NAME
+      user: antiatropos-hf-user
+    name: $CLUSTER_NAME
+current-context: $CLUSTER_NAME
+preferences: {}
+users:
+  - name: antiatropos-hf-user
+    user:
+      exec:
+        apiVersion: client.authentication.k8s.io/v1beta1
+        command: aws
+        args:
+          - eks
+          - token
+          - --region
+          - $REGION
+          - --cluster-name
+          - $CLUSTER_NAME
+        env:
+          - name: AWS_STS_REGIONAL_ENDPOINTS
+            value: regional
+          - name: AWS_DEFAULT_REGION
+            value: $REGION
+        interactiveMode: IfAvailable
+EOF
+echo "Kubeconfig written to: $OUTPUT"
+echo ""
+echo "IMPORTANT: The HF Space container needs the AWS CLI and credentials"
+echo "to authenticate with EKS. You have two options:"
+echo ""
+echo "Option A: Include aws-cli in your Docker image and set AWS_ACCESS_KEY_ID /"
+echo "          AWS_SECRET_ACCESS_KEY as HF Space secrets."
+echo ""
+echo "Option B: Use the kubernetes Python client with AWS SDK (boto3)."
+echo "          The kubernetes_executor.py already supports this via"
+echo "          load_kube_config() which uses the Python client's auth plugins."
+echo ""
+echo "To encode for HF Spaces secret:"
+echo "  cat $OUTPUT | base64 -w 0"

deploy/aws/grafana-trust-policy.json ADDED Viewed

	@@ -0,0 +1,12 @@

+{
+  "Version": "2012-10-17",
+  "Statement": [
+    {
+      "Effect": "Allow",
+      "Principal": {
+        "Service": "grafana.amazonaws.com"
+      },
+      "Action": "sts:AssumeRole"
+    }
+  ]
+}

deploy/aws/grafana-values.yaml ADDED Viewed

	@@ -0,0 +1,68 @@

+# Grafana self-hosted on EKS
+# Connects to the local Prometheus agent and imports AntiAtropos dashboards
+replicaCount: 1
+adminUser: admin
+adminPassword: antiatropos
+service:
+  type: ClusterIP
+  port: 80
+persistence:
+  enabled: true
+  size: 5Gi
+  storageClassName: gp2
+# Use the local Prometheus agent as data source
+additionalDataSources:
+  - name: AMP-Local
+    type: prometheus
+    access: proxy
+    url: http://prometheus-agent-server.monitoring.svc.cluster.local:80
+    isDefault: true
+    editable: true
+# Import AntiAtropos dashboards
+dashboardProviders:
+  dashboardproviders.yaml:
+    apiVersion: 1
+    providers:
+      - name: 'default'
+        orgId: 1
+        folder: 'AntiAtropos'
+        type: file
+        disableDeletion: false
+        editable: true
+        options:
+          path: /var/lib/grafana/dashboards
+dashboards:
+  default:
+    antiatropos-overview:
+      gnetId: null
+      datasource: AMP-Local
+    antiatropos-live:
+      gnetId: null
+      datasource: AMP-Local
+# Allow dashboard JSON files to be mounted
+extraSecretMounts:
+  - name: dashboards
+    mountPath: /var/lib/grafana/dashboards
+    subPath: ""
+    secretName: antiatropos-grafana-dashboards
+    readOnly: true
+resources:
+  limits:
+    memory: 512Mi
+    cpu: 250m
+  requests:
+    memory: 256Mi
+    cpu: 100m
+nodeSelector: {}
+tolerations: []
+affinity: {}

deploy/aws/k8s-workloads.yaml ADDED Viewed

	@@ -0,0 +1,296 @@

+# Sample microservice deployments for AntiAtropos SRE training.
+#
+# These are the workloads the SRE agent will SCALE_UP / SCALE_DOWN / REROUTE_TRAFFIC / SHED_LOAD.
+# Each maps to a simulator node via ANTIATROPOS_WORKLOAD_MAP on HF Spaces.
+#
+# Apply: kubectl apply -f k8s-workloads.yaml
+#
+# The Prometheus Agent (in monitoring namespace) scrapes these pods
+# because they have the prometheus.io/scrape annotation.
+# Metrics are remote-written to AMP where the AntiAtropos server on HF Spaces queries them.
+---
+apiVersion: v1
+kind: Namespace
+metadata:
+  name: prod-sre
+  labels:
+    app.kubernetes.io/part-of: antiatropos
+---
+# ResourceQuota: Hard cap on pods in prod-sre namespace.
+# This is a Kubernetes-level safety net. Even if the agent's Python cap fails,
+# Kubernetes will refuse to create pods beyond this limit.
+#
+# Max 30 pods = 6 replicas x 5 deployments (our worst-case budget)
+# Max 8 CPU / 8GB RAM = enough for 30 small nginx pods
+apiVersion: v1
+kind: ResourceQuota
+metadata:
+  name: prod-sre-quota
+  namespace: prod-sre
+spec:
+  hard:
+    pods: "30"
+    requests.cpu: "8"
+    requests.memory: 8Gi
+    limits.cpu: "15"
+    limits.memory: 15Gi
+---
+# payments — node-0 (VIP)
+# Business-critical payment service. Always has 2 replicas for redundancy.
+# The SRE agent should never SHED_LOAD on this (CRITICAL_NODES in simulator.py).
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: payments
+  namespace: prod-sre
+  labels:
+    app: payments
+    node-id: node-0
+    critical: "true"
+spec:
+  replicas: 2
+  selector:
+    matchLabels:
+      app: payments
+  template:
+    metadata:
+      labels:
+        app: payments
+        node-id: node-0
+      annotations:
+        prometheus.io/scrape: "true"
+        prometheus.io/port: "8080"
+        prometheus.io/path: "/metrics"
+    spec:
+      containers:
+        - name: payments
+          image: nginx:alpine
+          ports:
+            - containerPort: 80
+          resources:
+            requests:
+              cpu: 100m
+              memory: 64Mi
+            limits:
+              cpu: 250m
+              memory: 128Mi
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: payments
+  namespace: prod-sre
+spec:
+  selector:
+    app: payments
+  ports:
+    - port: 80
+      targetPort: 80
+---
+# checkout — node-1
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: checkout
+  namespace: prod-sre
+  labels:
+    app: checkout
+    node-id: node-1
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: checkout
+  template:
+    metadata:
+      labels:
+        app: checkout
+        node-id: node-1
+      annotations:
+        prometheus.io/scrape: "true"
+        prometheus.io/port: "8080"
+        prometheus.io/path: "/metrics"
+    spec:
+      containers:
+        - name: checkout
+          image: nginx:alpine
+          ports:
+            - containerPort: 80
+          resources:
+            requests:
+              cpu: 100m
+              memory: 64Mi
+            limits:
+              cpu: 250m
+              memory: 128Mi
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: checkout
+  namespace: prod-sre
+spec:
+  selector:
+    app: checkout
+  ports:
+    - port: 80
+      targetPort: 80
+---
+# catalog — node-2
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: catalog
+  namespace: prod-sre
+  labels:
+    app: catalog
+    node-id: node-2
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: catalog
+  template:
+    metadata:
+      labels:
+        app: catalog
+        node-id: node-2
+      annotations:
+        prometheus.io/scrape: "true"
+        prometheus.io/port: "8080"
+        prometheus.io/path: "/metrics"
+    spec:
+      containers:
+        - name: catalog
+          image: nginx:alpine
+          ports:
+            - containerPort: 80
+          resources:
+            requests:
+              cpu: 100m
+              memory: 64Mi
+            limits:
+              cpu: 250m
+              memory: 128Mi
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: catalog
+  namespace: prod-sre
+spec:
+  selector:
+    app: catalog
+  ports:
+    - port: 80
+      targetPort: 80
+---
+# cart — node-3
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: cart
+  namespace: prod-sre
+  labels:
+    app: cart
+    node-id: node-3
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: cart
+  template:
+    metadata:
+      labels:
+        app: cart
+        node-id: node-3
+      annotations:
+        prometheus.io/scrape: "true"
+        prometheus.io/port: "8080"
+        prometheus.io/path: "/metrics"
+    spec:
+      containers:
+        - name: cart
+          image: nginx:alpine
+          ports:
+            - containerPort: 80
+          resources:
+            requests:
+              cpu: 100m
+              memory: 64Mi
+            limits:
+              cpu: 250m
+              memory: 128Mi
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: cart
+  namespace: prod-sre
+spec:
+  selector:
+    app: cart
+  ports:
+    - port: 80
+      targetPort: 80
+---
+# auth — node-4
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: auth
+  namespace: prod-sre
+  labels:
+    app: auth
+    node-id: node-4
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: auth
+  template:
+    metadata:
+      labels:
+        app: auth
+        node-id: node-4
+      annotations:
+        prometheus.io/scrape: "true"
+        prometheus.io/port: "8080"
+        prometheus.io/path: "/metrics"
+    spec:
+      containers:
+        - name: auth
+          image: nginx:alpine
+          ports:
+            - containerPort: 80
+          resources:
+            requests:
+              cpu: 100m
+              memory: 64Mi
+            limits:
+              cpu: 250m
+              memory: 128Mi
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: auth
+  namespace: prod-sre
+spec:
+  selector:
+    app: auth
+  ports:
+    - port: 80
+      targetPort: 80

deploy/aws/kubeconfig-antiatropos.yaml ADDED Viewed

	@@ -0,0 +1,34 @@

+apiVersion: v1
+kind: Config
+clusters:
+  - cluster:
+      certificate-authority-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURCVENDQWUyZ0F3SUJBZ0lJQk5ZY1JYcVZ2dm93RFFZSktvWklodmNOQVFFTEJRQXdGVEVUTUJFR0ExVUUKQXhNS2EzVmlaWEp1WlhSbGN6QWVGdzB5TmpBME1qTXhPREEwTURGYUZ3MHpOakEwTWpBeE9EQTVNREZhTUJVeApFekFSQmdOVkJBTVRDbXQxWW1WeWJtVjBaWE13Z2dFaU1BMEdDU3FHU0liM0RRRUJBUVVBQTRJQkR3QXdnZ0VLCkFvSUJBUUN2cHYwRVRIREIxeVRjVVFxa21Xd2Z2YnE0Z3d1bm9HK0w0MkIvaUV0N3h1NVhTMjZQWVlwNURGckYKUTJoUTRndDlENDUwNXlHNkN0eCtWVXBncExpeUxEU3pMdEM2VHUrUm5uSEY0NHRHZ1NJQm9GaG9TaXhzWFV3SQoxU3E1NVBIeHhPQmo3OGJxRFVxL2R3eE1xOVk1TzBINmkwV1ZaZHMvTmhaMk9rd1dJeUJnYy9Rckhpb2ZJZm1qCkVhZ0psRm9Sb1c2L2RjajBiOThOMi9zaWt1blRhQldJSGpPay9ESkNiWldzU0JtOTBBY0V3dEdnN1Bhc1hOcUsKaWwydWxlMG9PYk9zTyszbDhpeU9nYktROHFDbFgwSU03UVN2Y1J4YnYwK2FCYXpxVS9BRkhMY1VmTW1VMXVKRwpLdGVuTUxzNnBLdlpyRU9EOFlacklkYmkyZDBaQWdNQkFBR2pXVEJYTUE0R0ExVWREd0VCL3dRRUF3SUNwREFQCkJnTlZIUk1CQWY4RUJUQURBUUgvTUIwR0ExVWREZ1FXQkJTVm43TWdjYkhCNE9wNFc0WEhLYlNPeWdBdDREQVYKQmdOVkhSRUVEakFNZ2dwcmRXSmxjbTVsZEdWek1BMEdDU3FHU0liM0RRRUJDd1VBQTRJQkFRQSsxc0Rjc1RJcQp0T3V4Nk5OMkUrTFlYOFEvTk9qWlhSQVhSeDlOdXhoL0RCRmJwTjUrTzg2VWROL3BJamI0WGUyTVRGaytCTXZnCmUyWk9NNGJFQTlLR3JPc1RhK3VBL3pKZFhjUXZ0MG00Kzd5T3VqcklHOGhuOTlZSjRlTmxYYk9nV3NOTmVDMnEKT01DVFFPdGtJNVlMNFNET2ZDRUlsOEpBU0QvZTNRd0p6Mk15bnNIR2F4azZYZ3VnVkgzekVQcVNRL3FZa2pQTgpDY0ZMNXF1WWVUODUzM3g0SENKb1dmblZReHlaOVJ2V1Y0eThpT3JqbTV3Z2xvN2U3NkRmaTBwTnczRS80MysxCisrdXdWYmhZZTE0OUhyK3FzWU1YbGFiTFJmeHhXT2RxdzMxbXdJeitSSHF5V2U4V3prZnhUVGlmQjZNVVJyQXgKRWVKQWkwdWMxSkRMCi0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K
+      server: https://D3CBAF956940D075AE61BB6193A93256.gr7.ap-south-1.eks.amazonaws.com
+    name: antiatropos
+contexts:
+  - context:
+      cluster: antiatropos
+      user: antiatropos-hf-user
+    name: antiatropos
+current-context: antiatropos
+preferences: {}
+users:
+  - name: antiatropos-hf-user
+    user:
+      exec:
+        apiVersion: client.authentication.k8s.io/v1beta1
+        command: aws
+        args:
+          - eks
+          - get-token
+          - --region
+          - ap-south-1
+          - --cluster-name
+          - antiatropos
+        env:
+          - name: AWS_STS_REGIONAL_ENDPOINTS
+            value: regional
+          - name: AWS_DEFAULT_REGION
+            value: ap-south-1
+        interactiveMode: IfAvailable

deploy/aws/prometheus-agent-values.yaml ADDED Viewed

	@@ -0,0 +1,95 @@

+# Helm values for Prometheus Agent that remote-writes to Amazon Managed Prometheus
+#
+# Usage:
+#   helm install prometheus-agent prometheus-community/prometheus \
+#     --namespace monitoring --create-namespace \
+#     -f prometheus-agent-values.yaml \
+#     --set prometheus.prometheusSpec.remoteWrite[0].url="https://aps-workspaces.ap-south-1.amazonaws.com/workspaces/WORKSPACE_ID/api/v1/remote_write"
+#
+# Prerequisite: Create an IAM service account for the prometheus pod
+#   eksctl create iamserviceaccount \
+#     --cluster antiatropos \
+#     --namespace monitoring \
+#     --name prometheus-sa \
+#     --attach-policy-arn arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess \
+#     --approve
+prometheus:
+  prometheusSpec:
+    # Run as agent mode (remote-write only, no local query API)
+    agentMode: true
+    # Remote write — override via --set on the command line
+    remoteWrite:
+      - url: "https://aps-workspaces.ap-south-1.amazonaws.com/workspaces/REPLACE_WORKSPACE_ID/api/v1/remote_write"
+        sigv4:
+          region: ap-south-1
+    # Scrape the workload pods in prod-sre namespace (the microservices
+    # the SRE agent manages: payments, checkout, catalog, cart, auth)
+    additionalScrapeConfigs:
+      - job_name: antiatropos-workloads
+        metrics_path: /metrics
+        scrape_interval: 15s
+        kubernetes_sd_configs:
+          - role: pod
+            namespaces:
+              names:
+                - prod-sre
+        relabel_configs:
+          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
+            action: keep
+            regex: true
+          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
+            action: replace
+            target_label: __metrics_path__
+            regex: (.+)
+          - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
+            action: replace
+            regex: ([^:]+)(?::\d+)?;(\d+)
+            replacement: $1:$2
+            target_label: __address__
+          - action: labelmap
+            regex: __meta_kubernetes_pod_label_(.+)
+          - source_labels: [__meta_kubernetes_namespace]
+            action: replace
+            target_label: namespace
+          - source_labels: [__meta_kubernetes_pod_name]
+            action: replace
+            target_label: pod
+      # Also scrape the Prometheus Agent's own metrics for monitoring
+      - job_name: prometheus-agent-self
+        scrape_interval: 15s
+        static_configs:
+          - targets:
+              - localhost:9090
+    resources:
+      requests:
+        cpu: 100m
+        memory: 256Mi
+      limits:
+        cpu: 500m
+        memory: 512Mi
+    # Short retention since we're remote-writing everything to AMP
+    retention: 2h
+  # Use the IAM service account for AMP authentication
+  serviceAccount:
+    name: prometheus-sa
+    create: false
+# Disable alertmanager (AMP handles alerting if needed)
+alertmanager:
+  enabled: false
+# Disable pushgateway
+pushgateway:
+  enabled: false
+# Disable server (we only need the agent)
+server:
+  enabled: false

deploy/aws/teardown-all.ps1 ADDED Viewed

	@@ -0,0 +1,242 @@

+# AntiAtropos - One-Run Teardown Script
+# Deletes entire AWS infrastructure: EKS cluster, AMP workspace
+#
+# Usage: .\deploy\aws\teardown-all.ps1
+$ErrorActionPreference = "Stop"
+# In PowerShell 7+, prevent native stderr output from becoming terminating errors.
+if (Get-Variable -Name PSNativeCommandUseErrorActionPreference -ErrorAction SilentlyContinue) {
+    $PSNativeCommandUseErrorActionPreference = $false
+}
+$Region = "ap-south-1"
+$ClusterName = "antiatropos"
+$AmpAlias = "antiatropos-metrics"
+$GeneratedKubeconfig = Join-Path $PSScriptRoot "kubeconfig-antiatropos.yaml"
+function Invoke-CheckedCommand {
+    param(
+        [ScriptBlock]$Command,
+        [string]$ErrorMessage
+    )
+    $previousErrorActionPreference = $ErrorActionPreference
+    $ErrorActionPreference = "Continue"
+    try {
+        & $Command
+    } finally {
+        $ErrorActionPreference = $previousErrorActionPreference
+    }
+    if ($LASTEXITCODE -ne 0) {
+        throw $ErrorMessage
+    }
+}
+function Get-EksClusterStatus {
+    param(
+        [string]$Name,
+        [string]$AwsRegion
+    )
+    try {
+        $status = aws eks describe-cluster --name $Name --region $AwsRegion --query 'cluster.status' --output text 2>$null
+    } catch {
+        return $null
+    }
+    if ($LASTEXITCODE -ne 0 -or [string]::IsNullOrWhiteSpace($status) -or $status -eq "None") {
+        return $null
+    }
+    return $status.Trim()
+}
+function Get-EksNodegroups {
+    param(
+        [string]$Name,
+        [string]$AwsRegion
+    )
+    try {
+        $raw = aws eks list-nodegroups --cluster-name $Name --region $AwsRegion --query 'nodegroups' --output text 2>$null
+    } catch {
+        return @()
+    }
+    if ($LASTEXITCODE -ne 0 -or [string]::IsNullOrWhiteSpace($raw) -or $raw -eq "None") {
+        return @()
+    }
+    return @($raw -split '\s+' | Where-Object { -not [string]::IsNullOrWhiteSpace($_) })
+}
+function Remove-ResidualEksStacks {
+    param(
+        [string]$Cluster,
+        [string]$AwsRegion
+    )
+    $stackPrefix = "eksctl-$Cluster"
+    $stackQuery = "StackSummaries[?starts_with(StackName, '$stackPrefix') && (StackStatus!='DELETE_COMPLETE' && StackStatus!='DELETE_IN_PROGRESS')].StackName"
+    $stacksText = aws cloudformation list-stacks --region $AwsRegion --query $stackQuery --output text 2>$null
+    if ($LASTEXITCODE -ne 0 -or [string]::IsNullOrWhiteSpace($stacksText) -or $stacksText -eq "None") {
+        return
+    }
+    $stacks = @($stacksText -split '\s+' | Where-Object { -not [string]::IsNullOrWhiteSpace($_) })
+    foreach ($stack in $stacks) {
+        Write-Host "Deleting residual stack: $stack" -ForegroundColor Yellow
+        Invoke-CheckedCommand -Command { aws cloudformation delete-stack --stack-name $stack --region $AwsRegion 2>$null | Out-Null } -ErrorMessage "Failed to delete stack '$stack'"
+        Invoke-CheckedCommand -Command { aws cloudformation wait stack-delete-complete --stack-name $stack --region $AwsRegion } -ErrorMessage "Timed out deleting stack '$stack'"
+    }
+}
+function Get-AmpWorkspaceIdByAlias {
+    param(
+        [string]$Alias,
+        [string]$AwsRegion
+    )
+    try {
+        $id = aws amp list-workspaces --alias $Alias --region $AwsRegion --query 'workspaces[0].workspaceId' --output text 2>$null
+    } catch {
+        return $null
+    }
+    if ($LASTEXITCODE -ne 0 -or [string]::IsNullOrWhiteSpace($id) -or $id -eq "None") {
+        return $null
+    }
+    return $id.Trim()
+}
+function Wait-AmpWorkspaceDeleted {
+    param(
+        [string]$WorkspaceId,
+        [string]$AwsRegion
+    )
+    for ($i = 0; $i -lt 30; $i++) {
+        try {
+            $status = aws amp describe-workspace --workspace-id $WorkspaceId --region $AwsRegion --query 'workspace.status.statusCode' --output text 2>$null
+        } catch {
+            return
+        }
+        if ($LASTEXITCODE -ne 0 -or [string]::IsNullOrWhiteSpace($status) -or $status -eq "None") {
+            return
+        }
+        Start-Sleep -Seconds 10
+    }
+    throw "AMP workspace '$WorkspaceId' deletion timed out"
+}
+Write-Host ""
+Write-Host "==========================================" -ForegroundColor Red
+Write-Host "   AntiAtropos AWS Infrastructure Teardown" -ForegroundColor Red
+Write-Host "==========================================" -ForegroundColor Red
+Write-Host "Region:      $Region"
+Write-Host "Cluster:     $ClusterName"
+Write-Host ""
+# --- Step 1: Delete EKS Cluster ---
+Write-Host ">>> Step 1: Deleting EKS cluster..." -ForegroundColor Yellow
+$clusterStatus = Get-EksClusterStatus -Name $ClusterName -AwsRegion $Region
+if ($clusterStatus) {
+    Write-Host "Cluster status: $clusterStatus" -ForegroundColor Yellow
+    if ($clusterStatus -ne "DELETING") {
+        $nodegroups = Get-EksNodegroups -Name $ClusterName -AwsRegion $Region
+        foreach ($ng in $nodegroups) {
+            Write-Host "Deleting nodegroup: $ng" -ForegroundColor Yellow
+            $ngStatus = aws eks describe-nodegroup --cluster-name $ClusterName --nodegroup-name $ng --region $Region --query 'nodegroup.status' --output text 2>$null
+            if ($LASTEXITCODE -eq 0 -and $ngStatus -ne "DELETING") {
+                Invoke-CheckedCommand -Command { aws eks delete-nodegroup --cluster-name $ClusterName --nodegroup-name $ng --region $Region --output text 2>$null | Out-Null } -ErrorMessage "Failed to start deletion for nodegroup '$ng'"
+            } else {
+                Write-Host "Nodegroup '$ng' already deleting" -ForegroundColor Yellow
+            }
+            Write-Host "Waiting for nodegroup deletion: $ng" -ForegroundColor Yellow
+            Invoke-CheckedCommand -Command { aws eks wait nodegroup-deleted --cluster-name $ClusterName --nodegroup-name $ng --region $Region } -ErrorMessage "Timed out waiting for nodegroup '$ng' deletion"
+            Write-Host "OK: Nodegroup deleted: $ng" -ForegroundColor Green
+        }
+        Write-Host "Deleting cluster control plane..." -ForegroundColor Yellow
+        Invoke-CheckedCommand -Command { eksctl delete cluster --name $ClusterName --region $Region 2>$null | Out-Null } -ErrorMessage "Failed to delete EKS cluster"
+    } else {
+        Write-Host "Cluster is already deleting" -ForegroundColor Yellow
+    }
+    Write-Host "Waiting for cluster deletion..." -ForegroundColor Yellow
+    Invoke-CheckedCommand -Command { aws eks wait cluster-deleted --name $ClusterName --region $Region } -ErrorMessage "Timed out waiting for EKS cluster deletion"
+    Write-Host "OK: Cluster deleted" -ForegroundColor Green
+} else {
+    Write-Host "OK: Cluster not found, skipping" -ForegroundColor Green
+}
+Write-Host "Checking for residual eksctl stacks..." -ForegroundColor Yellow
+Remove-ResidualEksStacks -Cluster $ClusterName -AwsRegion $Region
+Write-Host "OK: Residual EKS stacks cleaned" -ForegroundColor Green
+# --- Step 2: Delete AMP Workspace ---
+Write-Host ""
+Write-Host ">>> Step 2: Deleting AMP workspace..." -ForegroundColor Yellow
+$AmpWsId = Get-AmpWorkspaceIdByAlias -Alias $AmpAlias -AwsRegion $Region
+if (-not [string]::IsNullOrWhiteSpace($AmpWsId)) {
+    Invoke-CheckedCommand -Command { aws amp delete-workspace --workspace-id $AmpWsId --region $Region | Out-Null } -ErrorMessage "Failed to delete AMP workspace '$AmpWsId'"
+    Wait-AmpWorkspaceDeleted -WorkspaceId $AmpWsId -AwsRegion $Region
+    Write-Host "OK: AMP workspace deleted: $AmpWsId" -ForegroundColor Green
+} else {
+    Write-Host "OK: AMP workspace not found, skipping" -ForegroundColor Green
+}
+# --- Step 3: Local kubeconfig cleanup ---
+Write-Host ""
+Write-Host ">>> Step 3: Cleaning local kubeconfig entries..." -ForegroundColor Yellow
+try { kubectl config delete-context $ClusterName 2>$null | Out-Null } catch {}
+try { kubectl config delete-cluster $ClusterName 2>$null | Out-Null } catch {}
+try { kubectl config delete-user antiatropos-hf-user 2>$null | Out-Null } catch {}
+if (Test-Path $GeneratedKubeconfig) {
+    Remove-Item $GeneratedKubeconfig -Force
+    Write-Host "OK: Removed generated kubeconfig file" -ForegroundColor Green
+} else {
+    Write-Host "OK: Generated kubeconfig file not found, skipping" -ForegroundColor Green
+}
+# --- Step 4: Verify Cleanup ---
+Write-Host ""
+Write-Host ">>> Step 4: Verifying cleanup..." -ForegroundColor Yellow
+$clusterStillExists = [bool](Get-EksClusterStatus -Name $ClusterName -AwsRegion $Region)
+if ($clusterStillExists) {
+    Write-Host "WARN: Cluster still exists (deletion in progress)" -ForegroundColor Yellow
+} else {
+    Write-Host "OK: Cluster deleted" -ForegroundColor Green
+}
+$ampStillExists = -not [string]::IsNullOrWhiteSpace((Get-AmpWorkspaceIdByAlias -Alias $AmpAlias -AwsRegion $Region))
+if ($ampStillExists) {
+    Write-Host "WARN: AMP workspace alias '$AmpAlias' still exists" -ForegroundColor Yellow
+} else {
+    Write-Host "OK: AMP workspace deleted" -ForegroundColor Green
+}
+# --- Done ---
+Write-Host ""
+Write-Host "==========================================" -ForegroundColor Green
+Write-Host "   Teardown Complete!" -ForegroundColor Green
+Write-Host "==========================================" -ForegroundColor Green
+Write-Host ""
+Write-Host "All AWS infrastructure has been removed." -ForegroundColor Yellow
+Write-Host ""

deploy/do/README.md ADDED Viewed

	@@ -0,0 +1,92 @@

+# DigitalOcean Droplet one-shot deploy
+This deploy flow is for a single Ubuntu Droplet running:
+- k3s (single-node Kubernetes)
+- AntiAtropos sample workloads (`prod-sre`)
+- Prometheus + Grafana (`monitoring`)
+- lightweight control-plane API (`antiatropos-control` on port `8010`)
+The OpenEnv runtime (`server.app`) is intentionally **not** run on the droplet.
+The only supported split is:
+- local machine: OpenEnv server + inference loop
+- droplet: Kubernetes executor API + observability stack
+## Run
+From repository root on the Droplet:
+```bash
+sudo bash deploy/do/deploy-droplet-one-shot.sh
+```
+Optional overrides:
+```bash
+sudo REPO_DIR=/opt/AntiAtropos CONTROL_PORT=8010 MAX_REPLICAS=200 bash deploy/do/deploy-droplet-one-shot.sh
+```
+## What the script configures
+- k3s kubelet with `max-pods=250`
+- Prometheus service exposed on NodePort `30090`
+- Prometheus scrape job for annotated pods in namespace `prod-sre`
+- Env file at `.env.droplet` with:
+  - `KUBECONFIG=/etc/rancher/k3s/k3s.yaml`
+  - `ANTIATROPOS_WORKLOAD_MAP` for `node-0`..`node-4`
+- Systemd service:
+  - Name: `antiatropos-control`
+  - Exec: `uvicorn server.local_laptop_control:app --host 0.0.0.0 --port 8010`
+- Legacy cleanup:
+  - `antiatropos-fastapi` (VM OpenEnv service) is disabled/removed by default deploy path
+## Verify
+```bash
+systemctl status antiatropos-control --no-pager
+curl http://127.0.0.1:8010/health
+kubectl get deploy -n prod-sre
+kubectl get pods -n monitoring
+curl http://127.0.0.1:30090/api/v1/targets
+kubectl -n monitoring port-forward svc/grafana 3000:80
+```
+Set local `.env` to use this consolidated path:
+```env
+ENV_URL=http://localhost:8000
+ANTIATROPOS_CONTROL_PLANE_URL=http://<droplet-ip>:8010
+PROMETHEUS_URL=http://<droplet-ip>:30090
+```
+## Deterministic remote-scaling proof
+On droplet, watch desired replicas:
+```bash
+watch -n 1 'kubectl -n prod-sre get deploy -o custom-columns=NAME:.metadata.name,DESIRED:.spec.replicas,READY:.status.readyReplicas,AVAILABLE:.status.availableReplicas'
+```
+From local machine, send one control action:
+```bash
+curl -X POST http://<droplet-ip>:8010/step \
+  -H "Content-Type: application/json" \
+  -d '{"action_type":"SCALE_UP","target_node_id":"node-0","parameter":1.0}'
+```
+If `payments` desired replicas increase, scaling is happening on droplet.
+## Troubleshooting
+- **Pods do not move during inference**
+  - Verify local env points to droplet control API:
+    - `ANTIATROPOS_CONTROL_PLANE_URL=http://<droplet-ip>:8010`
+  - Check droplet control health:
+    - `curl http://127.0.0.1:8010/health`
+  - Check service status:
+    - `systemctl status antiatropos-control --no-pager`
+- **Connection refused from local to droplet:8010**
+  - Service not running or firewall closed.
+  - Start service and open firewall if needed.
+- **Need to remove legacy VM OpenEnv service**
+  - `sudo bash deploy/do/uninstall-legacy-openenv.sh`

deploy/do/antiatropos-control.service ADDED Viewed

	@@ -0,0 +1,16 @@

+[Unit]
+Description=AntiAtropos Droplet Control API
+After=network-online.target k3s.service
+Wants=network-online.target
+[Service]
+Type=simple
+User=root
+WorkingDirectory=/root/Anti-Atropos
+EnvironmentFile=/root/Anti-Atropos/.env.droplet
+ExecStart=/root/Anti-Atropos/.venv-droplet/bin/uvicorn server.local_laptop_control:app --host 0.0.0.0 --port 8010
+Restart=always
+RestartSec=3
+[Install]
+WantedBy=multi-user.target

deploy/do/deploy-droplet-one-shot.sh ADDED Viewed

	@@ -0,0 +1,183 @@

+#!/usr/bin/env bash
+set -euo pipefail
+# One-shot deploy for a single DigitalOcean Droplet:
+# - Installs k3s with kubelet max-pods=250
+# - Deploys workloads + Prometheus + Grafana
+# - Creates env file for live Kubernetes scaling
+# - Starts lightweight control-plane API via systemd (antiatropos-control)
+if [[ "${EUID}" -ne 0 ]]; then
+  echo "Run as root: sudo bash deploy/do/deploy-droplet-one-shot.sh"
+  exit 1
+fi
+REPO_DIR="${REPO_DIR:-$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)}"
+KUBECONFIG_PATH="${KUBECONFIG_PATH:-/etc/rancher/k3s/k3s.yaml}"
+CONTROL_PORT="${CONTROL_PORT:-8010}"
+CONTROL_HOST="${CONTROL_HOST:-0.0.0.0}"
+K8S_NAMESPACE="${K8S_NAMESPACE:-prod-sre}"
+MONITORING_NAMESPACE="${MONITORING_NAMESPACE:-monitoring}"
+PY_VENV_DIR="${PY_VENV_DIR:-${REPO_DIR}/.venv-droplet}"
+ENV_FILE="${ENV_FILE:-${REPO_DIR}/.env.droplet}"
+MIN_REPLICAS="${MIN_REPLICAS:-1}"
+MAX_REPLICAS="${MAX_REPLICAS:-250}"
+SCALE_STEP="${SCALE_STEP:-3}"
+WORKLOAD_MAP="${WORKLOAD_MAP:-{\"node-0\":{\"deployment\":\"payments\",\"namespace\":\"prod-sre\"},\"node-1\":{\"deployment\":\"checkout\",\"namespace\":\"prod-sre\"},\"node-2\":{\"deployment\":\"catalog\",\"namespace\":\"prod-sre\"},\"node-3\":{\"deployment\":\"cart\",\"namespace\":\"prod-sre\"},\"node-4\":{\"deployment\":\"auth\",\"namespace\":\"prod-sre\"}}}"
+echo "=== AntiAtropos Droplet One-Shot Deploy ==="
+echo "Repo:        ${REPO_DIR}"
+echo "Kubeconfig:  ${KUBECONFIG_PATH}"
+echo "Control API: ${CONTROL_HOST}:${CONTROL_PORT}"
+echo ""
+if [[ ! -f "${REPO_DIR}/deploy/local-laptop.yaml" ]]; then
+  echo "ERROR: deploy/local-laptop.yaml not found. Run from AntiAtropos checkout."
+  exit 1
+fi
+export DEBIAN_FRONTEND=noninteractive
+apt-get update
+apt-get install -y curl ca-certificates gnupg lsb-release python3 python3-venv python3-pip
+if ! command -v kubectl >/dev/null 2>&1; then
+  echo "Installing k3s (includes kubectl)..."
+  curl -sfL https://get.k3s.io | sh -s - --write-kubeconfig-mode 644 --kubelet-arg=max-pods=250
+else
+  echo "k3s/kubectl already present; skipping k3s install."
+fi
+if ! command -v helm >/dev/null 2>&1; then
+  echo "Installing Helm..."
+  curl -fsSL https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
+fi
+export KUBECONFIG="${KUBECONFIG_PATH}"
+echo "Waiting for Kubernetes node to be Ready..."
+kubectl wait --for=condition=Ready node --all --timeout=180s
+kubectl create ns "${K8S_NAMESPACE}" >/dev/null 2>&1 || true
+kubectl create ns "${MONITORING_NAMESPACE}" >/dev/null 2>&1 || true
+echo "Deploying AntiAtropos workloads..."
+kubectl apply -f "${REPO_DIR}/deploy/local-laptop.yaml"
+echo "Installing/upgrading Prometheus + Grafana..."
+helm repo add prometheus-community https://prometheus-community.github.io/helm-charts >/dev/null 2>&1 || true
+helm repo add grafana https://grafana.github.io/helm-charts >/dev/null 2>&1 || true
+helm repo update
+helm upgrade --install prometheus prometheus-community/prometheus \
+  -n "${MONITORING_NAMESPACE}" \
+  -f "${REPO_DIR}/deploy/prometheus-helm-values.yaml"
+if [[ -d "${REPO_DIR}/deploy/grafana/provisioning/dashboards/json" ]]; then
+  kubectl delete configmap grafana-dashboards -n "${MONITORING_NAMESPACE}" >/dev/null 2>&1 || true
+  kubectl create configmap grafana-dashboards \
+    -n "${MONITORING_NAMESPACE}" \
+    --from-file="${REPO_DIR}/deploy/grafana/provisioning/dashboards/json/"
+fi
+helm upgrade --install grafana grafana/grafana \
+  -n "${MONITORING_NAMESPACE}" \
+  -f "${REPO_DIR}/deploy/grafana-helm-values.yaml"
+echo "Exposing Grafana on NodePort 30000..."
+kubectl patch svc grafana -n "${MONITORING_NAMESPACE}" --type='merge' -p '{
+  "spec": {
+    "type": "NodePort",
+    "ports": [
+      {"port": 80, "nodePort": 30000, "targetPort": 3000, "name": "service"}
+    ]
+  }
+}' || true
+echo "Waiting for Grafana pods to be ready..."
+kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=grafana -n "${MONITORING_NAMESPACE}" --timeout=180s || true
+if [[ ! -f "${ENV_FILE}" ]]; then
+  cat > "${ENV_FILE}" <<EOF
+KUBECONFIG=/etc/rancher/k3s/k3s.yaml
+ANTIATROPOS_K8S_NAMESPACE=prod-sre
+ANTIATROPOS_MIN_REPLICAS=${MIN_REPLICAS}
+ANTIATROPOS_MAX_REPLICAS=${MAX_REPLICAS}
+ANTIATROPOS_SCALE_STEP=${SCALE_STEP}
+ANTIATROPOS_WORKLOAD_MAP=${WORKLOAD_MAP}
+EOF
+  echo "Created ${ENV_FILE}"
+else
+  echo "Using existing ${ENV_FILE}"
+fi
+echo "Preparing Python environment..."
+python3 -m venv "${PY_VENV_DIR}"
+"${PY_VENV_DIR}/bin/python" -m pip install --upgrade pip
+if [[ -f "${REPO_DIR}/pyproject.toml" ]]; then
+  # Prefer project metadata (uses openenv-core, not legacy openenv package name).
+  "${PY_VENV_DIR}/bin/pip" install -e "${REPO_DIR}"
+else
+  "${PY_VENV_DIR}/bin/pip" install -r "${REPO_DIR}/server/requirements.txt"
+fi
+# Hard cleanup: remove legacy VM OpenEnv service if it exists.
+if systemctl list-unit-files | grep -q '^antiatropos-fastapi\.service'; then
+  echo "Disabling legacy service antiatropos-fastapi..."
+  systemctl disable --now antiatropos-fastapi >/dev/null 2>&1 || true
+  rm -f /etc/systemd/system/antiatropos-fastapi.service
+fi
+cat > /etc/systemd/system/antiatropos-control.service <<EOF
+[Unit]
+Description=AntiAtropos Droplet Control API
+After=network-online.target k3s.service
+Wants=network-online.target
+[Service]
+Type=simple
+User=root
+WorkingDirectory=${REPO_DIR}
+EnvironmentFile=${ENV_FILE}
+ExecStart=${PY_VENV_DIR}/bin/uvicorn server.local_laptop_control:app --host ${CONTROL_HOST} --port ${CONTROL_PORT}
+Restart=always
+RestartSec=3
+[Install]
+WantedBy=multi-user.target
+EOF
+systemctl daemon-reload
+systemctl enable --now antiatropos-control
+echo ""
+echo "Waiting for control API readiness..."
+for _ in {1..30}; do
+  if curl -fsS "http://127.0.0.1:${CONTROL_PORT}/health" >/dev/null 2>&1; then
+    break
+  fi
+  sleep 2
+done
+PUBLIC_IP="$(curl -fsS https://api.ipify.org 2>/dev/null || true)"
+if [[ -z "${PUBLIC_IP}" ]]; then
+  PUBLIC_IP="$(hostname -I 2>/dev/null | awk '{print $1}')"
+fi
+PROM_URL_DISPLAY="http://${PUBLIC_IP:-<droplet-ip>}:30090"
+echo ""
+echo "=== Deploy Complete ==="
+echo "Control health:  http://127.0.0.1:${CONTROL_PORT}/health"
+echo "Control step:    http://127.0.0.1:${CONTROL_PORT}/step"
+echo "Prometheus svc:  kubectl -n ${MONITORING_NAMESPACE} get svc prometheus-server"
+echo "Prometheus URL:  ${PROM_URL_DISPLAY}"
+echo "Grafana URL:     http://${PUBLIC_IP:-<droplet-ip>}:30000  (admin / antiatropos)"
+echo ""
+echo "Service status command:"
+echo "  systemctl status antiatropos-control --no-pager"
+echo ""
+echo "If needed, edit env and restart control service:"
+echo "  ${ENV_FILE}"
+echo "  systemctl restart antiatropos-control"
+echo ""
+echo "Verify remote scaling path:"
+echo "  watch -n 1 'kubectl -n prod-sre get deploy -o custom-columns=NAME:.metadata.name,DESIRED:.spec.replicas,READY:.status.readyReplicas'"

deploy/do/uninstall-legacy-openenv.sh ADDED Viewed

	@@ -0,0 +1,25 @@

+#!/usr/bin/env bash
+set -euo pipefail
+# Removes legacy VM OpenEnv service path.
+# This keeps droplet runtime focused on control API + observability only.
+if [[ "${EUID}" -ne 0 ]]; then
+  echo "Run as root: sudo bash deploy/do/uninstall-legacy-openenv.sh"
+  exit 1
+fi
+if systemctl list-unit-files | grep -q '^antiatropos-fastapi\.service'; then
+  echo "Stopping and disabling antiatropos-fastapi..."
+  systemctl disable --now antiatropos-fastapi >/dev/null 2>&1 || true
+else
+  echo "antiatropos-fastapi service not registered."
+fi
+if [[ -f /etc/systemd/system/antiatropos-fastapi.service ]]; then
+  rm -f /etc/systemd/system/antiatropos-fastapi.service
+  echo "Removed /etc/systemd/system/antiatropos-fastapi.service"
+fi
+systemctl daemon-reload
+echo "Legacy VM OpenEnv service cleanup complete."

deploy/entrypoint.sh CHANGED Viewed

@@ -1,62 +1,71 @@
-#!/usr/bin/env bash
-set -euo pipefail
-FASTAPI_PID=""
-PROMETHEUS_PID=""
-GRAFANA_PID=""
-NGINX_PID=""
-MONITOR_PID=""
-cleanup() {
-    for pid in "${MONITOR_PID}" "${NGINX_PID}" "${GRAFANA_PID}" "${PROMETHEUS_PID}" "${FASTAPI_PID}"; do
-        if [[ -n "${pid}" ]]; then
-            kill "${pid}" 2>/dev/null || true
-        fi
-    done
-}
-trap cleanup INT TERM EXIT
-cd /app
-uvicorn server.app:app --host 127.0.0.1 --port 8000 &
-FASTAPI_PID=$!
-/opt/prometheus/prometheus \
-    --config.file=/etc/prometheus/prometheus.yml \
-    --storage.tsdb.path=/tmp/prometheus-data \
-    --web.listen-address=127.0.0.1:9090 \
-    --web.route-prefix=/prometheus \
-    &
-PROMETHEUS_PID=$!
-/opt/grafana/bin/grafana-server \
-    --homepath /opt/grafana \
-    --config /etc/grafana/grafana.ini \
-    cfg:default.paths.data=/var/lib/grafana \
-    cfg:default.paths.logs=/var/log/grafana \
-    cfg:default.paths.plugins=/var/lib/grafana/plugins \
-    cfg:default.paths.provisioning=/etc/grafana/provisioning \
-    &
-GRAFANA_PID=$!
-nginx -g "daemon off;" &
-NGINX_PID=$!
-monitor_children() {
-    while true; do
-        for pid in "${FASTAPI_PID}" "${PROMETHEUS_PID}" "${GRAFANA_PID}"; do
-            if ! kill -0 "${pid}" 2>/dev/null; then
-                echo "A backend service exited unexpectedly." >&2
-                kill "${NGINX_PID}" 2>/dev/null || true
-                exit 1
-            fi
-        done
-        sleep 2
-    done
-}
-monitor_children &
-MONITOR_PID=$!
-wait "${NGINX_PID}"

+#!/usr/bin/env bash
+set -euo pipefail
+FASTAPI_PID=""
+PROMETHEUS_PID=""
+GRAFANA_PID=""
+NGINX_PID=""
+MONITOR_PID=""
+cleanup() {
+    for pid in "${MONITOR_PID}" "${NGINX_PID}" "${GRAFANA_PID}" "${PROMETHEUS_PID}" "${FASTAPI_PID}"; do
+        if [[ -n "${pid}" ]]; then
+            kill "${pid}" 2>/dev/null || true
+        fi
+    done
+}
+trap cleanup INT TERM EXIT
+cd /app
+# Source HF Spaces live-mode config if present (overrides Dockerfile defaults)
+if [[ -f /app/.env.hf ]]; then
+  echo "Loading .env.hf..."
+  set -a
+  # shellcheck source=/dev/null
+  source /app/.env.hf
+  set +a
+fi
+uvicorn server.app:app --host 127.0.0.1 --port 8000 &
+FASTAPI_PID=$!
+/opt/prometheus/prometheus \
+    --config.file=/etc/prometheus/prometheus.yml \
+    --storage.tsdb.path=/tmp/prometheus-data \
+    --web.listen-address=127.0.0.1:9090 \
+    --web.route-prefix=/prometheus \
+    &
+PROMETHEUS_PID=$!
+/opt/grafana/bin/grafana-server \
+    --homepath /opt/grafana \
+    --config /etc/grafana/grafana.ini \
+    cfg:default.paths.data=/var/lib/grafana \
+    cfg:default.paths.logs=/var/log/grafana \
+    cfg:default.paths.plugins=/var/lib/grafana/plugins \
+    cfg:default.paths.provisioning=/etc/grafana/provisioning \
+    &
+GRAFANA_PID=$!
+nginx -g "daemon off;" &
+NGINX_PID=$!
+monitor_children() {
+    while true; do
+        for pid in "${FASTAPI_PID}" "${PROMETHEUS_PID}" "${GRAFANA_PID}"; do
+            if ! kill -0 "${pid}" 2>/dev/null; then
+                echo "A backend service exited unexpectedly." >&2
+                kill "${NGINX_PID}" 2>/dev/null || true
+                exit 1
+            fi
+        done
+        sleep 2
+    done
+}
+monitor_children &
+MONITOR_PID=$!
+wait "${NGINX_PID}"

deploy/grafana-datasource-local.yaml ADDED Viewed

	@@ -0,0 +1,11 @@

+# Grafana datasource provisioning - points to in-cluster Prometheus
+apiVersion: 1
+datasources:
+  - name: Prometheus
+    uid: PBFA97CFB590B2093
+    type: prometheus
+    access: proxy
+    url: http://prometheus-server.monitoring.svc.cluster.local
+    isDefault: true
+    editable: true

deploy/grafana-helm-values.yaml ADDED Viewed

	@@ -0,0 +1,46 @@

+# Grafana self-hosted on Kind - Simplified dashboard + datasource setup
+adminUser: admin
+adminPassword: antiatropos
+service:
+  type: ClusterIP
+  port: 80
+persistence:
+  enabled: false
+# Datasource provisioning - mount as separate file
+datasources:
+  datasources.yaml:
+    apiVersion: 1
+    datasources:
+      - name: Prometheus
+        uid: PBFA97CFB590B2093
+        type: prometheus
+        access: proxy
+        url: http://prometheus-server.monitoring.svc.cluster.local
+        isDefault: true
+        editable: true
+# Dashboard provider config
+dashboardProviders:
+  dashboardproviders.yaml:
+    apiVersion: 1
+    providers:
+      - name: AntiAtropos
+        orgId: 1
+        folder: AntiAtropos
+        type: file
+        disableDeletion: false
+        editable: true
+        updateIntervalSeconds: 30
+        options:
+          path: /var/lib/grafana/dashboards/antiatropos
+# Mount dashboard JSONs from ConfigMap
+extraConfigmapMounts:
+  - name: grafana-dashboards
+    configMap: grafana-dashboards
+    mountPath: /var/lib/grafana/dashboards/antiatropos
+    readOnly: true

deploy/grafana/grafana.ini CHANGED Viewed

@@ -1,21 +1,21 @@
-[server]
-http_addr = 127.0.0.1
-http_port = 3000
-domain = localhost
-root_url = /grafana/
-serve_from_sub_path = true
-router_logging = false
-enable_gzip = true
-[auth]
-disable_login_form = false
-[auth.anonymous]
-enabled = true
-org_role = Viewer
-[dashboards]
-default_home_dashboard_path = /etc/grafana/provisioning/dashboards/json/antiatropos-overview.json
-[security]
-allow_embedding = true

+[server]
+http_addr = 127.0.0.1
+http_port = 3000
+domain = localhost
+root_url = /grafana/
+serve_from_sub_path = true
+router_logging = false
+enable_gzip = true
+[auth]
+disable_login_form = false
+[auth.anonymous]
+enabled = true
+org_role = Viewer
+[dashboards]
+default_home_dashboard_path = /etc/grafana/provisioning/dashboards/json/antiatropos-overview.json
+[security]
+allow_embedding = true

deploy/grafana/provisioning/dashboards/dashboard.yaml CHANGED Viewed

@@ -1,12 +1,12 @@
-apiVersion: 1
-providers:
-  - name: AntiAtropos Dashboards
-    orgId: 1
-    folder: AntiAtropos
-    type: file
-    disableDeletion: false
-    editable: true
-    updateIntervalSeconds: 30
-    options:
-      path: /etc/grafana/provisioning/dashboards/json

+apiVersion: 1
+providers:
+  - name: AntiAtropos Dashboards
+    orgId: 1
+    folder: AntiAtropos
+    type: file
+    disableDeletion: false
+    editable: true
+    updateIntervalSeconds: 30
+    options:
+      path: /etc/grafana/provisioning/dashboards/json

deploy/grafana/provisioning/dashboards/json/antiatropos-live.json CHANGED Viewed

@@ -1,334 +1,334 @@
-{
-  "annotations": {
-    "list": [
-      {
-        "builtIn": 1,
-        "datasource": {
-          "type": "grafana",
-          "uid": "-- Grafana --"
-        },
-        "enable": true,
-        "hide": true,
-        "iconColor": "rgba(0, 211, 255, 1)",
-        "name": "Annotations & Alerts",
-        "type": "dashboard"
-      }
-    ]
-  },
-  "editable": true,
-  "fiscalYearStartMonth": 0,
-  "graphTooltip": 0,
-  "id": null,
-  "links": [],
-  "liveNow": false,
-  "panels": [
-    {
-      "datasource": {
-        "type": "prometheus",
-        "uid": "PBFA97CFB590B2093"
-      },
-      "fieldConfig": {
-        "defaults": {
-          "color": {
-            "mode": "palette-classic"
-          }
-        },
-        "overrides": []
-      },
-      "gridPos": {
-        "h": 7,
-        "w": 12,
-        "x": 0,
-        "y": 0
-      },
-      "id": 1,
-      "options": {
-        "legend": {
-          "calcs": [],
-          "displayMode": "list",
-          "placement": "bottom"
-        },
-        "tooltip": {
-          "mode": "single"
-        }
-      },
-      "targets": [
-        {
-          "expr": "sum by (action_type, ack_class) (rate(antiatropos_actions_total{task_id=~\"$task\",mode=~\"$mode\"}[1m]))",
-          "legendFormat": "{{action_type}} {{ack_class}}",
-          "refId": "A"
-        }
-      ],
-      "title": "Actions Per Second",
-      "type": "timeseries"
-    },
-    {
-      "datasource": {
-        "type": "prometheus",
-        "uid": "PBFA97CFB590B2093"
-      },
-      "fieldConfig": {
-        "defaults": {
-          "color": {
-            "mode": "palette-classic"
-          },
-          "min": 0,
-          "max": 1
-        },
-        "overrides": []
-      },
-      "gridPos": {
-        "h": 7,
-        "w": 12,
-        "x": 12,
-        "y": 0
-      },
-      "id": 2,
-      "options": {
-        "legend": {
-          "calcs": [],
-          "displayMode": "table",
-          "placement": "bottom"
-        },
-        "tooltip": {
-          "mode": "single"
-        }
-      },
-      "targets": [
-        {
-          "expr": "antiatropos_reward_normalized{task_id=~\"$task\",mode=~\"$mode\"}",
-          "legendFormat": "{{task_id}}/{{mode}} normalized",
-          "refId": "A"
-        }
-      ],
-      "title": "Normalized Reward [0,1]",
-      "type": "timeseries"
-    },
-    {
-      "datasource": {
-        "type": "prometheus",
-        "uid": "PBFA97CFB590B2093"
-      },
-      "fieldConfig": {
-        "defaults": {
-          "color": {
-            "mode": "palette-classic"
-          }
-        },
-        "overrides": []
-      },
-      "gridPos": {
-        "h": 7,
-        "w": 12,
-        "x": 0,
-        "y": 7
-      },
-      "id": 3,
-      "options": {
-        "legend": {
-          "calcs": [],
-          "displayMode": "table",
-          "placement": "bottom"
-        },
-        "tooltip": {
-          "mode": "single"
-        }
-      },
-      "targets": [
-        {
-          "expr": "antiatropos_reward_raw{task_id=~\"$task\",mode=~\"$mode\"}",
-          "legendFormat": "{{task_id}}/{{mode}} raw",
-          "refId": "A"
-        }
-      ],
-      "title": "Raw Reward",
-      "type": "timeseries"
-    },
-    {
-      "datasource": {
-        "type": "prometheus",
-        "uid": "PBFA97CFB590B2093"
-      },
-      "fieldConfig": {
-        "defaults": {
-          "color": {
-            "mode": "palette-classic"
-          }
-        },
-        "overrides": []
-      },
-      "gridPos": {
-        "h": 7,
-        "w": 12,
-        "x": 12,
-        "y": 7
-      },
-      "id": 4,
-      "options": {
-        "legend": {
-          "calcs": [],
-          "displayMode": "table",
-          "placement": "bottom"
-        },
-        "tooltip": {
-          "mode": "single"
-        }
-      },
-      "targets": [
-        {
-          "expr": "antiatropos_total_queue_backlog{task_id=~\"$task\",mode=~\"$mode\"}",
-          "legendFormat": "{{task_id}}/{{mode}} queue",
-          "refId": "A"
-        },
-        {
-          "expr": "antiatropos_average_latency_norm{task_id=~\"$task\",mode=~\"$mode\"}",
-          "legendFormat": "{{task_id}}/{{mode}} latency",
-          "refId": "B"
-        }
-      ],
-      "title": "Queue Backlog and Latency (Norm)",
-      "type": "timeseries"
-    },
-    {
-      "datasource": {
-        "type": "prometheus",
-        "uid": "PBFA97CFB590B2093"
-      },
-      "fieldConfig": {
-        "defaults": {
-          "color": {
-            "mode": "palette-classic"
-          }
-        },
-        "overrides": []
-      },
-      "gridPos": {
-        "h": 7,
-        "w": 12,
-        "x": 0,
-        "y": 14
-      },
-      "id": 5,
-      "options": {
-        "legend": {
-          "calcs": [],
-          "displayMode": "table",
-          "placement": "bottom"
-        },
-        "tooltip": {
-          "mode": "single"
-        }
-      },
-      "targets": [
-        {
-          "expr": "antiatropos_lyapunov_energy{task_id=~\"$task\",mode=~\"$mode\"}",
-          "legendFormat": "{{task_id}}/{{mode}}",
-          "refId": "A"
-        }
-      ],
-      "title": "Lyapunov Energy",
-      "type": "timeseries"
-    },
-    {
-      "datasource": {
-        "type": "prometheus",
-        "uid": "PBFA97CFB590B2093"
-      },
-      "fieldConfig": {
-        "defaults": {
-          "color": {
-            "mode": "palette-classic"
-          }
-        },
-        "overrides": []
-      },
-      "gridPos": {
-        "h": 7,
-        "w": 12,
-        "x": 12,
-        "y": 14
-      },
-      "id": 6,
-      "options": {
-        "legend": {
-          "calcs": [],
-          "displayMode": "table",
-          "placement": "bottom"
-        },
-        "tooltip": {
-          "mode": "single"
-        }
-      },
-      "targets": [
-        {
-          "expr": "histogram_quantile(0.95, sum(rate(antiatropos_executor_latency_ms_bucket{mode=~\"$mode\"}[2m])) by (le, mode))",
-          "legendFormat": "p95 {{mode}}",
-          "refId": "A"
-        },
-        {
-          "expr": "sum by (mode, error_code) (rate(antiatropos_executor_errors_total{mode=~\"$mode\"}[5m]))",
-          "legendFormat": "{{mode}} {{error_code}}",
-          "refId": "B"
-        }
-      ],
-      "title": "Executor Latency p95 and Errors/s",
-      "type": "timeseries"
-    }
-  ],
-  "refresh": "5s",
-  "schemaVersion": 39,
-  "style": "dark",
-  "tags": [
-    "antiatropos",
-    "sre",
-    "rl"
-  ],
-  "templating": {
-    "list": [
-      {
-        "datasource": {
-          "type": "prometheus",
-          "uid": "PBFA97CFB590B2093"
-        },
-        "definition": "label_values(antiatropos_steps_total, task_id)",
-        "includeAll": true,
-        "multi": true,
-        "name": "task",
-        "query": {
-          "qryType": 1,
-          "query": "label_values(antiatropos_steps_total, task_id)",
-          "refId": "TaskVar"
-        },
-        "refresh": 2,
-        "type": "query"
-      },
-      {
-        "datasource": {
-          "type": "prometheus",
-          "uid": "PBFA97CFB590B2093"
-        },
-        "definition": "label_values(antiatropos_steps_total, mode)",
-        "includeAll": true,
-        "multi": true,
-        "name": "mode",
-        "query": {
-          "qryType": 1,
-          "query": "label_values(antiatropos_steps_total, mode)",
-          "refId": "ModeVar"
-        },
-        "refresh": 2,
-        "type": "query"
-      }
-    ]
-  },
-  "time": {
-    "from": "now-15m",
-    "to": "now"
-  },
-  "timepicker": {},
-  "timezone": "",
-  "title": "AntiAtropos Live Control Plane",
-  "uid": "antiatropos-live",
-  "version": 2,
-  "weekStart": ""
-}

+{
+  "annotations": {
+    "list": [
+      {
+        "builtIn": 1,
+        "datasource": {
+          "type": "grafana",
+          "uid": "-- Grafana --"
+        },
+        "enable": true,
+        "hide": true,
+        "iconColor": "rgba(0, 211, 255, 1)",
+        "name": "Annotations & Alerts",
+        "type": "dashboard"
+      }
+    ]
+  },
+  "editable": true,
+  "fiscalYearStartMonth": 0,
+  "graphTooltip": 0,
+  "id": null,
+  "links": [],
+  "liveNow": false,
+  "panels": [
+    {
+      "datasource": {
+        "type": "prometheus",
+        "uid": "PBFA97CFB590B2093"
+      },
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "palette-classic"
+          }
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 7,
+        "w": 12,
+        "x": 0,
+        "y": 0
+      },
+      "id": 1,
+      "options": {
+        "legend": {
+          "calcs": [],
+          "displayMode": "list",
+          "placement": "bottom"
+        },
+        "tooltip": {
+          "mode": "single"
+        }
+      },
+      "targets": [
+        {
+          "expr": "sum by (action_type, ack_class) (rate(antiatropos_actions_total{task_id=~\"$task\",mode=~\"$mode\"}[5m]))",
+          "legendFormat": "{{action_type}} {{ack_class}}",
+          "refId": "A"
+        }
+      ],
+      "title": "Actions Per Second",
+      "type": "timeseries"
+    },
+    {
+      "datasource": {
+        "type": "prometheus",
+        "uid": "PBFA97CFB590B2093"
+      },
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "palette-classic"
+          },
+          "min": 0,
+          "max": 1
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 7,
+        "w": 12,
+        "x": 12,
+        "y": 0
+      },
+      "id": 2,
+      "options": {
+        "legend": {
+          "calcs": [],
+          "displayMode": "table",
+          "placement": "bottom"
+        },
+        "tooltip": {
+          "mode": "single"
+        }
+      },
+      "targets": [
+        {
+          "expr": "antiatropos_reward_normalized{task_id=~\"$task\",mode=~\"$mode\"}",
+          "legendFormat": "{{task_id}}/{{mode}} normalized",
+          "refId": "A"
+        }
+      ],
+      "title": "Normalized Reward [0,1]",
+      "type": "timeseries"
+    },
+    {
+      "datasource": {
+        "type": "prometheus",
+        "uid": "PBFA97CFB590B2093"
+      },
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "palette-classic"
+          }
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 7,
+        "w": 12,
+        "x": 0,
+        "y": 7
+      },
+      "id": 3,
+      "options": {
+        "legend": {
+          "calcs": [],
+          "displayMode": "table",
+          "placement": "bottom"
+        },
+        "tooltip": {
+          "mode": "single"
+        }
+      },
+      "targets": [
+        {
+          "expr": "antiatropos_reward_raw{task_id=~\"$task\",mode=~\"$mode\"}",
+          "legendFormat": "{{task_id}}/{{mode}} raw",
+          "refId": "A"
+        }
+      ],
+      "title": "Raw Reward",
+      "type": "timeseries"
+    },
+    {
+      "datasource": {
+        "type": "prometheus",
+        "uid": "PBFA97CFB590B2093"
+      },
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "palette-classic"
+          }
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 7,
+        "w": 12,
+        "x": 12,
+        "y": 7
+      },
+      "id": 4,
+      "options": {
+        "legend": {
+          "calcs": [],
+          "displayMode": "table",
+          "placement": "bottom"
+        },
+        "tooltip": {
+          "mode": "single"
+        }
+      },
+      "targets": [
+        {
+          "expr": "antiatropos_total_queue_backlog{task_id=~\"$task\",mode=~\"$mode\"}",
+          "legendFormat": "{{task_id}}/{{mode}} queue",
+          "refId": "A"
+        },
+        {
+          "expr": "antiatropos_average_latency_norm{task_id=~\"$task\",mode=~\"$mode\"}",
+          "legendFormat": "{{task_id}}/{{mode}} latency",
+          "refId": "B"
+        }
+      ],
+      "title": "Queue Backlog and Latency (Norm)",
+      "type": "timeseries"
+    },
+    {
+      "datasource": {
+        "type": "prometheus",
+        "uid": "PBFA97CFB590B2093"
+      },
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "palette-classic"
+          }
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 7,
+        "w": 12,
+        "x": 0,
+        "y": 14
+      },
+      "id": 5,
+      "options": {
+        "legend": {
+          "calcs": [],
+          "displayMode": "table",
+          "placement": "bottom"
+        },
+        "tooltip": {
+          "mode": "single"
+        }
+      },
+      "targets": [
+        {
+          "expr": "antiatropos_lyapunov_energy{task_id=~\"$task\",mode=~\"$mode\"}",
+          "legendFormat": "{{task_id}}/{{mode}}",
+          "refId": "A"
+        }
+      ],
+      "title": "Lyapunov Energy",
+      "type": "timeseries"
+    },
+    {
+      "datasource": {
+        "type": "prometheus",
+        "uid": "PBFA97CFB590B2093"
+      },
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "palette-classic"
+          }
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 7,
+        "w": 12,
+        "x": 12,
+        "y": 14
+      },
+      "id": 6,
+      "options": {
+        "legend": {
+          "calcs": [],
+          "displayMode": "table",
+          "placement": "bottom"
+        },
+        "tooltip": {
+          "mode": "single"
+        }
+      },
+      "targets": [
+        {
+          "expr": "histogram_quantile(0.95, sum(rate(antiatropos_executor_latency_ms_bucket{mode=~\"$mode\"}[5m])) by (le, mode))",
+          "legendFormat": "p95 {{mode}}",
+          "refId": "A"
+        },
+        {
+          "expr": "sum by (mode, error_code) (rate(antiatropos_executor_errors_total{mode=~\"$mode\"}[5m]))",
+          "legendFormat": "{{mode}} {{error_code}}",
+          "refId": "B"
+        }
+      ],
+      "title": "Executor Latency p95 and Errors/s",
+      "type": "timeseries"
+    }
+  ],
+  "refresh": "5s",
+  "schemaVersion": 39,
+  "style": "dark",
+  "tags": [
+    "antiatropos",
+    "sre",
+    "rl"
+  ],
+  "templating": {
+    "list": [
+      {
+        "datasource": {
+          "type": "prometheus",
+          "uid": "PBFA97CFB590B2093"
+        },
+        "definition": "label_values(antiatropos_steps_total, task_id)",
+        "includeAll": true,
+        "multi": true,
+        "name": "task",
+        "query": {
+          "qryType": 1,
+          "query": "label_values(antiatropos_steps_total, task_id)",
+          "refId": "TaskVar"
+        },
+        "refresh": 2,
+        "type": "query"
+      },
+      {
+        "datasource": {
+          "type": "prometheus",
+          "uid": "PBFA97CFB590B2093"
+        },
+        "definition": "label_values(antiatropos_steps_total, mode)",
+        "includeAll": true,
+        "multi": true,
+        "name": "mode",
+        "query": {
+          "qryType": 1,
+          "query": "label_values(antiatropos_steps_total, mode)",
+          "refId": "ModeVar"
+        },
+        "refresh": 2,
+        "type": "query"
+      }
+    ]
+  },
+  "time": {
+    "from": "now-15m",
+    "to": "now"
+  },
+  "timepicker": {},
+  "timezone": "",
+  "title": "AntiAtropos Live Control Plane",
+  "uid": "antiatropos-live",
+  "version": 2,
+  "weekStart": ""
+}

deploy/grafana/provisioning/dashboards/json/antiatropos-overview.json CHANGED Viewed

@@ -76,8 +76,8 @@
       "targets": [
         {
           "editorMode": "code",
-          "expr": "scalar(avg(last_over_time(antiatropos_reward{mode=\"simulated\"}[1m])))",
-          "legendFormat": "reward (simulated)",
           "range": true,
           "refId": "A"
         }
@@ -143,8 +143,8 @@
       "targets": [
         {
           "editorMode": "code",
-          "expr": "scalar(avg(last_over_time(antiatropos_total_queue_backlog{mode=\"simulated\"}[1m])))",
-          "legendFormat": "queue backlog (simulated)",
           "range": true,
           "refId": "A"
         }
@@ -210,8 +210,8 @@
       "targets": [
         {
           "editorMode": "code",
-          "expr": "scalar(avg(last_over_time(antiatropos_average_latency_norm{mode=\"simulated\"}[1m])))",
-          "legendFormat": "latency (simulated)",
           "range": true,
           "refId": "A"
         }
@@ -277,8 +277,8 @@
       "targets": [
         {
           "editorMode": "code",
-          "expr": "scalar(avg(last_over_time(antiatropos_lyapunov_energy{mode=\"simulated\"}[1m])))",
-          "legendFormat": "lyapunov energy (simulated)",
           "range": true,
           "refId": "A"
         }
@@ -369,14 +369,14 @@
       "targets": [
         {
           "editorMode": "code",
-          "expr": "antiatropos_reward{mode=\"simulated\"}",
           "legendFormat": "reward {{task_id}} ({{mode}})",
           "range": true,
           "refId": "A"
         },
         {
           "editorMode": "code",
-          "expr": "antiatropos_lyapunov_energy{mode=\"simulated\"}",
           "legendFormat": "lyapunov {{task_id}} ({{mode}})",
           "range": true,
           "refId": "B"
@@ -468,14 +468,14 @@
       "targets": [
         {
           "editorMode": "code",
-          "expr": "antiatropos_total_queue_backlog{mode=\"simulated\"}",
           "legendFormat": "queue {{task_id}} ({{mode}})",
           "range": true,
           "refId": "A"
         },
         {
           "editorMode": "code",
-          "expr": "antiatropos_average_latency_norm{mode=\"simulated\"}",
           "legendFormat": "latency {{task_id}} ({{mode}})",
           "range": true,
           "refId": "B"
@@ -535,14 +535,14 @@
       "targets": [
         {
           "editorMode": "code",
-          "expr": "sum by (task_id, mode) (rate(antiatropos_steps_total{mode=\"simulated\"}[1m]))",
           "legendFormat": "steps/sec {{task_id}} ({{mode}})",
           "range": true,
           "refId": "A"
         },
         {
           "editorMode": "code",
-          "expr": "sum by (task_id, mode, action_type) (rate(antiatropos_actions_total{mode=\"simulated\"}[1m]))",
           "legendFormat": "actions/sec {{action_type}} ({{task_id}}, {{mode}})",
           "range": true,
           "refId": "B"
@@ -602,14 +602,14 @@
       "targets": [
         {
           "editorMode": "code",
-          "expr": "sum by (mode, error_code) (rate(antiatropos_executor_errors_total{mode=\"simulated\"}[5m]))",
           "legendFormat": "executor errors {{error_code}} ({{mode}})",
           "range": true,
           "refId": "A"
         },
         {
           "editorMode": "code",
-          "expr": "histogram_quantile(0.95, sum(rate(antiatropos_executor_latency_ms_bucket{mode=\"simulated\"}[5m])) by (le, mode))",
           "legendFormat": "p95 executor latency {{mode}}",
           "range": true,
           "refId": "B"
@@ -640,3 +640,8 @@
   "version": 2,
   "weekStart": ""
 }

       "targets": [
         {
           "editorMode": "code",
+          "expr": "scalar(avg(last_over_time(antiatropos_reward{mode=~\"live|simulated|hybrid|aws\"}[1m])))",
+          "legendFormat": "reward (all modes)",
           "range": true,
           "refId": "A"
         }
       "targets": [
         {
           "editorMode": "code",
+          "expr": "scalar(avg(last_over_time(antiatropos_total_queue_backlog{mode=~\"live|simulated|hybrid|aws\"}[1m])))",
+          "legendFormat": "queue backlog (all modes)",
           "range": true,
           "refId": "A"
         }
       "targets": [
         {
           "editorMode": "code",
+          "expr": "scalar(avg(last_over_time(antiatropos_average_latency_norm{mode=~\"live|simulated|hybrid|aws\"}[1m])))",
+          "legendFormat": "latency (all modes)",
           "range": true,
           "refId": "A"
         }
       "targets": [
         {
           "editorMode": "code",
+          "expr": "scalar(avg(last_over_time(antiatropos_lyapunov_energy{mode=~\"live|simulated|hybrid|aws\"}[1m])))",
+          "legendFormat": "lyapunov energy (all modes)",
           "range": true,
           "refId": "A"
         }
       "targets": [
         {
           "editorMode": "code",
+          "expr": "antiatropos_reward{mode=~\"live|simulated|hybrid|aws\"}",
           "legendFormat": "reward {{task_id}} ({{mode}})",
           "range": true,
           "refId": "A"
         },
         {
           "editorMode": "code",
+          "expr": "antiatropos_lyapunov_energy{mode=~\"live|simulated|hybrid|aws\"}",
           "legendFormat": "lyapunov {{task_id}} ({{mode}})",
           "range": true,
           "refId": "B"
       "targets": [
         {
           "editorMode": "code",
+          "expr": "antiatropos_total_queue_backlog{mode=~\"live|simulated|hybrid|aws\"}",
           "legendFormat": "queue {{task_id}} ({{mode}})",
           "range": true,
           "refId": "A"
         },
         {
           "editorMode": "code",
+          "expr": "antiatropos_average_latency_norm{mode=~\"live|simulated|hybrid|aws\"}",
           "legendFormat": "latency {{task_id}} ({{mode}})",
           "range": true,
           "refId": "B"
       "targets": [
         {
           "editorMode": "code",
+          "expr": "sum by (task_id, mode) (rate(antiatropos_steps_total{mode=~\"live|simulated|hybrid|aws\"}[1m]))",
           "legendFormat": "steps/sec {{task_id}} ({{mode}})",
           "range": true,
           "refId": "A"
         },
         {
           "editorMode": "code",
+          "expr": "sum by (task_id, mode, action_type) (rate(antiatropos_actions_total{mode=~\"live|simulated|hybrid|aws\"}[1m]))",
           "legendFormat": "actions/sec {{action_type}} ({{task_id}}, {{mode}})",
           "range": true,
           "refId": "B"
       "targets": [
         {
           "editorMode": "code",
+          "expr": "sum by (mode, error_code) (rate(antiatropos_executor_errors_total{mode=~\"live|simulated|hybrid|aws\"}[5m]))",
           "legendFormat": "executor errors {{error_code}} ({{mode}})",
           "range": true,
           "refId": "A"
         },
         {
           "editorMode": "code",
+          "expr": "histogram_quantile(0.95, sum(rate(antiatropos_executor_latency_ms_bucket{mode=~\"live|simulated|hybrid|aws\"}[5m])) by (le, mode))",
           "legendFormat": "p95 executor latency {{mode}}",
           "range": true,
           "refId": "B"
   "version": 2,
   "weekStart": ""
 }

deploy/grafana/provisioning/dashboards/json/antiatropos-workloads.json ADDED Viewed

	@@ -0,0 +1,436 @@

+{
+  "annotations": {
+    "list": [
+      {
+        "builtIn": 1,
+        "datasource": {"type": "grafana", "uid": "-- Grafana --"},
+        "enable": true,
+        "hide": true,
+        "iconColor": "rgba(0, 211, 255, 1)",
+        "name": "Annotations & Alerts",
+        "type": "dashboard"
+      }
+    ]
+  },
+  "editable": true,
+  "fiscalYearStartMonth": 0,
+  "graphTooltip": 1,
+  "id": null,
+  "links": [],
+  "liveNow": false,
+  "panels": [
+    {
+      "datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
+      "fieldConfig": {
+        "defaults": {
+          "color": {"mode": "thresholds"},
+          "decimals": 1,
+          "mappings": [],
+          "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}, {"color": "yellow", "value": 100}, {"color": "red", "value": 500}]},
+          "unit": "reqps"
+        },
+        "overrides": []
+      },
+      "gridPos": {"h": 4, "w": 6, "x": 0, "y": 0},
+      "id": 1,
+      "options": {
+        "colorMode": "value",
+        "graphMode": "area",
+        "justifyMode": "auto",
+        "orientation": "auto",
+        "reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false},
+        "textMode": "auto"
+      },
+      "targets": [{"expr": "sum(rate(http_requests_total[1m]))", "refId": "A"}],
+      "title": "Total Request Rate",
+      "type": "stat"
+    },
+    {
+      "datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
+      "fieldConfig": {
+        "defaults": {
+          "color": {"mode": "thresholds"},
+          "decimals": 3,
+          "mappings": [],
+          "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}, {"color": "yellow", "value": 0.01}, {"color": "red", "value": 0.05}]},
+          "unit": "percentunit"
+        },
+        "overrides": []
+      },
+      "gridPos": {"h": 4, "w": 6, "x": 6, "y": 0},
+      "id": 2,
+      "options": {
+        "colorMode": "value",
+        "graphMode": "area",
+        "justifyMode": "auto",
+        "orientation": "auto",
+        "reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false},
+        "textMode": "auto"
+      },
+      "targets": [{"expr": "sum(rate(http_requests_total{status=~\"5..\"}[1m])) / clamp_min(sum(rate(http_requests_total[1m])), 1)", "refId": "A"}],
+      "title": "Global Error Rate",
+      "type": "stat"
+    },
+    {
+      "datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
+      "fieldConfig": {
+        "defaults": {
+          "color": {"mode": "thresholds"},
+          "decimals": 1,
+          "mappings": [],
+          "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}, {"color": "orange", "value": 50}, {"color": "red", "value": 100}]},
+          "unit": "none"
+        },
+        "overrides": []
+      },
+      "gridPos": {"h": 4, "w": 6, "x": 12, "y": 0},
+      "id": 3,
+      "options": {
+        "colorMode": "value",
+        "graphMode": "area",
+        "justifyMode": "auto",
+        "orientation": "auto",
+        "reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false},
+        "textMode": "auto"
+      },
+      "targets": [{"expr": "sum(queue_depth)", "refId": "A"}],
+      "title": "Total Queue Backlog",
+      "type": "stat"
+    },
+    {
+      "datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
+      "fieldConfig": {
+        "defaults": {
+          "color": {"mode": "thresholds"},
+          "decimals": 1,
+          "mappings": [],
+          "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}, {"color": "orange", "value": 100}, {"color": "red", "value": 200}]},
+          "unit": "ms"
+        },
+        "overrides": []
+      },
+      "gridPos": {"h": 4, "w": 6, "x": 18, "y": 0},
+      "id": 4,
+      "options": {
+        "colorMode": "value",
+        "graphMode": "area",
+        "justifyMode": "auto",
+        "orientation": "auto",
+        "reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false},
+        "textMode": "auto"
+      },
+      "targets": [{"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[1m])) by (le)) * 1000", "refId": "A"}],
+      "title": "Cluster p95 Latency",
+      "type": "stat"
+    },
+    {
+      "datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
+      "fieldConfig": {
+        "defaults": {
+          "color": {"mode": "palette-classic"},
+          "custom": {
+            "axisBorderShow": false,
+            "axisCenteredZero": false,
+            "axisColorMode": "text",
+            "axisLabel": "",
+            "axisPlacement": "auto",
+            "barAlignment": 0,
+            "drawStyle": "line",
+            "fillOpacity": 10,
+            "gradientMode": "none",
+            "hideFrom": {"legend": false, "tooltip": false, "viz": false},
+            "insertNulls": false,
+            "lineInterpolation": "linear",
+            "lineWidth": 2,
+            "pointSize": 3,
+            "scaleDistribution": {"type": "linear"},
+            "showPoints": "auto",
+            "spanNulls": false,
+            "stacking": {"group": "A", "mode": "none"},
+            "thresholdsStyle": {"mode": "off"}
+          },
+          "mappings": [],
+          "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
+          "unit": "reqps"
+        },
+        "overrides": []
+      },
+      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 4},
+      "id": 10,
+      "options": {
+        "legend": {"calcs": [], "displayMode": "list", "placement": "bottom", "showLegend": true},
+        "tooltip": {"mode": "multi", "sort": "none"}
+      },
+      "targets": [
+        {
+          "expr": "sum(rate(http_requests_total[1m])) by (node_id)",
+          "legendFormat": "{{node_id}}",
+          "refId": "A"
+        }
+      ],
+      "title": "Request Rate by Node",
+      "type": "timeseries"
+    },
+    {
+      "datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
+      "fieldConfig": {
+        "defaults": {
+          "color": {"mode": "palette-classic"},
+          "custom": {
+            "axisBorderShow": false,
+            "axisCenteredZero": false,
+            "axisColorMode": "text",
+            "axisLabel": "",
+            "axisPlacement": "auto",
+            "barAlignment": 0,
+            "drawStyle": "line",
+            "fillOpacity": 10,
+            "gradientMode": "none",
+            "hideFrom": {"legend": false, "tooltip": false, "viz": false},
+            "insertNulls": false,
+            "lineInterpolation": "linear",
+            "lineWidth": 2,
+            "pointSize": 3,
+            "scaleDistribution": {"type": "linear"},
+            "showPoints": "auto",
+            "spanNulls": false,
+            "stacking": {"group": "A", "mode": "none"},
+            "thresholdsStyle": {"mode": "off"}
+          },
+          "mappings": [],
+          "min": 0,
+          "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
+          "unit": "percentunit"
+        },
+        "overrides": []
+      },
+      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 4},
+      "id": 11,
+      "options": {
+        "legend": {"calcs": [], "displayMode": "list", "placement": "bottom", "showLegend": true},
+        "tooltip": {"mode": "multi", "sort": "none"}
+      },
+      "targets": [
+        {
+          "expr": "sum(rate(http_requests_total{status=~\"5..\"}[1m])) by (node_id) / clamp_min(sum(rate(http_requests_total[1m])) by (node_id), 1)",
+          "legendFormat": "{{node_id}}",
+          "refId": "A"
+        }
+      ],
+      "title": "Error Rate by Node",
+      "type": "timeseries"
+    },
+    {
+      "datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
+      "fieldConfig": {
+        "defaults": {
+          "color": {"mode": "palette-classic"},
+          "custom": {
+            "axisBorderShow": false,
+            "axisCenteredZero": false,
+            "axisColorMode": "text",
+            "axisLabel": "",
+            "axisPlacement": "auto",
+            "barAlignment": 0,
+            "drawStyle": "line",
+            "fillOpacity": 10,
+            "gradientMode": "none",
+            "hideFrom": {"legend": false, "tooltip": false, "viz": false},
+            "insertNulls": false,
+            "lineInterpolation": "linear",
+            "lineWidth": 2,
+            "pointSize": 3,
+            "scaleDistribution": {"type": "linear"},
+            "showPoints": "auto",
+            "spanNulls": false,
+            "stacking": {"group": "A", "mode": "none"},
+            "thresholdsStyle": {"mode": "off"}
+          },
+          "mappings": [],
+          "min": 0,
+          "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
+          "unit": "none"
+        },
+        "overrides": []
+      },
+      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 12},
+      "id": 12,
+      "options": {
+        "legend": {"calcs": [], "displayMode": "list", "placement": "bottom", "showLegend": true},
+        "tooltip": {"mode": "multi", "sort": "none"}
+      },
+      "targets": [
+        {
+          "expr": "avg(queue_depth) by (node_id)",
+          "legendFormat": "{{node_id}}",
+          "refId": "A"
+        }
+      ],
+      "title": "Queue Depth by Node",
+      "type": "timeseries"
+    },
+    {
+      "datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
+      "fieldConfig": {
+        "defaults": {
+          "color": {"mode": "palette-classic"},
+          "custom": {
+            "axisBorderShow": false,
+            "axisCenteredZero": false,
+            "axisColorMode": "text",
+            "axisLabel": "",
+            "axisPlacement": "auto",
+            "barAlignment": 0,
+            "drawStyle": "line",
+            "fillOpacity": 10,
+            "gradientMode": "none",
+            "hideFrom": {"legend": false, "tooltip": false, "viz": false},
+            "insertNulls": false,
+            "lineInterpolation": "linear",
+            "lineWidth": 2,
+            "pointSize": 3,
+            "scaleDistribution": {"type": "linear"},
+            "showPoints": "auto",
+            "spanNulls": false,
+            "stacking": {"group": "A", "mode": "none"},
+            "thresholdsStyle": {"mode": "off"}
+          },
+          "mappings": [],
+          "min": 0,
+          "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
+          "unit": "ms"
+        },
+        "overrides": []
+      },
+      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 12},
+      "id": 13,
+      "options": {
+        "legend": {"calcs": [], "displayMode": "list", "placement": "bottom", "showLegend": true},
+        "tooltip": {"mode": "multi", "sort": "none"}
+      },
+      "targets": [
+        {
+          "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (node_id, le)) * 1000",
+          "legendFormat": "{{node_id}}",
+          "refId": "A"
+        }
+      ],
+      "title": "Latency p95 by Node",
+      "type": "timeseries"
+    },
+    {
+      "datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
+      "fieldConfig": {
+        "defaults": {
+          "color": {"mode": "palette-classic"},
+          "custom": {
+            "axisBorderShow": false,
+            "axisCenteredZero": false,
+            "axisColorMode": "text",
+            "axisLabel": "",
+            "axisPlacement": "auto",
+            "barAlignment": 0,
+            "drawStyle": "line",
+            "fillOpacity": 10,
+            "gradientMode": "none",
+            "hideFrom": {"legend": false, "tooltip": false, "viz": false},
+            "insertNulls": false,
+            "lineInterpolation": "linear",
+            "lineWidth": 2,
+            "pointSize": 3,
+            "scaleDistribution": {"type": "linear"},
+            "showPoints": "auto",
+            "spanNulls": false,
+            "stacking": {"group": "A", "mode": "none"},
+            "thresholdsStyle": {"mode": "off"}
+          },
+          "mappings": [],
+          "min": 0,
+          "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
+          "unit": "percentunit"
+        },
+        "overrides": []
+      },
+      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 20},
+      "id": 14,
+      "options": {
+        "legend": {"calcs": [], "displayMode": "list", "placement": "bottom", "showLegend": true},
+        "tooltip": {"mode": "multi", "sort": "none"}
+      },
+      "targets": [
+        {
+          "expr": "avg(rate(container_cpu_usage_seconds_total[1m])) by (node_id)",
+          "legendFormat": "{{node_id}}",
+          "refId": "A"
+        }
+      ],
+      "title": "CPU by Node",
+      "type": "timeseries"
+    },
+    {
+      "datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
+      "fieldConfig": {
+        "defaults": {
+          "color": {"mode": "palette-classic"},
+          "custom": {
+            "axisBorderShow": false,
+            "axisCenteredZero": false,
+            "axisColorMode": "text",
+            "axisLabel": "",
+            "axisPlacement": "auto",
+            "barAlignment": 0,
+            "drawStyle": "line",
+            "fillOpacity": 10,
+            "gradientMode": "none",
+            "hideFrom": {"legend": false, "tooltip": false, "viz": false},
+            "insertNulls": false,
+            "lineInterpolation": "linear",
+            "lineWidth": 2,
+            "pointSize": 3,
+            "scaleDistribution": {"type": "linear"},
+            "showPoints": "auto",
+            "spanNulls": false,
+            "stacking": {"group": "A", "mode": "none"},
+            "thresholdsStyle": {"mode": "off"}
+          },
+          "mappings": [],
+          "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
+          "unit": "reqps"
+        },
+        "overrides": []
+      },
+      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 20},
+      "id": 15,
+      "options": {
+        "legend": {"calcs": [], "displayMode": "list", "placement": "bottom", "showLegend": true},
+        "tooltip": {"mode": "multi", "sort": "none"}
+      },
+      "targets": [
+        {
+          "expr": "sum(rate(http_requests_total{status=\"200\"}[1m])) by (node_id)",
+          "legendFormat": "200 {{node_id}}",
+          "refId": "A"
+        },
+        {
+          "expr": "sum(rate(http_requests_total{status=\"500\"}[1m])) by (node_id)",
+          "legendFormat": "500 {{node_id}}",
+          "refId": "B"
+        }
+      ],
+      "title": "Requests by Status Code",
+      "type": "timeseries"
+    }
+  ],
+  "refresh": "5s",
+  "schemaVersion": 41,
+  "style": "dark",
+  "tags": ["antiatropos", "sre", "workload"],
+  "templating": {"list": []},
+  "time": {"from": "now-15m", "to": "now"},
+  "timepicker": {},
+  "timezone": "browser",
+  "title": "AntiAtropos Workloads",
+  "uid": "antiatropos-workloads",
+  "version": 1,
+  "weekStart": ""
+}

deploy/grafana/provisioning/datasources/prometheus.yaml CHANGED Viewed

@@ -5,6 +5,6 @@ datasources:
     uid: PBFA97CFB590B2093
     type: prometheus
     access: proxy
-    url: http://127.0.0.1:9090/prometheus
     isDefault: true
-    editable: false

     uid: PBFA97CFB590B2093
     type: prometheus
     access: proxy
+    url: http://127.0.0.1:9090
     isDefault: true
+    editable: true

deploy/index.html CHANGED Viewed

@@ -1,473 +1,473 @@
-<!DOCTYPE html>
-<html lang="en">
-<head>
-    <meta charset="UTF-8">
-    <meta name="viewport" content="width=device-width, initial-scale=1.0">
-    <title>AntiAtropos Control Console</title>
-    <style>
-        :root {
-            --bg: #0b1220;
-            --bg-soft: #101a2d;
-            --panel: #111d33;
-            --line: #2b3d5d;
-            --text: #e6edf8;
-            --muted: #9bb0cf;
-            --accent: #ff5a3d;
-            --accent-strong: #e14830;
-            --ok: #3dcf8e;
-            --bad: #ff6f7f;
-        }
-        * {
-            box-sizing: border-box;
-        }
-        body {
-            margin: 0;
-            padding: 24px;
-            background:
-                radial-gradient(circle at top right, rgba(255, 90, 61, 0.18), transparent 40%),
-                radial-gradient(circle at top left, rgba(74, 140, 255, 0.18), transparent 35%),
-                var(--bg);
-            color: var(--text);
-            font-family: "Segoe UI", "Helvetica Neue", Arial, sans-serif;
-        }
-        .shell {
-            max-width: 1440px;
-            margin: 0 auto;
-            display: grid;
-            gap: 18px;
-        }
-        .card {
-            background: linear-gradient(180deg, rgba(17, 29, 51, 0.88), rgba(15, 25, 44, 0.92));
-            border: 1px solid var(--line);
-            border-radius: 16px;
-        }
-        .header {
-            padding: 20px 22px;
-            display: flex;
-            justify-content: space-between;
-            align-items: center;
-            gap: 16px;
-            flex-wrap: wrap;
-        }
-        .title h1 {
-            margin: 0;
-            font-size: 1.5rem;
-            letter-spacing: 0.01em;
-        }
-        .title p {
-            margin: 4px 0 0;
-            color: var(--muted);
-            font-size: 0.95rem;
-        }
-        .links {
-            display: flex;
-            gap: 10px;
-            flex-wrap: wrap;
-        }
-        .link-btn {
-            display: inline-flex;
-            align-items: center;
-            justify-content: center;
-            height: 38px;
-            padding: 0 14px;
-            border-radius: 10px;
-            border: 1px solid var(--line);
-            color: var(--text);
-            text-decoration: none;
-            background: var(--bg-soft);
-            font-size: 0.9rem;
-        }
-        .layout {
-            display: grid;
-            grid-template-columns: 1fr;
-            gap: 18px;
-        }
-        .controls {
-            padding: 16px;
-            display: grid;
-            grid-template-columns: 1fr;
-            gap: 14px;
-        }
-        .controls-grid {
-            display: grid;
-            grid-template-columns: repeat(4, minmax(0, 1fr));
-            gap: 12px;
-            align-items: end;
-        }
-        .field label {
-            display: block;
-            color: var(--muted);
-            font-size: 0.78rem;
-            font-weight: 600;
-            letter-spacing: 0.04em;
-            margin-bottom: 6px;
-            text-transform: uppercase;
-        }
-        .field select,
-        .field input {
-            width: 100%;
-            height: 44px;
-            border-radius: 10px;
-            border: 1px solid var(--line);
-            background: #0c162a;
-            color: var(--text);
-            padding: 0 12px;
-            font-size: 0.95rem;
-        }
-        .actions {
-            display: grid;
-            grid-template-columns: 180px 1fr;
-            gap: 10px;
-        }
-        .btn {
-            border: 1px solid var(--line);
-            border-radius: 10px;
-            height: 44px;
-            cursor: pointer;
-            font-weight: 600;
-            font-size: 0.95rem;
-            color: var(--text);
-            background: var(--bg-soft);
-        }
-        .btn-primary {
-            background: linear-gradient(135deg, var(--accent), var(--accent-strong));
-            border-color: transparent;
-            color: #fff;
-        }
-        .metrics {
-            padding: 16px;
-            display: grid;
-            grid-template-columns: repeat(5, minmax(0, 1fr));
-            gap: 10px;
-        }
-        .metric {
-            background: #0d172a;
-            border: 1px solid var(--line);
-            border-radius: 12px;
-            padding: 12px;
-            min-height: 86px;
-        }
-        .metric .name {
-            color: var(--muted);
-            font-size: 0.78rem;
-            text-transform: uppercase;
-            letter-spacing: 0.05em;
-            margin-bottom: 8px;
-        }
-        .metric .value {
-            font-family: Consolas, "SFMono-Regular", Menlo, monospace;
-            font-size: 1.18rem;
-            font-weight: 700;
-            color: var(--text);
-        }
-        .metric .value.good {
-            color: var(--ok);
-        }
-        .metric .value.bad {
-            color: var(--bad);
-        }
-        .monitor {
-            padding: 16px;
-            display: grid;
-            gap: 10px;
-        }
-        .monitor-head {
-            display: flex;
-            justify-content: space-between;
-            align-items: center;
-            gap: 12px;
-            flex-wrap: wrap;
-        }
-        .monitor-head h2 {
-            margin: 0;
-            font-size: 1.05rem;
-            font-weight: 700;
-        }
-        .monitor-head p {
-            margin: 0;
-            color: var(--muted);
-            font-size: 0.85rem;
-        }
-        .graph-wrap {
-            height: 920px;
-            border: 1px solid var(--line);
-            border-radius: 12px;
-            overflow: hidden;
-            background: #0a1324;
-        }
-        iframe {
-            width: 100%;
-            height: 100%;
-            border: 0;
-        }
-        .logs {
-            padding: 16px;
-        }
-        .logs h3 {
-            margin: 0 0 10px;
-            font-size: 0.9rem;
-            color: var(--muted);
-            text-transform: uppercase;
-            letter-spacing: 0.05em;
-        }
-        #terminal {
-            background: #091121;
-            border: 1px solid var(--line);
-            border-radius: 10px;
-            height: 160px;
-            overflow-y: auto;
-            padding: 10px;
-            font-family: Consolas, "SFMono-Regular", Menlo, monospace;
-            font-size: 0.83rem;
-            color: #c9d6ed;
-        }
-        .log-line {
-            padding: 2px 0;
-            border-bottom: 1px solid rgba(155, 176, 207, 0.08);
-        }
-        .log-time {
-            color: #7084a8;
-            margin-right: 8px;
-            font-size: 0.72rem;
-        }
-        @media (max-width: 1120px) {
-            .controls-grid {
-                grid-template-columns: 1fr 1fr;
-            }
-            .actions {
-                grid-template-columns: 1fr;
-            }
-            .metrics {
-                grid-template-columns: 1fr 1fr;
-            }
-        }
-        @media (max-width: 680px) {
-            body {
-                padding: 12px;
-            }
-            .controls-grid,
-            .metrics {
-                grid-template-columns: 1fr;
-            }
-            .graph-wrap {
-                height: 760px;
-            }
-        }
-    </style>
-</head>
-<body>
-    <div class="shell">
-        <header class="card header">
-            <div class="title">
-                <h1>AntiAtropos SRE Control Console</h1>
-                <p>Simulated environment with direct observability through Prometheus and Grafana</p>
-            </div>
-            <div class="links">
-                <a class="link-btn" href="/docs" target="_blank">API Docs</a>
-                <a class="link-btn" href="/prometheus/" target="_blank">Open Prometheus</a>
-                <a class="link-btn" href="/grafana/" target="_blank">Open Grafana</a>
-            </div>
-        </header>
-        <main class="layout">
-            <section class="card controls">
-                <div class="controls-grid">
-                    <div class="field">
-                        <label for="action-type">Action Type</label>
-                        <select id="action-type">
-                            <option value="NO_OP">NO_OP</option>
-                            <option value="SCALE_UP">SCALE_UP</option>
-                            <option value="SCALE_DOWN">SCALE_DOWN</option>
-                            <option value="REROUTE_TRAFFIC">REROUTE_TRAFFIC</option>
-                            <option value="SHED_LOAD">SHED_LOAD</option>
-                        </select>
-                    </div>
-                    <div class="field">
-                        <label for="node-id">Target Node</label>
-                        <select id="node-id">
-                            <option value="node-0">node-0 (VIP)</option>
-                            <option value="node-1">node-1</option>
-                            <option value="node-2">node-2</option>
-                            <option value="node-3">node-3</option>
-                            <option value="node-4">node-4</option>
-                        </select>
-                    </div>
-                    <div class="field">
-                        <label for="parameter">Parameter</label>
-                        <input id="parameter" type="number" step="0.1" value="0.0">
-                    </div>
-                    <div class="actions">
-                        <button class="btn btn-primary" onclick="resetEnv()">Reset Episode</button>
-                        <button class="btn" onclick="stepEnv()">Execute Step</button>
-                    </div>
-                </div>
-            </section>
-            <section class="card metrics">
-                <div class="metric">
-                    <div class="name">Cluster ID</div>
-                    <div id="cluster-id" class="value">---</div>
-                </div>
-                <div class="metric">
-                    <div class="name">Reward</div>
-                    <div id="last-reward" class="value">0.0000</div>
-                </div>
-                <div class="metric">
-                    <div class="name">Lyapunov Energy</div>
-                    <div id="lyapunov-val" class="value">0.0000</div>
-                </div>
-                <div class="metric">
-                    <div class="name">Mode</div>
-                    <div id="mode-val" class="value">simulated</div>
-                </div>
-                <div class="metric">
-                    <div class="name">Step</div>
-                    <div id="step-val" class="value">0</div>
-                </div>
-            </section>
-            <section class="card monitor">
-                <div class="monitor-head">
-                    <h2>Required Graphs</h2>
-                    <p>Raw metrics source: Prometheus. Curated dashboard: Grafana.</p>
-                </div>
-                <div class="graph-wrap">
-                    <iframe
-                        id="grafana-iframe"
-                        src="/grafana/d/antiatropos-overview/antiatropos-overview?kiosk&theme=dark&refresh=5s&from=now-30m&to=now">
-                    </iframe>
-                </div>
-            </section>
-            <section class="card logs">
-                <h3>System Logs</h3>
-                <div id="terminal">
-                    <div class="log-line"><span class="log-time">[init]</span>Waiting for interaction.</div>
-                </div>
-            </section>
-        </main>
-    </div>
-    <script>
-        const terminal = document.getElementById("terminal");
-        function log(message, type = "info") {
-            const time = new Date().toLocaleTimeString([], {
-                hour12: false,
-                hour: "2-digit",
-                minute: "2-digit",
-                second: "2-digit"
-            });
-            const row = document.createElement("div");
-            row.className = "log-line";
-            const color = type === "error" ? "#ff6f7f" : type === "success" ? "#3dcf8e" : "#c9d6ed";
-            row.innerHTML = '<span class="log-time">[' + time + "]</span><span style=\"color:" + color + "\">" + message + "</span>";
-            terminal.appendChild(row);
-            terminal.scrollTop = terminal.scrollHeight;
-        }
-        function updateUI(data) {
-            const observation = data.observation || {};
-            const rewardNode = document.getElementById("last-reward");
-            const reward = typeof data.reward === "number" ? data.reward : 0;
-            document.getElementById("cluster-id").innerText = (observation.cluster_id || "---").toString().slice(0, 12);
-            document.getElementById("lyapunov-val").innerText = Number(observation.lyapunov_energy || 0).toFixed(4);
-            document.getElementById("mode-val").innerText = (observation.mode || "simulated").toString();
-            document.getElementById("step-val").innerText = String(observation.step || 0);
-            rewardNode.innerText = reward.toFixed(4);
-            rewardNode.className = reward < 0 ? "value bad" : "value good";
-        }
-        async function resetEnv() {
-            log("Resetting environment...");
-            try {
-                const response = await fetch("/reset", {
-                    method: "POST",
-                    headers: { "Content-Type": "application/json" },
-                    body: JSON.stringify({})
-                });
-                const data = await response.json();
-                updateUI(data);
-                log("Environment reset complete.", "success");
-            } catch (err) {
-                log("Reset failed: " + err.message, "error");
-            }
-        }
-        async function stepEnv() {
-            const action = {
-                action_type: document.getElementById("action-type").value,
-                target_node_id: document.getElementById("node-id").value,
-                parameter: parseFloat(document.getElementById("parameter").value)
-            };
-            log("Dispatching " + action.action_type + " to " + action.target_node_id + " (" + action.parameter + ")");
-            try {
-                const response = await fetch("/step", {
-                    method: "POST",
-                    headers: { "Content-Type": "application/json" },
-                    body: JSON.stringify({ action: action })
-                });
-                const data = await response.json();
-                if (data.detail) {
-                    log("Invalid payload: " + JSON.stringify(data.detail), "error");
-                    return;
-                }
-                updateUI(data);
-                log(
-                    "Step complete. Reward=" + Number(data.reward || 0).toFixed(3) +
-                    " Lyapunov=" + Number((data.observation || {}).lyapunov_energy || 0).toFixed(3),
-                    "success"
-                );
-            } catch (err) {
-                log("Execution failed: " + err.message, "error");
-            }
-        }
-    </script>
-</body>
-</html>

+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>AntiAtropos Control Console</title>
+    <style>
+        :root {
+            --bg: #0b1220;
+            --bg-soft: #101a2d;
+            --panel: #111d33;
+            --line: #2b3d5d;
+            --text: #e6edf8;
+            --muted: #9bb0cf;
+            --accent: #ff5a3d;
+            --accent-strong: #e14830;
+            --ok: #3dcf8e;
+            --bad: #ff6f7f;
+        }
+        * {
+            box-sizing: border-box;
+        }
+        body {
+            margin: 0;
+            padding: 24px;
+            background:
+                radial-gradient(circle at top right, rgba(255, 90, 61, 0.18), transparent 40%),
+                radial-gradient(circle at top left, rgba(74, 140, 255, 0.18), transparent 35%),
+                var(--bg);
+            color: var(--text);
+            font-family: "Segoe UI", "Helvetica Neue", Arial, sans-serif;
+        }
+        .shell {
+            max-width: 1440px;
+            margin: 0 auto;
+            display: grid;
+            gap: 18px;
+        }
+        .card {
+            background: linear-gradient(180deg, rgba(17, 29, 51, 0.88), rgba(15, 25, 44, 0.92));
+            border: 1px solid var(--line);
+            border-radius: 16px;
+        }
+        .header {
+            padding: 20px 22px;
+            display: flex;
+            justify-content: space-between;
+            align-items: center;
+            gap: 16px;
+            flex-wrap: wrap;
+        }
+        .title h1 {
+            margin: 0;
+            font-size: 1.5rem;
+            letter-spacing: 0.01em;
+        }
+        .title p {
+            margin: 4px 0 0;
+            color: var(--muted);
+            font-size: 0.95rem;
+        }
+        .links {
+            display: flex;
+            gap: 10px;
+            flex-wrap: wrap;
+        }
+        .link-btn {
+            display: inline-flex;
+            align-items: center;
+            justify-content: center;
+            height: 38px;
+            padding: 0 14px;
+            border-radius: 10px;
+            border: 1px solid var(--line);
+            color: var(--text);
+            text-decoration: none;
+            background: var(--bg-soft);
+            font-size: 0.9rem;
+        }
+        .layout {
+            display: grid;
+            grid-template-columns: 1fr;
+            gap: 18px;
+        }
+        .controls {
+            padding: 16px;
+            display: grid;
+            grid-template-columns: 1fr;
+            gap: 14px;
+        }
+        .controls-grid {
+            display: grid;
+            grid-template-columns: repeat(4, minmax(0, 1fr));
+            gap: 12px;
+            align-items: end;
+        }
+        .field label {
+            display: block;
+            color: var(--muted);
+            font-size: 0.78rem;
+            font-weight: 600;
+            letter-spacing: 0.04em;
+            margin-bottom: 6px;
+            text-transform: uppercase;
+        }
+        .field select,
+        .field input {
+            width: 100%;
+            height: 44px;
+            border-radius: 10px;
+            border: 1px solid var(--line);
+            background: #0c162a;
+            color: var(--text);
+            padding: 0 12px;
+            font-size: 0.95rem;
+        }
+        .actions {
+            display: grid;
+            grid-template-columns: 180px 1fr;
+            gap: 10px;
+        }
+        .btn {
+            border: 1px solid var(--line);
+            border-radius: 10px;
+            height: 44px;
+            cursor: pointer;
+            font-weight: 600;
+            font-size: 0.95rem;
+            color: var(--text);
+            background: var(--bg-soft);
+        }
+        .btn-primary {
+            background: linear-gradient(135deg, var(--accent), var(--accent-strong));
+            border-color: transparent;
+            color: #fff;
+        }
+        .metrics {
+            padding: 16px;
+            display: grid;
+            grid-template-columns: repeat(5, minmax(0, 1fr));
+            gap: 10px;
+        }
+        .metric {
+            background: #0d172a;
+            border: 1px solid var(--line);
+            border-radius: 12px;
+            padding: 12px;
+            min-height: 86px;
+        }
+        .metric .name {
+            color: var(--muted);
+            font-size: 0.78rem;
+            text-transform: uppercase;
+            letter-spacing: 0.05em;
+            margin-bottom: 8px;
+        }
+        .metric .value {
+            font-family: Consolas, "SFMono-Regular", Menlo, monospace;
+            font-size: 1.18rem;
+            font-weight: 700;
+            color: var(--text);
+        }
+        .metric .value.good {
+            color: var(--ok);
+        }
+        .metric .value.bad {
+            color: var(--bad);
+        }
+        .monitor {
+            padding: 16px;
+            display: grid;
+            gap: 10px;
+        }
+        .monitor-head {
+            display: flex;
+            justify-content: space-between;
+            align-items: center;
+            gap: 12px;
+            flex-wrap: wrap;
+        }
+        .monitor-head h2 {
+            margin: 0;
+            font-size: 1.05rem;
+            font-weight: 700;
+        }
+        .monitor-head p {
+            margin: 0;
+            color: var(--muted);
+            font-size: 0.85rem;
+        }
+        .graph-wrap {
+            height: 920px;
+            border: 1px solid var(--line);
+            border-radius: 12px;
+            overflow: hidden;
+            background: #0a1324;
+        }
+        iframe {
+            width: 100%;
+            height: 100%;
+            border: 0;
+        }
+        .logs {
+            padding: 16px;
+        }
+        .logs h3 {
+            margin: 0 0 10px;
+            font-size: 0.9rem;
+            color: var(--muted);
+            text-transform: uppercase;
+            letter-spacing: 0.05em;
+        }
+        #terminal {
+            background: #091121;
+            border: 1px solid var(--line);
+            border-radius: 10px;
+            height: 160px;
+            overflow-y: auto;
+            padding: 10px;
+            font-family: Consolas, "SFMono-Regular", Menlo, monospace;
+            font-size: 0.83rem;
+            color: #c9d6ed;
+        }
+        .log-line {
+            padding: 2px 0;
+            border-bottom: 1px solid rgba(155, 176, 207, 0.08);
+        }
+        .log-time {
+            color: #7084a8;
+            margin-right: 8px;
+            font-size: 0.72rem;
+        }
+        @media (max-width: 1120px) {
+            .controls-grid {
+                grid-template-columns: 1fr 1fr;
+            }
+            .actions {
+                grid-template-columns: 1fr;
+            }
+            .metrics {
+                grid-template-columns: 1fr 1fr;
+            }
+        }
+        @media (max-width: 680px) {
+            body {
+                padding: 12px;
+            }
+            .controls-grid,
+            .metrics {
+                grid-template-columns: 1fr;
+            }
+            .graph-wrap {
+                height: 760px;
+            }
+        }
+    </style>
+</head>
+<body>
+    <div class="shell">
+        <header class="card header">
+            <div class="title">
+                <h1>AntiAtropos SRE Control Console</h1>
+                <p>Simulated environment with direct observability through Prometheus and Grafana</p>
+            </div>
+            <div class="links">
+                <a class="link-btn" href="/docs" target="_blank">API Docs</a>
+                <a class="link-btn" href="/prometheus/" target="_blank">Open Prometheus</a>
+                <a class="link-btn" href="/grafana/" target="_blank">Open Grafana</a>
+            </div>
+        </header>
+        <main class="layout">
+            <section class="card controls">
+                <div class="controls-grid">
+                    <div class="field">
+                        <label for="action-type">Action Type</label>
+                        <select id="action-type">
+                            <option value="NO_OP">NO_OP</option>
+                            <option value="SCALE_UP">SCALE_UP</option>
+                            <option value="SCALE_DOWN">SCALE_DOWN</option>
+                            <option value="REROUTE_TRAFFIC">REROUTE_TRAFFIC</option>
+                            <option value="SHED_LOAD">SHED_LOAD</option>
+                        </select>
+                    </div>
+                    <div class="field">
+                        <label for="node-id">Target Node</label>
+                        <select id="node-id">
+                            <option value="node-0">node-0 (VIP)</option>
+                            <option value="node-1">node-1</option>
+                            <option value="node-2">node-2</option>
+                            <option value="node-3">node-3</option>
+                            <option value="node-4">node-4</option>
+                        </select>
+                    </div>
+                    <div class="field">
+                        <label for="parameter">Parameter</label>
+                        <input id="parameter" type="number" step="0.1" value="0.0">
+                    </div>
+                    <div class="actions">
+                        <button class="btn btn-primary" onclick="resetEnv()">Reset Episode</button>
+                        <button class="btn" onclick="stepEnv()">Execute Step</button>
+                    </div>
+                </div>
+            </section>
+            <section class="card metrics">
+                <div class="metric">
+                    <div class="name">Cluster ID</div>
+                    <div id="cluster-id" class="value">---</div>
+                </div>
+                <div class="metric">
+                    <div class="name">Reward</div>
+                    <div id="last-reward" class="value">0.0000</div>
+                </div>
+                <div class="metric">
+                    <div class="name">Lyapunov Energy</div>
+                    <div id="lyapunov-val" class="value">0.0000</div>
+                </div>
+                <div class="metric">
+                    <div class="name">Mode</div>
+                    <div id="mode-val" class="value">simulated</div>
+                </div>
+                <div class="metric">
+                    <div class="name">Step</div>
+                    <div id="step-val" class="value">0</div>
+                </div>
+            </section>
+            <section class="card monitor">
+                <div class="monitor-head">
+                    <h2>Required Graphs</h2>
+                    <p>Raw metrics source: Prometheus. Curated dashboard: Grafana.</p>
+                </div>
+                <div class="graph-wrap">
+                    <iframe
+                        id="grafana-iframe"
+                        src="/grafana/d/antiatropos-overview/antiatropos-overview?kiosk&theme=dark&refresh=5s&from=now-30m&to=now">
+                    </iframe>
+                </div>
+            </section>
+            <section class="card logs">
+                <h3>System Logs</h3>
+                <div id="terminal">
+                    <div class="log-line"><span class="log-time">[init]</span>Waiting for interaction.</div>
+                </div>
+            </section>
+        </main>
+    </div>
+    <script>
+        const terminal = document.getElementById("terminal");
+        function log(message, type = "info") {
+            const time = new Date().toLocaleTimeString([], {
+                hour12: false,
+                hour: "2-digit",
+                minute: "2-digit",
+                second: "2-digit"
+            });
+            const row = document.createElement("div");
+            row.className = "log-line";
+            const color = type === "error" ? "#ff6f7f" : type === "success" ? "#3dcf8e" : "#c9d6ed";
+            row.innerHTML = '<span class="log-time">[' + time + "]</span><span style=\"color:" + color + "\">" + message + "</span>";
+            terminal.appendChild(row);
+            terminal.scrollTop = terminal.scrollHeight;
+        }
+        function updateUI(data) {
+            const observation = data.observation || {};
+            const rewardNode = document.getElementById("last-reward");
+            const reward = typeof data.reward === "number" ? data.reward : 0;
+            document.getElementById("cluster-id").innerText = (observation.cluster_id || "---").toString().slice(0, 12);
+            document.getElementById("lyapunov-val").innerText = Number(observation.lyapunov_energy || 0).toFixed(4);
+            document.getElementById("mode-val").innerText = (observation.mode || "simulated").toString();
+            document.getElementById("step-val").innerText = String(observation.step || 0);
+            rewardNode.innerText = reward.toFixed(4);
+            rewardNode.className = reward < 0 ? "value bad" : "value good";
+        }
+        async function resetEnv() {
+            log("Resetting environment...");
+            try {
+                const response = await fetch("/reset", {
+                    method: "POST",
+                    headers: { "Content-Type": "application/json" },
+                    body: JSON.stringify({})
+                });
+                const data = await response.json();
+                updateUI(data);
+                log("Environment reset complete.", "success");
+            } catch (err) {
+                log("Reset failed: " + err.message, "error");
+            }
+        }
+        async function stepEnv() {
+            const action = {
+                action_type: document.getElementById("action-type").value,
+                target_node_id: document.getElementById("node-id").value,
+                parameter: parseFloat(document.getElementById("parameter").value)
+            };
+            log("Dispatching " + action.action_type + " to " + action.target_node_id + " (" + action.parameter + ")");
+            try {
+                const response = await fetch("/step", {
+                    method: "POST",
+                    headers: { "Content-Type": "application/json" },
+                    body: JSON.stringify({ action: action })
+                });
+                const data = await response.json();
+                if (data.detail) {
+                    log("Invalid payload: " + JSON.stringify(data.detail), "error");
+                    return;
+                }
+                updateUI(data);
+                log(
+                    "Step complete. Reward=" + Number(data.reward || 0).toFixed(3) +
+                    " Lyapunov=" + Number((data.observation || {}).lyapunov_energy || 0).toFixed(3),
+                    "success"
+                );
+            } catch (err) {
+                log("Execution failed: " + err.message, "error");
+            }
+        }
+    </script>
+</body>
+</html>

deploy/kind-maxpods-250.yaml ADDED Viewed

	@@ -0,0 +1,11 @@

+kind: Cluster
+apiVersion: kind.x-k8s.io/v1alpha4
+name: antiatropos-local
+nodes:
+  - role: control-plane
+    kubeadmConfigPatches:
+      - |
+        kind: InitConfiguration
+        nodeRegistration:
+          kubeletExtraArgs:
+            max-pods: "250"

deploy/local-laptop.yaml ADDED Viewed

	@@ -0,0 +1,365 @@

+apiVersion: v1
+kind: Namespace
+metadata:
+  name: prod-sre
+---
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: auth
+  namespace: prod-sre
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: auth
+  template:
+    metadata:
+      labels:
+        app: auth
+      annotations:
+        prometheus.io/scrape: "true"
+        prometheus.io/port: "8080"
+        prometheus.io/path: "/metrics.txt"
+    spec:
+      containers:
+      - name: auth
+        image: python:3.12-alpine
+        env:
+        - name: NODE_ID
+          value: node-4
+        - name: BASE_QUEUE
+          value: "6"
+        command: ["/bin/sh", "-lc"]
+        args:
+        - |
+          mkdir -p /www
+          echo ok > /www/index.html
+          python -m http.server 8080 --directory /www >/tmp/http.log 2>&1 &
+          req=0; err=0; cpu_total=0
+          while true; do
+            t=$(date +%s)
+            noise=$((t % 11))
+            req=$((req + 30 + noise))
+            q=$((BASE_QUEUE + (t % 20) - 10))
+            if [ "$q" -lt 0 ]; then q=0; fi
+            err=$((err + q / 20))
+            cpu_inc=$((10 + q / 10))
+            cpu_total=$((cpu_total + cpu_inc))
+            lat_ms=$((35 + q * 3))
+            b005=$((req / 5)); b01=$((req / 3)); b025=$((req / 2)); b05=$((req * 3 / 4)); b1=$req; b2=$req
+            lat_sum=$(awk "BEGIN {printf \"%.3f\", $req * $lat_ms / 1000.0}")
+            {
+              echo "# HELP http_requests_total Synthetic request counter"
+              echo "# TYPE http_requests_total counter"
+              echo "http_requests_total{node_id=\"${NODE_ID}\",status=\"200\"} ${req}"
+              echo "http_requests_total{node_id=\"${NODE_ID}\",status=\"500\"} ${err}"
+              echo "# HELP queue_depth Synthetic queue depth"
+              echo "# TYPE queue_depth gauge"
+              echo "queue_depth{node_id=\"${NODE_ID}\"} ${q}"
+              echo "# HELP container_cpu_usage_seconds_total Synthetic CPU counter"
+              echo "# TYPE container_cpu_usage_seconds_total counter"
+              echo "container_cpu_usage_seconds_total{node_id=\"${NODE_ID}\"} ${cpu_total}"
+              echo "# HELP http_request_duration_seconds Synthetic request duration histogram"
+              echo "# TYPE http_request_duration_seconds histogram"
+              echo "http_request_duration_seconds_bucket{node_id=\"${NODE_ID}\",le=\"0.05\"} ${b005}"
+              echo "http_request_duration_seconds_bucket{node_id=\"${NODE_ID}\",le=\"0.1\"} ${b01}"
+              echo "http_request_duration_seconds_bucket{node_id=\"${NODE_ID}\",le=\"0.25\"} ${b025}"
+              echo "http_request_duration_seconds_bucket{node_id=\"${NODE_ID}\",le=\"0.5\"} ${b05}"
+              echo "http_request_duration_seconds_bucket{node_id=\"${NODE_ID}\",le=\"1\"} ${b1}"
+              echo "http_request_duration_seconds_bucket{node_id=\"${NODE_ID}\",le=\"2\"} ${b2}"
+              echo "http_request_duration_seconds_bucket{node_id=\"${NODE_ID}\",le=\"+Inf\"} ${req}"
+              echo "http_request_duration_seconds_count{node_id=\"${NODE_ID}\"} ${req}"
+              echo "http_request_duration_seconds_sum{node_id=\"${NODE_ID}\"} ${lat_sum}"
+            } > /www/metrics.txt
+            sleep 2
+          done
+---
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: cart
+  namespace: prod-sre
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: cart
+  template:
+    metadata:
+      labels:
+        app: cart
+      annotations:
+        prometheus.io/scrape: "true"
+        prometheus.io/port: "8080"
+        prometheus.io/path: "/metrics.txt"
+    spec:
+      containers:
+      - name: cart
+        image: python:3.12-alpine
+        env:
+        - name: NODE_ID
+          value: node-3
+        - name: BASE_QUEUE
+          value: "14"
+        command: ["/bin/sh", "-lc"]
+        args:
+        - |
+          mkdir -p /www
+          echo ok > /www/index.html
+          python -m http.server 8080 --directory /www >/tmp/http.log 2>&1 &
+          req=0; err=0; cpu_total=0
+          while true; do
+            t=$(date +%s)
+            noise=$((t % 11))
+            req=$((req + 30 + noise))
+            q=$((BASE_QUEUE + (t % 20) - 10))
+            if [ "$q" -lt 0 ]; then q=0; fi
+            err=$((err + q / 20))
+            cpu_inc=$((10 + q / 10))
+            cpu_total=$((cpu_total + cpu_inc))
+            lat_ms=$((35 + q * 3))
+            b005=$((req / 5)); b01=$((req / 3)); b025=$((req / 2)); b05=$((req * 3 / 4)); b1=$req; b2=$req
+            lat_sum=$(awk "BEGIN {printf \"%.3f\", $req * $lat_ms / 1000.0}")
+            {
+              echo "# HELP http_requests_total Synthetic request counter"
+              echo "# TYPE http_requests_total counter"
+              echo "http_requests_total{node_id=\"${NODE_ID}\",status=\"200\"} ${req}"
+              echo "http_requests_total{node_id=\"${NODE_ID}\",status=\"500\"} ${err}"
+              echo "# HELP queue_depth Synthetic queue depth"
+              echo "# TYPE queue_depth gauge"
+              echo "queue_depth{node_id=\"${NODE_ID}\"} ${q}"
+              echo "# HELP container_cpu_usage_seconds_total Synthetic CPU counter"
+              echo "# TYPE container_cpu_usage_seconds_total counter"
+              echo "container_cpu_usage_seconds_total{node_id=\"${NODE_ID}\"} ${cpu_total}"
+              echo "# HELP http_request_duration_seconds Synthetic request duration histogram"
+              echo "# TYPE http_request_duration_seconds histogram"
+              echo "http_request_duration_seconds_bucket{node_id=\"${NODE_ID}\",le=\"0.05\"} ${b005}"
+              echo "http_request_duration_seconds_bucket{node_id=\"${NODE_ID}\",le=\"0.1\"} ${b01}"
+              echo "http_request_duration_seconds_bucket{node_id=\"${NODE_ID}\",le=\"0.25\"} ${b025}"
+              echo "http_request_duration_seconds_bucket{node_id=\"${NODE_ID}\",le=\"0.5\"} ${b05}"
+              echo "http_request_duration_seconds_bucket{node_id=\"${NODE_ID}\",le=\"1\"} ${b1}"
+              echo "http_request_duration_seconds_bucket{node_id=\"${NODE_ID}\",le=\"2\"} ${b2}"
+              echo "http_request_duration_seconds_bucket{node_id=\"${NODE_ID}\",le=\"+Inf\"} ${req}"
+              echo "http_request_duration_seconds_count{node_id=\"${NODE_ID}\"} ${req}"
+              echo "http_request_duration_seconds_sum{node_id=\"${NODE_ID}\"} ${lat_sum}"
+            } > /www/metrics.txt
+            sleep 2
+          done
+---
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: catalog
+  namespace: prod-sre
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: catalog
+  template:
+    metadata:
+      labels:
+        app: catalog
+      annotations:
+        prometheus.io/scrape: "true"
+        prometheus.io/port: "8080"
+        prometheus.io/path: "/metrics.txt"
+    spec:
+      containers:
+      - name: catalog
+        image: python:3.12-alpine
+        env:
+        - name: NODE_ID
+          value: node-2
+        - name: BASE_QUEUE
+          value: "20"
+        command: ["/bin/sh", "-lc"]
+        args:
+        - |
+          mkdir -p /www
+          echo ok > /www/index.html
+          python -m http.server 8080 --directory /www >/tmp/http.log 2>&1 &
+          req=0; err=0; cpu_total=0
+          while true; do
+            t=$(date +%s)
+            noise=$((t % 11))
+            req=$((req + 30 + noise))
+            q=$((BASE_QUEUE + (t % 20) - 10))
+            if [ "$q" -lt 0 ]; then q=0; fi
+            err=$((err + q / 20))
+            cpu_inc=$((10 + q / 10))
+            cpu_total=$((cpu_total + cpu_inc))
+            lat_ms=$((35 + q * 3))
+            b005=$((req / 5)); b01=$((req / 3)); b025=$((req / 2)); b05=$((req * 3 / 4)); b1=$req; b2=$req
+            lat_sum=$(awk "BEGIN {printf \"%.3f\", $req * $lat_ms / 1000.0}")
+            {
+              echo "# HELP http_requests_total Synthetic request counter"
+              echo "# TYPE http_requests_total counter"
+              echo "http_requests_total{node_id=\"${NODE_ID}\",status=\"200\"} ${req}"
+              echo "http_requests_total{node_id=\"${NODE_ID}\",status=\"500\"} ${err}"
+              echo "# HELP queue_depth Synthetic queue depth"
+              echo "# TYPE queue_depth gauge"
+              echo "queue_depth{node_id=\"${NODE_ID}\"} ${q}"
+              echo "# HELP container_cpu_usage_seconds_total Synthetic CPU counter"
+              echo "# TYPE container_cpu_usage_seconds_total counter"
+              echo "container_cpu_usage_seconds_total{node_id=\"${NODE_ID}\"} ${cpu_total}"
+              echo "# HELP http_request_duration_seconds Synthetic request duration histogram"
+              echo "# TYPE http_request_duration_seconds histogram"
+              echo "http_request_duration_seconds_bucket{node_id=\"${NODE_ID}\",le=\"0.05\"} ${b005}"
+              echo "http_request_duration_seconds_bucket{node_id=\"${NODE_ID}\",le=\"0.1\"} ${b01}"
+              echo "http_request_duration_seconds_bucket{node_id=\"${NODE_ID}\",le=\"0.25\"} ${b025}"
+              echo "http_request_duration_seconds_bucket{node_id=\"${NODE_ID}\",le=\"0.5\"} ${b05}"
+              echo "http_request_duration_seconds_bucket{node_id=\"${NODE_ID}\",le=\"1\"} ${b1}"
+              echo "http_request_duration_seconds_bucket{node_id=\"${NODE_ID}\",le=\"2\"} ${b2}"
+              echo "http_request_duration_seconds_bucket{node_id=\"${NODE_ID}\",le=\"+Inf\"} ${req}"
+              echo "http_request_duration_seconds_count{node_id=\"${NODE_ID}\"} ${req}"
+              echo "http_request_duration_seconds_sum{node_id=\"${NODE_ID}\"} ${lat_sum}"
+            } > /www/metrics.txt
+            sleep 2
+          done
+---
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: checkout
+  namespace: prod-sre
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: checkout
+  template:
+    metadata:
+      labels:
+        app: checkout
+      annotations:
+        prometheus.io/scrape: "true"
+        prometheus.io/port: "8080"
+        prometheus.io/path: "/metrics.txt"
+    spec:
+      containers:
+      - name: checkout
+        image: python:3.12-alpine
+        env:
+        - name: NODE_ID
+          value: node-1
+        - name: BASE_QUEUE
+          value: "24"
+        command: ["/bin/sh", "-lc"]
+        args:
+        - |
+          mkdir -p /www
+          echo ok > /www/index.html
+          python -m http.server 8080 --directory /www >/tmp/http.log 2>&1 &
+          req=0; err=0; cpu_total=0
+          while true; do
+            t=$(date +%s)
+            noise=$((t % 11))
+            req=$((req + 30 + noise))
+            q=$((BASE_QUEUE + (t % 20) - 10))
+            if [ "$q" -lt 0 ]; then q=0; fi
+            err=$((err + q / 20))
+            cpu_inc=$((10 + q / 10))
+            cpu_total=$((cpu_total + cpu_inc))
+            lat_ms=$((35 + q * 3))
+            b005=$((req / 5)); b01=$((req / 3)); b025=$((req / 2)); b05=$((req * 3 / 4)); b1=$req; b2=$req
+            lat_sum=$(awk "BEGIN {printf \"%.3f\", $req * $lat_ms / 1000.0}")
+            {
+              echo "# HELP http_requests_total Synthetic request counter"
+              echo "# TYPE http_requests_total counter"
+              echo "http_requests_total{node_id=\"${NODE_ID}\",status=\"200\"} ${req}"
+              echo "http_requests_total{node_id=\"${NODE_ID}\",status=\"500\"} ${err}"
+              echo "# HELP queue_depth Synthetic queue depth"
+              echo "# TYPE queue_depth gauge"
+              echo "queue_depth{node_id=\"${NODE_ID}\"} ${q}"
+              echo "# HELP container_cpu_usage_seconds_total Synthetic CPU counter"
+              echo "# TYPE container_cpu_usage_seconds_total counter"
+              echo "container_cpu_usage_seconds_total{node_id=\"${NODE_ID}\"} ${cpu_total}"
+              echo "# HELP http_request_duration_seconds Synthetic request duration histogram"
+              echo "# TYPE http_request_duration_seconds histogram"
+              echo "http_request_duration_seconds_bucket{node_id=\"${NODE_ID}\",le=\"0.05\"} ${b005}"
+              echo "http_request_duration_seconds_bucket{node_id=\"${NODE_ID}\",le=\"0.1\"} ${b01}"
+              echo "http_request_duration_seconds_bucket{node_id=\"${NODE_ID}\",le=\"0.25\"} ${b025}"
+              echo "http_request_duration_seconds_bucket{node_id=\"${NODE_ID}\",le=\"0.5\"} ${b05}"
+              echo "http_request_duration_seconds_bucket{node_id=\"${NODE_ID}\",le=\"1\"} ${b1}"
+              echo "http_request_duration_seconds_bucket{node_id=\"${NODE_ID}\",le=\"2\"} ${b2}"
+              echo "http_request_duration_seconds_bucket{node_id=\"${NODE_ID}\",le=\"+Inf\"} ${req}"
+              echo "http_request_duration_seconds_count{node_id=\"${NODE_ID}\"} ${req}"
+              echo "http_request_duration_seconds_sum{node_id=\"${NODE_ID}\"} ${lat_sum}"
+            } > /www/metrics.txt
+            sleep 2
+          done
+---
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: payments
+  namespace: prod-sre
+spec:
+  replicas: 2
+  selector:
+    matchLabels:
+      app: payments
+  template:
+    metadata:
+      labels:
+        app: payments
+      annotations:
+        prometheus.io/scrape: "true"
+        prometheus.io/port: "8080"
+        prometheus.io/path: "/metrics.txt"
+    spec:
+      containers:
+      - name: payments
+        image: python:3.12-alpine
+        env:
+        - name: NODE_ID
+          value: node-0
+        - name: BASE_QUEUE
+          value: "30"
+        command: ["/bin/sh", "-lc"]
+        args:
+        - |
+          mkdir -p /www
+          echo ok > /www/index.html
+          python -m http.server 8080 --directory /www >/tmp/http.log 2>&1 &
+          req=0; err=0; cpu_total=0
+          while true; do
+            t=$(date +%s)
+            noise=$((t % 11))
+            req=$((req + 30 + noise))
+            q=$((BASE_QUEUE + (t % 20) - 10))
+            if [ "$q" -lt 0 ]; then q=0; fi
+            err=$((err + q / 20))
+            cpu_inc=$((10 + q / 10))
+            cpu_total=$((cpu_total + cpu_inc))
+            lat_ms=$((35 + q * 3))
+            b005=$((req / 5)); b01=$((req / 3)); b025=$((req / 2)); b05=$((req * 3 / 4)); b1=$req; b2=$req
+            lat_sum=$(awk "BEGIN {printf \"%.3f\", $req * $lat_ms / 1000.0}")
+            {
+              echo "# HELP http_requests_total Synthetic request counter"
+              echo "# TYPE http_requests_total counter"
+              echo "http_requests_total{node_id=\"${NODE_ID}\",status=\"200\"} ${req}"
+              echo "http_requests_total{node_id=\"${NODE_ID}\",status=\"500\"} ${err}"
+              echo "# HELP queue_depth Synthetic queue depth"
+              echo "# TYPE queue_depth gauge"
+              echo "queue_depth{node_id=\"${NODE_ID}\"} ${q}"
+              echo "# HELP container_cpu_usage_seconds_total Synthetic CPU counter"
+              echo "# TYPE container_cpu_usage_seconds_total counter"
+              echo "container_cpu_usage_seconds_total{node_id=\"${NODE_ID}\"} ${cpu_total}"
+              echo "# HELP http_request_duration_seconds Synthetic request duration histogram"
+              echo "# TYPE http_request_duration_seconds histogram"
+              echo "http_request_duration_seconds_bucket{node_id=\"${NODE_ID}\",le=\"0.05\"} ${b005}"
+              echo "http_request_duration_seconds_bucket{node_id=\"${NODE_ID}\",le=\"0.1\"} ${b01}"
+              echo "http_request_duration_seconds_bucket{node_id=\"${NODE_ID}\",le=\"0.25\"} ${b025}"
+              echo "http_request_duration_seconds_bucket{node_id=\"${NODE_ID}\",le=\"0.5\"} ${b05}"
+              echo "http_request_duration_seconds_bucket{node_id=\"${NODE_ID}\",le=\"1\"} ${b1}"
+              echo "http_request_duration_seconds_bucket{node_id=\"${NODE_ID}\",le=\"2\"} ${b2}"
+              echo "http_request_duration_seconds_bucket{node_id=\"${NODE_ID}\",le=\"+Inf\"} ${req}"
+              echo "http_request_duration_seconds_count{node_id=\"${NODE_ID}\"} ${req}"
+              echo "http_request_duration_seconds_sum{node_id=\"${NODE_ID}\"} ${lat_sum}"
+            } > /www/metrics.txt
+            sleep 2
+          done

deploy/local/datasource-local.yaml ADDED Viewed

	@@ -0,0 +1,10 @@

+apiVersion: 1
+datasources:
+  - name: Prometheus
+    uid: PBFA97CFB590B2093
+    type: prometheus
+    access: proxy
+    url: http://prometheus-local:9090
+    isDefault: true
+    editable: true

deploy/local/grafana-local-values.yaml ADDED Viewed

	@@ -0,0 +1,34 @@

+adminUser: admin
+adminPassword: antiatropos
+service:
+  type: ClusterIP
+persistence:
+  enabled: false
+resources:
+  requests:
+    cpu: 100m
+    memory: 192Mi
+  limits:
+    cpu: 400m
+    memory: 384Mi
+datasources:
+  datasources.yaml:
+    apiVersion: 1
+    datasources:
+      - name: Prometheus
+        type: prometheus
+        access: proxy
+        url: http://prometheus-server.monitoring.svc.cluster.local
+        isDefault: true
+        editable: true
+sidecar:
+  dashboards:
+    enabled: true
+    label: grafana_dashboard
+    labelValue: "1"
+    searchNamespace: ALL

deploy/local/prometheus-local-values.yaml ADDED Viewed

	@@ -0,0 +1,49 @@

+alertmanager:
+  enabled: false
+kube-state-metrics:
+  enabled: false
+prometheus-node-exporter:
+  enabled: false
+prometheus-pushgateway:
+  enabled: false
+extraScrapeConfigs: |
+  - job_name: 'antiatropos-fastapi'
+    metrics_path: /metrics
+    static_configs:
+      - targets: ['host.docker.internal:8000']
+  - job_name: 'prod-sre-annotated-pods'
+    kubernetes_sd_configs:
+      - role: pod
+        namespaces:
+          names: ['prod-sre']
+    relabel_configs:
+      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
+        action: keep
+        regex: true
+      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
+        action: replace
+        target_label: __metrics_path__
+        regex: (.+)
+      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
+        action: replace
+        regex: ([^:]+)(?::\d+)?;(\d+)
+        replacement: $1:$2
+        target_label: __address__
+server:
+  persistentVolume:
+    enabled: false
+  resources:
+    requests:
+      cpu: 100m
+      memory: 256Mi
+    limits:
+      cpu: 500m
+      memory: 512Mi
+  service:
+    type: ClusterIP

deploy/nginx.conf CHANGED Viewed

@@ -1,89 +1,89 @@
-worker_processes auto;
-pid /tmp/nginx.pid;
-error_log /dev/stderr info;
-events {
-    worker_connections 1024;
-}
-http {
-    include /etc/nginx/mime.types;
-    default_type application/octet-stream;
-    sendfile on;
-    keepalive_timeout 65;
-    access_log /dev/stdout;
-    map $http_upgrade $connection_upgrade {
-        default upgrade;
-        '' close;
-    }
-    server {
-        listen 7860;
-        server_name _;
-        client_max_body_size 50m;
-        proxy_read_timeout 3600s;
-        proxy_send_timeout 3600s;
-        location = /prometheus {
-            return 301 /prometheus/;
-        }
-        location = /grafana {
-            return 301 /grafana/;
-        }
-        location /prometheus/ {
-            proxy_pass http://127.0.0.1:9090;
-            proxy_http_version 1.1;
-            proxy_set_header Host $host;
-            proxy_set_header X-Real-IP $remote_addr;
-            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
-            proxy_set_header X-Forwarded-Host $host;
-            proxy_set_header X-Forwarded-Proto $scheme;
-            proxy_set_header X-Forwarded-Prefix /prometheus;
-        }
-        location /grafana/ {
-            proxy_pass http://127.0.0.1:3000;
-            proxy_http_version 1.1;
-            proxy_set_header Host $host;
-            proxy_set_header X-Real-IP $remote_addr;
-            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
-            proxy_set_header X-Forwarded-Host $host;
-            proxy_set_header X-Forwarded-Proto $scheme;
-            proxy_set_header X-Forwarded-Prefix /grafana;
-        }
-        location /grafana/api/live/ {
-            proxy_pass http://127.0.0.1:3000;
-            proxy_http_version 1.1;
-            proxy_set_header Upgrade $http_upgrade;
-            proxy_set_header Connection $connection_upgrade;
-            proxy_set_header Host $host;
-            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
-            proxy_set_header X-Forwarded-Host $host;
-            proxy_set_header X-Forwarded-Proto $scheme;
-            proxy_set_header X-Forwarded-Prefix /grafana;
-        }
-        location / {
-            root /var/www/html;
-            index index.html;
-            try_files $uri $uri/ @fastapi;
-        }
-        location @fastapi {
-            proxy_pass http://127.0.0.1:8000;
-            proxy_http_version 1.1;
-            proxy_set_header Host $host;
-            proxy_set_header X-Real-IP $remote_addr;
-            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
-            proxy_set_header X-Forwarded-Host $host;
-            proxy_set_header X-Forwarded-Proto $scheme;
-            proxy_set_header Upgrade $http_upgrade;
-            proxy_set_header Connection $connection_upgrade;
-        }
-    }
-}

+worker_processes auto;
+pid /tmp/nginx.pid;
+error_log /dev/stderr info;
+events {
+    worker_connections 1024;
+}
+http {
+    include /etc/nginx/mime.types;
+    default_type application/octet-stream;
+    sendfile on;
+    keepalive_timeout 65;
+    access_log /dev/stdout;
+    map $http_upgrade $connection_upgrade {
+        default upgrade;
+        '' close;
+    }
+    server {
+        listen 7860;
+        server_name _;
+        client_max_body_size 50m;
+        proxy_read_timeout 3600s;
+        proxy_send_timeout 3600s;
+        location = /prometheus {
+            return 301 /prometheus/;
+        }
+        location = /grafana {
+            return 301 /grafana/;
+        }
+        location /prometheus/ {
+            proxy_pass http://127.0.0.1:9090;
+            proxy_http_version 1.1;
+            proxy_set_header Host $host;
+            proxy_set_header X-Real-IP $remote_addr;
+            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
+            proxy_set_header X-Forwarded-Host $host;
+            proxy_set_header X-Forwarded-Proto $scheme;
+            proxy_set_header X-Forwarded-Prefix /prometheus;
+        }
+        location /grafana/ {
+            proxy_pass http://127.0.0.1:3000;
+            proxy_http_version 1.1;
+            proxy_set_header Host $host;
+            proxy_set_header X-Real-IP $remote_addr;
+            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
+            proxy_set_header X-Forwarded-Host $host;
+            proxy_set_header X-Forwarded-Proto $scheme;
+            proxy_set_header X-Forwarded-Prefix /grafana;
+        }
+        location /grafana/api/live/ {
+            proxy_pass http://127.0.0.1:3000;
+            proxy_http_version 1.1;
+            proxy_set_header Upgrade $http_upgrade;
+            proxy_set_header Connection $connection_upgrade;
+            proxy_set_header Host $host;
+            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
+            proxy_set_header X-Forwarded-Host $host;
+            proxy_set_header X-Forwarded-Proto $scheme;
+            proxy_set_header X-Forwarded-Prefix /grafana;
+        }
+        location / {
+            root /var/www/html;
+            index index.html;
+            try_files $uri $uri/ @fastapi;
+        }
+        location @fastapi {
+            proxy_pass http://127.0.0.1:8000;
+            proxy_http_version 1.1;
+            proxy_set_header Host $host;
+            proxy_set_header X-Real-IP $remote_addr;
+            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
+            proxy_set_header X-Forwarded-Host $host;
+            proxy_set_header X-Forwarded-Proto $scheme;
+            proxy_set_header Upgrade $http_upgrade;
+            proxy_set_header Connection $connection_upgrade;
+        }
+    }
+}