# AtlasOps — Architecture ## System Overview ```mermaid graph TB subgraph GCP["Google Cloud Platform — us-central1"] subgraph GKE["GKE Standard Cluster (atlasops)"] OB["Online Boutique
11 microservices
gRPC + protobuf"] CM["Chaos Mesh
PodChaos · NetworkChaos
StressChaos · DNSChaos
IOChaos · TimeChaos"] PROM["Prometheus + Grafana
kube-prometheus-stack"] JAEGER["Jaeger + OTel Collector
Distributed tracing"] ARGO["Argo CD
GitOps rollbacks"] AM["Alertmanager
Webhook → coordinator"] end CSQL["Cloud SQL
Postgres 15"] PS["Cloud PubSub
cloudsre-checkout-events"] CMON["Cloud Monitoring API
GCP-native metrics"] CLOG["Cloud Logging
Structured logs"] end subgraph MI300X["AMD MI300X — 192 GB HBM3"] COORD["Coordinator
FastAPI :9099"] subgraph AGENTS["4 Specialized Agents (Qwen2.5-7B + QLoRA)"] T["🔴 Triage"] D["🔍 Diagnosis"] R["🔧 Remediation"] C["📣 Comms"] end JUDGE["72B Judge
Qwen2.5-72B
Adversarial designer
+ Evaluator"] DASH["Gradio Dashboard
:7860"] end CM -->|injects faults| OB OB -->|metrics| PROM OB -->|traces| JAEGER PROM -->|alert fires| AM AM -->|webhook| COORD COORD --> T --> D --> R --> C T & D & R & C <-->|"20 real SRE tools
(kubectl · promql · jaeger
argocd · gcloud · alertmanager)"| GKE T & D & R & C <-->|Cloud APIs| CSQL & PS & CMON & CLOG JUDGE -->|generates scenarios| CM JUDGE -->|scores actions| COORD DASH -->|live thoughts SSE| COORD ``` --- ## Agent Chain ```mermaid sequenceDiagram participant AM as Alertmanager participant CO as Coordinator participant TR as Triage Agent participant DG as Diagnosis Agent participant RM as Remediation Agent participant CM as Comms Agent participant GKE as Real GKE Cluster AM->>CO: POST /webhook (alert fired) CO->>TR: {incident_id, alert} TR->>GKE: kubectl_top_pods() TR->>GKE: promql_query("rate(5xx[1m])") TR-->>CO: {severity: P1, blast_radius: [...]} CO->>DG: {triage_output} DG->>GKE: jaeger_search(service, min_duration=500ms) DG->>GKE: promql_query_range(query, last_15m) DG->>GKE: kubectl_logs(bottleneck_pod) DG->>GKE: argocd_app_history(app) DG-->>CO: {root_cause: {...}, recommended_actions: [...]} CO->>RM: {triage, diagnosis} RM->>GKE: argocd_rollback(app, revision) RM->>GKE: promql_query("rate(5xx[1m])") Note over RM: verify error_rate < 1% RM-->>CO: {outcome: resolved, ttr: 187s} CO->>CM: {full incident chain} CM->>CM: slack_post_update(...) CM->>CM: postmortem_draft(incident) CM-->>CO: {postmortem_path: docs/postmortems/...} ``` --- ## AMD MI300X Co-hosting ``` AMD MI300X (192 GB HBM3) ┌─────────────────────────────────────────────────────────┐ │ │ │ Qwen2.5-7B base (4-bit NF4) ~4 GB │ │ ├─ triage_adapter (LoRA r=16) ~40 MB │ │ ├─ diagnosis_adapter (LoRA r=16) ~40 MB │ │ ├─ remediation_adapter (LoRA r=16) ~40 MB │ │ └─ comms_adapter (LoRA r=16) ~40 MB │ │ │ │ Qwen2.5-72B (4-bit NF4) ~37 GB │ │ └─ Adversarial designer + judge │ │ │ │ Total used: ~41 GB Available: ~151 GB │ │ │ │ ❌ A100 (80 GB): can fit judge OR agents, not both │ │ ❌ T4 (16 GB): can't fit even the 7B base │ │ ✅ MI300X: all 5 models + room to spare │ └─────────────────────────────────────────────────────────┘ ``` --- ## Training Pipeline ```mermaid flowchart LR A["Real GKE Cluster
11 live microservices"] -->|Chaos Mesh injects faults| B B["Alertmanager fires
real webhook"] --> C C["4 agents run
against live cluster"] --> D D["72B judge scores
tool calls + outcomes"] --> E E["5k trajectory corpus
data/sft_corpus.jsonl
reward_contract included"] E --> F["Stage 1: QLoRA SFT
Qwen2.5-7B + LoRA r=16
4-bit NF4 on MI300X"] F --> G["Stage 2: GRPO
Optuna HP search (6 trials)
num_gen=8, cosine LR
tier-aware reward contract"] G --> H["4 role adapters
~40 MB each
checkpoints/grpo_v3/"] H -->|"hot-swap adapters
per agent call"| I["Production serving
vLLM on MI300X"] ``` --- ## 20 Real SRE Tools ``` kubectl (7) promql (2) jaeger (2) ───────────── ────────── ────────── kubectl_get promql_query jaeger_search kubectl_describe promql_query_range jaeger_get_trace kubectl_logs kubectl_top_pods argocd (3) gcloud (2) kubectl_top_nodes ───────── ────────── kubectl_rollout argocd_list_apps gcloud_logs_read kubectl_scale argocd_app_history cloud_monitoring_query kubectl_exec argocd_rollback alertmanager (2) comms (2) ──────────────── ────────── alertmanager_silence slack_post_update alertmanager_list postmortem_draft ``` AtlasOps exposes 20 production-relevant SRE tools across Kubernetes, tracing, metrics, GitOps, and comms workflows. --- ## Reward Contract (Anti-Gaming) ``` reward = 0.35 × resolve # did the incident get fixed? + 0.20 × evidence # was the root cause proven with data? + 0.20 × safety # was the action minimum-blast-radius? + 0.15 × speed # logistic decay — fast is good, race-to-zero is penalised + 0.10 × comms # was a postmortem generated? - 0.10 if turns > 40 # command spam - 0.25 if claimed resolved but wasn't # false resolution - 0.20 if efficiency < 0.3 # unsafe shortcut - 0.20 if reasoning AND correctness both low # hallucinated evidence - 0.10 if silenced an alert without resolving # over-silencing tier adjustments: cascade: r_evidence weight ↑ (tracing the chain matters more) multi_fault: r_safety weight ↑ (conservative action matters more) adversarial: all penalties × 1.25 (harder tier, stricter scoring) ```