File size: 7,047 Bytes
7e9a520
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
# AtlasOps β€” Architecture

## System Overview

```mermaid
graph TB
    subgraph GCP["Google Cloud Platform β€” us-central1"]
        subgraph GKE["GKE Standard Cluster (atlasops)"]
            OB["Online Boutique<br/>11 microservices<br/>gRPC + protobuf"]
            CM["Chaos Mesh<br/>PodChaos Β· NetworkChaos<br/>StressChaos Β· DNSChaos<br/>IOChaos Β· TimeChaos"]
            PROM["Prometheus + Grafana<br/>kube-prometheus-stack"]
            JAEGER["Jaeger + OTel Collector<br/>Distributed tracing"]
            ARGO["Argo CD<br/>GitOps rollbacks"]
            AM["Alertmanager<br/>Webhook β†’ coordinator"]
        end
        CSQL["Cloud SQL<br/>Postgres 15"]
        PS["Cloud PubSub<br/>cloudsre-checkout-events"]
        CMON["Cloud Monitoring API<br/>GCP-native metrics"]
        CLOG["Cloud Logging<br/>Structured logs"]
    end

    subgraph MI300X["AMD MI300X β€” 192 GB HBM3"]
        COORD["Coordinator<br/>FastAPI :9099"]
        subgraph AGENTS["4 Specialized Agents (Qwen2.5-7B + QLoRA)"]
            T["πŸ”΄ Triage"]
            D["πŸ” Diagnosis"]
            R["πŸ”§ Remediation"]
            C["πŸ“£ Comms"]
        end
        JUDGE["72B Judge<br/>Qwen2.5-72B<br/>Adversarial designer<br/>+ Evaluator"]
        DASH["Gradio Dashboard<br/>:7860"]
    end

    CM -->|injects faults| OB
    OB -->|metrics| PROM
    OB -->|traces| JAEGER
    PROM -->|alert fires| AM
    AM -->|webhook| COORD
    COORD --> T --> D --> R --> C
    T & D & R & C <-->|"20 real SRE tools<br/>(kubectl Β· promql Β· jaeger<br/>argocd Β· gcloud Β· alertmanager)"| GKE
    T & D & R & C <-->|Cloud APIs| CSQL & PS & CMON & CLOG
    JUDGE -->|generates scenarios| CM
    JUDGE -->|scores actions| COORD
    DASH -->|live thoughts SSE| COORD
```

---

## Agent Chain

```mermaid
sequenceDiagram
    participant AM as Alertmanager
    participant CO as Coordinator
    participant TR as Triage Agent
    participant DG as Diagnosis Agent
    participant RM as Remediation Agent
    participant CM as Comms Agent
    participant GKE as Real GKE Cluster

    AM->>CO: POST /webhook (alert fired)
    CO->>TR: {incident_id, alert}
    TR->>GKE: kubectl_top_pods()
    TR->>GKE: promql_query("rate(5xx[1m])")
    TR-->>CO: {severity: P1, blast_radius: [...]}

    CO->>DG: {triage_output}
    DG->>GKE: jaeger_search(service, min_duration=500ms)
    DG->>GKE: promql_query_range(query, last_15m)
    DG->>GKE: kubectl_logs(bottleneck_pod)
    DG->>GKE: argocd_app_history(app)
    DG-->>CO: {root_cause: {...}, recommended_actions: [...]}

    CO->>RM: {triage, diagnosis}
    RM->>GKE: argocd_rollback(app, revision)
    RM->>GKE: promql_query("rate(5xx[1m])")
    Note over RM: verify error_rate < 1%
    RM-->>CO: {outcome: resolved, ttr: 187s}

    CO->>CM: {full incident chain}
    CM->>CM: slack_post_update(...)
    CM->>CM: postmortem_draft(incident)
    CM-->>CO: {postmortem_path: docs/postmortems/...}
```

---

## AMD MI300X Co-hosting

```
AMD MI300X (192 GB HBM3)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                         β”‚
β”‚  Qwen2.5-7B base (4-bit NF4)           ~4 GB           β”‚
β”‚  β”œβ”€ triage_adapter    (LoRA r=16)      ~40 MB           β”‚
β”‚  β”œβ”€ diagnosis_adapter (LoRA r=16)      ~40 MB           β”‚
β”‚  β”œβ”€ remediation_adapter (LoRA r=16)    ~40 MB           β”‚
β”‚  └─ comms_adapter     (LoRA r=16)      ~40 MB           β”‚
β”‚                                                         β”‚
β”‚  Qwen2.5-72B (4-bit NF4)               ~37 GB           β”‚
β”‚  └─ Adversarial designer + judge                        β”‚
β”‚                                                         β”‚
β”‚  Total used: ~41 GB    Available: ~151 GB               β”‚
β”‚                                                         β”‚
β”‚  ❌ A100 (80 GB): can fit judge OR agents, not both     β”‚
β”‚  ❌ T4   (16 GB): can't fit even the 7B base            β”‚
β”‚  βœ… MI300X: all 5 models + room to spare                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

---

## Training Pipeline

```mermaid
flowchart LR
    A["Real GKE Cluster<br/>11 live microservices"] -->|Chaos Mesh injects faults| B
    B["Alertmanager fires<br/>real webhook"] --> C
    C["4 agents run<br/>against live cluster"] --> D
    D["72B judge scores<br/>tool calls + outcomes"] --> E
    E["5k trajectory corpus<br/>data/sft_corpus.jsonl<br/>reward_contract included"]

    E --> F["Stage 1: QLoRA SFT<br/>Qwen2.5-7B + LoRA r=16<br/>4-bit NF4 on MI300X"]
    F --> G["Stage 2: GRPO<br/>Optuna HP search (6 trials)<br/>num_gen=8, cosine LR<br/>tier-aware reward contract"]
    G --> H["4 role adapters<br/>~40 MB each<br/>checkpoints/grpo_v3/"]

    H -->|"hot-swap adapters<br/>per agent call"| I["Production serving<br/>vLLM on MI300X"]
```

---

## 20 Real SRE Tools

```
kubectl (7)          promql (2)          jaeger (2)
─────────────        ──────────          ──────────
kubectl_get          promql_query        jaeger_search
kubectl_describe     promql_query_range  jaeger_get_trace
kubectl_logs
kubectl_top_pods     argocd (3)          gcloud (2)
kubectl_top_nodes    ─────────           ──────────
kubectl_rollout      argocd_list_apps    gcloud_logs_read
kubectl_scale        argocd_app_history  cloud_monitoring_query
kubectl_exec         argocd_rollback

alertmanager (2)     comms (2)
────────────────     ──────────
alertmanager_silence slack_post_update
alertmanager_list    postmortem_draft
```

AtlasOps exposes 20 production-relevant SRE tools across Kubernetes, tracing, metrics, GitOps, and comms workflows.

---

## Reward Contract (Anti-Gaming)

```
reward = 0.35 Γ— resolve                    # did the incident get fixed?
       + 0.20 Γ— evidence                   # was the root cause proven with data?
       + 0.20 Γ— safety                     # was the action minimum-blast-radius?
       + 0.15 Γ— speed                      # logistic decay β€” fast is good, race-to-zero is penalised
       + 0.10 Γ— comms                      # was a postmortem generated?

       - 0.10  if turns > 40              # command spam
       - 0.25  if claimed resolved but wasn't  # false resolution
       - 0.20  if efficiency < 0.3        # unsafe shortcut
       - 0.20  if reasoning AND correctness both low  # hallucinated evidence
       - 0.10  if silenced an alert without resolving # over-silencing

tier adjustments:
  cascade:      r_evidence weight ↑ (tracing the chain matters more)
  multi_fault:  r_safety weight ↑   (conservative action matters more)
  adversarial:  all penalties Γ— 1.25 (harder tier, stricter scoring)
```