File size: 14,271 Bytes
7e9a520 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 | ---
title: AtlasOps
emoji: π¨
colorFrom: red
colorTo: blue
sdk: docker
app_port: 7860
pinned: true
short_description: 4 AI agents responding to real GKE incidents on AMD MI300X
tags:
- agents
- multi-agent
- reinforcement-learning
- amd
- rocm
- sre
- kubernetes
---
# AtlasOps β Can 4 AI agents replace an on-call SRE team?
> **AMD Developer Hackathon 2026** | Real GKE cluster Β· Real Chaos Mesh Β· Real Prometheus alerts Β· AMD MI300X
[](https://github.com/Harikishanth/AtlasOps/actions/workflows/ci.yml)
[](LICENSE)
[](docs/MI300X_EVIDENCE.md)
---
We gave 4 specialized AI agents a PagerDuty alert, a live GKE cluster running 11 microservices, and 20 real SRE tools. No simulated responses. No fake metrics. No Docker Compose pretending to be cloud.
**Triage** acked the alert and mapped the blast radius in 47 seconds.
**Diagnosis** traced the root cause to a currency service CPU hog via Jaeger in 3 tool calls.
**Remediation** executed `argocd rollback` and confirmed error rate < 1% via Prometheus.
**Comms** drafted a Cloudflare-quality postmortem with real timestamps from the cluster.
Total time to resolve a Cloudflare 2019 cascade replay: **4 minutes 12 seconds.**
A senior SRE on a good day: ~25 minutes.
This is **AtlasOps** β a self-improving multi-agent SRE platform where a 72B adversarial judge generates infinite novel chaos scenarios targeting the agents' specific weaknesses, trained via SFT β Online GRPO on an AMD MI300X (192 GB HBM3).
---
## Architecture
```
βββββββββββββββββββββ GOOGLE CLOUD PLATFORM ββββββββββββββββββββββ
β GKE Standard Cluster (us-central1, 3Γ e2-standard-4) β
β ββ Online Boutique (11 services: Go, Python, Node, Java, C#) β
β ββ Chaos Mesh (PodChaos, NetworkChaos, StressChaos, ...) β
β ββ Prometheus + Grafana + Jaeger + OTel + Alertmanager β
β ββ Argo CD (real rollback execution) β
β Cloud SQL (Postgres 15) Β· Cloud PubSub Β· Cloud Monitoring β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β kubectl + promql + jaeger + argocd + gcloud APIs
βΌ
βββββββββββββββββ AMD MI300X (192 GB HBM3) ββββββββββββββββββββββββ
β vLLM co-hosting β 5 models on ONE GPU: β
β Qwen2.5-7B Γ 4 (Triage / Diagnosis / Remediation / Comms) β
β Qwen2.5-72B (LLM Judge + adversarial scenario designer) β
β β
β Alert β Triage β Diagnosis β [Approval Gate] β Remediation β
β β Comms β Postmortem β
β β
β Circuit Breaker Β· Incident Correlator Β· HMAC Audit Log β
β Spaced-Rep Curriculum Β· DAPO GRPO Β· Dense Per-Step Rewards β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```
---
## Track Coverage
### Track 1 β AI Agents & Agentic Workflows
AtlasOps is a purpose-built multi-agent framework for SRE automation. Rather than wrapping LangChain or CrewAI, we implement the full agentic stack directly β giving us tighter control over tool routing, approval gates, circuit breaking, and streaming than any general-purpose framework offers out of the box. The coordinator orchestrates 4 specialized roles (Triage, Diagnosis, Remediation, Comms) with tool-calling, human-in-the-loop approval, and alert correlation. Models: **Qwen2.5-7B Γ 4** (open-source, AMD MI300X co-hosted).
### Track 2 β Fine-Tuning on AMD GPUs
Full fine-tuning pipeline on AMD hardware:
| Component | Library |
|---|---|
| Hardware | AMD Instinct MI300X (192 GB HBM3) |
| GPU runtime | **ROCm 7.2** |
| Training framework | **PyTorch** (ROCm wheel) |
| Quantisation | **BitsAndBytes-ROCm** (4-bit NF4 QLoRA, LoRA r=16) + **AWQ** (72B judge) |
| Fine-tuning | **TRL** SFTTrainer + GRPOTrainer (DAPO loss) |
| PEFT | **LoRA** r=16, Ξ±=32, target: q/k/v/o/gate/up/down proj |
| AMD kernel optimisation | **Hugging Face Optimum-AMD** β BetterTransformer applied to local inference path (`inference.py`) |
| Serving | **vLLM 0.17.1** (ROCm build β PagedAttention, flash attention for MI300X) |
| Domain | **SRE Operations** β incident triage, root-cause diagnosis, remediation, postmortem authoring |
---
## 20 Real SRE Tools
`kubectl_get` Β· `kubectl_describe` Β· `kubectl_logs` Β· `kubectl_top_pods` Β· `kubectl_rollout` Β· `kubectl_scale` Β· `kubectl_exec` Β· `promql_query` Β· `promql_query_range` Β· `jaeger_search` Β· `jaeger_get_trace` Β· `argocd_list_apps` Β· `argocd_app_history` Β· **`argocd_rollback`** Β· `gcloud_logs_read` Β· `cloud_monitoring_query` Β· `alertmanager_list_alerts` Β· `alertmanager_silence` Β· `slack_post_update` Β· **`postmortem_draft`**
Every tool hits a real API or real cluster. No mocks in production.
---
## 38 Chaos Scenarios + Infinite Adversarial Generation
| Tier | Count | Examples |
|---|---|---|
| Single-fault | 8 | pod-kill, CPU hog, memory leak, network loss, disk fill, clock skew |
| Cascade | 5 | currency latency β checkout timeout β frontend 5xx surge |
| Multi-fault | 5 | 3 simultaneous faults + red herrings across namespaces |
| Named Replays | 10 | Cloudflare 2019, AWS S3 2017, GitHub 2018, Discord 2022, Knight Capital 2012β¦ |
| **Dynamic adversarial** | β | Qwen2.5-72B judge designs new Chaos Mesh YAML targeting agent weaknesses in real time |
---
## Production Guardrails
### Human-in-the-loop Approval Gate
- **P0**: manual runbook only β agents produce a step-by-step plan, no auto-execution
- **P1**: approval window (60 s default, configurable) β execution proceeds if approved or times out
- **P2/P3**: fully automatic
- `POST /approval/callback` Β· `GET /approval/pending`
### Circuit Breaker
Hard stops runaway automation:
- 50 tool calls per incident max
- 10 mutating actions per hour
- 5 concurrent incidents max
- Trips after 3 consecutive unresolved incidents
- `GET /circuit-breaker/status` Β· `POST /circuit-breaker/reset`
### Incident Correlator
Alert-storm deduplication β groups alerts from the same service/namespace within a 5-minute window into a single incident chain. Prevents 10 parallel agent chains firing for one cascade failure.
### HMAC Audit Log
Every tool call, approval decision, and incident boundary is written to an append-only HMAC hash-chained log (`data/audit_log.jsonl`). Tamper-evident by design β `verify_integrity()` checks the full chain.
---
## Training Pipeline
### SFT β Online GRPO on AMD MI300X
```
5k trajectories (real GKE rollouts, teacher model)
β
QLoRA SFT (Qwen2.5-7B, 4-bit NF4, LoRA r=16)
β
Online GRPO (G=8 live GKE rollouts per step, DAPO loss)
β
Benchmark (38 frozen scenarios, anti-gaming reward contract)
```
**This is true online RL.** Each GRPO training step:
1. Applies a real Chaos Mesh fault to the live GKE cluster
2. Runs G=8 parallel agent chain rollouts
3. Scores each with the reward contract (kubectl/promql verify real cluster state)
4. Computes GRPO advantages and updates the policy
### What makes our training different from competitors
| Feature | Standard GRPO | AtlasOps |
|---|---|---|
| Environment | Simulator / offline rewards | **Real GKE cluster, live kubectl** |
| Loss | Standard GRPO | **DAPO** (distributional advantage β more stable on skewed rewards) |
| Reward | Episode-level only | **Dense per-step** (progress delta per tool call) + episode contract |
| Curriculum | Random / fixed | **Spaced repetition** (mastery tracking, [3β6β12β24β48] resurface intervals) |
| Scenario generation | Static | **Infinite adversarial** (72B judge generates new Chaos YAML live) |
### Reward Contract (Anti-Gaming)
```
R = 0.35 Γ resolve + 0.20 Γ evidence + 0.20 Γ safety + 0.15 Γ speed + 0.10 Γ comms
β command_spam (0.10) β false_resolution (0.25) β unsafe_shortcut (0.20)
β hallucinated_evidence (0.20) β over_silence (0.10)
Per-step dense signal = progress_delta Γ 0.8 + 0.1 (forward motion)
β 0.1 Γ rollbacks, Γ 0.5 if tool_failed
Final blend = 0.70 Γ episode_contract + 0.30 Γ dense_step_total (normalised)
```
Tier weights shift: cascade/adversarial penalise 1.25Γ harder. Named replays require evidence before resolution counts.
---
## Benchmark Results
| Model | Resolution | Avg Reward | Cascade | Named Replays |
|---|---|---|---|---|
| Qwen2.5-7B zero-shot | 54% | 0.481 | 40% | 30% |
| AtlasOps SFT | 68% | 0.601 | 62% | 55% |
| **AtlasOps GRPO (MI300X)** | **82%** | **0.729** | **78%** | **72%** |
**+28 pp improvement** from zero-shot baseline β GRPO. Reward includes anti-gaming penalties (command spam, false resolution, hallucinated evidence).
*Run `python scripts/release_gate.py` to verify artifact presence. Results auto-update in the dashboard Benchmark tab.*
---
## Quick Start
### Prerequisites
- GCP project with `container.googleapis.com` enabled
- `gcloud`, `kubectl`, `helm` installed
- AMD MI300X instance (or Fireworks AI fallback for inference)
### 1. Provision GCP infrastructure
```bash
bash infra/setup.sh <YOUR_PROJECT_ID> us-central1 atlasops
```
### 2. Start the ops console
```bash
pip install -e ".[dev]"
python app.py # http://localhost:7860
```
### Hugging Face Space (use your trained 7B + judge on Router)
Set Space secrets: **`HF_TOKEN`**, **`ATLASOPS_USE_HF_INFERENCE=1`**, **`AGENT_MODEL`**, **`JUDGE_MODEL`**.
Paste your merged GRPO Hub id as `AGENT_MODEL` (merge locally with `training/merge_lora_for_hub.py` under `.[train]`).
Full checklist: [docs/HF_SPACE_SETUP.md](docs/HF_SPACE_SETUP.md).
### 3. Inject a chaos scenario
```bash
make chaos SCENARIO=single_fault/sf-001 # pod-kill on cartservice
make chaos SCENARIO=named_replays/hist-cloudflare-2019
make chaos-reset
```
Or click a scenario button in the ops console β agents respond in real time.
### 4. Run the benchmark
```bash
python bench/runner.py --model checkpoints/grpo_v3 --tag grpo_v3
# Results β bench/results/comparison_table.md
```
### 5. Train on AMD MI300X
```bash
# Set up MI300X (installs ROCm deps, downloads models)
bash infra/setup_mi300x.sh
python training/generate_trajectories.py # 5k SFT examples
python training/sft.py --model Qwen/Qwen2.5-7B-Instruct --rocm
python training/grpo.py --model checkpoints/sft_v3 --rocm
```
### 6. Run tests
```bash
# Core agent + tool tests
python -m pytest tests/test_tools.py tests/test_coordinator.py tests/test_bench_runner.py -q
# Safety guardrail tests
python -m pytest tests/test_approval.py tests/test_circuit_breaker.py \
tests/test_correlator.py tests/test_audit.py -q
# App endpoint smoke tests
python -m pytest tests/test_app_endpoints.py -q
```
### 7. Release readiness gate
```bash
python scripts/release_gate.py --strict
# Writes docs/RELEASE_READINESS.md β all checks must PASS before submission
```
---
## Project Structure
```
atlasops/
βββ agents/
β βββ coordinator.py # FastAPI + full agent chain
β βββ approval.py # Human-in-the-loop gate (P0/P1/P2/P3)
β βββ circuit_breaker.py # Hard limits on tool calls + mutations
β βββ correlator.py # Alert storm deduplication
β βββ audit.py # HMAC hash-chained audit trail
β βββ adversarial_designer.py # 72B judge β infinite Chaos YAML
β βββ judge.py # Episode scoring
β βββ stream.py # SSE thought streaming
β βββ prompts/ # triage / diagnosis / remediation / comms
β βββ tools/ # 20 real SRE tool wrappers
βββ bench/
β βββ runner.py # Benchmark harness (38 frozen scenarios)
β βββ chaos_manifests/ # sf-001..008 Β· cs-001..005 Β· mf-001..005 Β· named_replays/
βββ config/
β βββ runtime.py # Frozen scenarios Β· reward contract Β· CurriculumManager Β· StepRewardTracker
βββ training/
β βββ sft.py # QLoRA SFT (4-bit NF4, LoRA r=16)
β βββ grpo.py # Online GRPO (DAPO loss, spaced-rep curriculum, dense rewards)
β βββ generate_trajectories.py
βββ scripts/
β βββ release_gate.py # Pre-submission readiness checker
βββ static/
β βββ index.html # Custom dark ops console (SSE + service topology + Slack feed)
βββ tests/ # 100+ tests across tools, coordinator, bench, safety
βββ docs/ # Postmortems Β· MI300X evidence Β· benchmarks
βββ infra/ # GCP provisioning Β· Helm values
βββ app.py # FastAPI entry point (HF Spaces)
βββ Dockerfile # HF Spaces container
```
---
## Why AMD MI300X
- **192 GB HBM3** β fits all 5 models simultaneously: 4 Γ Qwen2.5-7B-4bit (~4 GB each) + Qwen2.5-72B-4bit (~37 GB) = ~53 GB total. Impossible on A100 (80 GB OOM on 72B alone).
- **Online GRPO needs low-latency inference** β each training step fires 8 live GKE rollouts. MI300X throughput keeps step time under 5 minutes.
- **ROCm-native** β all training scripts target `--rocm`. Verified: `BitsAndBytesConfig` + `paged_adamw_8bit` on ROCm.
See [docs/MI300X_EVIDENCE.md](docs/MI300X_EVIDENCE.md) for `rocm-smi` snapshots and memory breakdown.
---
## License
MIT β see [LICENSE](LICENSE)
|