atlasops / README.md
Harikishanth R
fix: skip-kubectl + scroll + health β€” HF Space ready
7e9a520
---
title: AtlasOps
emoji: 🚨
colorFrom: red
colorTo: blue
sdk: docker
app_port: 7860
pinned: true
short_description: 4 AI agents responding to real GKE incidents on AMD MI300X
tags:
- agents
- multi-agent
- reinforcement-learning
- amd
- rocm
- sre
- kubernetes
---
# AtlasOps β€” Can 4 AI agents replace an on-call SRE team?
> **AMD Developer Hackathon 2026** | Real GKE cluster Β· Real Chaos Mesh Β· Real Prometheus alerts Β· AMD MI300X
[![CI](https://github.com/Harikishanth/AtlasOps/actions/workflows/ci.yml/badge.svg)](https://github.com/Harikishanth/AtlasOps/actions/workflows/ci.yml)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
[![AMD MI300X](https://img.shields.io/badge/GPU-AMD%20MI300X%20192GB-red)](docs/MI300X_EVIDENCE.md)
---
We gave 4 specialized AI agents a PagerDuty alert, a live GKE cluster running 11 microservices, and 20 real SRE tools. No simulated responses. No fake metrics. No Docker Compose pretending to be cloud.
**Triage** acked the alert and mapped the blast radius in 47 seconds.
**Diagnosis** traced the root cause to a currency service CPU hog via Jaeger in 3 tool calls.
**Remediation** executed `argocd rollback` and confirmed error rate < 1% via Prometheus.
**Comms** drafted a Cloudflare-quality postmortem with real timestamps from the cluster.
Total time to resolve a Cloudflare 2019 cascade replay: **4 minutes 12 seconds.**
A senior SRE on a good day: ~25 minutes.
This is **AtlasOps** β€” a self-improving multi-agent SRE platform where a 72B adversarial judge generates infinite novel chaos scenarios targeting the agents' specific weaknesses, trained via SFT β†’ Online GRPO on an AMD MI300X (192 GB HBM3).
---
## Architecture
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ GOOGLE CLOUD PLATFORM ─────────────────────┐
β”‚ GKE Standard Cluster (us-central1, 3Γ— e2-standard-4) β”‚
β”‚ β”œβ”€ Online Boutique (11 services: Go, Python, Node, Java, C#) β”‚
β”‚ β”œβ”€ Chaos Mesh (PodChaos, NetworkChaos, StressChaos, ...) β”‚
β”‚ β”œβ”€ Prometheus + Grafana + Jaeger + OTel + Alertmanager β”‚
β”‚ └─ Argo CD (real rollback execution) β”‚
β”‚ Cloud SQL (Postgres 15) Β· Cloud PubSub Β· Cloud Monitoring β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ kubectl + promql + jaeger + argocd + gcloud APIs
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ AMD MI300X (192 GB HBM3) ───────────────────────┐
β”‚ vLLM co-hosting β€” 5 models on ONE GPU: β”‚
β”‚ Qwen2.5-7B Γ— 4 (Triage / Diagnosis / Remediation / Comms) β”‚
β”‚ Qwen2.5-72B (LLM Judge + adversarial scenario designer) β”‚
β”‚ β”‚
β”‚ Alert β†’ Triage β†’ Diagnosis β†’ [Approval Gate] β†’ Remediation β”‚
β”‚ β†’ Comms β†’ Postmortem β”‚
β”‚ β”‚
β”‚ Circuit Breaker Β· Incident Correlator Β· HMAC Audit Log β”‚
β”‚ Spaced-Rep Curriculum Β· DAPO GRPO Β· Dense Per-Step Rewards β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
---
## Track Coverage
### Track 1 β€” AI Agents & Agentic Workflows
AtlasOps is a purpose-built multi-agent framework for SRE automation. Rather than wrapping LangChain or CrewAI, we implement the full agentic stack directly β€” giving us tighter control over tool routing, approval gates, circuit breaking, and streaming than any general-purpose framework offers out of the box. The coordinator orchestrates 4 specialized roles (Triage, Diagnosis, Remediation, Comms) with tool-calling, human-in-the-loop approval, and alert correlation. Models: **Qwen2.5-7B Γ— 4** (open-source, AMD MI300X co-hosted).
### Track 2 β€” Fine-Tuning on AMD GPUs
Full fine-tuning pipeline on AMD hardware:
| Component | Library |
|---|---|
| Hardware | AMD Instinct MI300X (192 GB HBM3) |
| GPU runtime | **ROCm 7.2** |
| Training framework | **PyTorch** (ROCm wheel) |
| Quantisation | **BitsAndBytes-ROCm** (4-bit NF4 QLoRA, LoRA r=16) + **AWQ** (72B judge) |
| Fine-tuning | **TRL** SFTTrainer + GRPOTrainer (DAPO loss) |
| PEFT | **LoRA** r=16, Ξ±=32, target: q/k/v/o/gate/up/down proj |
| AMD kernel optimisation | **Hugging Face Optimum-AMD** β€” BetterTransformer applied to local inference path (`inference.py`) |
| Serving | **vLLM 0.17.1** (ROCm build β€” PagedAttention, flash attention for MI300X) |
| Domain | **SRE Operations** β€” incident triage, root-cause diagnosis, remediation, postmortem authoring |
---
## 20 Real SRE Tools
`kubectl_get` Β· `kubectl_describe` Β· `kubectl_logs` Β· `kubectl_top_pods` Β· `kubectl_rollout` Β· `kubectl_scale` Β· `kubectl_exec` Β· `promql_query` Β· `promql_query_range` Β· `jaeger_search` Β· `jaeger_get_trace` Β· `argocd_list_apps` Β· `argocd_app_history` Β· **`argocd_rollback`** Β· `gcloud_logs_read` Β· `cloud_monitoring_query` Β· `alertmanager_list_alerts` Β· `alertmanager_silence` Β· `slack_post_update` Β· **`postmortem_draft`**
Every tool hits a real API or real cluster. No mocks in production.
---
## 38 Chaos Scenarios + Infinite Adversarial Generation
| Tier | Count | Examples |
|---|---|---|
| Single-fault | 8 | pod-kill, CPU hog, memory leak, network loss, disk fill, clock skew |
| Cascade | 5 | currency latency β†’ checkout timeout β†’ frontend 5xx surge |
| Multi-fault | 5 | 3 simultaneous faults + red herrings across namespaces |
| Named Replays | 10 | Cloudflare 2019, AWS S3 2017, GitHub 2018, Discord 2022, Knight Capital 2012… |
| **Dynamic adversarial** | ∞ | Qwen2.5-72B judge designs new Chaos Mesh YAML targeting agent weaknesses in real time |
---
## Production Guardrails
### Human-in-the-loop Approval Gate
- **P0**: manual runbook only β€” agents produce a step-by-step plan, no auto-execution
- **P1**: approval window (60 s default, configurable) β€” execution proceeds if approved or times out
- **P2/P3**: fully automatic
- `POST /approval/callback` Β· `GET /approval/pending`
### Circuit Breaker
Hard stops runaway automation:
- 50 tool calls per incident max
- 10 mutating actions per hour
- 5 concurrent incidents max
- Trips after 3 consecutive unresolved incidents
- `GET /circuit-breaker/status` Β· `POST /circuit-breaker/reset`
### Incident Correlator
Alert-storm deduplication β€” groups alerts from the same service/namespace within a 5-minute window into a single incident chain. Prevents 10 parallel agent chains firing for one cascade failure.
### HMAC Audit Log
Every tool call, approval decision, and incident boundary is written to an append-only HMAC hash-chained log (`data/audit_log.jsonl`). Tamper-evident by design β€” `verify_integrity()` checks the full chain.
---
## Training Pipeline
### SFT β†’ Online GRPO on AMD MI300X
```
5k trajectories (real GKE rollouts, teacher model)
↓
QLoRA SFT (Qwen2.5-7B, 4-bit NF4, LoRA r=16)
↓
Online GRPO (G=8 live GKE rollouts per step, DAPO loss)
↓
Benchmark (38 frozen scenarios, anti-gaming reward contract)
```
**This is true online RL.** Each GRPO training step:
1. Applies a real Chaos Mesh fault to the live GKE cluster
2. Runs G=8 parallel agent chain rollouts
3. Scores each with the reward contract (kubectl/promql verify real cluster state)
4. Computes GRPO advantages and updates the policy
### What makes our training different from competitors
| Feature | Standard GRPO | AtlasOps |
|---|---|---|
| Environment | Simulator / offline rewards | **Real GKE cluster, live kubectl** |
| Loss | Standard GRPO | **DAPO** (distributional advantage β€” more stable on skewed rewards) |
| Reward | Episode-level only | **Dense per-step** (progress delta per tool call) + episode contract |
| Curriculum | Random / fixed | **Spaced repetition** (mastery tracking, [3β†’6β†’12β†’24β†’48] resurface intervals) |
| Scenario generation | Static | **Infinite adversarial** (72B judge generates new Chaos YAML live) |
### Reward Contract (Anti-Gaming)
```
R = 0.35 Γ— resolve + 0.20 Γ— evidence + 0.20 Γ— safety + 0.15 Γ— speed + 0.10 Γ— comms
βˆ’ command_spam (0.10) βˆ’ false_resolution (0.25) βˆ’ unsafe_shortcut (0.20)
βˆ’ hallucinated_evidence (0.20) βˆ’ over_silence (0.10)
Per-step dense signal = progress_delta Γ— 0.8 + 0.1 (forward motion)
βˆ’ 0.1 Γ— rollbacks, Γ— 0.5 if tool_failed
Final blend = 0.70 Γ— episode_contract + 0.30 Γ— dense_step_total (normalised)
```
Tier weights shift: cascade/adversarial penalise 1.25Γ— harder. Named replays require evidence before resolution counts.
---
## Benchmark Results
| Model | Resolution | Avg Reward | Cascade | Named Replays |
|---|---|---|---|---|
| Qwen2.5-7B zero-shot | 54% | 0.481 | 40% | 30% |
| AtlasOps SFT | 68% | 0.601 | 62% | 55% |
| **AtlasOps GRPO (MI300X)** | **82%** | **0.729** | **78%** | **72%** |
**+28 pp improvement** from zero-shot baseline β†’ GRPO. Reward includes anti-gaming penalties (command spam, false resolution, hallucinated evidence).
*Run `python scripts/release_gate.py` to verify artifact presence. Results auto-update in the dashboard Benchmark tab.*
---
## Quick Start
### Prerequisites
- GCP project with `container.googleapis.com` enabled
- `gcloud`, `kubectl`, `helm` installed
- AMD MI300X instance (or Fireworks AI fallback for inference)
### 1. Provision GCP infrastructure
```bash
bash infra/setup.sh <YOUR_PROJECT_ID> us-central1 atlasops
```
### 2. Start the ops console
```bash
pip install -e ".[dev]"
python app.py # http://localhost:7860
```
### Hugging Face Space (use your trained 7B + judge on Router)
Set Space secrets: **`HF_TOKEN`**, **`ATLASOPS_USE_HF_INFERENCE=1`**, **`AGENT_MODEL`**, **`JUDGE_MODEL`**.
Paste your merged GRPO Hub id as `AGENT_MODEL` (merge locally with `training/merge_lora_for_hub.py` under `.[train]`).
Full checklist: [docs/HF_SPACE_SETUP.md](docs/HF_SPACE_SETUP.md).
### 3. Inject a chaos scenario
```bash
make chaos SCENARIO=single_fault/sf-001 # pod-kill on cartservice
make chaos SCENARIO=named_replays/hist-cloudflare-2019
make chaos-reset
```
Or click a scenario button in the ops console β€” agents respond in real time.
### 4. Run the benchmark
```bash
python bench/runner.py --model checkpoints/grpo_v3 --tag grpo_v3
# Results β†’ bench/results/comparison_table.md
```
### 5. Train on AMD MI300X
```bash
# Set up MI300X (installs ROCm deps, downloads models)
bash infra/setup_mi300x.sh
python training/generate_trajectories.py # 5k SFT examples
python training/sft.py --model Qwen/Qwen2.5-7B-Instruct --rocm
python training/grpo.py --model checkpoints/sft_v3 --rocm
```
### 6. Run tests
```bash
# Core agent + tool tests
python -m pytest tests/test_tools.py tests/test_coordinator.py tests/test_bench_runner.py -q
# Safety guardrail tests
python -m pytest tests/test_approval.py tests/test_circuit_breaker.py \
tests/test_correlator.py tests/test_audit.py -q
# App endpoint smoke tests
python -m pytest tests/test_app_endpoints.py -q
```
### 7. Release readiness gate
```bash
python scripts/release_gate.py --strict
# Writes docs/RELEASE_READINESS.md β€” all checks must PASS before submission
```
---
## Project Structure
```
atlasops/
β”œβ”€β”€ agents/
β”‚ β”œβ”€β”€ coordinator.py # FastAPI + full agent chain
β”‚ β”œβ”€β”€ approval.py # Human-in-the-loop gate (P0/P1/P2/P3)
β”‚ β”œβ”€β”€ circuit_breaker.py # Hard limits on tool calls + mutations
β”‚ β”œβ”€β”€ correlator.py # Alert storm deduplication
β”‚ β”œβ”€β”€ audit.py # HMAC hash-chained audit trail
β”‚ β”œβ”€β”€ adversarial_designer.py # 72B judge β†’ infinite Chaos YAML
β”‚ β”œβ”€β”€ judge.py # Episode scoring
β”‚ β”œβ”€β”€ stream.py # SSE thought streaming
β”‚ β”œβ”€β”€ prompts/ # triage / diagnosis / remediation / comms
β”‚ └── tools/ # 20 real SRE tool wrappers
β”œβ”€β”€ bench/
β”‚ β”œβ”€β”€ runner.py # Benchmark harness (38 frozen scenarios)
β”‚ └── chaos_manifests/ # sf-001..008 Β· cs-001..005 Β· mf-001..005 Β· named_replays/
β”œβ”€β”€ config/
β”‚ └── runtime.py # Frozen scenarios Β· reward contract Β· CurriculumManager Β· StepRewardTracker
β”œβ”€β”€ training/
β”‚ β”œβ”€β”€ sft.py # QLoRA SFT (4-bit NF4, LoRA r=16)
β”‚ β”œβ”€β”€ grpo.py # Online GRPO (DAPO loss, spaced-rep curriculum, dense rewards)
β”‚ └── generate_trajectories.py
β”œβ”€β”€ scripts/
β”‚ └── release_gate.py # Pre-submission readiness checker
β”œβ”€β”€ static/
β”‚ └── index.html # Custom dark ops console (SSE + service topology + Slack feed)
β”œβ”€β”€ tests/ # 100+ tests across tools, coordinator, bench, safety
β”œβ”€β”€ docs/ # Postmortems Β· MI300X evidence Β· benchmarks
β”œβ”€β”€ infra/ # GCP provisioning Β· Helm values
β”œβ”€β”€ app.py # FastAPI entry point (HF Spaces)
└── Dockerfile # HF Spaces container
```
---
## Why AMD MI300X
- **192 GB HBM3** β€” fits all 5 models simultaneously: 4 Γ— Qwen2.5-7B-4bit (~4 GB each) + Qwen2.5-72B-4bit (~37 GB) = ~53 GB total. Impossible on A100 (80 GB OOM on 72B alone).
- **Online GRPO needs low-latency inference** β€” each training step fires 8 live GKE rollouts. MI300X throughput keeps step time under 5 minutes.
- **ROCm-native** β€” all training scripts target `--rocm`. Verified: `BitsAndBytesConfig` + `paged_adamw_8bit` on ROCm.
See [docs/MI300X_EVIDENCE.md](docs/MI300X_EVIDENCE.md) for `rocm-smi` snapshots and memory breakdown.
---
## License
MIT β€” see [LICENSE](LICENSE)