--- title: AtlasOps emoji: ๐Ÿšจ colorFrom: red colorTo: blue sdk: docker app_port: 7860 pinned: true short_description: 4 AI agents responding to real GKE incidents on AMD MI300X tags: - agents - multi-agent - reinforcement-learning - amd - rocm - sre - kubernetes --- # AtlasOps โ€” Can 4 AI agents replace an on-call SRE team? > **AMD Developer Hackathon 2026** | Real GKE cluster ยท Real Chaos Mesh ยท Real Prometheus alerts ยท AMD MI300X [![CI](https://github.com/Harikishanth/AtlasOps/actions/workflows/ci.yml/badge.svg)](https://github.com/Harikishanth/AtlasOps/actions/workflows/ci.yml) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE) [![AMD MI300X](https://img.shields.io/badge/GPU-AMD%20MI300X%20192GB-red)](docs/MI300X_EVIDENCE.md) --- We gave 4 specialized AI agents a PagerDuty alert, a live GKE cluster running 11 microservices, and 20 real SRE tools. No simulated responses. No fake metrics. No Docker Compose pretending to be cloud. **Triage** acked the alert and mapped the blast radius in 47 seconds. **Diagnosis** traced the root cause to a currency service CPU hog via Jaeger in 3 tool calls. **Remediation** executed `argocd rollback` and confirmed error rate < 1% via Prometheus. **Comms** drafted a Cloudflare-quality postmortem with real timestamps from the cluster. Total time to resolve a Cloudflare 2019 cascade replay: **4 minutes 12 seconds.** A senior SRE on a good day: ~25 minutes. This is **AtlasOps** โ€” a self-improving multi-agent SRE platform where a 72B adversarial judge generates infinite novel chaos scenarios targeting the agents' specific weaknesses, trained via SFT โ†’ Online GRPO on an AMD MI300X (192 GB HBM3). --- ## Architecture ``` โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ GOOGLE CLOUD PLATFORM โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ GKE Standard Cluster (us-central1, 3ร— e2-standard-4) โ”‚ โ”‚ โ”œโ”€ Online Boutique (11 services: Go, Python, Node, Java, C#) โ”‚ โ”‚ โ”œโ”€ Chaos Mesh (PodChaos, NetworkChaos, StressChaos, ...) โ”‚ โ”‚ โ”œโ”€ Prometheus + Grafana + Jaeger + OTel + Alertmanager โ”‚ โ”‚ โ””โ”€ Argo CD (real rollback execution) โ”‚ โ”‚ Cloud SQL (Postgres 15) ยท Cloud PubSub ยท Cloud Monitoring โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ kubectl + promql + jaeger + argocd + gcloud APIs โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ AMD MI300X (192 GB HBM3) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ vLLM co-hosting โ€” 5 models on ONE GPU: โ”‚ โ”‚ Qwen2.5-7B ร— 4 (Triage / Diagnosis / Remediation / Comms) โ”‚ โ”‚ Qwen2.5-72B (LLM Judge + adversarial scenario designer) โ”‚ โ”‚ โ”‚ โ”‚ Alert โ†’ Triage โ†’ Diagnosis โ†’ [Approval Gate] โ†’ Remediation โ”‚ โ”‚ โ†’ Comms โ†’ Postmortem โ”‚ โ”‚ โ”‚ โ”‚ Circuit Breaker ยท Incident Correlator ยท HMAC Audit Log โ”‚ โ”‚ Spaced-Rep Curriculum ยท DAPO GRPO ยท Dense Per-Step Rewards โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` --- ## Track Coverage ### Track 1 โ€” AI Agents & Agentic Workflows AtlasOps is a purpose-built multi-agent framework for SRE automation. Rather than wrapping LangChain or CrewAI, we implement the full agentic stack directly โ€” giving us tighter control over tool routing, approval gates, circuit breaking, and streaming than any general-purpose framework offers out of the box. The coordinator orchestrates 4 specialized roles (Triage, Diagnosis, Remediation, Comms) with tool-calling, human-in-the-loop approval, and alert correlation. Models: **Qwen2.5-7B ร— 4** (open-source, AMD MI300X co-hosted). ### Track 2 โ€” Fine-Tuning on AMD GPUs Full fine-tuning pipeline on AMD hardware: | Component | Library | |---|---| | Hardware | AMD Instinct MI300X (192 GB HBM3) | | GPU runtime | **ROCm 7.2** | | Training framework | **PyTorch** (ROCm wheel) | | Quantisation | **BitsAndBytes-ROCm** (4-bit NF4 QLoRA, LoRA r=16) + **AWQ** (72B judge) | | Fine-tuning | **TRL** SFTTrainer + GRPOTrainer (DAPO loss) | | PEFT | **LoRA** r=16, ฮฑ=32, target: q/k/v/o/gate/up/down proj | | AMD kernel optimisation | **Hugging Face Optimum-AMD** โ€” BetterTransformer applied to local inference path (`inference.py`) | | Serving | **vLLM 0.17.1** (ROCm build โ€” PagedAttention, flash attention for MI300X) | | Domain | **SRE Operations** โ€” incident triage, root-cause diagnosis, remediation, postmortem authoring | --- ## 20 Real SRE Tools `kubectl_get` ยท `kubectl_describe` ยท `kubectl_logs` ยท `kubectl_top_pods` ยท `kubectl_rollout` ยท `kubectl_scale` ยท `kubectl_exec` ยท `promql_query` ยท `promql_query_range` ยท `jaeger_search` ยท `jaeger_get_trace` ยท `argocd_list_apps` ยท `argocd_app_history` ยท **`argocd_rollback`** ยท `gcloud_logs_read` ยท `cloud_monitoring_query` ยท `alertmanager_list_alerts` ยท `alertmanager_silence` ยท `slack_post_update` ยท **`postmortem_draft`** Every tool hits a real API or real cluster. No mocks in production. --- ## 38 Chaos Scenarios + Infinite Adversarial Generation | Tier | Count | Examples | |---|---|---| | Single-fault | 8 | pod-kill, CPU hog, memory leak, network loss, disk fill, clock skew | | Cascade | 5 | currency latency โ†’ checkout timeout โ†’ frontend 5xx surge | | Multi-fault | 5 | 3 simultaneous faults + red herrings across namespaces | | Named Replays | 10 | Cloudflare 2019, AWS S3 2017, GitHub 2018, Discord 2022, Knight Capital 2012โ€ฆ | | **Dynamic adversarial** | โˆž | Qwen2.5-72B judge designs new Chaos Mesh YAML targeting agent weaknesses in real time | --- ## Production Guardrails ### Human-in-the-loop Approval Gate - **P0**: manual runbook only โ€” agents produce a step-by-step plan, no auto-execution - **P1**: approval window (60 s default, configurable) โ€” execution proceeds if approved or times out - **P2/P3**: fully automatic - `POST /approval/callback` ยท `GET /approval/pending` ### Circuit Breaker Hard stops runaway automation: - 50 tool calls per incident max - 10 mutating actions per hour - 5 concurrent incidents max - Trips after 3 consecutive unresolved incidents - `GET /circuit-breaker/status` ยท `POST /circuit-breaker/reset` ### Incident Correlator Alert-storm deduplication โ€” groups alerts from the same service/namespace within a 5-minute window into a single incident chain. Prevents 10 parallel agent chains firing for one cascade failure. ### HMAC Audit Log Every tool call, approval decision, and incident boundary is written to an append-only HMAC hash-chained log (`data/audit_log.jsonl`). Tamper-evident by design โ€” `verify_integrity()` checks the full chain. --- ## Training Pipeline ### SFT โ†’ Online GRPO on AMD MI300X ``` 5k trajectories (real GKE rollouts, teacher model) โ†“ QLoRA SFT (Qwen2.5-7B, 4-bit NF4, LoRA r=16) โ†“ Online GRPO (G=8 live GKE rollouts per step, DAPO loss) โ†“ Benchmark (38 frozen scenarios, anti-gaming reward contract) ``` **This is true online RL.** Each GRPO training step: 1. Applies a real Chaos Mesh fault to the live GKE cluster 2. Runs G=8 parallel agent chain rollouts 3. Scores each with the reward contract (kubectl/promql verify real cluster state) 4. Computes GRPO advantages and updates the policy ### What makes our training different from competitors | Feature | Standard GRPO | AtlasOps | |---|---|---| | Environment | Simulator / offline rewards | **Real GKE cluster, live kubectl** | | Loss | Standard GRPO | **DAPO** (distributional advantage โ€” more stable on skewed rewards) | | Reward | Episode-level only | **Dense per-step** (progress delta per tool call) + episode contract | | Curriculum | Random / fixed | **Spaced repetition** (mastery tracking, [3โ†’6โ†’12โ†’24โ†’48] resurface intervals) | | Scenario generation | Static | **Infinite adversarial** (72B judge generates new Chaos YAML live) | ### Reward Contract (Anti-Gaming) ``` R = 0.35 ร— resolve + 0.20 ร— evidence + 0.20 ร— safety + 0.15 ร— speed + 0.10 ร— comms โˆ’ command_spam (0.10) โˆ’ false_resolution (0.25) โˆ’ unsafe_shortcut (0.20) โˆ’ hallucinated_evidence (0.20) โˆ’ over_silence (0.10) Per-step dense signal = progress_delta ร— 0.8 + 0.1 (forward motion) โˆ’ 0.1 ร— rollbacks, ร— 0.5 if tool_failed Final blend = 0.70 ร— episode_contract + 0.30 ร— dense_step_total (normalised) ``` Tier weights shift: cascade/adversarial penalise 1.25ร— harder. Named replays require evidence before resolution counts. --- ## Benchmark Results | Model | Resolution | Avg Reward | Cascade | Named Replays | |---|---|---|---|---| | Qwen2.5-7B zero-shot | 54% | 0.481 | 40% | 30% | | AtlasOps SFT | 68% | 0.601 | 62% | 55% | | **AtlasOps GRPO (MI300X)** | **82%** | **0.729** | **78%** | **72%** | **+28 pp improvement** from zero-shot baseline โ†’ GRPO. Reward includes anti-gaming penalties (command spam, false resolution, hallucinated evidence). *Run `python scripts/release_gate.py` to verify artifact presence. Results auto-update in the dashboard Benchmark tab.* --- ## Quick Start ### Prerequisites - GCP project with `container.googleapis.com` enabled - `gcloud`, `kubectl`, `helm` installed - AMD MI300X instance (or Fireworks AI fallback for inference) ### 1. Provision GCP infrastructure ```bash bash infra/setup.sh us-central1 atlasops ``` ### 2. Start the ops console ```bash pip install -e ".[dev]" python app.py # http://localhost:7860 ``` ### Hugging Face Space (use your trained 7B + judge on Router) Set Space secrets: **`HF_TOKEN`**, **`ATLASOPS_USE_HF_INFERENCE=1`**, **`AGENT_MODEL`**, **`JUDGE_MODEL`**. Paste your merged GRPO Hub id as `AGENT_MODEL` (merge locally with `training/merge_lora_for_hub.py` under `.[train]`). Full checklist: [docs/HF_SPACE_SETUP.md](docs/HF_SPACE_SETUP.md). ### 3. Inject a chaos scenario ```bash make chaos SCENARIO=single_fault/sf-001 # pod-kill on cartservice make chaos SCENARIO=named_replays/hist-cloudflare-2019 make chaos-reset ``` Or click a scenario button in the ops console โ€” agents respond in real time. ### 4. Run the benchmark ```bash python bench/runner.py --model checkpoints/grpo_v3 --tag grpo_v3 # Results โ†’ bench/results/comparison_table.md ``` ### 5. Train on AMD MI300X ```bash # Set up MI300X (installs ROCm deps, downloads models) bash infra/setup_mi300x.sh python training/generate_trajectories.py # 5k SFT examples python training/sft.py --model Qwen/Qwen2.5-7B-Instruct --rocm python training/grpo.py --model checkpoints/sft_v3 --rocm ``` ### 6. Run tests ```bash # Core agent + tool tests python -m pytest tests/test_tools.py tests/test_coordinator.py tests/test_bench_runner.py -q # Safety guardrail tests python -m pytest tests/test_approval.py tests/test_circuit_breaker.py \ tests/test_correlator.py tests/test_audit.py -q # App endpoint smoke tests python -m pytest tests/test_app_endpoints.py -q ``` ### 7. Release readiness gate ```bash python scripts/release_gate.py --strict # Writes docs/RELEASE_READINESS.md โ€” all checks must PASS before submission ``` --- ## Project Structure ``` atlasops/ โ”œโ”€โ”€ agents/ โ”‚ โ”œโ”€โ”€ coordinator.py # FastAPI + full agent chain โ”‚ โ”œโ”€โ”€ approval.py # Human-in-the-loop gate (P0/P1/P2/P3) โ”‚ โ”œโ”€โ”€ circuit_breaker.py # Hard limits on tool calls + mutations โ”‚ โ”œโ”€โ”€ correlator.py # Alert storm deduplication โ”‚ โ”œโ”€โ”€ audit.py # HMAC hash-chained audit trail โ”‚ โ”œโ”€โ”€ adversarial_designer.py # 72B judge โ†’ infinite Chaos YAML โ”‚ โ”œโ”€โ”€ judge.py # Episode scoring โ”‚ โ”œโ”€โ”€ stream.py # SSE thought streaming โ”‚ โ”œโ”€โ”€ prompts/ # triage / diagnosis / remediation / comms โ”‚ โ””โ”€โ”€ tools/ # 20 real SRE tool wrappers โ”œโ”€โ”€ bench/ โ”‚ โ”œโ”€โ”€ runner.py # Benchmark harness (38 frozen scenarios) โ”‚ โ””โ”€โ”€ chaos_manifests/ # sf-001..008 ยท cs-001..005 ยท mf-001..005 ยท named_replays/ โ”œโ”€โ”€ config/ โ”‚ โ””โ”€โ”€ runtime.py # Frozen scenarios ยท reward contract ยท CurriculumManager ยท StepRewardTracker โ”œโ”€โ”€ training/ โ”‚ โ”œโ”€โ”€ sft.py # QLoRA SFT (4-bit NF4, LoRA r=16) โ”‚ โ”œโ”€โ”€ grpo.py # Online GRPO (DAPO loss, spaced-rep curriculum, dense rewards) โ”‚ โ””โ”€โ”€ generate_trajectories.py โ”œโ”€โ”€ scripts/ โ”‚ โ””โ”€โ”€ release_gate.py # Pre-submission readiness checker โ”œโ”€โ”€ static/ โ”‚ โ””โ”€โ”€ index.html # Custom dark ops console (SSE + service topology + Slack feed) โ”œโ”€โ”€ tests/ # 100+ tests across tools, coordinator, bench, safety โ”œโ”€โ”€ docs/ # Postmortems ยท MI300X evidence ยท benchmarks โ”œโ”€โ”€ infra/ # GCP provisioning ยท Helm values โ”œโ”€โ”€ app.py # FastAPI entry point (HF Spaces) โ””โ”€โ”€ Dockerfile # HF Spaces container ``` --- ## Why AMD MI300X - **192 GB HBM3** โ€” fits all 5 models simultaneously: 4 ร— Qwen2.5-7B-4bit (~4 GB each) + Qwen2.5-72B-4bit (~37 GB) = ~53 GB total. Impossible on A100 (80 GB OOM on 72B alone). - **Online GRPO needs low-latency inference** โ€” each training step fires 8 live GKE rollouts. MI300X throughput keeps step time under 5 minutes. - **ROCm-native** โ€” all training scripts target `--rocm`. Verified: `BitsAndBytesConfig` + `paged_adamw_8bit` on ROCm. See [docs/MI300X_EVIDENCE.md](docs/MI300X_EVIDENCE.md) for `rocm-smi` snapshots and memory breakdown. --- ## License MIT โ€” see [LICENSE](LICENSE)