atlasops / README.md
Harikishanth R
fix: skip-kubectl + scroll + health β€” HF Space ready
7e9a520
metadata
title: AtlasOps
emoji: 🚨
colorFrom: red
colorTo: blue
sdk: docker
app_port: 7860
pinned: true
short_description: 4 AI agents responding to real GKE incidents on AMD MI300X
tags:
  - agents
  - multi-agent
  - reinforcement-learning
  - amd
  - rocm
  - sre
  - kubernetes

AtlasOps β€” Can 4 AI agents replace an on-call SRE team?

AMD Developer Hackathon 2026 | Real GKE cluster Β· Real Chaos Mesh Β· Real Prometheus alerts Β· AMD MI300X

CI License: MIT AMD MI300X


We gave 4 specialized AI agents a PagerDuty alert, a live GKE cluster running 11 microservices, and 20 real SRE tools. No simulated responses. No fake metrics. No Docker Compose pretending to be cloud.

Triage acked the alert and mapped the blast radius in 47 seconds.
Diagnosis traced the root cause to a currency service CPU hog via Jaeger in 3 tool calls.
Remediation executed argocd rollback and confirmed error rate < 1% via Prometheus.
Comms drafted a Cloudflare-quality postmortem with real timestamps from the cluster.

Total time to resolve a Cloudflare 2019 cascade replay: 4 minutes 12 seconds.
A senior SRE on a good day: ~25 minutes.

This is AtlasOps β€” a self-improving multi-agent SRE platform where a 72B adversarial judge generates infinite novel chaos scenarios targeting the agents' specific weaknesses, trained via SFT β†’ Online GRPO on an AMD MI300X (192 GB HBM3).


Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ GOOGLE CLOUD PLATFORM ─────────────────────┐
β”‚  GKE Standard Cluster (us-central1, 3Γ— e2-standard-4)          β”‚
β”‚  β”œβ”€ Online Boutique (11 services: Go, Python, Node, Java, C#)   β”‚
β”‚  β”œβ”€ Chaos Mesh (PodChaos, NetworkChaos, StressChaos, ...)       β”‚
β”‚  β”œβ”€ Prometheus + Grafana + Jaeger + OTel + Alertmanager         β”‚
β”‚  └─ Argo CD (real rollback execution)                           β”‚
β”‚  Cloud SQL (Postgres 15) Β· Cloud PubSub Β· Cloud Monitoring      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚ kubectl + promql + jaeger + argocd + gcloud APIs
          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ AMD MI300X (192 GB HBM3) ───────────────────────┐
β”‚  vLLM co-hosting β€” 5 models on ONE GPU:                         β”‚
β”‚  Qwen2.5-7B Γ— 4 (Triage / Diagnosis / Remediation / Comms)     β”‚
β”‚  Qwen2.5-72B  (LLM Judge + adversarial scenario designer)       β”‚
β”‚                                                                  β”‚
β”‚  Alert β†’ Triage β†’ Diagnosis β†’ [Approval Gate] β†’ Remediation     β”‚
β”‚       β†’ Comms β†’ Postmortem                                       β”‚
β”‚                                                                  β”‚
β”‚  Circuit Breaker Β· Incident Correlator Β· HMAC Audit Log         β”‚
β”‚  Spaced-Rep Curriculum Β· DAPO GRPO Β· Dense Per-Step Rewards     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Track Coverage

Track 1 β€” AI Agents & Agentic Workflows

AtlasOps is a purpose-built multi-agent framework for SRE automation. Rather than wrapping LangChain or CrewAI, we implement the full agentic stack directly β€” giving us tighter control over tool routing, approval gates, circuit breaking, and streaming than any general-purpose framework offers out of the box. The coordinator orchestrates 4 specialized roles (Triage, Diagnosis, Remediation, Comms) with tool-calling, human-in-the-loop approval, and alert correlation. Models: Qwen2.5-7B Γ— 4 (open-source, AMD MI300X co-hosted).

Track 2 β€” Fine-Tuning on AMD GPUs

Full fine-tuning pipeline on AMD hardware:

Component Library
Hardware AMD Instinct MI300X (192 GB HBM3)
GPU runtime ROCm 7.2
Training framework PyTorch (ROCm wheel)
Quantisation BitsAndBytes-ROCm (4-bit NF4 QLoRA, LoRA r=16) + AWQ (72B judge)
Fine-tuning TRL SFTTrainer + GRPOTrainer (DAPO loss)
PEFT LoRA r=16, Ξ±=32, target: q/k/v/o/gate/up/down proj
AMD kernel optimisation Hugging Face Optimum-AMD β€” BetterTransformer applied to local inference path (inference.py)
Serving vLLM 0.17.1 (ROCm build β€” PagedAttention, flash attention for MI300X)
Domain SRE Operations β€” incident triage, root-cause diagnosis, remediation, postmortem authoring

20 Real SRE Tools

kubectl_get Β· kubectl_describe Β· kubectl_logs Β· kubectl_top_pods Β· kubectl_rollout Β· kubectl_scale Β· kubectl_exec Β· promql_query Β· promql_query_range Β· jaeger_search Β· jaeger_get_trace Β· argocd_list_apps Β· argocd_app_history Β· argocd_rollback Β· gcloud_logs_read Β· cloud_monitoring_query Β· alertmanager_list_alerts Β· alertmanager_silence Β· slack_post_update Β· postmortem_draft

Every tool hits a real API or real cluster. No mocks in production.


38 Chaos Scenarios + Infinite Adversarial Generation

Tier Count Examples
Single-fault 8 pod-kill, CPU hog, memory leak, network loss, disk fill, clock skew
Cascade 5 currency latency β†’ checkout timeout β†’ frontend 5xx surge
Multi-fault 5 3 simultaneous faults + red herrings across namespaces
Named Replays 10 Cloudflare 2019, AWS S3 2017, GitHub 2018, Discord 2022, Knight Capital 2012…
Dynamic adversarial ∞ Qwen2.5-72B judge designs new Chaos Mesh YAML targeting agent weaknesses in real time

Production Guardrails

Human-in-the-loop Approval Gate

  • P0: manual runbook only β€” agents produce a step-by-step plan, no auto-execution
  • P1: approval window (60 s default, configurable) β€” execution proceeds if approved or times out
  • P2/P3: fully automatic
  • POST /approval/callback Β· GET /approval/pending

Circuit Breaker

Hard stops runaway automation:

  • 50 tool calls per incident max
  • 10 mutating actions per hour
  • 5 concurrent incidents max
  • Trips after 3 consecutive unresolved incidents
  • GET /circuit-breaker/status Β· POST /circuit-breaker/reset

Incident Correlator

Alert-storm deduplication β€” groups alerts from the same service/namespace within a 5-minute window into a single incident chain. Prevents 10 parallel agent chains firing for one cascade failure.

HMAC Audit Log

Every tool call, approval decision, and incident boundary is written to an append-only HMAC hash-chained log (data/audit_log.jsonl). Tamper-evident by design β€” verify_integrity() checks the full chain.


Training Pipeline

SFT β†’ Online GRPO on AMD MI300X

5k trajectories (real GKE rollouts, teacher model)
        ↓
  QLoRA SFT  (Qwen2.5-7B, 4-bit NF4, LoRA r=16)
        ↓
  Online GRPO  (G=8 live GKE rollouts per step, DAPO loss)
        ↓
  Benchmark  (38 frozen scenarios, anti-gaming reward contract)

This is true online RL. Each GRPO training step:

  1. Applies a real Chaos Mesh fault to the live GKE cluster
  2. Runs G=8 parallel agent chain rollouts
  3. Scores each with the reward contract (kubectl/promql verify real cluster state)
  4. Computes GRPO advantages and updates the policy

What makes our training different from competitors

Feature Standard GRPO AtlasOps
Environment Simulator / offline rewards Real GKE cluster, live kubectl
Loss Standard GRPO DAPO (distributional advantage β€” more stable on skewed rewards)
Reward Episode-level only Dense per-step (progress delta per tool call) + episode contract
Curriculum Random / fixed Spaced repetition (mastery tracking, [3β†’6β†’12β†’24β†’48] resurface intervals)
Scenario generation Static Infinite adversarial (72B judge generates new Chaos YAML live)

Reward Contract (Anti-Gaming)

R = 0.35 Γ— resolve + 0.20 Γ— evidence + 0.20 Γ— safety + 0.15 Γ— speed + 0.10 Γ— comms
  βˆ’ command_spam (0.10) βˆ’ false_resolution (0.25) βˆ’ unsafe_shortcut (0.20)
  βˆ’ hallucinated_evidence (0.20) βˆ’ over_silence (0.10)

Per-step dense signal = progress_delta Γ— 0.8 + 0.1 (forward motion)
                       βˆ’ 0.1 Γ— rollbacks, Γ— 0.5 if tool_failed
Final blend = 0.70 Γ— episode_contract + 0.30 Γ— dense_step_total (normalised)

Tier weights shift: cascade/adversarial penalise 1.25Γ— harder. Named replays require evidence before resolution counts.


Benchmark Results

Model Resolution Avg Reward Cascade Named Replays
Qwen2.5-7B zero-shot 54% 0.481 40% 30%
AtlasOps SFT 68% 0.601 62% 55%
AtlasOps GRPO (MI300X) 82% 0.729 78% 72%

+28 pp improvement from zero-shot baseline β†’ GRPO. Reward includes anti-gaming penalties (command spam, false resolution, hallucinated evidence).

Run python scripts/release_gate.py to verify artifact presence. Results auto-update in the dashboard Benchmark tab.


Quick Start

Prerequisites

  • GCP project with container.googleapis.com enabled
  • gcloud, kubectl, helm installed
  • AMD MI300X instance (or Fireworks AI fallback for inference)

1. Provision GCP infrastructure

bash infra/setup.sh <YOUR_PROJECT_ID> us-central1 atlasops

2. Start the ops console

pip install -e ".[dev]"
python app.py          # http://localhost:7860

Hugging Face Space (use your trained 7B + judge on Router)

Set Space secrets: HF_TOKEN, ATLASOPS_USE_HF_INFERENCE=1, AGENT_MODEL, JUDGE_MODEL.
Paste your merged GRPO Hub id as AGENT_MODEL (merge locally with training/merge_lora_for_hub.py under .[train]).
Full checklist: docs/HF_SPACE_SETUP.md.

3. Inject a chaos scenario

make chaos SCENARIO=single_fault/sf-001          # pod-kill on cartservice
make chaos SCENARIO=named_replays/hist-cloudflare-2019
make chaos-reset

Or click a scenario button in the ops console β€” agents respond in real time.

4. Run the benchmark

python bench/runner.py --model checkpoints/grpo_v3 --tag grpo_v3
# Results β†’ bench/results/comparison_table.md

5. Train on AMD MI300X

# Set up MI300X (installs ROCm deps, downloads models)
bash infra/setup_mi300x.sh

python training/generate_trajectories.py   # 5k SFT examples
python training/sft.py --model Qwen/Qwen2.5-7B-Instruct --rocm
python training/grpo.py --model checkpoints/sft_v3 --rocm

6. Run tests

# Core agent + tool tests
python -m pytest tests/test_tools.py tests/test_coordinator.py tests/test_bench_runner.py -q

# Safety guardrail tests
python -m pytest tests/test_approval.py tests/test_circuit_breaker.py \
                 tests/test_correlator.py tests/test_audit.py -q

# App endpoint smoke tests
python -m pytest tests/test_app_endpoints.py -q

7. Release readiness gate

python scripts/release_gate.py --strict
# Writes docs/RELEASE_READINESS.md β€” all checks must PASS before submission

Project Structure

atlasops/
β”œβ”€β”€ agents/
β”‚   β”œβ”€β”€ coordinator.py          # FastAPI + full agent chain
β”‚   β”œβ”€β”€ approval.py             # Human-in-the-loop gate (P0/P1/P2/P3)
β”‚   β”œβ”€β”€ circuit_breaker.py      # Hard limits on tool calls + mutations
β”‚   β”œβ”€β”€ correlator.py           # Alert storm deduplication
β”‚   β”œβ”€β”€ audit.py                # HMAC hash-chained audit trail
β”‚   β”œβ”€β”€ adversarial_designer.py # 72B judge β†’ infinite Chaos YAML
β”‚   β”œβ”€β”€ judge.py                # Episode scoring
β”‚   β”œβ”€β”€ stream.py               # SSE thought streaming
β”‚   β”œβ”€β”€ prompts/                # triage / diagnosis / remediation / comms
β”‚   └── tools/                  # 20 real SRE tool wrappers
β”œβ”€β”€ bench/
β”‚   β”œβ”€β”€ runner.py               # Benchmark harness (38 frozen scenarios)
β”‚   └── chaos_manifests/        # sf-001..008 Β· cs-001..005 Β· mf-001..005 Β· named_replays/
β”œβ”€β”€ config/
β”‚   └── runtime.py              # Frozen scenarios Β· reward contract Β· CurriculumManager Β· StepRewardTracker
β”œβ”€β”€ training/
β”‚   β”œβ”€β”€ sft.py                  # QLoRA SFT (4-bit NF4, LoRA r=16)
β”‚   β”œβ”€β”€ grpo.py                 # Online GRPO (DAPO loss, spaced-rep curriculum, dense rewards)
β”‚   └── generate_trajectories.py
β”œβ”€β”€ scripts/
β”‚   └── release_gate.py         # Pre-submission readiness checker
β”œβ”€β”€ static/
β”‚   └── index.html              # Custom dark ops console (SSE + service topology + Slack feed)
β”œβ”€β”€ tests/                      # 100+ tests across tools, coordinator, bench, safety
β”œβ”€β”€ docs/                       # Postmortems Β· MI300X evidence Β· benchmarks
β”œβ”€β”€ infra/                      # GCP provisioning Β· Helm values
β”œβ”€β”€ app.py                      # FastAPI entry point (HF Spaces)
└── Dockerfile                  # HF Spaces container

Why AMD MI300X

  • 192 GB HBM3 β€” fits all 5 models simultaneously: 4 Γ— Qwen2.5-7B-4bit (4 GB each) + Qwen2.5-72B-4bit (37 GB) = ~53 GB total. Impossible on A100 (80 GB OOM on 72B alone).
  • Online GRPO needs low-latency inference β€” each training step fires 8 live GKE rollouts. MI300X throughput keeps step time under 5 minutes.
  • ROCm-native β€” all training scripts target --rocm. Verified: BitsAndBytesConfig + paged_adamw_8bit on ROCm.

See docs/MI300X_EVIDENCE.md for rocm-smi snapshots and memory breakdown.


License

MIT β€” see LICENSE