--- marp: true theme: uncover paginate: true backgroundColor: '#060A12' color: '#E8EDF5' style: | @import url('https://fonts.googleapis.com/css2?family=Inter:wght@300;400;600;700;900&family=JetBrains+Mono:wght@400;700&display=swap'); section { font-family: 'Inter', sans-serif; font-size: 20px; background: #060A12; color: #E8EDF5; padding: 48px 56px; display: flex; flex-direction: column; justify-content: center; } h1 { font-size: 2.8em; font-weight: 900; letter-spacing: -1px; line-height: 1.1; margin-bottom: 12px; } h2 { font-size: 1.5em; font-weight: 700; letter-spacing: 3px; text-transform: uppercase; margin-bottom: 20px; padding-bottom: 10px; border-bottom: 2px solid rgba(255,255,255,0.1); } h3 { font-size: 1.1em; font-weight: 600; color: #00d4ff; letter-spacing: 1px; margin-bottom: 8px; } strong { color: #00D4FF; } em { color: #ff4560; font-style: normal; font-weight: 600; } code { font-family: 'JetBrains Mono', monospace; background: rgba(0,212,255,0.08); color: #00d4ff; padding: 2px 8px; border-radius: 4px; font-size: 0.85em; border: 1px solid rgba(0,212,255,0.2); } pre { background: rgba(0,0,0,0.4); border: 1px solid rgba(0,212,255,0.15); border-radius: 8px; padding: 18px 20px; font-size: 0.75em; } table { font-size: 0.8em; border-collapse: collapse; width: 100%; margin-top: 16px; } th { background: rgba(0,212,255,0.1); color: #00d4ff; padding: 10px 14px; text-align: left; font-weight: 700; letter-spacing: 1px; text-transform: uppercase; font-size: 0.8em; } td { padding: 9px 14px; border-bottom: 1px solid rgba(255,255,255,0.06); } tr:last-child td { border-bottom: none; } blockquote { border-left: 3px solid #ff4560; padding-left: 20px; color: #9BA3B8; font-style: italic; margin: 16px 0; } section::after { font-family: 'JetBrains Mono', monospace; font-size: 0.65em; color: rgba(255,255,255,0.2); content: attr(data-marpit-pagination) ' / ' attr(data-marpit-pagination-total); } .accent { color: #00D4FF; } .red { color: #FF4560; } .green { color: #00E396; } .yellow { color: #FFB703; } .dim { color: #5A6478; } ---
AMD Developer Hackathon 2026
# AtlasOps
Can 4 AI agents replace
an on-call SRE team?
Real GKE Cluster · GCP
AMD MI300X · 192 GB HBM3
SFT + Online GRPO Trained
Harikishanth R · Reshma Affrin F · Jehrome F  |  Da Big Three
--- ## The Problem
2:47 AM
When P1 alerts fire on average. Your on-call engineer is asleep — or stressed, rushing.
~25 min
Average human MTTR for a cascade incident. Revenue bleeding the entire time.
$250B
Global observability + SRE market. On-call burnout is the industry's most expensive unsolved problem.
Every SRE team has a war story. The 3 AM page. The cascading failure nobody understood for 40 minutes. The postmortem that blamed "human error." The real failure was that there was no system to help them think faster.
--- ## Introducing AtlasOps
Four specialized AI agents. One AMD MI300X. One real GKE cluster. No simulations.
🔴
TRIAGE
Ack alert
Classify severity
Map blast radius
<4 tool calls
🔍
DIAGNOSIS
PromQL queries
Jaeger traces
kubectl logs
Root cause ID
🔧
REMEDIATION
Argo CD rollback
kubectl scale
Alert silence
Verify fix
📣
COMMS
Slack update
Postmortem
Status page
Action items
Alert → TriageDiagnosis → [Approval Gate] → RemediationComms → Postmortem
--- ## Real Infrastructure — Not a Simulation

☁ Google Cloud Platform

GKE Standard Cluster — us-central1, 3× e2-standard-4
Online Boutique — 11 real microservices (Go, Python, Node, Java, C#, gRPC)
Chaos Mesh — PodChaos · NetworkChaos · StressChaos · DNSChaos · IOChaos · TimeChaos
Prometheus + Grafana + Jaeger + OTel — full observability stack
Argo CD — GitOps rollbacks, real execution
Cloud SQL (Postgres 15) + Cloud PubSub + Cloud Monitoring
Alertmanager — webhook fires agents on real alerts

🛠 20 Real SRE Tools

kubectl (7 cmds)
promql_query
promql_range
jaeger_search
jaeger_get_trace
argocd_rollback
gcloud_logs_read
cloud_monitoring
alertmanager_silence
postmortem_draft
Every tool hits a real API. No mocks in production.
--- ## 38 Chaos Scenarios + Infinite Adversarial Generation
TierCountDifficulty
Single-fault8Beginner
Cascade5Hard
Multi-fault5Expert
Named Replays10Expert
Dynamic Adversarial72B-designed

10 Named Historical Replays

Cloudflare 2019 — Regex CPU storm, 85% traffic down
GitHub 2018 — DB failover loop, 24h incident
AWS S3 2017 — Typo'd command cascaded globally
Discord 2022 — Redis thundering herd
Fastly 2021 — Bad VCL config, internet outage
Facebook BGP 2021 — Control plane partition
Knight Capital 2012 — Partial deploy, $440M loss
+ Datadog 2023 · Slack 2022 · Azure DNS 2019
Adversarial designer: After each benchmark run, the Qwen2.5-72B judge analyzes the agent's failure modes and generates brand-new Chaos Mesh YAML targeting those exact weaknesses. The test set gets harder as the model improves — impossible to memorize.
--- ## Why AMD MI300X Was Non-Negotiable

Memory Requirements

Qwen2.5-7B base (shared) ~4 GB
4× LoRA adapters (r=16) ~160 MB
Qwen2.5-72B judge (AWQ) ~37 GB
GRPO training buffers ~12 GB
vLLM KV cache ~70 GB
Total required ~126 GB

GPU Comparison

A100 (80 GB) ❌
Fits agents OR judge — not both simultaneously. Online GRPO impossible.
T4 (16 GB) ❌
Can't fit Qwen2.5-7B at all. CUDA OOM at model load.
MI300X 192 GB HBM3 ✅
All 5 models co-hosted. 66 GB free. 18× faster inference vs shared API.
--- ## Training Pipeline — SFT → Online GRPO

Phase 1: Supervised Fine-Tuning

2,028 real GKE trajectories · QLoRA 4-bit NF4 · LoRA r=16
loss: 1.2650.480.190.027
accuracy: 71.96%99.10%
time: 14 min 16 sec
adapter: 78 MB LoRA
Model learned: correct tool-call sequence, promql before argocd rollback, postmortem structure

Phase 2: Online GRPO

60 steps · 236 real GKE rollout episodes · DAPO loss
step 01: mean=0.355 max=0.539
step 24: mean=0.376 max=0.700
step 31: mean=0.421 max=0.671 ← peak
step 60: mean=0.364 max=0.506
overall: mean=0.200 runtime=9h 34m
True online RL: every step = real chaos + real rollouts + real cluster scoring
--- ## What Makes Our Training Unique | Feature | Standard GRPO | **AtlasOps** | |---|---|---| | Environment | Simulator / offline | **Real GKE cluster, live kubectl** | | Loss function | GRPO | **DAPO** — stable on sparse rewards | | Reward signal | Episode-level only | **Dense per-step** + episode contract | | Curriculum | Random / fixed | **Spaced repetition** — mastery tracking | | Scenario generation | Static | **∞ adversarial** — 72B judge writes new Chaos YAML | | Judge | Single rubric | **3 personas** — Junior / Senior / Principal |
−0.25
false resolution penalty
−0.20
hallucinated evidence
+0.15
red herring bonus
+0.08
mutating action success
--- ## Benchmark Results
28 frozen scenarios · Real GKE cluster · AMD MI300X · Qwen2.5-7B
82%
Resolution Rate
+28pp vs zero-shot
0.729
Avg Reward
72B judge-scored
59s
Avg MTTR
vs ~25 min human
78%
Cascade Rate
+38pp vs zero-shot
| Model | Resolution | Reward | Cascade | Named Replays | Unsafe Actions | |---|---|---|---|---|---| | Qwen2.5-7B zero-shot | 54% | 0.481 | 40% | 30% | 5 | | AtlasOps SFT | 68% | 0.601 | 62% | 55% | 3 | | **AtlasOps GRPO (MI300X)** | **82%** | **0.729** | **78%** | **72%** | **1** | --- ## Production Safety — No Agent Can Cause an Outage
🚦
APPROVAL GATE
P0: Human required — no auto-execution
P1: 60-second approval window
P2/P3: Fully automatic
Token-based callbacks via REST API
CIRCUIT BREAKER
50 tool calls per incident max
10 mutating actions per hour
3 consecutive failures → OPEN state
Tripped 1× during GRPO training (working as designed)
🔗
INCIDENT CORRELATOR
5-minute deduplication window
Fingerprint-based alert grouping
Prevents 10 parallel chains on one cascade
Tracks all active incidents
📋
HMAC AUDIT LOG
Hash-chained entries — tamper-evident
Every tool call + approval logged
`verify_integrity()` checks full chain
Cryptographic proof of what happened
--- ## Cloudflare 2019 — Replay Postmortem
What happened when we ran AtlasOps against a real recreation of the incident that took down 85% of Cloudflare's traffic
00:03  TRIAGE   PagerDuty ACK · severity P1 · blast: frontend + checkout + cart
00:08  DIAGNOSIS   promql → 5xx surge on checkoutservice (error_rate: 34%)
00:10  DIAGNOSIS   jaeger → timeout chain ends at currencyservice (CPU at 1999m/2000m)
00:13  REMEDIATION   argocd rollback currencyservice → revision 3 ✓
00:18  REMEDIATION   promql confirms error_rate < 0.1% · RESOLVED
00:22  COMMS   slack posted · statuspage updated
00:24  COMMS   postmortem saved → docs/postmortems/cloudflare-2019-replay.md
4m 12s
Total MTTR
3
Tool calls to root cause
0.856
Judge score
--- ## Tech Stack

AMD Hardware

MI300X — 192 GB HBM3
ROCm 7.2
vLLM 0.17.1 (ROCm build)
18× speedup vs shared API
312ms p50 inference latency
5 models co-hosted simultaneously

ML Training

Qwen2.5-7B-Instruct × 4
Qwen2.5-72B-Instruct-AWQ (judge)
TRL 1.4.0 — SFTTrainer + GRPOTrainer
PEFT QLoRA — 4-bit NF4, r=16
BitsAndBytes-ROCm
HF Optimum-AMD (inference)

Application

FastAPI + custom SSE streaming
Chaos Mesh (6 fault types)
Prometheus + Grafana + Jaeger
Argo CD GitOps
GKE Standard · Cloud SQL
Docker · HuggingFace Spaces
---
AMD Developer Hackathon 2026

AtlasOps

Real GKE cluster · Real training · Real results
54% → 82% resolution rate. Zero simulations.
GitHub
Harikishanth/AtlasOps
HF Space
lablab-ai-amd/atlasops
Team
Da Big Three
Harikishanth R  ·  Reshma Affrin F  ·  Jehrome F