---
marp: true
theme: uncover
paginate: true
backgroundColor: '#060A12'
color: '#E8EDF5'
style: |
@import url('https://fonts.googleapis.com/css2?family=Inter:wght@300;400;600;700;900&family=JetBrains+Mono:wght@400;700&display=swap');
section {
font-family: 'Inter', sans-serif;
font-size: 20px;
background: #060A12;
color: #E8EDF5;
padding: 48px 56px;
display: flex;
flex-direction: column;
justify-content: center;
}
h1 {
font-size: 2.8em;
font-weight: 900;
letter-spacing: -1px;
line-height: 1.1;
margin-bottom: 12px;
}
h2 {
font-size: 1.5em;
font-weight: 700;
letter-spacing: 3px;
text-transform: uppercase;
margin-bottom: 20px;
padding-bottom: 10px;
border-bottom: 2px solid rgba(255,255,255,0.1);
}
h3 {
font-size: 1.1em;
font-weight: 600;
color: #00d4ff;
letter-spacing: 1px;
margin-bottom: 8px;
}
strong { color: #00D4FF; }
em { color: #ff4560; font-style: normal; font-weight: 600; }
code {
font-family: 'JetBrains Mono', monospace;
background: rgba(0,212,255,0.08);
color: #00d4ff;
padding: 2px 8px;
border-radius: 4px;
font-size: 0.85em;
border: 1px solid rgba(0,212,255,0.2);
}
pre {
background: rgba(0,0,0,0.4);
border: 1px solid rgba(0,212,255,0.15);
border-radius: 8px;
padding: 18px 20px;
font-size: 0.75em;
}
table {
font-size: 0.8em;
border-collapse: collapse;
width: 100%;
margin-top: 16px;
}
th {
background: rgba(0,212,255,0.1);
color: #00d4ff;
padding: 10px 14px;
text-align: left;
font-weight: 700;
letter-spacing: 1px;
text-transform: uppercase;
font-size: 0.8em;
}
td {
padding: 9px 14px;
border-bottom: 1px solid rgba(255,255,255,0.06);
}
tr:last-child td { border-bottom: none; }
blockquote {
border-left: 3px solid #ff4560;
padding-left: 20px;
color: #9BA3B8;
font-style: italic;
margin: 16px 0;
}
section::after {
font-family: 'JetBrains Mono', monospace;
font-size: 0.65em;
color: rgba(255,255,255,0.2);
content: attr(data-marpit-pagination) ' / ' attr(data-marpit-pagination-total);
}
.accent { color: #00D4FF; }
.red { color: #FF4560; }
.green { color: #00E396; }
.yellow { color: #FFB703; }
.dim { color: #5A6478; }
---
AMD Developer Hackathon 2026
# AtlasOps
Can 4 AI agents replace
an on-call SRE team?
Real GKE Cluster · GCP
AMD MI300X · 192 GB HBM3
SFT + Online GRPO Trained
Harikishanth R · Reshma Affrin F · Jehrome F | Da Big Three
---
## The Problem
2:47 AM
When P1 alerts fire on average. Your on-call engineer is asleep — or stressed, rushing.
~25 min
Average human MTTR for a cascade incident. Revenue bleeding the entire time.
$250B
Global observability + SRE market. On-call burnout is the industry's most expensive unsolved problem.
Every SRE team has a war story. The 3 AM page. The cascading failure nobody understood for 40 minutes. The postmortem that blamed "human error." The real failure was that there was no system to help them think faster.
---
## Introducing AtlasOps
Four specialized AI agents. One AMD MI300X. One real GKE cluster. No simulations.
🔴
TRIAGE
Ack alert
Classify severity
Map blast radius
<4 tool calls
🔍
DIAGNOSIS
PromQL queries
Jaeger traces
kubectl logs
Root cause ID
🔧
REMEDIATION
Argo CD rollback
kubectl scale
Alert silence
Verify fix
📣
COMMS
Slack update
Postmortem
Status page
Action items
Alert → Triage → Diagnosis → [Approval Gate] → Remediation → Comms → Postmortem
---
## Real Infrastructure — Not a Simulation
☁ Google Cloud Platform
▸GKE Standard Cluster — us-central1, 3× e2-standard-4
▸Online Boutique — 11 real microservices (Go, Python, Node, Java, C#, gRPC)
▸Chaos Mesh — PodChaos · NetworkChaos · StressChaos · DNSChaos · IOChaos · TimeChaos
▸Prometheus + Grafana + Jaeger + OTel — full observability stack
▸Argo CD — GitOps rollbacks, real execution
▸Cloud SQL (Postgres 15) + Cloud PubSub + Cloud Monitoring
▸Alertmanager — webhook fires agents on real alerts
🛠 20 Real SRE Tools
kubectl (7 cmds)
promql_query
promql_range
jaeger_search
jaeger_get_trace
argocd_rollback
gcloud_logs_read
cloud_monitoring
alertmanager_silence
postmortem_draft
Every tool hits a real API. No mocks in production.
---
## 38 Chaos Scenarios + Infinite Adversarial Generation
| Tier | Count | Difficulty |
| Single-fault | 8 | Beginner |
| Cascade | 5 | Hard |
| Multi-fault | 5 | Expert |
| Named Replays | 10 | Expert |
| Dynamic Adversarial | ∞ | 72B-designed |
10 Named Historical Replays
⚡ Cloudflare 2019 — Regex CPU storm, 85% traffic down
⚡ GitHub 2018 — DB failover loop, 24h incident
⚡ AWS S3 2017 — Typo'd command cascaded globally
⚡ Discord 2022 — Redis thundering herd
⚡ Fastly 2021 — Bad VCL config, internet outage
⚡ Facebook BGP 2021 — Control plane partition
⚡ Knight Capital 2012 — Partial deploy, $440M loss
+ Datadog 2023 · Slack 2022 · Azure DNS 2019
Adversarial designer: After each benchmark run, the Qwen2.5-72B judge analyzes the agent's failure modes and generates brand-new Chaos Mesh YAML targeting those exact weaknesses. The test set gets harder as the model improves — impossible to memorize.
---
## Why AMD MI300X Was Non-Negotiable
Memory Requirements
Qwen2.5-7B base (shared) ~4 GB
4× LoRA adapters (r=16) ~160 MB
Qwen2.5-72B judge (AWQ) ~37 GB
GRPO training buffers ~12 GB
vLLM KV cache ~70 GB
Total required ~126 GB
GPU Comparison
A100 (80 GB) ❌
Fits agents OR judge — not both simultaneously. Online GRPO impossible.
T4 (16 GB) ❌
Can't fit Qwen2.5-7B at all. CUDA OOM at model load.
MI300X 192 GB HBM3 ✅
All 5 models co-hosted. 66 GB free. 18× faster inference vs shared API.
---
## Training Pipeline — SFT → Online GRPO
Phase 1: Supervised Fine-Tuning
2,028 real GKE trajectories · QLoRA 4-bit NF4 · LoRA r=16
loss: 1.265 → 0.48 → 0.19 → 0.027
accuracy: 71.96% → 99.10%
time: 14 min 16 sec
adapter: 78 MB LoRA
Model learned: correct tool-call sequence, promql before argocd rollback, postmortem structure
Phase 2: Online GRPO
60 steps · 236 real GKE rollout episodes · DAPO loss
step 01: mean=0.355 max=0.539
step 24: mean=0.376 max=0.700
step 31: mean=0.421 max=0.671 ← peak
step 60: mean=0.364 max=0.506
overall: mean=0.200 runtime=9h 34m
True online RL: every step = real chaos + real rollouts + real cluster scoring
---
## What Makes Our Training Unique
| Feature | Standard GRPO | **AtlasOps** |
|---|---|---|
| Environment | Simulator / offline | **Real GKE cluster, live kubectl** |
| Loss function | GRPO | **DAPO** — stable on sparse rewards |
| Reward signal | Episode-level only | **Dense per-step** + episode contract |
| Curriculum | Random / fixed | **Spaced repetition** — mastery tracking |
| Scenario generation | Static | **∞ adversarial** — 72B judge writes new Chaos YAML |
| Judge | Single rubric | **3 personas** — Junior / Senior / Principal |
−0.25
false resolution penalty
−0.20
hallucinated evidence
+0.08
mutating action success
---
## Benchmark Results
28 frozen scenarios · Real GKE cluster · AMD MI300X · Qwen2.5-7B
82%
Resolution Rate
+28pp vs zero-shot
0.729
Avg Reward
72B judge-scored
59s
Avg MTTR
vs ~25 min human
78%
Cascade Rate
+38pp vs zero-shot
| Model | Resolution | Reward | Cascade | Named Replays | Unsafe Actions |
|---|---|---|---|---|---|
| Qwen2.5-7B zero-shot | 54% | 0.481 | 40% | 30% | 5 |
| AtlasOps SFT | 68% | 0.601 | 62% | 55% | 3 |
| **AtlasOps GRPO (MI300X)** | **82%** | **0.729** | **78%** | **72%** | **1** |
---
## Production Safety — No Agent Can Cause an Outage
🚦
APPROVAL GATE
P0: Human required — no auto-execution
P1: 60-second approval window
P2/P3: Fully automatic
Token-based callbacks via REST API
⚡
CIRCUIT BREAKER
50 tool calls per incident max
10 mutating actions per hour
3 consecutive failures → OPEN state
Tripped 1× during GRPO training (working as designed)
🔗
INCIDENT CORRELATOR
5-minute deduplication window
Fingerprint-based alert grouping
Prevents 10 parallel chains on one cascade
Tracks all active incidents
📋
HMAC AUDIT LOG
Hash-chained entries — tamper-evident
Every tool call + approval logged
`verify_integrity()` checks full chain
Cryptographic proof of what happened
---
## Cloudflare 2019 — Replay Postmortem
What happened when we ran AtlasOps against a real recreation of the incident that took down 85% of Cloudflare's traffic
00:03 TRIAGE PagerDuty ACK · severity P1 · blast: frontend + checkout + cart
00:08 DIAGNOSIS promql → 5xx surge on checkoutservice (error_rate: 34%)
00:10 DIAGNOSIS jaeger → timeout chain ends at currencyservice (CPU at 1999m/2000m)
00:13 REMEDIATION argocd rollback currencyservice → revision 3 ✓
00:18 REMEDIATION promql confirms error_rate < 0.1% · RESOLVED
00:22 COMMS slack posted · statuspage updated
00:24 COMMS postmortem saved → docs/postmortems/cloudflare-2019-replay.md
3
Tool calls to root cause
---
## Tech Stack
AMD Hardware
MI300X — 192 GB HBM3
ROCm 7.2
vLLM 0.17.1 (ROCm build)
18× speedup vs shared API
312ms p50 inference latency
5 models co-hosted simultaneously
ML Training
Qwen2.5-7B-Instruct × 4
Qwen2.5-72B-Instruct-AWQ (judge)
TRL 1.4.0 — SFTTrainer + GRPOTrainer
PEFT QLoRA — 4-bit NF4, r=16
BitsAndBytes-ROCm
HF Optimum-AMD (inference)
Application
FastAPI + custom SSE streaming
Chaos Mesh (6 fault types)
Prometheus + Grafana + Jaeger
Argo CD GitOps
GKE Standard · Cloud SQL
Docker · HuggingFace Spaces
---
AMD Developer Hackathon 2026
AtlasOps
Real GKE cluster · Real training · Real results
54% → 82% resolution rate. Zero simulations.
GitHub
Harikishanth/AtlasOps
HF Space
lablab-ai-amd/atlasops
Harikishanth R · Reshma Affrin F · Jehrome F