--- marp: true theme: uncover paginate: true backgroundColor: '#060A12' color: '#E8EDF5' style: | @import url('https://fonts.googleapis.com/css2?family=Inter:wght@300;400;600;700;900&family=JetBrains+Mono:wght@400;700&display=swap'); section { font-family: 'Inter', sans-serif; font-size: 20px; background: #060A12; color: #E8EDF5; padding: 48px 56px; display: flex; flex-direction: column; justify-content: center; } h1 { font-size: 2.8em; font-weight: 900; letter-spacing: -1px; line-height: 1.1; margin-bottom: 12px; } h2 { font-size: 1.5em; font-weight: 700; letter-spacing: 3px; text-transform: uppercase; margin-bottom: 20px; padding-bottom: 10px; border-bottom: 2px solid rgba(255,255,255,0.1); } h3 { font-size: 1.1em; font-weight: 600; color: #00d4ff; letter-spacing: 1px; margin-bottom: 8px; } strong { color: #00D4FF; } em { color: #ff4560; font-style: normal; font-weight: 600; } code { font-family: 'JetBrains Mono', monospace; background: rgba(0,212,255,0.08); color: #00d4ff; padding: 2px 8px; border-radius: 4px; font-size: 0.85em; border: 1px solid rgba(0,212,255,0.2); } pre { background: rgba(0,0,0,0.4); border: 1px solid rgba(0,212,255,0.15); border-radius: 8px; padding: 18px 20px; font-size: 0.75em; } table { font-size: 0.8em; border-collapse: collapse; width: 100%; margin-top: 16px; } th { background: rgba(0,212,255,0.1); color: #00d4ff; padding: 10px 14px; text-align: left; font-weight: 700; letter-spacing: 1px; text-transform: uppercase; font-size: 0.8em; } td { padding: 9px 14px; border-bottom: 1px solid rgba(255,255,255,0.06); } tr:last-child td { border-bottom: none; } blockquote { border-left: 3px solid #ff4560; padding-left: 20px; color: #9BA3B8; font-style: italic; margin: 16px 0; } section::after { font-family: 'JetBrains Mono', monospace; font-size: 0.65em; color: rgba(255,255,255,0.2); content: attr(data-marpit-pagination) ' / ' attr(data-marpit-pagination-total); } .accent { color: #00D4FF; } .red { color: #FF4560; } .green { color: #00E396; } .yellow { color: #FFB703; } .dim { color: #5A6478; } ---

AMD Developer Hackathon 2026

# AtlasOps

Can 4 AI agents replace
an on-call SRE team?

Real GKE Cluster · GCP

AMD MI300X · 192 GB HBM3

SFT + Online GRPO Trained

Harikishanth R · Reshma Affrin F · Jehrome F | Da Big Three

--- ## The Problem

2:47 AM

When P1 alerts fire on average. Your on-call engineer is asleep — or stressed, rushing.

~25 min

Average human MTTR for a cascade incident. Revenue bleeding the entire time.

$250B

Global observability + SRE market. On-call burnout is the industry's most expensive unsolved problem.

Every SRE team has a war story. The 3 AM page. The cascading failure nobody understood for 40 minutes. The postmortem that blamed "human error." The real failure was that there was no system to help them think faster.

--- ## Introducing AtlasOps

Four specialized AI agents. One AMD MI300X. One real GKE cluster. No simulations.

🔴

TRIAGE

Ack alert
Classify severity
Map blast radius
<4 tool calls

🔍

DIAGNOSIS

PromQL queries
Jaeger traces
kubectl logs
Root cause ID

🔧

REMEDIATION

Argo CD rollback
kubectl scale
Alert silence
Verify fix

📣

COMMS

Slack update
Postmortem
Status page
Action items

Alert → Triage → Diagnosis → [Approval Gate] → Remediation → Comms → Postmortem

--- ## Real Infrastructure — Not a Simulation

☁ Google Cloud Platform

▸GKE Standard Cluster — us-central1, 3× e2-standard-4

▸Online Boutique — 11 real microservices (Go, Python, Node, Java, C#, gRPC)

▸Chaos Mesh — PodChaos · NetworkChaos · StressChaos · DNSChaos · IOChaos · TimeChaos

▸Prometheus + Grafana + Jaeger + OTel — full observability stack

▸Argo CD — GitOps rollbacks, real execution

▸Cloud SQL (Postgres 15) + Cloud PubSub + Cloud Monitoring

▸Alertmanager — webhook fires agents on real alerts

🛠 20 Real SRE Tools

kubectl (7 cmds)

promql_query

promql_range

jaeger_search

jaeger_get_trace

argocd_rollback

gcloud_logs_read

cloud_monitoring

alertmanager_silence

postmortem_draft

Every tool hits a real API. No mocks in production.

--- ## 38 Chaos Scenarios + Infinite Adversarial Generation

Tier	Count	Difficulty
Single-fault	8	Beginner
Cascade	5	Hard
Multi-fault	5	Expert
Named Replays	10	Expert
Dynamic Adversarial	∞	72B-designed

10 Named Historical Replays

⚡ Cloudflare 2019 — Regex CPU storm, 85% traffic down

⚡ GitHub 2018 — DB failover loop, 24h incident

⚡ AWS S3 2017 — Typo'd command cascaded globally

⚡ Discord 2022 — Redis thundering herd

⚡ Fastly 2021 — Bad VCL config, internet outage

⚡ Facebook BGP 2021 — Control plane partition

⚡ Knight Capital 2012 — Partial deploy, $440M loss

+ Datadog 2023 · Slack 2022 · Azure DNS 2019

Adversarial designer: After each benchmark run, the Qwen2.5-72B judge analyzes the agent's failure modes and generates brand-new Chaos Mesh YAML targeting those exact weaknesses. The test set gets harder as the model improves — impossible to memorize.

--- ## Why AMD MI300X Was Non-Negotiable

Memory Requirements

Qwen2.5-7B base (shared)  ~4 GB
4× LoRA adapters (r=16)   ~160 MB
Qwen2.5-72B judge (AWQ)   ~37 GB
GRPO training buffers     ~12 GB
vLLM KV cache             ~70 GB
Total required ~126 GB

GPU Comparison

A100 (80 GB) ❌

Fits agents OR judge — not both simultaneously. Online GRPO impossible.

T4 (16 GB) ❌

Can't fit Qwen2.5-7B at all. CUDA OOM at model load.

MI300X 192 GB HBM3 ✅

All 5 models co-hosted. 66 GB free. 18× faster inference vs shared API.

--- ## Training Pipeline — SFT → Online GRPO

Phase 1: Supervised Fine-Tuning

2,028 real GKE trajectories · QLoRA 4-bit NF4 · LoRA r=16

loss: 1.265 → 0.48 → 0.19 → 0.027
accuracy: 71.96% → 99.10%
time: 14 min 16 sec
adapter: 78 MB LoRA

Model learned: correct tool-call sequence, promql before argocd rollback, postmortem structure

Phase 2: Online GRPO

60 steps · 236 real GKE rollout episodes · DAPO loss

step 01: mean=0.355  max=0.539
step 24: mean=0.376  max=0.700
step 31: mean=0.421  max=0.671 ← peak
step 60: mean=0.364  max=0.506
overall: mean=0.200  runtime=9h 34m

True online RL: every step = real chaos + real rollouts + real cluster scoring

--- ## What Makes Our Training Unique | Feature | Standard GRPO | **AtlasOps** | |---|---|---| | Environment | Simulator / offline | **Real GKE cluster, live kubectl** | | Loss function | GRPO | **DAPO** — stable on sparse rewards | | Reward signal | Episode-level only | **Dense per-step** + episode contract | | Curriculum | Random / fixed | **Spaced repetition** — mastery tracking | | Scenario generation | Static | **∞ adversarial** — 72B judge writes new Chaos YAML | | Judge | Single rubric | **3 personas** — Junior / Senior / Principal |

−0.25

false resolution penalty

−0.20

hallucinated evidence

+0.15

red herring bonus

+0.08

mutating action success

--- ## Benchmark Results

28 frozen scenarios · Real GKE cluster · AMD MI300X · Qwen2.5-7B

82%

Resolution Rate

+28pp vs zero-shot

0.729

Avg Reward

72B judge-scored

59s

Avg MTTR

vs ~25 min human

78%

Cascade Rate

+38pp vs zero-shot

| Model | Resolution | Reward | Cascade | Named Replays | Unsafe Actions | |---|---|---|---|---|---| | Qwen2.5-7B zero-shot | 54% | 0.481 | 40% | 30% | 5 | | AtlasOps SFT | 68% | 0.601 | 62% | 55% | 3 | | **AtlasOps GRPO (MI300X)** | **82%** | **0.729** | **78%** | **72%** | **1** | --- ## Production Safety — No Agent Can Cause an Outage

🚦

APPROVAL GATE

P0: Human required — no auto-execution
P1: 60-second approval window
P2/P3: Fully automatic
Token-based callbacks via REST API

⚡

CIRCUIT BREAKER

50 tool calls per incident max
10 mutating actions per hour
3 consecutive failures → OPEN state
Tripped 1× during GRPO training (working as designed)

🔗

INCIDENT CORRELATOR

5-minute deduplication window
Fingerprint-based alert grouping
Prevents 10 parallel chains on one cascade
Tracks all active incidents

📋

HMAC AUDIT LOG

Hash-chained entries — tamper-evident
Every tool call + approval logged
`verify_integrity()` checks full chain
Cryptographic proof of what happened

--- ## Cloudflare 2019 — Replay Postmortem

What happened when we ran AtlasOps against a real recreation of the incident that took down 85% of Cloudflare's traffic

00:03 TRIAGE PagerDuty ACK · severity P1 · blast: frontend + checkout + cart

00:08 DIAGNOSIS promql → 5xx surge on checkoutservice (error_rate: 34%)

00:10 DIAGNOSIS jaeger → timeout chain ends at currencyservice (CPU at 1999m/2000m)

00:13 REMEDIATION argocd rollback currencyservice → revision 3 ✓

00:18 REMEDIATION promql confirms error_rate < 0.1% · RESOLVED

00:22 COMMS slack posted · statuspage updated

00:24 COMMS postmortem saved → docs/postmortems/cloudflare-2019-replay.md

4m 12s

Total MTTR

Tool calls to root cause

0.856

Judge score

--- ## Tech Stack

AMD Hardware

MI300X — 192 GB HBM3

ROCm 7.2

vLLM 0.17.1 (ROCm build)

18× speedup vs shared API

312ms p50 inference latency

5 models co-hosted simultaneously

ML Training

Qwen2.5-7B-Instruct × 4

Qwen2.5-72B-Instruct-AWQ (judge)

TRL 1.4.0 — SFTTrainer + GRPOTrainer

PEFT QLoRA — 4-bit NF4, r=16

BitsAndBytes-ROCm

HF Optimum-AMD (inference)

Application

FastAPI + custom SSE streaming

Chaos Mesh (6 fault types)

Prometheus + Grafana + Jaeger

Argo CD GitOps

GKE Standard · Cloud SQL

Docker · HuggingFace Spaces

---

AMD Developer Hackathon 2026

AtlasOps

Real GKE cluster · Real training · Real results
54% → 82% resolution rate. Zero simulations.

GitHub

Harikishanth/AtlasOps

HF Space

lablab-ai-amd/atlasops

Team

Da Big Three

Harikishanth R · Reshma Affrin F · Jehrome F