metadata
marp: true
theme: uncover
paginate: true
backgroundColor: '#060A12'
color: '#E8EDF5'
style: >
@import
url('https://fonts.googleapis.com/css2?family=Inter:wght@300;400;600;700;900&family=JetBrains+Mono:wght@400;700&display=swap');
section {
font-family: 'Inter', sans-serif;
font-size: 20px;
background: #060A12;
color: #E8EDF5;
padding: 48px 56px;
display: flex;
flex-direction: column;
justify-content: center;
}
h1 {
font-size: 2.8em;
font-weight: 900;
letter-spacing: -1px;
line-height: 1.1;
margin-bottom: 12px;
}
h2 {
font-size: 1.5em;
font-weight: 700;
letter-spacing: 3px;
text-transform: uppercase;
margin-bottom: 20px;
padding-bottom: 10px;
border-bottom: 2px solid rgba(255,255,255,0.1);
}
h3 {
font-size: 1.1em;
font-weight: 600;
color: #00d4ff;
letter-spacing: 1px;
margin-bottom: 8px;
}
strong { color: #00D4FF; }
em { color: #ff4560; font-style: normal; font-weight: 600; }
code {
font-family: 'JetBrains Mono', monospace;
background: rgba(0,212,255,0.08);
color: #00d4ff;
padding: 2px 8px;
border-radius: 4px;
font-size: 0.85em;
border: 1px solid rgba(0,212,255,0.2);
}
pre {
background: rgba(0,0,0,0.4);
border: 1px solid rgba(0,212,255,0.15);
border-radius: 8px;
padding: 18px 20px;
font-size: 0.75em;
}
table {
font-size: 0.8em;
border-collapse: collapse;
width: 100%;
margin-top: 16px;
}
th {
background: rgba(0,212,255,0.1);
color: #00d4ff;
padding: 10px 14px;
text-align: left;
font-weight: 700;
letter-spacing: 1px;
text-transform: uppercase;
font-size: 0.8em;
}
td {
padding: 9px 14px;
border-bottom: 1px solid rgba(255,255,255,0.06);
}
tr:last-child td { border-bottom: none; }
blockquote {
border-left: 3px solid #ff4560;
padding-left: 20px;
color: #9BA3B8;
font-style: italic;
margin: 16px 0;
}
section::after {
font-family: 'JetBrains Mono', monospace;
font-size: 0.65em;
color: rgba(255,255,255,0.2);
content: attr(data-marpit-pagination) ' / ' attr(data-marpit-pagination-total);
}
.accent { color: #00D4FF; }
.red { color: #FF4560; }
.green { color: #00E396; }
.yellow { color: #FFB703; }
.dim { color: #5A6478; }
AMD Developer Hackathon 2026
AtlasOps
Can 4 AI agents replace
an on-call SRE team?
an on-call SRE team?
Real GKE Cluster Β· GCP
AMD MI300X Β· 192 GB HBM3
SFT + Online GRPO Trained
Harikishanth R Β· Reshma Affrin F Β· Jehrome F | Da Big Three
The Problem
2:47 AM
When P1 alerts fire on average. Your on-call engineer is asleep β or stressed, rushing.
~25 min
Average human MTTR for a cascade incident. Revenue bleeding the entire time.
$250B
Global observability + SRE market. On-call burnout is the industry's most expensive unsolved problem.
Every SRE team has a war story. The 3 AM page. The cascading failure nobody understood for 40 minutes. The postmortem that blamed "human error." The real failure was that there was no system to help them think faster.
Introducing AtlasOps
Four specialized AI agents. One AMD MI300X. One real GKE cluster. No simulations.
π΄
TRIAGE
Ack alert
Classify severity
Map blast radius
<4 tool calls
Classify severity
Map blast radius
<4 tool calls
π
DIAGNOSIS
PromQL queries
Jaeger traces
kubectl logs
Root cause ID
Jaeger traces
kubectl logs
Root cause ID
π§
REMEDIATION
Argo CD rollback
kubectl scale
Alert silence
Verify fix
kubectl scale
Alert silence
Verify fix
π£
COMMS
Slack update
Postmortem
Status page
Action items
Postmortem
Status page
Action items
Alert β Triage β Diagnosis β [Approval Gate] β Remediation β Comms β Postmortem
Real Infrastructure β Not a Simulation
β Google Cloud Platform
βΈGKE Standard Cluster β us-central1, 3Γ e2-standard-4
βΈOnline Boutique β 11 real microservices (Go, Python, Node, Java, C#, gRPC)
βΈChaos Mesh β PodChaos Β· NetworkChaos Β· StressChaos Β· DNSChaos Β· IOChaos Β· TimeChaos
βΈPrometheus + Grafana + Jaeger + OTel β full observability stack
βΈArgo CD β GitOps rollbacks, real execution
βΈCloud SQL (Postgres 15) + Cloud PubSub + Cloud Monitoring
βΈAlertmanager β webhook fires agents on real alerts
π 20 Real SRE Tools
kubectl (7 cmds)
promql_query
promql_range
jaeger_search
jaeger_get_trace
argocd_rollback
gcloud_logs_read
cloud_monitoring
alertmanager_silence
postmortem_draft
Every tool hits a real API. No mocks in production.
38 Chaos Scenarios + Infinite Adversarial Generation
| Tier | Count | Difficulty |
|---|---|---|
| Single-fault | 8 | Beginner |
| Cascade | 5 | Hard |
| Multi-fault | 5 | Expert |
| Named Replays | 10 | Expert |
| Dynamic Adversarial | β | 72B-designed |
10 Named Historical Replays
β‘ Cloudflare 2019 β Regex CPU storm, 85% traffic down
β‘ GitHub 2018 β DB failover loop, 24h incident
β‘ AWS S3 2017 β Typo'd command cascaded globally
β‘ Discord 2022 β Redis thundering herd
β‘ Fastly 2021 β Bad VCL config, internet outage
β‘ Facebook BGP 2021 β Control plane partition
β‘ Knight Capital 2012 β Partial deploy, $440M loss
+ Datadog 2023 Β· Slack 2022 Β· Azure DNS 2019
Adversarial designer: After each benchmark run, the Qwen2.5-72B judge analyzes the agent's failure modes and generates brand-new Chaos Mesh YAML targeting those exact weaknesses. The test set gets harder as the model improves β impossible to memorize.
Why AMD MI300X Was Non-Negotiable
Memory Requirements
Qwen2.5-7B base (shared) ~4 GB
4Γ LoRA adapters (r=16) ~160 MB
Qwen2.5-72B judge (AWQ) ~37 GB
GRPO training buffers ~12 GB
vLLM KV cache ~70 GB
Total required ~126 GB
GPU Comparison
A100 (80 GB) β
Fits agents OR judge β not both simultaneously. Online GRPO impossible.
T4 (16 GB) β
Can't fit Qwen2.5-7B at all. CUDA OOM at model load.
MI300X 192 GB HBM3 β
All 5 models co-hosted. 66 GB free. 18Γ faster inference vs shared API.
Training Pipeline β SFT β Online GRPO
Phase 1: Supervised Fine-Tuning
2,028 real GKE trajectories Β· QLoRA 4-bit NF4 Β· LoRA r=16
loss: 1.265 β 0.48 β 0.19 β 0.027
accuracy: 71.96% β 99.10%
time: 14 min 16 sec
adapter: 78 MB LoRA
Model learned: correct tool-call sequence, promql before argocd rollback, postmortem structure
Phase 2: Online GRPO
60 steps Β· 236 real GKE rollout episodes Β· DAPO loss
step 01: mean=0.355 max=0.539
step 24: mean=0.376 max=0.700
step 31: mean=0.421 max=0.671 β peak
step 60: mean=0.364 max=0.506
overall: mean=0.200 runtime=9h 34m
True online RL: every step = real chaos + real rollouts + real cluster scoring
What Makes Our Training Unique
| Feature | Standard GRPO | AtlasOps |
|---|---|---|
| Environment | Simulator / offline | Real GKE cluster, live kubectl |
| Loss function | GRPO | DAPO β stable on sparse rewards |
| Reward signal | Episode-level only | Dense per-step + episode contract |
| Curriculum | Random / fixed | Spaced repetition β mastery tracking |
| Scenario generation | Static | β adversarial β 72B judge writes new Chaos YAML |
| Judge | Single rubric | 3 personas β Junior / Senior / Principal |
β0.25
false resolution penalty
β0.20
hallucinated evidence
+0.15
red herring bonus
+0.08
mutating action success
Benchmark Results
28 frozen scenarios Β· Real GKE cluster Β· AMD MI300X Β· Qwen2.5-7B
82%
Resolution Rate
+28pp vs zero-shot
0.729
Avg Reward
72B judge-scored
59s
Avg MTTR
vs ~25 min human
78%
Cascade Rate
+38pp vs zero-shot
| Model | Resolution | Reward | Cascade | Named Replays | Unsafe Actions |
|---|---|---|---|---|---|
| Qwen2.5-7B zero-shot | 54% | 0.481 | 40% | 30% | 5 |
| AtlasOps SFT | 68% | 0.601 | 62% | 55% | 3 |
| AtlasOps GRPO (MI300X) | 82% | 0.729 | 78% | 72% | 1 |
Production Safety β No Agent Can Cause an Outage
π¦
APPROVAL GATE
P0: Human required β no auto-execution
P1: 60-second approval window
P2/P3: Fully automatic
Token-based callbacks via REST API
P1: 60-second approval window
P2/P3: Fully automatic
Token-based callbacks via REST API
β‘
CIRCUIT BREAKER
50 tool calls per incident max
10 mutating actions per hour
3 consecutive failures β OPEN state
Tripped 1Γ during GRPO training (working as designed)
10 mutating actions per hour
3 consecutive failures β OPEN state
Tripped 1Γ during GRPO training (working as designed)
π
INCIDENT CORRELATOR
5-minute deduplication window
Fingerprint-based alert grouping
Prevents 10 parallel chains on one cascade
Tracks all active incidents
Fingerprint-based alert grouping
Prevents 10 parallel chains on one cascade
Tracks all active incidents
π
HMAC AUDIT LOG
Hash-chained entries β tamper-evident
Every tool call + approval logged
`verify_integrity()` checks full chain
Cryptographic proof of what happened
Every tool call + approval logged
`verify_integrity()` checks full chain
Cryptographic proof of what happened
Cloudflare 2019 β Replay Postmortem
What happened when we ran AtlasOps against a real recreation of the incident that took down 85% of Cloudflare's traffic
00:03 TRIAGE PagerDuty ACK Β· severity P1 Β· blast: frontend + checkout + cart
00:08 DIAGNOSIS promql β 5xx surge on checkoutservice (error_rate: 34%)
00:10 DIAGNOSIS jaeger β timeout chain ends at currencyservice (CPU at 1999m/2000m)
00:13 REMEDIATION argocd rollback currencyservice β revision 3 β
00:18 REMEDIATION promql confirms error_rate < 0.1% Β· RESOLVED
00:22 COMMS slack posted Β· statuspage updated
00:24 COMMS postmortem saved β docs/postmortems/cloudflare-2019-replay.md
4m 12s
Total MTTR
3
Tool calls to root cause
0.856
Judge score
Tech Stack
AMD Hardware
MI300X β 192 GB HBM3
ROCm 7.2
vLLM 0.17.1 (ROCm build)
18Γ speedup vs shared API
312ms p50 inference latency
5 models co-hosted simultaneously
ML Training
Qwen2.5-7B-Instruct Γ 4
Qwen2.5-72B-Instruct-AWQ (judge)
TRL 1.4.0 β SFTTrainer + GRPOTrainer
PEFT QLoRA β 4-bit NF4, r=16
BitsAndBytes-ROCm
HF Optimum-AMD (inference)
Application
FastAPI + custom SSE streaming
Chaos Mesh (6 fault types)
Prometheus + Grafana + Jaeger
Argo CD GitOps
GKE Standard Β· Cloud SQL
Docker Β· HuggingFace Spaces
AMD Developer Hackathon 2026
AtlasOps
Real GKE cluster Β· Real training Β· Real results
54% β 82% resolution rate. Zero simulations.
54% β 82% resolution rate. Zero simulations.
GitHub
Harikishanth/AtlasOps
HF Space
lablab-ai-amd/atlasops
Team
Da Big Three
Harikishanth R Β· Reshma Affrin F Β· Jehrome F