atlasops / docs /slides.md
Harikishanth R
fix: skip-kubectl + scroll + health β€” HF Space ready
7e9a520
metadata
marp: true
theme: uncover
paginate: true
backgroundColor: '#060A12'
color: '#E8EDF5'
style: >
  @import
  url('https://fonts.googleapis.com/css2?family=Inter:wght@300;400;600;700;900&family=JetBrains+Mono:wght@400;700&display=swap');


  section {
    font-family: 'Inter', sans-serif;
    font-size: 20px;
    background: #060A12;
    color: #E8EDF5;
    padding: 48px 56px;
    display: flex;
    flex-direction: column;
    justify-content: center;
  }


  h1 {
    font-size: 2.8em;
    font-weight: 900;
    letter-spacing: -1px;
    line-height: 1.1;
    margin-bottom: 12px;
  }


  h2 {
    font-size: 1.5em;
    font-weight: 700;
    letter-spacing: 3px;
    text-transform: uppercase;
    margin-bottom: 20px;
    padding-bottom: 10px;
    border-bottom: 2px solid rgba(255,255,255,0.1);
  }


  h3 {
    font-size: 1.1em;
    font-weight: 600;
    color: #00d4ff;
    letter-spacing: 1px;
    margin-bottom: 8px;
  }


  strong { color: #00D4FF; }

  em { color: #ff4560; font-style: normal; font-weight: 600; }


  code {
    font-family: 'JetBrains Mono', monospace;
    background: rgba(0,212,255,0.08);
    color: #00d4ff;
    padding: 2px 8px;
    border-radius: 4px;
    font-size: 0.85em;
    border: 1px solid rgba(0,212,255,0.2);
  }


  pre {
    background: rgba(0,0,0,0.4);
    border: 1px solid rgba(0,212,255,0.15);
    border-radius: 8px;
    padding: 18px 20px;
    font-size: 0.75em;
  }


  table {
    font-size: 0.8em;
    border-collapse: collapse;
    width: 100%;
    margin-top: 16px;
  }


  th {
    background: rgba(0,212,255,0.1);
    color: #00d4ff;
    padding: 10px 14px;
    text-align: left;
    font-weight: 700;
    letter-spacing: 1px;
    text-transform: uppercase;
    font-size: 0.8em;
  }


  td {
    padding: 9px 14px;
    border-bottom: 1px solid rgba(255,255,255,0.06);
  }


  tr:last-child td { border-bottom: none; }


  blockquote {
    border-left: 3px solid #ff4560;
    padding-left: 20px;
    color: #9BA3B8;
    font-style: italic;
    margin: 16px 0;
  }


  section::after {
    font-family: 'JetBrains Mono', monospace;
    font-size: 0.65em;
    color: rgba(255,255,255,0.2);
    content: attr(data-marpit-pagination) ' / ' attr(data-marpit-pagination-total);
  }


  .accent { color: #00D4FF; }

  .red { color: #FF4560; }

  .green { color: #00E396; }

  .yellow { color: #FFB703; }

  .dim { color: #5A6478; }
AMD Developer Hackathon 2026

AtlasOps

Can 4 AI agents replace
an on-call SRE team?
Real GKE Cluster Β· GCP
AMD MI300X Β· 192 GB HBM3
SFT + Online GRPO Trained
Harikishanth R Β· Reshma Affrin F Β· Jehrome F  |  Da Big Three

The Problem

2:47 AM
When P1 alerts fire on average. Your on-call engineer is asleep β€” or stressed, rushing.
~25 min
Average human MTTR for a cascade incident. Revenue bleeding the entire time.
$250B
Global observability + SRE market. On-call burnout is the industry's most expensive unsolved problem.
Every SRE team has a war story. The 3 AM page. The cascading failure nobody understood for 40 minutes. The postmortem that blamed "human error." The real failure was that there was no system to help them think faster.

Introducing AtlasOps

Four specialized AI agents. One AMD MI300X. One real GKE cluster. No simulations.
πŸ”΄
TRIAGE
Ack alert
Classify severity
Map blast radius
<4 tool calls
πŸ”
DIAGNOSIS
PromQL queries
Jaeger traces
kubectl logs
Root cause ID
πŸ”§
REMEDIATION
Argo CD rollback
kubectl scale
Alert silence
Verify fix
πŸ“£
COMMS
Slack update
Postmortem
Status page
Action items
Alert β†’ Triage β†’ Diagnosis β†’ [Approval Gate] β†’ Remediation β†’ Comms β†’ Postmortem

Real Infrastructure β€” Not a Simulation

☁ Google Cloud Platform

β–ΈGKE Standard Cluster β€” us-central1, 3Γ— e2-standard-4
β–ΈOnline Boutique β€” 11 real microservices (Go, Python, Node, Java, C#, gRPC)
β–ΈChaos Mesh β€” PodChaos Β· NetworkChaos Β· StressChaos Β· DNSChaos Β· IOChaos Β· TimeChaos
β–ΈPrometheus + Grafana + Jaeger + OTel β€” full observability stack
β–ΈArgo CD β€” GitOps rollbacks, real execution
β–ΈCloud SQL (Postgres 15) + Cloud PubSub + Cloud Monitoring
β–ΈAlertmanager β€” webhook fires agents on real alerts

πŸ›  20 Real SRE Tools

kubectl (7 cmds)
promql_query
promql_range
jaeger_search
jaeger_get_trace
argocd_rollback
gcloud_logs_read
cloud_monitoring
alertmanager_silence
postmortem_draft
Every tool hits a real API. No mocks in production.

38 Chaos Scenarios + Infinite Adversarial Generation

TierCountDifficulty
Single-fault8Beginner
Cascade5Hard
Multi-fault5Expert
Named Replays10Expert
Dynamic Adversarial∞72B-designed

10 Named Historical Replays

⚑ Cloudflare 2019 β€” Regex CPU storm, 85% traffic down
⚑ GitHub 2018 β€” DB failover loop, 24h incident
⚑ AWS S3 2017 β€” Typo'd command cascaded globally
⚑ Discord 2022 β€” Redis thundering herd
⚑ Fastly 2021 β€” Bad VCL config, internet outage
⚑ Facebook BGP 2021 β€” Control plane partition
⚑ Knight Capital 2012 β€” Partial deploy, $440M loss
+ Datadog 2023 Β· Slack 2022 Β· Azure DNS 2019
Adversarial designer: After each benchmark run, the Qwen2.5-72B judge analyzes the agent's failure modes and generates brand-new Chaos Mesh YAML targeting those exact weaknesses. The test set gets harder as the model improves β€” impossible to memorize.

Why AMD MI300X Was Non-Negotiable

Memory Requirements

Qwen2.5-7B base (shared) ~4 GB
4Γ— LoRA adapters (r=16) ~160 MB
Qwen2.5-72B judge (AWQ) ~37 GB
GRPO training buffers ~12 GB
vLLM KV cache ~70 GB
Total required ~126 GB

GPU Comparison

A100 (80 GB) ❌
Fits agents OR judge β€” not both simultaneously. Online GRPO impossible.
T4 (16 GB) ❌
Can't fit Qwen2.5-7B at all. CUDA OOM at model load.
MI300X 192 GB HBM3 βœ…
All 5 models co-hosted. 66 GB free. 18Γ— faster inference vs shared API.

Training Pipeline β€” SFT β†’ Online GRPO

Phase 1: Supervised Fine-Tuning

2,028 real GKE trajectories Β· QLoRA 4-bit NF4 Β· LoRA r=16
loss: 1.265 β†’ 0.48 β†’ 0.19 β†’ 0.027
accuracy: 71.96% β†’ 99.10%
time: 14 min 16 sec
adapter: 78 MB LoRA
Model learned: correct tool-call sequence, promql before argocd rollback, postmortem structure

Phase 2: Online GRPO

60 steps Β· 236 real GKE rollout episodes Β· DAPO loss
step 01: mean=0.355 max=0.539
step 24: mean=0.376 max=0.700
step 31: mean=0.421 max=0.671 ← peak
step 60: mean=0.364 max=0.506
overall: mean=0.200 runtime=9h 34m
True online RL: every step = real chaos + real rollouts + real cluster scoring

What Makes Our Training Unique

Feature Standard GRPO AtlasOps
Environment Simulator / offline Real GKE cluster, live kubectl
Loss function GRPO DAPO β€” stable on sparse rewards
Reward signal Episode-level only Dense per-step + episode contract
Curriculum Random / fixed Spaced repetition β€” mastery tracking
Scenario generation Static ∞ adversarial β€” 72B judge writes new Chaos YAML
Judge Single rubric 3 personas β€” Junior / Senior / Principal
βˆ’0.25
false resolution penalty
βˆ’0.20
hallucinated evidence
+0.15
red herring bonus
+0.08
mutating action success

Benchmark Results

28 frozen scenarios Β· Real GKE cluster Β· AMD MI300X Β· Qwen2.5-7B
82%
Resolution Rate
+28pp vs zero-shot
0.729
Avg Reward
72B judge-scored
59s
Avg MTTR
vs ~25 min human
78%
Cascade Rate
+38pp vs zero-shot
Model Resolution Reward Cascade Named Replays Unsafe Actions
Qwen2.5-7B zero-shot 54% 0.481 40% 30% 5
AtlasOps SFT 68% 0.601 62% 55% 3
AtlasOps GRPO (MI300X) 82% 0.729 78% 72% 1

Production Safety β€” No Agent Can Cause an Outage

🚦
APPROVAL GATE
P0: Human required β€” no auto-execution
P1: 60-second approval window
P2/P3: Fully automatic
Token-based callbacks via REST API
⚑
CIRCUIT BREAKER
50 tool calls per incident max
10 mutating actions per hour
3 consecutive failures β†’ OPEN state
Tripped 1Γ— during GRPO training (working as designed)
πŸ”—
INCIDENT CORRELATOR
5-minute deduplication window
Fingerprint-based alert grouping
Prevents 10 parallel chains on one cascade
Tracks all active incidents
πŸ“‹
HMAC AUDIT LOG
Hash-chained entries β€” tamper-evident
Every tool call + approval logged
`verify_integrity()` checks full chain
Cryptographic proof of what happened

Cloudflare 2019 β€” Replay Postmortem

What happened when we ran AtlasOps against a real recreation of the incident that took down 85% of Cloudflare's traffic
00:03  TRIAGE   PagerDuty ACK Β· severity P1 Β· blast: frontend + checkout + cart
00:08  DIAGNOSIS   promql β†’ 5xx surge on checkoutservice (error_rate: 34%)
00:10  DIAGNOSIS   jaeger β†’ timeout chain ends at currencyservice (CPU at 1999m/2000m)
00:13  REMEDIATION   argocd rollback currencyservice β†’ revision 3 βœ“
00:18  REMEDIATION   promql confirms error_rate < 0.1% Β· RESOLVED
00:22  COMMS   slack posted Β· statuspage updated
00:24  COMMS   postmortem saved β†’ docs/postmortems/cloudflare-2019-replay.md
4m 12s
Total MTTR
3
Tool calls to root cause
0.856
Judge score

Tech Stack

AMD Hardware

MI300X β€” 192 GB HBM3
ROCm 7.2
vLLM 0.17.1 (ROCm build)
18Γ— speedup vs shared API
312ms p50 inference latency
5 models co-hosted simultaneously

ML Training

Qwen2.5-7B-Instruct Γ— 4
Qwen2.5-72B-Instruct-AWQ (judge)
TRL 1.4.0 β€” SFTTrainer + GRPOTrainer
PEFT QLoRA β€” 4-bit NF4, r=16
BitsAndBytes-ROCm
HF Optimum-AMD (inference)

Application

FastAPI + custom SSE streaming
Chaos Mesh (6 fault types)
Prometheus + Grafana + Jaeger
Argo CD GitOps
GKE Standard Β· Cloud SQL
Docker Β· HuggingFace Spaces

AMD Developer Hackathon 2026

AtlasOps

Real GKE cluster Β· Real training Β· Real results
54% β†’ 82% resolution rate. Zero simulations.
GitHub
Harikishanth/AtlasOps
HF Space
lablab-ai-amd/atlasops
Team
Da Big Three
Harikishanth R  Β·  Reshma Affrin F  Β·  Jehrome F