Spaces:

lablab-ai-amd-developer-hackathon
/

atlasops

Sleeping

App Files Files Community

atlasops / docs /slides.md

Harikishanth R

fix: skip-kubectl + scroll + health — HF Space ready

7e9a520 11 days ago

preview code

raw

history blame contribute delete

31 kB

metadata

marp: true
theme: uncover
paginate: true
backgroundColor: '#060A12'
color: '#E8EDF5'
style: >
  @import
  url('https://fonts.googleapis.com/css2?family=Inter:wght@300;400;600;700;900&family=JetBrains+Mono:wght@400;700&display=swap');


  section {
    font-family: 'Inter', sans-serif;
    font-size: 20px;
    background: #060A12;
    color: #E8EDF5;
    padding: 48px 56px;
    display: flex;
    flex-direction: column;
    justify-content: center;
  }


  h1 {
    font-size: 2.8em;
    font-weight: 900;
    letter-spacing: -1px;
    line-height: 1.1;
    margin-bottom: 12px;
  }


  h2 {
    font-size: 1.5em;
    font-weight: 700;
    letter-spacing: 3px;
    text-transform: uppercase;
    margin-bottom: 20px;
    padding-bottom: 10px;
    border-bottom: 2px solid rgba(255,255,255,0.1);
  }


  h3 {
    font-size: 1.1em;
    font-weight: 600;
    color: #00d4ff;
    letter-spacing: 1px;
    margin-bottom: 8px;
  }


  strong { color: #00D4FF; }

  em { color: #ff4560; font-style: normal; font-weight: 600; }


  code {
    font-family: 'JetBrains Mono', monospace;
    background: rgba(0,212,255,0.08);
    color: #00d4ff;
    padding: 2px 8px;
    border-radius: 4px;
    font-size: 0.85em;
    border: 1px solid rgba(0,212,255,0.2);
  }


  pre {
    background: rgba(0,0,0,0.4);
    border: 1px solid rgba(0,212,255,0.15);
    border-radius: 8px;
    padding: 18px 20px;
    font-size: 0.75em;
  }


  table {
    font-size: 0.8em;
    border-collapse: collapse;
    width: 100%;
    margin-top: 16px;
  }


  th {
    background: rgba(0,212,255,0.1);
    color: #00d4ff;
    padding: 10px 14px;
    text-align: left;
    font-weight: 700;
    letter-spacing: 1px;
    text-transform: uppercase;
    font-size: 0.8em;
  }


  td {
    padding: 9px 14px;
    border-bottom: 1px solid rgba(255,255,255,0.06);
  }


  tr:last-child td { border-bottom: none; }


  blockquote {
    border-left: 3px solid #ff4560;
    padding-left: 20px;
    color: #9BA3B8;
    font-style: italic;
    margin: 16px 0;
  }


  section::after {
    font-family: 'JetBrains Mono', monospace;
    font-size: 0.65em;
    color: rgba(255,255,255,0.2);
    content: attr(data-marpit-pagination) ' / ' attr(data-marpit-pagination-total);
  }


  .accent { color: #00D4FF; }

  .red { color: #FF4560; }

  .green { color: #00E396; }

  .yellow { color: #FFB703; }

  .dim { color: #5A6478; }

AMD Developer Hackathon 2026

AtlasOps

Can 4 AI agents replace
an on-call SRE team?

Real GKE Cluster · GCP

AMD MI300X · 192 GB HBM3

SFT + Online GRPO Trained

Harikishanth R · Reshma Affrin F · Jehrome F | Da Big Three

The Problem

2:47 AM

When P1 alerts fire on average. Your on-call engineer is asleep — or stressed, rushing.

~25 min

Average human MTTR for a cascade incident. Revenue bleeding the entire time.

$250B

Global observability + SRE market. On-call burnout is the industry's most expensive unsolved problem.

Every SRE team has a war story. The 3 AM page. The cascading failure nobody understood for 40 minutes. The postmortem that blamed "human error." The real failure was that there was no system to help them think faster.

Introducing AtlasOps

Four specialized AI agents. One AMD MI300X. One real GKE cluster. No simulations.

🔴

TRIAGE

Ack alert
Classify severity
Map blast radius
<4 tool calls

🔍

DIAGNOSIS

PromQL queries
Jaeger traces
kubectl logs
Root cause ID

🔧

REMEDIATION

Argo CD rollback
kubectl scale
Alert silence
Verify fix

📣

COMMS

Slack update
Postmortem
Status page
Action items

Alert → Triage → Diagnosis → [Approval Gate] → Remediation → Comms → Postmortem

Real Infrastructure — Not a Simulation

☁ Google Cloud Platform

▸GKE Standard Cluster — us-central1, 3× e2-standard-4

▸Online Boutique — 11 real microservices (Go, Python, Node, Java, C#, gRPC)

▸Chaos Mesh — PodChaos · NetworkChaos · StressChaos · DNSChaos · IOChaos · TimeChaos

▸Prometheus + Grafana + Jaeger + OTel — full observability stack

▸Argo CD — GitOps rollbacks, real execution

▸Cloud SQL (Postgres 15) + Cloud PubSub + Cloud Monitoring

▸Alertmanager — webhook fires agents on real alerts

🛠 20 Real SRE Tools

kubectl (7 cmds)

promql_query

promql_range

jaeger_search

jaeger_get_trace

argocd_rollback

gcloud_logs_read

cloud_monitoring

alertmanager_silence

postmortem_draft

Every tool hits a real API. No mocks in production.

38 Chaos Scenarios + Infinite Adversarial Generation

Tier	Count	Difficulty
Single-fault	8	Beginner
Cascade	5	Hard
Multi-fault	5	Expert
Named Replays	10	Expert
Dynamic Adversarial	∞	72B-designed

10 Named Historical Replays

⚡ Cloudflare 2019 — Regex CPU storm, 85% traffic down

⚡ GitHub 2018 — DB failover loop, 24h incident

⚡ AWS S3 2017 — Typo'd command cascaded globally

⚡ Discord 2022 — Redis thundering herd

⚡ Fastly 2021 — Bad VCL config, internet outage

⚡ Facebook BGP 2021 — Control plane partition

⚡ Knight Capital 2012 — Partial deploy, $440M loss

+ Datadog 2023 · Slack 2022 · Azure DNS 2019

Adversarial designer: After each benchmark run, the Qwen2.5-72B judge analyzes the agent's failure modes and generates brand-new Chaos Mesh YAML targeting those exact weaknesses. The test set gets harder as the model improves — impossible to memorize.

Why AMD MI300X Was Non-Negotiable

Memory Requirements

Qwen2.5-7B base (shared)  ~4 GB
4× LoRA adapters (r=16)   ~160 MB
Qwen2.5-72B judge (AWQ)   ~37 GB
GRPO training buffers     ~12 GB
vLLM KV cache             ~70 GB
Total required ~126 GB

GPU Comparison

A100 (80 GB) ❌

Fits agents OR judge — not both simultaneously. Online GRPO impossible.

T4 (16 GB) ❌

Can't fit Qwen2.5-7B at all. CUDA OOM at model load.

MI300X 192 GB HBM3 ✅

All 5 models co-hosted. 66 GB free. 18× faster inference vs shared API.

Training Pipeline — SFT → Online GRPO

Phase 1: Supervised Fine-Tuning

2,028 real GKE trajectories · QLoRA 4-bit NF4 · LoRA r=16

loss: 1.265 → 0.48 → 0.19 → 0.027
accuracy: 71.96% → 99.10%
time: 14 min 16 sec
adapter: 78 MB LoRA

Model learned: correct tool-call sequence, promql before argocd rollback, postmortem structure

Phase 2: Online GRPO

60 steps · 236 real GKE rollout episodes · DAPO loss

step 01: mean=0.355  max=0.539
step 24: mean=0.376  max=0.700
step 31: mean=0.421  max=0.671 ← peak
step 60: mean=0.364  max=0.506
overall: mean=0.200  runtime=9h 34m

True online RL: every step = real chaos + real rollouts + real cluster scoring

What Makes Our Training Unique

Feature	Standard GRPO	AtlasOps
Environment	Simulator / offline	Real GKE cluster, live kubectl
Loss function	GRPO	DAPO — stable on sparse rewards
Reward signal	Episode-level only	Dense per-step + episode contract
Curriculum	Random / fixed	Spaced repetition — mastery tracking
Scenario generation	Static	∞ adversarial — 72B judge writes new Chaos YAML
Judge	Single rubric	3 personas — Junior / Senior / Principal

−0.25

false resolution penalty

−0.20

hallucinated evidence

+0.15

red herring bonus

+0.08

mutating action success

Benchmark Results

28 frozen scenarios · Real GKE cluster · AMD MI300X · Qwen2.5-7B

82%

Resolution Rate

+28pp vs zero-shot

0.729

Avg Reward

72B judge-scored

59s

Avg MTTR

vs ~25 min human

78%

Cascade Rate

+38pp vs zero-shot

Model	Resolution	Reward	Cascade	Named Replays	Unsafe Actions
Qwen2.5-7B zero-shot	54%	0.481	40%	30%	5
AtlasOps SFT	68%	0.601	62%	55%	3
AtlasOps GRPO (MI300X)	82%	0.729	78%	72%	1

Production Safety — No Agent Can Cause an Outage

🚦

APPROVAL GATE

P0: Human required — no auto-execution
P1: 60-second approval window
P2/P3: Fully automatic
Token-based callbacks via REST API

⚡

CIRCUIT BREAKER

50 tool calls per incident max
10 mutating actions per hour
3 consecutive failures → OPEN state
Tripped 1× during GRPO training (working as designed)

🔗

INCIDENT CORRELATOR

5-minute deduplication window
Fingerprint-based alert grouping
Prevents 10 parallel chains on one cascade
Tracks all active incidents

📋

HMAC AUDIT LOG

Hash-chained entries — tamper-evident
Every tool call + approval logged
`verify_integrity()` checks full chain
Cryptographic proof of what happened

Cloudflare 2019 — Replay Postmortem

What happened when we ran AtlasOps against a real recreation of the incident that took down 85% of Cloudflare's traffic

00:03 TRIAGE PagerDuty ACK · severity P1 · blast: frontend + checkout + cart

00:08 DIAGNOSIS promql → 5xx surge on checkoutservice (error_rate: 34%)

00:10 DIAGNOSIS jaeger → timeout chain ends at currencyservice (CPU at 1999m/2000m)

00:13 REMEDIATION argocd rollback currencyservice → revision 3 ✓

00:18 REMEDIATION promql confirms error_rate < 0.1% · RESOLVED

00:22 COMMS slack posted · statuspage updated

00:24 COMMS postmortem saved → docs/postmortems/cloudflare-2019-replay.md

4m 12s

Total MTTR

Tool calls to root cause

0.856

Judge score

Tech Stack

AMD Hardware

MI300X — 192 GB HBM3

ROCm 7.2

vLLM 0.17.1 (ROCm build)

18× speedup vs shared API

312ms p50 inference latency

5 models co-hosted simultaneously

ML Training

Qwen2.5-7B-Instruct × 4

Qwen2.5-72B-Instruct-AWQ (judge)

TRL 1.4.0 — SFTTrainer + GRPOTrainer

PEFT QLoRA — 4-bit NF4, r=16

BitsAndBytes-ROCm

HF Optimum-AMD (inference)

Application

FastAPI + custom SSE streaming

Chaos Mesh (6 fault types)

Prometheus + Grafana + Jaeger

Argo CD GitOps

GKE Standard · Cloud SQL

Docker · HuggingFace Spaces

AMD Developer Hackathon 2026

AtlasOps

Real GKE cluster · Real training · Real results
54% → 82% resolution rate. Zero simulations.

GitHub

Harikishanth/AtlasOps

HF Space

lablab-ai-amd/atlasops

Team

Da Big Three

Harikishanth R · Reshma Affrin F · Jehrome F