| # GPU Goblin — Idea |
|
|
| > **An AI agent that hunts wasted compute on AMD MI300X.** |
| > Upload a fine-tuning config; the agent profiles a real run, diagnoses inefficiency, recommends ROCm-specific fixes, then re-runs and proves the speedup with hard numbers. |
|
|
| --- |
|
|
| ## The Problem |
|
|
| Most teams fine-tuning LLMs on AMD hardware leave **2–3× of throughput on the floor** without realizing it. The waste hides in plain sight: |
|
|
| - **HBM under-fed** — batch sizes copied from NVIDIA tutorials don't exploit MI300X's 192 GB. |
| - **Wrong precision** — `fp16` is the reflexive default; `bf16` matches throughput on CDNA3 with better numerics. |
| - **Naive attention** — teams ship without flash-attn ROCm fork or PyTorch SDPA enabled. |
| - **Generic kernels** — hipBLASLt tuning, MIOpen autotune, RCCL collective tweaks all left at defaults. |
| - **Wrong libraries** — `bitsandbytes` is the reflexive choice for 8-bit Adam, but it's **not officially supported on ROCm**. Optimum-AMD validates Flash Attention 2 + GPTQ/AWQ paths instead. |
| - **Distributed footguns** — one-process-driving-many-GPUs serializes kernel launches on ROCm; NUMA auto-balancing causes variance; `NCCL_MIN_NCHANNELS` left at default. |
| - **Knobs nobody knows about** — `HSA_FORCE_FINE_GRAIN_PCIE`, `MIOPEN_FIND_MODE`, hipBLASLt cache paths. |
|
|
| Engineers don't fix this not because they're lazy — it's because the knowledge is **scattered across ROCm docs, AMD blog posts, GitHub issues, and Discord threads**. Nobody has time to read it all before launching a run. |
|
|
| ## The Solution — GPU Goblin |
|
|
| A tool-using agent that does what an experienced AMD performance engineer would do, in one minute: |
|
|
| 1. **Read** the user's fine-tuning script or HF `TrainingArguments`. |
| 2. **Profile** a 10-step warm run on MI300X (torch.profiler + `rocprofv3` + amd-smi). |
| 3. **Diagnose** against a curated knowledge base of ROCm-specific optimizations. |
| 4. **Patch** the config — concrete diff, not vague advice. |
| 5. **Benchmark** the patched config on the same MI300X. |
| 6. **Report** before/after side-by-side with real numbers and citations. |
|
|
| The user keeps interacting in chat: *"why bf16?"*, *"what if I can't change the optimizer?"*, *"how much $ does this save per epoch?"* |
|
|
| ## Why This Wins |
|
|
| **Most teams build user-facing AI. We build AI for AI builders.** That alone is judge-bait. |
|
|
| But the deeper hook is that GPU Goblin is **provably useful** in a way most hackathon projects aren't. We don't have to convince judges the recommendations are good — we *show* them, on the same MI300X, with the same model, in the same demo. Tokens/sec: `142 → 318`. End of debate. |
|
|
| ### Track Fit (Track 1: AI Agents & Agentic Workflows) |
|
|
| - **Primary track:** AI Agents & Agentic Workflows. Real tool-using loop. The agent observes (profile), hypothesizes (KB query), tests (benchmark), refines (patch). Every step visible in chat. Not a one-shot LLM call dressed up as an agent. |
| - **Fine-tuning is what we *audit*, not the track we enter.** Canonical workload is **Qwen2.5-7B-Instruct + LoRA on MI300X**. Recommendations are *fine-tuning specific* — batch sizing for LoRA, gradient checkpointing thresholds, optimizer choice (8-bit Adam on ROCm), seq-length packing. |
|
|
| ### Qwen Technology Partner Challenge |
|
|
| GPU Goblin satisfies the Qwen partner challenge two ways: |
|
|
| 1. **Qwen as the agent brain.** The audit loop runs on `Qwen/Qwen2.5-7B-Instruct` served via Hugging Face Inference Providers. Every tool call, every recommendation, every chat answer is generated by Qwen. |
| 2. **Qwen as the audit target.** The canonical demo workload is a Qwen2.5-7B-Instruct LoRA fine-tune. The agent audits *the same model family it runs on*. |
|
|
| Result: end-to-end Qwen on AMD silicon. Directly answers the judging criterion *"How effectively the chosen model(s) are integrated into the solution."* |
|
|
| ### Hugging Face Integration |
|
|
| Hugging Face is the named Technology Partner — both **model hub** and **deployment layer**: |
|
|
| - **Model hub:** Qwen models pulled from HF Hub at audit time + at agent-spin-up time. |
| - **Optimum-AMD:** several KB rules cite Optimum-AMD validated paths (Flash Attention 2 on MI300, GPTQ/AWQ ROCm path). |
| - **Inference Providers:** the agent's Qwen calls go through HF's Inference Providers router (`provider="auto"` selects Together / Fireworks-AI / Nebius based on availability). |
| - **Deployment as HF Space:** the Streamlit UI ships as a Space within the AMD Developer Hackathon HF Organization, satisfying the "Demo Application Platform" + "Application URL" submission fields. The Space runs in offline-replay mode by default so it works without any live backend. |
|
|
| ### AMD Differentiation |
|
|
| Every single recommendation cites a **ROCm-specific rule**, not generic PyTorch advice. The knowledge base is the moat: |
|
|
| - ROCm env-var tuning (`HSA_*`, `MIOPEN_*`, `NCCL_MIN_NCHANNELS=112`) |
| - hipBLASLt hint logging + offline tuning files; MIOpen `MIOPEN_FIND_*` autotune |
| - Flash-attn ROCm fork (validated on MI300 via Optimum-AMD) |
| - RCCL topology + tensor-parallel placement within a single XGMI island |
| - bitsandbytes-on-ROCm gotchas (not officially supported — recommend alternatives) |
| - bf16 vs fp16 on CDNA3 matrix cores |
| - One-process-per-GPU vs one-process-many-GPUs (ROCm serializes launches in the latter) |
| - NUMA auto-balancing disable for stable benchmarks |
| - MI300X-specific: 304 CUs, 192 GB HBM3, ~5.3 TB/s peak bandwidth, native FP8 |
|
|
| This is the kind of insight you only get from someone who has actually shipped on MI300X. We bottle it. |
|
|
| ## Demo Narrative (3 minutes) |
|
|
| **Setup (15s):** "Most teams waste 50%+ of their MI300X. Watch GPU Goblin find that waste live." |
|
|
| **Live demo (2 min):** |
|
|
| 1. Drop in `train_qwen_lora.py` — batch=4, fp16, naive attention. (Looks normal.) |
| 2. Agent: *"Parsing config…"* → extracts hyperparams. |
| 3. Agent: *"Running 10-step profile on MI300X…"* → shows real metrics: HBM 38%, MFU 24%. |
| 4. Agent: *"Querying ROCm playbook…"* → 4 issues found, each with citation. |
| 5. Agent: *"Generating optimized config…"* → diff appears: bf16, batch=12, flash-attn ROCm, hipBLASLt env, packed sequences. |
| 6. Agent: *"Benchmarking new config — 50 steps on MI300X…"* → live progress. |
| 7. Final report: **142 → 318 tokens/sec (2.24×). MFU 24% → 51%. $X saved per epoch.** |
|
|
| **Why it works (45s):** Show the agent loop. Show the KB. Land the line: *"the agent runs Qwen on AMD silicon, the audit target is Qwen on AMD silicon, the optimizations are AMD-specific. End-to-end AMD."* |
|
|
| ## What This Is *Not* |
|
|
| - Not a generic PyTorch profiler GUI (`rocprofv3` + tensorboard already exist) |
| - Not a chatbot wrapping a static FAQ (the agent runs real benchmarks) |
| - Not a multi-cloud cost tool (focused on MI300X compute waste, not infra cost) |
| - Not a fine-tuning trainer itself (we *audit* others' runs) |
|
|
| ## One-Line Pitch |
|
|
| > **GPU Goblin is the AMD performance engineer you wish was on your team — except it costs five minutes and audits any fine-tuning run on MI300X.** |
|
|