gpu-goblin / README.md
bharathtelu's picture
Deploy auto-tune UI + scripts (work-from-91d0cf0)
a9aa4ae verified
|
Raw
History Blame Contribute Delete
18.4 kB
---
title: GPU Goblin
emoji: 🧌
colorFrom: red
colorTo: red
sdk: streamlit
sdk_version: "1.32.0"
app_file: ui/auto_tune_ui.py
pinned: false
license: mit
short_description: AI auto-tuner for MI300X fine-tuning workloads.
tags:
- amd
- mi300x
- rocm
- qwen
- huggingface
- agent
- fine-tuning
- llm
---
# GPU Goblin
> An AI agent that hunts wasted compute on AMD MI300X. Powered by Qwen.
GPU Goblin profiles a fine-tuning run, diagnoses inefficiency against a
curated ROCm knowledge base, recommends MI300X-specific fixes, and
re-benchmarks to prove the speedup with real numbers. The agent itself runs
on a Qwen model via Hugging Face Inference Providers; the canonical demo
workload is `Qwen/Qwen2.5-7B-Instruct` LoRA fine-tuning on MI300X.
Submitted to the **AMD Developer Hackathon**, Track 1: AI Agents & Agentic
Workflows. Incorporates the Qwen Technology Partner challenge (Qwen as both
agent brain and audit target) and uses Hugging Face as the model hub +
deployment layer.
See [`brainstorming/idea.md`](brainstorming/idea.md),
[`brainstorming/architecture.md`](brainstorming/architecture.md), and
[`brainstorming/goals.md`](brainstorming/goals.md).
## Quick Start
```bash
pip install -e ".[dev]"
# Required for the live agent loop:
export HF_TOKEN=hf_... # Hugging Face Inference token
# Optional β€” override the default Qwen model / provider:
# export GOBLIN_QWEN_MODEL=Qwen/Qwen2.5-7B-Instruct
# export GOBLIN_QWEN_PROVIDER=auto # or together / fireworks-ai / nebius / ...
uvicorn agent.server:app --reload --port 8000
streamlit run ui/app.py
```
The Streamlit UI works **without `HF_TOKEN`** in offline-replay mode β€” it
plays a cached audit trajectory (`tests/fixtures/cached_audit.json`) so
judges can see the canonical `142 β†’ 318 tok/s (2.24Γ—)` demo without our
backend or any live LLM.
## Repo Layout
```
agent/
schemas.py # Shared pydantic models (RunMetrics, WorkloadConfig, ...)
backends/ # Pluggable LLM driver (Qwen via HF Inference Providers)
tools/ # 6 tools the agent can call
loop.py # Provider-agnostic tool-use loop
server.py # FastAPI + SSE
runner/ # GPU runner (rocprofv3 wrapper) + FakeRunner fallback
kb/ # ROCm knowledge base (22 curated rules, the moat)
ui/ # Streamlit chat UI
workloads/ # Canonical Qwen demo + synthetic corpus
tests/ # Pytest suite + fixtures
brainstorming/ # Design docs (idea / architecture / goals)
```
## Development
The agent loop is testable on a laptop without an MI300X via the `FakeRunner`
and the synthetic corpus in `workloads/synthetic/`. Real benchmarks require
ROCm + MI300X (the `LiveRunner` auto-falls-back to `FakeRunner` when
`rocprofv3` / `amd-smi` / a render device are missing).
```bash
python3 -m pytest tests/ -v # 86 tests, no GPU required
python3 -m agent workloads/train_qwen_lora.py # CLI driver, prints SSE events
```
## Running on AMD Developer Cloud (MI300X)
End-to-end recipe for the live demo path. Assumes you've got the $100 hackathon
credits. Plan on ~10-15 GPU-hours total (well under budget).
### 1. Provision an MI300X instance
1. Sign in to [AMD AI Developer Program](https://www.amd.com/en/developer/resources/developer-program.html)
and join the AMD Developer Cloud waitlist (instant approval for hackathon
participants).
2. Spin up an **MI300X** instance. Pick the largest container disk you can
(the model weights cache is 15-30 GB).
3. SSH in. You should land in a Linux shell with a `/dev/dri/renderD*`
device visible β€” confirm with `ls /dev/dri`.
### 2. Pull the ROCm + PyTorch container
```bash
docker pull rocm/pytorch:rocm6.1_ubuntu22.04_py3.10_pytorch_2.3
```
Run it with the GPU exposed. The `--device` / `--group-add video` lines are
the only ones that matter for ROCm passthrough:
```bash
docker run -it --rm \
--device=/dev/kfd --device=/dev/dri \
--group-add video --ipc=host --shm-size=16g \
-v $HOME:/workspace \
-e HF_TOKEN=$HF_TOKEN \
-e GOBLIN_QWEN_MODEL=Qwen/Qwen2.5-7B-Instruct \
-e ROCM_IMAGE_TAG=rocm6.1_pytorch2.3 \
-p 8000:8000 -p 8501:8501 \
rocm/pytorch:rocm6.1_ubuntu22.04_py3.10_pytorch_2.3 bash
```
### 3. Verify the GPU is visible inside the container
```bash
amd-smi monitor # shows utilization, HBM, power per GPU
rocprofv3 --version # confirms the profiler is on PATH
python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"
# β†’ True MI300X
```
If any of these fail, `LiveRunner` will fall back to `FakeRunner` β€” the
agent loop still works, but you get cached metrics instead of a real
benchmark. Don't chase the bug; the demo lane is intact.
### 4. Clone, install, run
```bash
cd /workspace
git clone https://github.com/Manoj-Sri/amd-hackathon-rocm.git goblin
cd goblin
pip install -e ".[dev]"
python -m pytest tests/ -q # 86 tests pass without GPU; faster sanity check
# Live run on MI300X:
python -m agent workloads/train_qwen_lora.py
# Streams SSE events: thought, tool_call, tool_result, ..., final_report
```
### 5. Run the FastAPI server + UI
```bash
# Terminal 1 (inside the container):
uvicorn agent.server:app --host 0.0.0.0 --port 8000
# Terminal 2:
streamlit run ui/app.py --server.port 8501 --server.address 0.0.0.0
```
If the cloud instance gives you a public IP/URL, port-forward 8501 to your
laptop. If not, SSH-tunnel: `ssh -L 8501:localhost:8501 user@instance` β†’
open `http://localhost:8501` locally.
### 6. Cost-control checklist
- Cache benchmark results (`bench_cache/` is content-addressed by config +
workload SHA + container tag, so identical configs are free).
- Day-1 baseline run is the only "must-burn-GPU" task; everything else can
use cached metrics or the FakeRunner.
- Stop the instance between work sessions. AMD Developer Cloud bills only
for running time.
- Public reference price: ~$1.99/GPU-hour for MI300X VMs. ~$8 of your $100
covers a full demo + dry-runs.
## Integrating with HF Qwen
The agent runs on Qwen via Hugging Face **Inference Providers**, which
auto-routes your request to one of HF's serving partners (Together,
Fireworks-AI, Nebius, Replicate, ...). You do not run Qwen yourself β€” HF
does β€” and you authenticate with a single token.
### 1. Get a Hugging Face token
1. Sign up at [huggingface.co](https://huggingface.co/).
2. Create a token at [Settings β†’ Access Tokens](https://huggingface.co/settings/tokens)
with **read** + **inference** scope.
3. Export it:
```bash
export HF_TOKEN=hf_yourtokenhere
```
### 2. Join the AMD Developer Hackathon HF Organization
The hackathon submission requires publishing your project as a Hugging Face
Space within the event organization. Click the "Join" link on the
[hackathon page](https://lablab.ai/ai-hackathons/amd-developer) (look for
"Join the AMD Developer Hackathon HF Organization") and accept the
invitation in your HF account.
### 3. Confirm Qwen reachability
Before running the full agent, smoke-test the HF Inference connection:
```bash
python - <<'PY'
import asyncio, os
from huggingface_hub import AsyncInferenceClient
async def go():
client = AsyncInferenceClient(token=os.environ["HF_TOKEN"])
resp = await client.chat_completion(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[{"role": "user", "content": "Say hello in 5 words."}],
max_tokens=32,
)
print(resp.choices[0].message.content)
asyncio.run(go())
PY
```
Expect a 1-line Qwen response. If you get an auth error, your token is
missing the `inference` scope. If you get a 404, the chosen model isn't
served by any active provider β€” try `Qwen/Qwen2.5-32B-Instruct` or set
`provider="together"` explicitly.
### 4. Run the agent against Qwen
The agent picks Qwen automatically β€” no env var needed beyond `HF_TOKEN`:
```bash
python -m agent workloads/train_qwen_lora.py
```
You should see SSE events streaming: `thought` blocks from Qwen, `tool_call`
events as it picks tools, `tool_result` events, and finally a `final_report`
with the canonical `142 β†’ 318 tok/s (2.24Γ—)` line.
### 5. Switching the model or provider
Qwen has many variants. Override at process start:
```bash
# Bigger model, more reliable tool calls:
export GOBLIN_QWEN_MODEL=Qwen/Qwen2.5-32B-Instruct
# Pin to a specific provider (skip auto-routing):
export GOBLIN_QWEN_PROVIDER=together # or fireworks-ai / nebius / replicate
```
`Qwen/Qwen2.5-7B-Instruct` (the default) is the sweet spot: ~14 GB at
bf16, fast, supports tool calls. Bump to `-32B-Instruct` if 7B starts
emitting malformed tool arguments mid-audit.
### 6. Self-host Qwen on the same MI300X via vLLM (Path B)
The strongest "AMD-end-to-end" story: Qwen runs on the same MI300X that
GPU Goblin is auditing, served by vLLM behind an OpenAI-compatible
endpoint. Goblin already supports this β€” pick it with one env var. The
recipe below mirrors the [lablab vLLM-on-AMD-Developer-Cloud
tutorial](https://lablab.ai/ai-tutorials/amd-developer-cloud-host-llm-vllm).
#### 6a. Stand up vLLM on the MI300X
```bash
# Inside your MI300X cloud instance (rocm/pytorch container or bare host).
# Pull AMD's official rocm/vllm image β€” has vLLM + ROCm + Qwen support baked in.
docker pull rocm/vllm:latest
# Run vLLM serving a Qwen tool-calling model. `--tool-call-parser hermes`
# is the critical flag β€” it tells vLLM to parse Qwen's Hermes-format
# <tool_call> tags into the OpenAI `tool_calls` shape the agent expects.
#
# Pick ONE of the model recipes below.
# (a) Qwen2.5-32B-Instruct β€” recommended for the AMD GPU path. ~64 GB at
# bf16 (well under MI300X's 192 GB). Tool calling is significantly
# more reliable than 7B and the 32K context fits any audit conversation
# comfortably. First run downloads ~64 GB, takes 5-10 minutes.
docker run -d --name qwen-vllm \
--device=/dev/kfd --device=/dev/dri --group-add video \
--ipc=host --shm-size=16g \
-p 8000:8000 \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
-e HF_TOKEN=$HF_TOKEN \
rocm/vllm:latest \
--model Qwen/Qwen2.5-32B-Instruct \
--dtype bfloat16 \
--max-model-len 32768 \
--gpu-memory-utilization 0.85 \
--enable-auto-tool-choice \
--tool-call-parser hermes
# (b) Qwen2.5-7B-Instruct β€” light/fast, fine for smoke tests, occasionally
# hallucinates rule ids on tool calls. Use --max-model-len 32768 (the
# model's native cap) to keep audits from exhausting context near the
# compare_runs step.
# docker run -d --name qwen-vllm \
# ...same flags as above except...
# --model Qwen/Qwen2.5-7B-Instruct \
# --max-model-len 32768 \
# ...
# Wait for "Application startup complete" in the logs, then verify:
docker logs -f qwen-vllm # ctrl-C once you see "Application startup complete"
curl http://localhost:8000/v1/models
# β†’ JSON listing the model id you served
```
Sanity check tool calling end-to-end:
```bash
curl -s http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "Qwen/Qwen2.5-32B-Instruct",
"messages": [{"role": "user", "content": "Call get_weather for Paris."}],
"tools": [{"type":"function","function":{"name":"get_weather","description":"weather","parameters":{"type":"object","properties":{"city":{"type":"string"}},"required":["city"]}}}],
"tool_choice": "auto"
}' | python3 -c "import sys,json;r=json.load(sys.stdin);print(r['choices'][0]['message'])"
```
Expect a `tool_calls` array with `name=get_weather` and `arguments` mentioning Paris. If you get plain text instead, the `--tool-call-parser hermes` flag was dropped.
#### 6b. Point GPU Goblin at the local vLLM
```bash
export GOBLIN_AGENT_BACKEND=qwen-vllm
export GOBLIN_QWEN_VLLM_URL=http://localhost:8000/v1
export GOBLIN_QWEN_VLLM_MODEL=Qwen/Qwen2.5-32B-Instruct # match the model you served
# Optional β€” only if you fronted vLLM with auth (default vLLM ignores the key):
# export GOBLIN_QWEN_VLLM_KEY=<your-token>
python -m agent workloads/train_qwen_lora.py
```
Verify the agent picked up the right backend:
```bash
curl http://localhost:8000/healthz # the Goblin server, not vLLM
# β†’ {"backend": "qwen-vllm", "vllm_url": "http://localhost:8000/v1", ...}
```
That's it β€” the agent loop, the tools, the system prompt, the SSE
streaming, the offline-replay fallback all carry over unchanged. The
only thing that's different is which OpenAI-compatible endpoint
``QwenVLLMBackend`` talks to.
#### 6c. Comparing the two backends
| Aspect | `qwen-hf` (default) | `qwen-vllm` |
|---|---|---|
| Auth | `HF_TOKEN` | none by default; optional `GOBLIN_QWEN_VLLM_KEY` |
| Compute | Together / Fireworks-AI / Nebius (HF routes) | Your MI300X |
| Latency | 200-500 ms / turn (network-bound) | 50-150 ms / turn (in-cluster) |
| Cost | HF Inference credits | Your AMD Developer Cloud GPU-hours |
| Demo story | "uses HF as the model hub" | "Qwen runs on the same MI300X it audits" |
| Setup time | 30 sec (just `HF_TOKEN`) | 2-3 min (model download + warmup) |
| Best for | HF Space deployment | Pitch demo on MI300X |
Run **both** during the hackathon: `qwen-hf` for the Space (judges who
click the URL get a real audit without an MI300X); `qwen-vllm` for the
live pitch demo (the strongest "all AMD" story the judging criterion
"How effectively the chosen model is integrated" rewards).
## Deploying to Hugging Face Spaces
This repo is **already shaped to be a Hugging Face Space** β€” `README.md`
carries the YAML frontmatter HF needs, and `requirements.txt` at the root
is the deliberately-minimal Streamlit-only dependency set (no torch /
transformers / huggingface_hub at runtime). The deployed Space is the
**offline-replay lane**: judges interact with a Streamlit UI that streams
the canonical `142 β†’ 318 tok/s (2.24Γ—)` audit trajectory from
`tests/fixtures/cached_audit.json`, without a live LLM, without a backend,
and without our laptop. This is what satisfies the hackathon's "Demo
Application Platform + Application URL" submission fields.
### One-time setup
1. Create a Hugging Face account at [huggingface.co](https://huggingface.co/)
and accept the invite to the **AMD Developer Hackathon HF Organization**
(link is on the [hackathon page](https://lablab.ai/ai-hackathons/amd-developer)
under the Hugging Face section).
2. Create a token at [Settings β†’ Access Tokens](https://huggingface.co/settings/tokens)
with **`write`** scope (you need write access to push to the Space repo).
Save it as `HF_PUSH_TOKEN`.
3. On the HF organization's page, click **"New Space"**:
- Owner: AMD Developer Hackathon org
- Space name: `gpu-goblin` (or your preferred slug)
- License: MIT
- SDK: **Streamlit**
- Hardware: **CPU basic** (free; the Space loads no GPU code path)
- Visibility: Public
4. Don't initialize the Space with anything β€” leave it empty so the first
push lands cleanly.
### Deploy
From the project root, push the existing `feat/scaffold` branch to the
Space's git remote:
```bash
# Add the Space remote (use HTTPS with your username + HF_PUSH_TOKEN as password):
git remote add space https://huggingface.co/spaces/<org-slug>/gpu-goblin
# Push (HF Spaces use 'main' as the default branch):
git push space feat/scaffold:main
```
You'll see a build log at `https://huggingface.co/spaces/<org-slug>/gpu-goblin`.
Cold-start takes 30-60 seconds (Streamlit + the pure-pydantic deps); once
up, the canonical demo trajectory replays in ~10 seconds when a judge
clicks **"Use sample workload"**.
### What the Space looks like to judges
When a judge opens the Space URL:
1. The lane toggle defaults to **Offline replay** β€” appropriate for the Space
since there's no MI300X behind it.
2. They click **"Use sample workload"** (which references
`workloads/train_qwen_lora.py`).
3. Streamlit attempts to reach `http://localhost:8000/audit`, fails with
`ConnectionError`, surfaces a one-line warning ("Backend unreachable β€”
running offline-replay demo from cached audit"), then plays the cached
audit trajectory event-by-event with `~0.4s` pauses between events.
4. Final report renders: `Tokens/sec: 142 β†’ 318 (2.24Γ—)` with the
side-by-side metrics table, waste-budget bar chart, diff viewer, and
per-rule citations.
### Updating the Space
After any change to the main repo, redeploy:
```bash
git push space feat/scaffold:main
```
HF rebuilds the Space automatically on push.
### (Stretch) Live agent in the Space
The shipped Space is read-only β€” it doesn't reach a real LLM. If you want
judges to drive the agent live, two paths:
1. **Stand up the FastAPI backend somewhere reachable** (an MI300X on AMD
Developer Cloud, an HF Inference Endpoint, a small CPU box) and set the
Space's `GOBLIN_BACKEND_URL` secret to that URL. The Streamlit app will
stream real SSE from your backend instead of the cached replay.
2. **Embed the agent loop in-process** (refactor `ui/app.py` to call
`agent.loop.run_audit` directly via `asyncio.run`). This adds
`huggingface_hub` to `requirements.txt` and requires `HF_TOKEN` as a
Space secret. Larger cold-start, fully self-contained.
Both are post-MVP; the offline-replay Space is what satisfies the
submission requirement.
## Configuration Reference
| Env var | Default | Purpose |
|---|---|---|
| `GOBLIN_AGENT_BACKEND` | `qwen-hf` | Pick the LLM backend: `qwen-hf` (HF Inference Providers) or `qwen-vllm` (self-hosted vLLM on MI300X). |
| `HF_TOKEN` | *(none)* | Required when `qwen-hf` is active. Hugging Face Inference token. |
| `GOBLIN_QWEN_MODEL` | `Qwen/Qwen2.5-7B-Instruct` | Model id used by the `qwen-hf` backend. |
| `GOBLIN_QWEN_PROVIDER` | `auto` | HF Inference Provider routing (`auto` / `together` / `fireworks-ai` / `nebius` / ...). |
| `GOBLIN_QWEN_VLLM_URL` | `http://localhost:8000/v1` | Base URL of the self-hosted vLLM endpoint (only used when `qwen-vllm` is active). |
| `GOBLIN_QWEN_VLLM_MODEL` | `Qwen/Qwen2.5-7B-Instruct` | Model id served by your local vLLM. |
| `GOBLIN_QWEN_VLLM_KEY` | `EMPTY` | Optional auth token if you front vLLM with nginx/Caddy + auth. |
| `GOBLIN_BACKEND_URL` | `http://localhost:8000/audit` | UI's backend endpoint. |
| `ROCM_IMAGE_TAG` | `unknown` | Container tag mixed into the benchmark cache key. |
| `GOBLIN_GPU_ID` | `0` | Which `/dev/dri/renderD*` to bind in `goblin_runner.sh`. |