A newer version of the Streamlit SDK is available: 1.58.0
title: GPU Goblin
emoji: π§
colorFrom: red
colorTo: red
sdk: streamlit
sdk_version: 1.32.0
app_file: ui/auto_tune_ui.py
pinned: false
license: mit
short_description: AI auto-tuner for MI300X fine-tuning workloads.
tags:
- amd
- mi300x
- rocm
- qwen
- huggingface
- agent
- fine-tuning
- llm
GPU Goblin
An AI agent that hunts wasted compute on AMD MI300X. Powered by Qwen.
GPU Goblin profiles a fine-tuning run, diagnoses inefficiency against a
curated ROCm knowledge base, recommends MI300X-specific fixes, and
re-benchmarks to prove the speedup with real numbers. The agent itself runs
on a Qwen model via Hugging Face Inference Providers; the canonical demo
workload is Qwen/Qwen2.5-7B-Instruct LoRA fine-tuning on MI300X.
Submitted to the AMD Developer Hackathon, Track 1: AI Agents & Agentic Workflows. Incorporates the Qwen Technology Partner challenge (Qwen as both agent brain and audit target) and uses Hugging Face as the model hub + deployment layer.
See brainstorming/idea.md,
brainstorming/architecture.md, and
brainstorming/goals.md.
Quick Start
pip install -e ".[dev]"
# Required for the live agent loop:
export HF_TOKEN=hf_... # Hugging Face Inference token
# Optional β override the default Qwen model / provider:
# export GOBLIN_QWEN_MODEL=Qwen/Qwen2.5-7B-Instruct
# export GOBLIN_QWEN_PROVIDER=auto # or together / fireworks-ai / nebius / ...
uvicorn agent.server:app --reload --port 8000
streamlit run ui/app.py
The Streamlit UI works without HF_TOKEN in offline-replay mode β it
plays a cached audit trajectory (tests/fixtures/cached_audit.json) so
judges can see the canonical 142 β 318 tok/s (2.24Γ) demo without our
backend or any live LLM.
Repo Layout
agent/
schemas.py # Shared pydantic models (RunMetrics, WorkloadConfig, ...)
backends/ # Pluggable LLM driver (Qwen via HF Inference Providers)
tools/ # 6 tools the agent can call
loop.py # Provider-agnostic tool-use loop
server.py # FastAPI + SSE
runner/ # GPU runner (rocprofv3 wrapper) + FakeRunner fallback
kb/ # ROCm knowledge base (22 curated rules, the moat)
ui/ # Streamlit chat UI
workloads/ # Canonical Qwen demo + synthetic corpus
tests/ # Pytest suite + fixtures
brainstorming/ # Design docs (idea / architecture / goals)
Development
The agent loop is testable on a laptop without an MI300X via the FakeRunner
and the synthetic corpus in workloads/synthetic/. Real benchmarks require
ROCm + MI300X (the LiveRunner auto-falls-back to FakeRunner when
rocprofv3 / amd-smi / a render device are missing).
python3 -m pytest tests/ -v # 86 tests, no GPU required
python3 -m agent workloads/train_qwen_lora.py # CLI driver, prints SSE events
Running on AMD Developer Cloud (MI300X)
End-to-end recipe for the live demo path. Assumes you've got the $100 hackathon credits. Plan on ~10-15 GPU-hours total (well under budget).
1. Provision an MI300X instance
- Sign in to AMD AI Developer Program and join the AMD Developer Cloud waitlist (instant approval for hackathon participants).
- Spin up an MI300X instance. Pick the largest container disk you can (the model weights cache is 15-30 GB).
- SSH in. You should land in a Linux shell with a
/dev/dri/renderD*device visible β confirm withls /dev/dri.
2. Pull the ROCm + PyTorch container
docker pull rocm/pytorch:rocm6.1_ubuntu22.04_py3.10_pytorch_2.3
Run it with the GPU exposed. The --device / --group-add video lines are
the only ones that matter for ROCm passthrough:
docker run -it --rm \
--device=/dev/kfd --device=/dev/dri \
--group-add video --ipc=host --shm-size=16g \
-v $HOME:/workspace \
-e HF_TOKEN=$HF_TOKEN \
-e GOBLIN_QWEN_MODEL=Qwen/Qwen2.5-7B-Instruct \
-e ROCM_IMAGE_TAG=rocm6.1_pytorch2.3 \
-p 8000:8000 -p 8501:8501 \
rocm/pytorch:rocm6.1_ubuntu22.04_py3.10_pytorch_2.3 bash
3. Verify the GPU is visible inside the container
amd-smi monitor # shows utilization, HBM, power per GPU
rocprofv3 --version # confirms the profiler is on PATH
python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"
# β True MI300X
If any of these fail, LiveRunner will fall back to FakeRunner β the
agent loop still works, but you get cached metrics instead of a real
benchmark. Don't chase the bug; the demo lane is intact.
4. Clone, install, run
cd /workspace
git clone https://github.com/Manoj-Sri/amd-hackathon-rocm.git goblin
cd goblin
pip install -e ".[dev]"
python -m pytest tests/ -q # 86 tests pass without GPU; faster sanity check
# Live run on MI300X:
python -m agent workloads/train_qwen_lora.py
# Streams SSE events: thought, tool_call, tool_result, ..., final_report
5. Run the FastAPI server + UI
# Terminal 1 (inside the container):
uvicorn agent.server:app --host 0.0.0.0 --port 8000
# Terminal 2:
streamlit run ui/app.py --server.port 8501 --server.address 0.0.0.0
If the cloud instance gives you a public IP/URL, port-forward 8501 to your
laptop. If not, SSH-tunnel: ssh -L 8501:localhost:8501 user@instance β
open http://localhost:8501 locally.
6. Cost-control checklist
- Cache benchmark results (
bench_cache/is content-addressed by config + workload SHA + container tag, so identical configs are free). - Day-1 baseline run is the only "must-burn-GPU" task; everything else can use cached metrics or the FakeRunner.
- Stop the instance between work sessions. AMD Developer Cloud bills only for running time.
- Public reference price: ~$1.99/GPU-hour for MI300X VMs. ~$8 of your $100 covers a full demo + dry-runs.
Integrating with HF Qwen
The agent runs on Qwen via Hugging Face Inference Providers, which auto-routes your request to one of HF's serving partners (Together, Fireworks-AI, Nebius, Replicate, ...). You do not run Qwen yourself β HF does β and you authenticate with a single token.
1. Get a Hugging Face token
- Sign up at huggingface.co.
- Create a token at Settings β Access Tokens with read + inference scope.
- Export it:
export HF_TOKEN=hf_yourtokenhere
2. Join the AMD Developer Hackathon HF Organization
The hackathon submission requires publishing your project as a Hugging Face Space within the event organization. Click the "Join" link on the hackathon page (look for "Join the AMD Developer Hackathon HF Organization") and accept the invitation in your HF account.
3. Confirm Qwen reachability
Before running the full agent, smoke-test the HF Inference connection:
python - <<'PY'
import asyncio, os
from huggingface_hub import AsyncInferenceClient
async def go():
client = AsyncInferenceClient(token=os.environ["HF_TOKEN"])
resp = await client.chat_completion(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[{"role": "user", "content": "Say hello in 5 words."}],
max_tokens=32,
)
print(resp.choices[0].message.content)
asyncio.run(go())
PY
Expect a 1-line Qwen response. If you get an auth error, your token is
missing the inference scope. If you get a 404, the chosen model isn't
served by any active provider β try Qwen/Qwen2.5-32B-Instruct or set
provider="together" explicitly.
4. Run the agent against Qwen
The agent picks Qwen automatically β no env var needed beyond HF_TOKEN:
python -m agent workloads/train_qwen_lora.py
You should see SSE events streaming: thought blocks from Qwen, tool_call
events as it picks tools, tool_result events, and finally a final_report
with the canonical 142 β 318 tok/s (2.24Γ) line.
5. Switching the model or provider
Qwen has many variants. Override at process start:
# Bigger model, more reliable tool calls:
export GOBLIN_QWEN_MODEL=Qwen/Qwen2.5-32B-Instruct
# Pin to a specific provider (skip auto-routing):
export GOBLIN_QWEN_PROVIDER=together # or fireworks-ai / nebius / replicate
Qwen/Qwen2.5-7B-Instruct (the default) is the sweet spot: ~14 GB at
bf16, fast, supports tool calls. Bump to -32B-Instruct if 7B starts
emitting malformed tool arguments mid-audit.
6. Self-host Qwen on the same MI300X via vLLM (Path B)
The strongest "AMD-end-to-end" story: Qwen runs on the same MI300X that GPU Goblin is auditing, served by vLLM behind an OpenAI-compatible endpoint. Goblin already supports this β pick it with one env var. The recipe below mirrors the lablab vLLM-on-AMD-Developer-Cloud tutorial.
6a. Stand up vLLM on the MI300X
# Inside your MI300X cloud instance (rocm/pytorch container or bare host).
# Pull AMD's official rocm/vllm image β has vLLM + ROCm + Qwen support baked in.
docker pull rocm/vllm:latest
# Run vLLM serving a Qwen tool-calling model. `--tool-call-parser hermes`
# is the critical flag β it tells vLLM to parse Qwen's Hermes-format
# <tool_call> tags into the OpenAI `tool_calls` shape the agent expects.
#
# Pick ONE of the model recipes below.
# (a) Qwen2.5-32B-Instruct β recommended for the AMD GPU path. ~64 GB at
# bf16 (well under MI300X's 192 GB). Tool calling is significantly
# more reliable than 7B and the 32K context fits any audit conversation
# comfortably. First run downloads ~64 GB, takes 5-10 minutes.
docker run -d --name qwen-vllm \
--device=/dev/kfd --device=/dev/dri --group-add video \
--ipc=host --shm-size=16g \
-p 8000:8000 \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
-e HF_TOKEN=$HF_TOKEN \
rocm/vllm:latest \
--model Qwen/Qwen2.5-32B-Instruct \
--dtype bfloat16 \
--max-model-len 32768 \
--gpu-memory-utilization 0.85 \
--enable-auto-tool-choice \
--tool-call-parser hermes
# (b) Qwen2.5-7B-Instruct β light/fast, fine for smoke tests, occasionally
# hallucinates rule ids on tool calls. Use --max-model-len 32768 (the
# model's native cap) to keep audits from exhausting context near the
# compare_runs step.
# docker run -d --name qwen-vllm \
# ...same flags as above except...
# --model Qwen/Qwen2.5-7B-Instruct \
# --max-model-len 32768 \
# ...
# Wait for "Application startup complete" in the logs, then verify:
docker logs -f qwen-vllm # ctrl-C once you see "Application startup complete"
curl http://localhost:8000/v1/models
# β JSON listing the model id you served
Sanity check tool calling end-to-end:
curl -s http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "Qwen/Qwen2.5-32B-Instruct",
"messages": [{"role": "user", "content": "Call get_weather for Paris."}],
"tools": [{"type":"function","function":{"name":"get_weather","description":"weather","parameters":{"type":"object","properties":{"city":{"type":"string"}},"required":["city"]}}}],
"tool_choice": "auto"
}' | python3 -c "import sys,json;r=json.load(sys.stdin);print(r['choices'][0]['message'])"
Expect a tool_calls array with name=get_weather and arguments mentioning Paris. If you get plain text instead, the --tool-call-parser hermes flag was dropped.
6b. Point GPU Goblin at the local vLLM
export GOBLIN_AGENT_BACKEND=qwen-vllm
export GOBLIN_QWEN_VLLM_URL=http://localhost:8000/v1
export GOBLIN_QWEN_VLLM_MODEL=Qwen/Qwen2.5-32B-Instruct # match the model you served
# Optional β only if you fronted vLLM with auth (default vLLM ignores the key):
# export GOBLIN_QWEN_VLLM_KEY=<your-token>
python -m agent workloads/train_qwen_lora.py
Verify the agent picked up the right backend:
curl http://localhost:8000/healthz # the Goblin server, not vLLM
# β {"backend": "qwen-vllm", "vllm_url": "http://localhost:8000/v1", ...}
That's it β the agent loop, the tools, the system prompt, the SSE
streaming, the offline-replay fallback all carry over unchanged. The
only thing that's different is which OpenAI-compatible endpoint
QwenVLLMBackend talks to.
6c. Comparing the two backends
| Aspect | qwen-hf (default) |
qwen-vllm |
|---|---|---|
| Auth | HF_TOKEN |
none by default; optional GOBLIN_QWEN_VLLM_KEY |
| Compute | Together / Fireworks-AI / Nebius (HF routes) | Your MI300X |
| Latency | 200-500 ms / turn (network-bound) | 50-150 ms / turn (in-cluster) |
| Cost | HF Inference credits | Your AMD Developer Cloud GPU-hours |
| Demo story | "uses HF as the model hub" | "Qwen runs on the same MI300X it audits" |
| Setup time | 30 sec (just HF_TOKEN) |
2-3 min (model download + warmup) |
| Best for | HF Space deployment | Pitch demo on MI300X |
Run both during the hackathon: qwen-hf for the Space (judges who
click the URL get a real audit without an MI300X); qwen-vllm for the
live pitch demo (the strongest "all AMD" story the judging criterion
"How effectively the chosen model is integrated" rewards).
Deploying to Hugging Face Spaces
This repo is already shaped to be a Hugging Face Space β README.md
carries the YAML frontmatter HF needs, and requirements.txt at the root
is the deliberately-minimal Streamlit-only dependency set (no torch /
transformers / huggingface_hub at runtime). The deployed Space is the
offline-replay lane: judges interact with a Streamlit UI that streams
the canonical 142 β 318 tok/s (2.24Γ) audit trajectory from
tests/fixtures/cached_audit.json, without a live LLM, without a backend,
and without our laptop. This is what satisfies the hackathon's "Demo
Application Platform + Application URL" submission fields.
One-time setup
- Create a Hugging Face account at huggingface.co and accept the invite to the AMD Developer Hackathon HF Organization (link is on the hackathon page under the Hugging Face section).
- Create a token at Settings β Access Tokens
with
writescope (you need write access to push to the Space repo). Save it asHF_PUSH_TOKEN. - On the HF organization's page, click "New Space":
- Owner: AMD Developer Hackathon org
- Space name:
gpu-goblin(or your preferred slug) - License: MIT
- SDK: Streamlit
- Hardware: CPU basic (free; the Space loads no GPU code path)
- Visibility: Public
- Don't initialize the Space with anything β leave it empty so the first push lands cleanly.
Deploy
From the project root, push the existing feat/scaffold branch to the
Space's git remote:
# Add the Space remote (use HTTPS with your username + HF_PUSH_TOKEN as password):
git remote add space https://huggingface.co/spaces/<org-slug>/gpu-goblin
# Push (HF Spaces use 'main' as the default branch):
git push space feat/scaffold:main
You'll see a build log at https://huggingface.co/spaces/<org-slug>/gpu-goblin.
Cold-start takes 30-60 seconds (Streamlit + the pure-pydantic deps); once
up, the canonical demo trajectory replays in ~10 seconds when a judge
clicks "Use sample workload".
What the Space looks like to judges
When a judge opens the Space URL:
- The lane toggle defaults to Offline replay β appropriate for the Space since there's no MI300X behind it.
- They click "Use sample workload" (which references
workloads/train_qwen_lora.py). - Streamlit attempts to reach
http://localhost:8000/audit, fails withConnectionError, surfaces a one-line warning ("Backend unreachable β running offline-replay demo from cached audit"), then plays the cached audit trajectory event-by-event with~0.4spauses between events. - Final report renders:
Tokens/sec: 142 β 318 (2.24Γ)with the side-by-side metrics table, waste-budget bar chart, diff viewer, and per-rule citations.
Updating the Space
After any change to the main repo, redeploy:
git push space feat/scaffold:main
HF rebuilds the Space automatically on push.
(Stretch) Live agent in the Space
The shipped Space is read-only β it doesn't reach a real LLM. If you want judges to drive the agent live, two paths:
- Stand up the FastAPI backend somewhere reachable (an MI300X on AMD
Developer Cloud, an HF Inference Endpoint, a small CPU box) and set the
Space's
GOBLIN_BACKEND_URLsecret to that URL. The Streamlit app will stream real SSE from your backend instead of the cached replay. - Embed the agent loop in-process (refactor
ui/app.pyto callagent.loop.run_auditdirectly viaasyncio.run). This addshuggingface_hubtorequirements.txtand requiresHF_TOKENas a Space secret. Larger cold-start, fully self-contained.
Both are post-MVP; the offline-replay Space is what satisfies the submission requirement.
Configuration Reference
| Env var | Default | Purpose |
|---|---|---|
GOBLIN_AGENT_BACKEND |
qwen-hf |
Pick the LLM backend: qwen-hf (HF Inference Providers) or qwen-vllm (self-hosted vLLM on MI300X). |
HF_TOKEN |
(none) | Required when qwen-hf is active. Hugging Face Inference token. |
GOBLIN_QWEN_MODEL |
Qwen/Qwen2.5-7B-Instruct |
Model id used by the qwen-hf backend. |
GOBLIN_QWEN_PROVIDER |
auto |
HF Inference Provider routing (auto / together / fireworks-ai / nebius / ...). |
GOBLIN_QWEN_VLLM_URL |
http://localhost:8000/v1 |
Base URL of the self-hosted vLLM endpoint (only used when qwen-vllm is active). |
GOBLIN_QWEN_VLLM_MODEL |
Qwen/Qwen2.5-7B-Instruct |
Model id served by your local vLLM. |
GOBLIN_QWEN_VLLM_KEY |
EMPTY |
Optional auth token if you front vLLM with nginx/Caddy + auth. |
GOBLIN_BACKEND_URL |
http://localhost:8000/audit |
UI's backend endpoint. |
ROCM_IMAGE_TAG |
unknown |
Container tag mixed into the benchmark cache key. |
GOBLIN_GPU_ID |
0 |
Which /dev/dri/renderD* to bind in goblin_runner.sh. |