--- title: GPU Goblin emoji: 🧌 colorFrom: red colorTo: red sdk: streamlit sdk_version: "1.32.0" app_file: ui/auto_tune_ui.py pinned: false license: mit short_description: AI auto-tuner for MI300X fine-tuning workloads. tags: - amd - mi300x - rocm - qwen - huggingface - agent - fine-tuning - llm --- # GPU Goblin > An AI agent that hunts wasted compute on AMD MI300X. Powered by Qwen. GPU Goblin profiles a fine-tuning run, diagnoses inefficiency against a curated ROCm knowledge base, recommends MI300X-specific fixes, and re-benchmarks to prove the speedup with real numbers. The agent itself runs on a Qwen model via Hugging Face Inference Providers; the canonical demo workload is `Qwen/Qwen2.5-7B-Instruct` LoRA fine-tuning on MI300X. Submitted to the **AMD Developer Hackathon**, Track 1: AI Agents & Agentic Workflows. Incorporates the Qwen Technology Partner challenge (Qwen as both agent brain and audit target) and uses Hugging Face as the model hub + deployment layer. See [`brainstorming/idea.md`](brainstorming/idea.md), [`brainstorming/architecture.md`](brainstorming/architecture.md), and [`brainstorming/goals.md`](brainstorming/goals.md). ## Quick Start ```bash pip install -e ".[dev]" # Required for the live agent loop: export HF_TOKEN=hf_... # Hugging Face Inference token # Optional — override the default Qwen model / provider: # export GOBLIN_QWEN_MODEL=Qwen/Qwen2.5-7B-Instruct # export GOBLIN_QWEN_PROVIDER=auto # or together / fireworks-ai / nebius / ... uvicorn agent.server:app --reload --port 8000 streamlit run ui/app.py ``` The Streamlit UI works **without `HF_TOKEN`** in offline-replay mode — it plays a cached audit trajectory (`tests/fixtures/cached_audit.json`) so judges can see the canonical `142 → 318 tok/s (2.24×)` demo without our backend or any live LLM. ## Repo Layout ``` agent/ schemas.py # Shared pydantic models (RunMetrics, WorkloadConfig, ...) backends/ # Pluggable LLM driver (Qwen via HF Inference Providers) tools/ # 6 tools the agent can call loop.py # Provider-agnostic tool-use loop server.py # FastAPI + SSE runner/ # GPU runner (rocprofv3 wrapper) + FakeRunner fallback kb/ # ROCm knowledge base (22 curated rules, the moat) ui/ # Streamlit chat UI workloads/ # Canonical Qwen demo + synthetic corpus tests/ # Pytest suite + fixtures brainstorming/ # Design docs (idea / architecture / goals) ``` ## Development The agent loop is testable on a laptop without an MI300X via the `FakeRunner` and the synthetic corpus in `workloads/synthetic/`. Real benchmarks require ROCm + MI300X (the `LiveRunner` auto-falls-back to `FakeRunner` when `rocprofv3` / `amd-smi` / a render device are missing). ```bash python3 -m pytest tests/ -v # 86 tests, no GPU required python3 -m agent workloads/train_qwen_lora.py # CLI driver, prints SSE events ``` ## Running on AMD Developer Cloud (MI300X) End-to-end recipe for the live demo path. Assumes you've got the $100 hackathon credits. Plan on ~10-15 GPU-hours total (well under budget). ### 1. Provision an MI300X instance 1. Sign in to [AMD AI Developer Program](https://www.amd.com/en/developer/resources/developer-program.html) and join the AMD Developer Cloud waitlist (instant approval for hackathon participants). 2. Spin up an **MI300X** instance. Pick the largest container disk you can (the model weights cache is 15-30 GB). 3. SSH in. You should land in a Linux shell with a `/dev/dri/renderD*` device visible — confirm with `ls /dev/dri`. ### 2. Pull the ROCm + PyTorch container ```bash docker pull rocm/pytorch:rocm6.1_ubuntu22.04_py3.10_pytorch_2.3 ``` Run it with the GPU exposed. The `--device` / `--group-add video` lines are the only ones that matter for ROCm passthrough: ```bash docker run -it --rm \ --device=/dev/kfd --device=/dev/dri \ --group-add video --ipc=host --shm-size=16g \ -v $HOME:/workspace \ -e HF_TOKEN=$HF_TOKEN \ -e GOBLIN_QWEN_MODEL=Qwen/Qwen2.5-7B-Instruct \ -e ROCM_IMAGE_TAG=rocm6.1_pytorch2.3 \ -p 8000:8000 -p 8501:8501 \ rocm/pytorch:rocm6.1_ubuntu22.04_py3.10_pytorch_2.3 bash ``` ### 3. Verify the GPU is visible inside the container ```bash amd-smi monitor # shows utilization, HBM, power per GPU rocprofv3 --version # confirms the profiler is on PATH python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))" # → True MI300X ``` If any of these fail, `LiveRunner` will fall back to `FakeRunner` — the agent loop still works, but you get cached metrics instead of a real benchmark. Don't chase the bug; the demo lane is intact. ### 4. Clone, install, run ```bash cd /workspace git clone https://github.com/Manoj-Sri/amd-hackathon-rocm.git goblin cd goblin pip install -e ".[dev]" python -m pytest tests/ -q # 86 tests pass without GPU; faster sanity check # Live run on MI300X: python -m agent workloads/train_qwen_lora.py # Streams SSE events: thought, tool_call, tool_result, ..., final_report ``` ### 5. Run the FastAPI server + UI ```bash # Terminal 1 (inside the container): uvicorn agent.server:app --host 0.0.0.0 --port 8000 # Terminal 2: streamlit run ui/app.py --server.port 8501 --server.address 0.0.0.0 ``` If the cloud instance gives you a public IP/URL, port-forward 8501 to your laptop. If not, SSH-tunnel: `ssh -L 8501:localhost:8501 user@instance` → open `http://localhost:8501` locally. ### 6. Cost-control checklist - Cache benchmark results (`bench_cache/` is content-addressed by config + workload SHA + container tag, so identical configs are free). - Day-1 baseline run is the only "must-burn-GPU" task; everything else can use cached metrics or the FakeRunner. - Stop the instance between work sessions. AMD Developer Cloud bills only for running time. - Public reference price: ~$1.99/GPU-hour for MI300X VMs. ~$8 of your $100 covers a full demo + dry-runs. ## Integrating with HF Qwen The agent runs on Qwen via Hugging Face **Inference Providers**, which auto-routes your request to one of HF's serving partners (Together, Fireworks-AI, Nebius, Replicate, ...). You do not run Qwen yourself — HF does — and you authenticate with a single token. ### 1. Get a Hugging Face token 1. Sign up at [huggingface.co](https://huggingface.co/). 2. Create a token at [Settings → Access Tokens](https://huggingface.co/settings/tokens) with **read** + **inference** scope. 3. Export it: ```bash export HF_TOKEN=hf_yourtokenhere ``` ### 2. Join the AMD Developer Hackathon HF Organization The hackathon submission requires publishing your project as a Hugging Face Space within the event organization. Click the "Join" link on the [hackathon page](https://lablab.ai/ai-hackathons/amd-developer) (look for "Join the AMD Developer Hackathon HF Organization") and accept the invitation in your HF account. ### 3. Confirm Qwen reachability Before running the full agent, smoke-test the HF Inference connection: ```bash python - <<'PY' import asyncio, os from huggingface_hub import AsyncInferenceClient async def go(): client = AsyncInferenceClient(token=os.environ["HF_TOKEN"]) resp = await client.chat_completion( model="Qwen/Qwen2.5-7B-Instruct", messages=[{"role": "user", "content": "Say hello in 5 words."}], max_tokens=32, ) print(resp.choices[0].message.content) asyncio.run(go()) PY ``` Expect a 1-line Qwen response. If you get an auth error, your token is missing the `inference` scope. If you get a 404, the chosen model isn't served by any active provider — try `Qwen/Qwen2.5-32B-Instruct` or set `provider="together"` explicitly. ### 4. Run the agent against Qwen The agent picks Qwen automatically — no env var needed beyond `HF_TOKEN`: ```bash python -m agent workloads/train_qwen_lora.py ``` You should see SSE events streaming: `thought` blocks from Qwen, `tool_call` events as it picks tools, `tool_result` events, and finally a `final_report` with the canonical `142 → 318 tok/s (2.24×)` line. ### 5. Switching the model or provider Qwen has many variants. Override at process start: ```bash # Bigger model, more reliable tool calls: export GOBLIN_QWEN_MODEL=Qwen/Qwen2.5-32B-Instruct # Pin to a specific provider (skip auto-routing): export GOBLIN_QWEN_PROVIDER=together # or fireworks-ai / nebius / replicate ``` `Qwen/Qwen2.5-7B-Instruct` (the default) is the sweet spot: ~14 GB at bf16, fast, supports tool calls. Bump to `-32B-Instruct` if 7B starts emitting malformed tool arguments mid-audit. ### 6. Self-host Qwen on the same MI300X via vLLM (Path B) The strongest "AMD-end-to-end" story: Qwen runs on the same MI300X that GPU Goblin is auditing, served by vLLM behind an OpenAI-compatible endpoint. Goblin already supports this — pick it with one env var. The recipe below mirrors the [lablab vLLM-on-AMD-Developer-Cloud tutorial](https://lablab.ai/ai-tutorials/amd-developer-cloud-host-llm-vllm). #### 6a. Stand up vLLM on the MI300X ```bash # Inside your MI300X cloud instance (rocm/pytorch container or bare host). # Pull AMD's official rocm/vllm image — has vLLM + ROCm + Qwen support baked in. docker pull rocm/vllm:latest # Run vLLM serving a Qwen tool-calling model. `--tool-call-parser hermes` # is the critical flag — it tells vLLM to parse Qwen's Hermes-format # tags into the OpenAI `tool_calls` shape the agent expects. # # Pick ONE of the model recipes below. # (a) Qwen2.5-32B-Instruct — recommended for the AMD GPU path. ~64 GB at # bf16 (well under MI300X's 192 GB). Tool calling is significantly # more reliable than 7B and the 32K context fits any audit conversation # comfortably. First run downloads ~64 GB, takes 5-10 minutes. docker run -d --name qwen-vllm \ --device=/dev/kfd --device=/dev/dri --group-add video \ --ipc=host --shm-size=16g \ -p 8000:8000 \ -v $HOME/.cache/huggingface:/root/.cache/huggingface \ -e HF_TOKEN=$HF_TOKEN \ rocm/vllm:latest \ --model Qwen/Qwen2.5-32B-Instruct \ --dtype bfloat16 \ --max-model-len 32768 \ --gpu-memory-utilization 0.85 \ --enable-auto-tool-choice \ --tool-call-parser hermes # (b) Qwen2.5-7B-Instruct — light/fast, fine for smoke tests, occasionally # hallucinates rule ids on tool calls. Use --max-model-len 32768 (the # model's native cap) to keep audits from exhausting context near the # compare_runs step. # docker run -d --name qwen-vllm \ # ...same flags as above except... # --model Qwen/Qwen2.5-7B-Instruct \ # --max-model-len 32768 \ # ... # Wait for "Application startup complete" in the logs, then verify: docker logs -f qwen-vllm # ctrl-C once you see "Application startup complete" curl http://localhost:8000/v1/models # → JSON listing the model id you served ``` Sanity check tool calling end-to-end: ```bash curl -s http://localhost:8000/v1/chat/completions \ -H 'Content-Type: application/json' \ -d '{ "model": "Qwen/Qwen2.5-32B-Instruct", "messages": [{"role": "user", "content": "Call get_weather for Paris."}], "tools": [{"type":"function","function":{"name":"get_weather","description":"weather","parameters":{"type":"object","properties":{"city":{"type":"string"}},"required":["city"]}}}], "tool_choice": "auto" }' | python3 -c "import sys,json;r=json.load(sys.stdin);print(r['choices'][0]['message'])" ``` Expect a `tool_calls` array with `name=get_weather` and `arguments` mentioning Paris. If you get plain text instead, the `--tool-call-parser hermes` flag was dropped. #### 6b. Point GPU Goblin at the local vLLM ```bash export GOBLIN_AGENT_BACKEND=qwen-vllm export GOBLIN_QWEN_VLLM_URL=http://localhost:8000/v1 export GOBLIN_QWEN_VLLM_MODEL=Qwen/Qwen2.5-32B-Instruct # match the model you served # Optional — only if you fronted vLLM with auth (default vLLM ignores the key): # export GOBLIN_QWEN_VLLM_KEY= python -m agent workloads/train_qwen_lora.py ``` Verify the agent picked up the right backend: ```bash curl http://localhost:8000/healthz # the Goblin server, not vLLM # → {"backend": "qwen-vllm", "vllm_url": "http://localhost:8000/v1", ...} ``` That's it — the agent loop, the tools, the system prompt, the SSE streaming, the offline-replay fallback all carry over unchanged. The only thing that's different is which OpenAI-compatible endpoint ``QwenVLLMBackend`` talks to. #### 6c. Comparing the two backends | Aspect | `qwen-hf` (default) | `qwen-vllm` | |---|---|---| | Auth | `HF_TOKEN` | none by default; optional `GOBLIN_QWEN_VLLM_KEY` | | Compute | Together / Fireworks-AI / Nebius (HF routes) | Your MI300X | | Latency | 200-500 ms / turn (network-bound) | 50-150 ms / turn (in-cluster) | | Cost | HF Inference credits | Your AMD Developer Cloud GPU-hours | | Demo story | "uses HF as the model hub" | "Qwen runs on the same MI300X it audits" | | Setup time | 30 sec (just `HF_TOKEN`) | 2-3 min (model download + warmup) | | Best for | HF Space deployment | Pitch demo on MI300X | Run **both** during the hackathon: `qwen-hf` for the Space (judges who click the URL get a real audit without an MI300X); `qwen-vllm` for the live pitch demo (the strongest "all AMD" story the judging criterion "How effectively the chosen model is integrated" rewards). ## Deploying to Hugging Face Spaces This repo is **already shaped to be a Hugging Face Space** — `README.md` carries the YAML frontmatter HF needs, and `requirements.txt` at the root is the deliberately-minimal Streamlit-only dependency set (no torch / transformers / huggingface_hub at runtime). The deployed Space is the **offline-replay lane**: judges interact with a Streamlit UI that streams the canonical `142 → 318 tok/s (2.24×)` audit trajectory from `tests/fixtures/cached_audit.json`, without a live LLM, without a backend, and without our laptop. This is what satisfies the hackathon's "Demo Application Platform + Application URL" submission fields. ### One-time setup 1. Create a Hugging Face account at [huggingface.co](https://huggingface.co/) and accept the invite to the **AMD Developer Hackathon HF Organization** (link is on the [hackathon page](https://lablab.ai/ai-hackathons/amd-developer) under the Hugging Face section). 2. Create a token at [Settings → Access Tokens](https://huggingface.co/settings/tokens) with **`write`** scope (you need write access to push to the Space repo). Save it as `HF_PUSH_TOKEN`. 3. On the HF organization's page, click **"New Space"**: - Owner: AMD Developer Hackathon org - Space name: `gpu-goblin` (or your preferred slug) - License: MIT - SDK: **Streamlit** - Hardware: **CPU basic** (free; the Space loads no GPU code path) - Visibility: Public 4. Don't initialize the Space with anything — leave it empty so the first push lands cleanly. ### Deploy From the project root, push the existing `feat/scaffold` branch to the Space's git remote: ```bash # Add the Space remote (use HTTPS with your username + HF_PUSH_TOKEN as password): git remote add space https://huggingface.co/spaces//gpu-goblin # Push (HF Spaces use 'main' as the default branch): git push space feat/scaffold:main ``` You'll see a build log at `https://huggingface.co/spaces//gpu-goblin`. Cold-start takes 30-60 seconds (Streamlit + the pure-pydantic deps); once up, the canonical demo trajectory replays in ~10 seconds when a judge clicks **"Use sample workload"**. ### What the Space looks like to judges When a judge opens the Space URL: 1. The lane toggle defaults to **Offline replay** — appropriate for the Space since there's no MI300X behind it. 2. They click **"Use sample workload"** (which references `workloads/train_qwen_lora.py`). 3. Streamlit attempts to reach `http://localhost:8000/audit`, fails with `ConnectionError`, surfaces a one-line warning ("Backend unreachable — running offline-replay demo from cached audit"), then plays the cached audit trajectory event-by-event with `~0.4s` pauses between events. 4. Final report renders: `Tokens/sec: 142 → 318 (2.24×)` with the side-by-side metrics table, waste-budget bar chart, diff viewer, and per-rule citations. ### Updating the Space After any change to the main repo, redeploy: ```bash git push space feat/scaffold:main ``` HF rebuilds the Space automatically on push. ### (Stretch) Live agent in the Space The shipped Space is read-only — it doesn't reach a real LLM. If you want judges to drive the agent live, two paths: 1. **Stand up the FastAPI backend somewhere reachable** (an MI300X on AMD Developer Cloud, an HF Inference Endpoint, a small CPU box) and set the Space's `GOBLIN_BACKEND_URL` secret to that URL. The Streamlit app will stream real SSE from your backend instead of the cached replay. 2. **Embed the agent loop in-process** (refactor `ui/app.py` to call `agent.loop.run_audit` directly via `asyncio.run`). This adds `huggingface_hub` to `requirements.txt` and requires `HF_TOKEN` as a Space secret. Larger cold-start, fully self-contained. Both are post-MVP; the offline-replay Space is what satisfies the submission requirement. ## Configuration Reference | Env var | Default | Purpose | |---|---|---| | `GOBLIN_AGENT_BACKEND` | `qwen-hf` | Pick the LLM backend: `qwen-hf` (HF Inference Providers) or `qwen-vllm` (self-hosted vLLM on MI300X). | | `HF_TOKEN` | *(none)* | Required when `qwen-hf` is active. Hugging Face Inference token. | | `GOBLIN_QWEN_MODEL` | `Qwen/Qwen2.5-7B-Instruct` | Model id used by the `qwen-hf` backend. | | `GOBLIN_QWEN_PROVIDER` | `auto` | HF Inference Provider routing (`auto` / `together` / `fireworks-ai` / `nebius` / ...). | | `GOBLIN_QWEN_VLLM_URL` | `http://localhost:8000/v1` | Base URL of the self-hosted vLLM endpoint (only used when `qwen-vllm` is active). | | `GOBLIN_QWEN_VLLM_MODEL` | `Qwen/Qwen2.5-7B-Instruct` | Model id served by your local vLLM. | | `GOBLIN_QWEN_VLLM_KEY` | `EMPTY` | Optional auth token if you front vLLM with nginx/Caddy + auth. | | `GOBLIN_BACKEND_URL` | `http://localhost:8000/audit` | UI's backend endpoint. | | `ROCM_IMAGE_TAG` | `unknown` | Container tag mixed into the benchmark cache key. | | `GOBLIN_GPU_ID` | `0` | Which `/dev/dri/renderD*` to bind in `goblin_runner.sh`. |