ROCKET: An autonomous performance agent for AMD MI300X

Community Article Published May 10, 2026

The first question every AMD developer asks
The hardware
The architecture: four agents in a tight evaluation loop
The bounded toolbox
Why bounded toolbox > free-form codegen (especially in 24 hours)
Decision discipline: why the speedup is "honest"
The actual run
What this is and isn't
The HF Space
Run it yourself
What's next
Links
The first question every AMD developer asks

Pick up a fresh PyTorch model on an AMD MI300X and your first instinct is almost always the same: how do I make this faster?

You start the usual loop. Run torch.profiler or rocprof. Read the trace. Try bf16. Try torch.compile. Try fused attention. Re-bench. Diff the outputs. Discard the change that broke correctness. Try again.

It's a loop you can absolutely run yourself. It's also a loop that begs to be automated — because it's bounded, measurable, and almost embarrassingly mechanical. The only judgment call is which knob to turn next given what the profile says.

That judgment call is exactly the thing modern LLMs are good at. So I built an agent that closes the loop.

ROCKET is an autonomous performance optimizer for AMD MI300X. You hand it a PyTorch model. It profiles, hypothesizes, applies one transformation at a time, validates correctness, re-benchmarks, and stops when no remaining tool beats the threshold. The output is a measured speedup, a JSONL research log, and a PR-ready diff.

On Qwen2.5-7B-Instruct (batch 8, prompt 256, generated 512) it took the model from 62.6 → 183.5 tok/s on a single MI300X — a 2.93× honest, end-to-end speedup. The agent tried 5 tools and kept 1. It rejected the 4 that didn't beat the validation threshold.

This post is about how it works, the methodology choice that made the build tractable in 24 hours, and the honest limits of what a result like this means.

The hardware

I built and benchmarked everything on the AMD Developer Cloud — single Instinct MI300X droplet, 192 GB HBM3, ROCm 7.0, PyTorch 2.6.0. The same MI300X serves the planner LLM (Qwen2.5-7B-Instruct via vLLM) AND runs the model under test. One node, end to end. No cross-cloud orchestration, no second machine. That mattered for the 24-hour clock.

If you've only ever tuned on NVIDIA, two things will surprise you. First, the developer ergonomics are closer than the marketing suggests — torch.profiler works, bf16 cast is one line, vLLM serves cleanly. Second, ROCm 7 + PyTorch 2.6 is mature enough that the same idioms transfer; you don't need a separate mental model. That's the part that makes an autonomous agent on AMD interesting now and not eighteen months ago.

The architecture: four agents in a tight evaluation loop

┌──────────────┐
│   Profiler   │  torch.profiler / rocprof — hot-spot summary
└──────┬───────┘
       │
       ▼
┌──────────────┐
│   Planner    │  Qwen2.5-7B-Instruct (vLLM) — picks ONE tool
└──────┬───────┘
       │
       ▼
┌──────────────┐
│ Implementer  │  applies one of 5 bounded transformations
└──────┬───────┘
       │
       ▼
┌──────────────┐
│  Validator   │  re-bench + correctness diff
└──────┬───────┘
       │
       └──── loop until no tool beats the threshold

Each component does one thing. The loop body is the agent's only freedom. That constraint is doing a lot of work — more on that in a second.

The bounded toolbox

ROCKET does not write arbitrary code. The Planner picks from a fixed, curated set of high-leverage transformations:

Tool	What it does	Why it might win
`dtype_cast`	cast model to bf16 / fp16	halves memory, ~2× throughput on MI300X
`torch_compile`	Inductor-fused kernels via `torch.compile`	autotuned graph for residual + RMSNorm + matmul
`sdpa_attention`	switch to PyTorch's fused scaled-dot-product attention	built-in fast path
`input_padding`	pad shapes to 128/256 multiples	GPU-friendly tile alignment
`kv_cache_config`	ensure KV-caching is enabled and well-shaped	2-4× on autoregressive generation

The agent's intelligence goes into which transformation to try, in what order, with which params, given the profile. That's the design choice the rest of this post is really about.

Why bounded toolbox > free-form codegen (especially in 24 hours)

The obvious version of this project is "give the LLM the trace and let it write whatever PyTorch / Triton / HIP code it wants." I tried that path on hour two. It does not work. Three reasons.

1. The validation cost dominates. Free-form generated code fails to compile, fails correctness, or fails subtly (logits drift past tolerance). Each failure costs you a benchmark run. With a bounded toolbox, every candidate is known good by construction; the only question is whether it helps for this model. You convert "did the agent write valid code?" — an unbounded problem — into "did this transformation move tok/s?" — a measurable one.

2. The signal is in choice, not generation. The interesting expertise of a perf engineer is "given this trace, the next thing to try is X" — not "here is a novel kernel." Most of the wins in real perf work come from a small set of well-known transformations applied in the right order. A bounded toolbox lets the LLM put its IQ where it belongs.

3. It makes the agent auditable. The output isn't a black-box diff. It's a sequence of named transformations, each with a hypothesis, a measured Δ, and a kept/rejected decision. Anyone can read the JSONL trace and reproduce the run.

This is the methodology I want to defend hardest. The agent's freedom is the choice; the action space is bounded. That's what made a 24-hour build produce honest numbers instead of vibes.

Decision discipline: why the speedup is "honest"

I keep using the word honest. It's doing real work. Three rules.

Measured, not modeled. Every candidate triggers a full benchmark re-run. No back-of-envelope estimates, no "should be faster." If wall-clock didn't move, the change didn't happen. This is not as common in optimization papers as you'd hope.

Correctness gate. After each candidate, outputs are diffed against the baseline within a numerical tolerance. A change that's faster but produces drifted logits is automatically rejected. The bf16 cast, for example, is gated on this — if the diff exceeds tolerance for your model, ROCKET will refuse to keep it.

One variable at a time. Each iteration mutates one knob. Confounds are eliminated by construction. When ROCKET reports a 1.4× win, you know exactly which line caused it.

The result on Qwen2.5-7B was 4 rejections, 1 acceptance, 0 silent regressions.

The actual run

Here is what the trace looks like, edited for readability:

#	Tool	Hypothesis	Measured Δ	Decision
01	`dtype_cast`	fp32 weights bottleneck memory bandwidth	+193%	KEPT — 62.6 → 183.5 tok/s
02	`kv_cache_config`	inspect cache; verify enabled and well-shaped	+0.2%	rejected — already on by default
03	`sdpa_attention`	switch to fused SDPA	−1.8%	rejected — Qwen path already uses fused
04	`torch_compile`	Inductor will fuse residual + RMSNorm + matmul	+2.7%	rejected — below threshold
05	`input_padding`	pad seqlen to 256 multiple	+1.4%	rejected — below threshold

Final: 62.6 → 183.5 tok/s = 2.93× honest speedup, no human in the loop.

I want to be clear-eyed about what this number does and doesn't show. It shows that the agent found and applied the right primary lever (bf16) and correctly rejected four plausible alternatives that would not have helped on this model — including changes that look promising on paper. That's the value: not "agents discover surprising new optimizations," but "agents reliably triage the obvious things in the right order, reject the dead ends, and ship a verified diff."

For a different model — say, one where bf16 had drifted past tolerance, or where attention wasn't already fused — the agent would have made a different sequence of choices. That's the property worth defending.

What this is and isn't

Is: a working autopilot for the most common perf-tuning loop on AMD MI300X. Reproducible, replayable, and gated by real measurement. Solo build, 24 hours, on a single droplet.

Isn't: a general-purpose superhuman compiler. The toolbox is small (5 tools). The validation is single-seed (no statistical bands yet). Only one model family was tested end-to-end. The 2.93× is dominated by one transformation that an experienced engineer would also have tried first.

If you're an AMD developer, what ROCKET buys you today is the triage — the part of the work that's mechanical but tedious, where most of the time goes to "did I remember to try this?" and "did I measure it correctly?". The agent runs that loop honestly and hands you a diff.

The HF Space

The MI300X-bound run isn't free to host on a CPU Space, so the HF Space ships a replay of an actual ROCKET run. The agent ran on the AMD Developer Cloud droplet, the trace was dumped to logs/run.jsonl, and the Space animates that trace — live tok/s chart, the planner's reasoning at each step, the kept/rejected decisions. Proof, not screenshots.

→ https://huggingface.co/spaces/lablab-ai-amd-developer-hackathon/ROCKET-mi300x

If you'd like to ❤️ the Space it helps with the community prize at the hackathon — but the code is Apache-2.0 either way, so what I'd really love is for someone to take it and run it on a different model.

Run it yourself

# On an MI300X droplet (AMD Developer Cloud, ROCm 7.0 + PyTorch 2.6.0 image)
git clone https://github.com/KMaruthi2002/rocket
cd rocket
pip install -r requirements.txt

# Start the planner brain
vllm serve Qwen/Qwen2.5-7B-Instruct --port 8000 &

# Run the orchestrator on a target model
python -m rocket.orchestrator \
    --model Qwen/Qwen2.5-1.5B-Instruct \
    --iterations 5

The trace lands in logs/run.jsonl. That file is the source of truth — drop it into the Space's replay viewer to animate any run.

What's next

A few directions I want to push if I keep going:

Wider toolbox. Add CK matmul, FP8, paged-KV, and ROCm-native kernels. Each new tool widens the search space, not the agent.
Multi-model benchmarks. Llama 3, Mixtral, FLUX. Same loop, same gates — comparable speedup numbers across families. This is where you start to see whether the agent's choice ordering matters or whether a fixed checklist would do as well.
PR mode. Output a clean diff + a trace + a rationale. Drop into any PyTorch repo as a CI optimization step.
Comparison vs. baselines. TVM Ansor, manual expert tuning, naive checklist. The interesting question isn't "did the agent win?" — it's "did the agent win where the alternatives don't?"

If those experiments hold up, this becomes more than a hackathon project. For now: it's an honest 2.93× on real AMD hardware, in 24 hours, with a methodology I'd defend in a code review.

Links

Code: github.com/KMaruthi2002/rocket (Apache-2.0)
Live demo (HF Space): lablab-ai-amd-developer-hackathon/ROCKET-mi300x
Hardware: AMD Developer Cloud · Instinct MI300X · ROCm 7

Built solo by Maruthi Kunchala for the AMD × lablab.ai Developer Hackathon, May 2026. Questions, ideas, or models you want me to throw at ROCKET? Open an issue on the repo or ping me on the HF Space discussion.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote