WebGPU Is the Compute Path for AMD AI PCs
A practical guide to running LLMs, face animation, and AI inference on AMD Strix Halo unified memory hardware through WebGPU -- bypassing the broken ROCm stack entirely.
By Joshua Orsak (LJTSG) and Claude.
The Problem
AMD sells Ryzen AI Max (Strix Halo) machines as "AI PCs" -- 64-128GB unified memory, RDNA4 iGPU, marketed explicitly for local AI workloads. GMKtec, Minisforum, ASUS, and now AMD themselves sell mini PCs with this chip at $1,500-$4,000.
The compute software does not work.
ROCm on gfx1151 is marked "Preview" -- not production supported. In practice:
- Output corrupts after 4-5 LLM turns (ROCm #5499)
- Firmware bricked compute for months (MES firmware regression)
- Official PyTorch wheels crash with "invalid device function"
- hipBLASLt falls back to hipBLAS -- 9% of theoretical FLOPS
amd-smireports ALL monitoring metrics as N/A- NPU (XDNA 2) SDK refuses with "unsupported platform" on Linux
- The workaround tax:
HSA_OVERRIDE_GFX_VERSION,HSA_ENABLE_SDMA=0, specific kernel versions, community-built PyTorch wheels
The hardware delivers exactly what AMD promised. The software makes it unusable for the thing they told you to buy it for.
The Solution
WebGPU routes through the gaming driver, not the compute driver.
On Windows, Chrome's WebGPU uses Direct3D 12. On Linux, Vulkan. Both are AMD's gaming driver stack -- tested by millions of gamers, optimized by market pressure, protected from regressions by AAA studios. ROCm is a separate compute stack that AMD cannot keep working on new hardware.
The same GPU that crashes ROCm after 4 turns runs Gemma 26B at 22 tok/s through WebGPU without a hiccup. Same silicon. Different driver path.
What We Proved
| Workload | Model | Performance | ROCm comparison |
|---|---|---|---|
| LLM (26B MoE) | Gemma-4-26B-A4B | 22 tok/s, 20GB loaded | ROCm: corruption after 4 turns |
| LLM (7.8B reasoning) | EXAONE-Deep-7.8B | Full chain-of-thought | First ever on WebGPU |
| LLM (3.8B reasoning) | Phi-4-mini-reasoning | Instant math CoT | First reasoning variant on WebGPU |
| LLM (360M) | SmolLM2-360M | Sub-2s load, instant gen | -- |
| LLM (9B character) | Tiger-Gemma-9B-v3 | Character voice embodiment | First on WebGPU |
| Face animation | FLOAT (StyleGAN2) | 12.9 fps decoder on iGPU | ROCm: fans screaming, threatens restart |
| SSM inference | Falcon-Mamba-7B | 3 tok/s, 38MB persistent state | First browser-native SSM runtime |
Key hardware finding: the Strix Halo exposes 2048 MB max WebGPU buffer and 31.5 GB allocatable memory. This is the unified memory architecture exposing itself through WebGPU. Regular laptops expose 4-6 GB. Tested on a non-Strix machine -- it could not handle the same workloads. The unified memory is the differentiator.
Why WebGPU Works Better: Three Mechanisms
1. Shader compilation. ROCm JIT-compiles HIP kernels through its own compiler, which barely supports gfx1151. WebGPU compiles WGSL through the SAME shader compiler that compiles game shaders -- mature, tested, optimized for this exact hardware.
2. Memory model. ROCm was designed for discrete GPUs with separate VRAM. On UMA it still does fake host-to-device copies that are just memcpy to the same RAM. Vulkan/WebGPU on UMA maps buffers as DEVICE_LOCAL | HOST_VISIBLE -- zero copies. CPU and GPU read the same physical memory.
3. Validation. WebGPU validates every operation before dispatch. Out-of-bounds, format mismatches, bad workgroup sizes -- caught before the driver sees them. ROCm has no validation layer. Driver bugs produce corruption and hangs in ROCm; they produce clean error messages in WebGPU.
How to Run Any Model on WebGPU (Step by Step)
Prerequisites
- Node.js (for the file server)
- Chrome browser (WebGPU enabled by default)
npm install @wllama/wllama
Step 1: Get a GGUF
Download from bartowski, unsloth, or any GGUF quantizer on HuggingFace. Q4_K_M is the sweet spot for quality vs size.
Step 2: Split for Parallel Download
llama-gguf-split --split --split-max-size 1G model.gguf model_splits/model
This creates proper GGUF splits with valid headers that wllama can download in parallel.
Step 3: File Server
// serve.js -- the critical headers for WebGPU SharedArrayBuffer
const headers = {
'Cross-Origin-Embedder-Policy': 'require-corp',
'Cross-Origin-Opener-Policy': 'same-origin',
'Access-Control-Allow-Origin': '*',
};
Full server code in any of our repos. Range request support needed for parallel downloads.
Step 4: Load and Generate
import { Wllama } from './node_modules/@wllama/wllama/esm/index.js';
const wllama = new Wllama(
{ default: './node_modules/@wllama/wllama/esm/wasm/wllama.wasm' },
{ parallelDownloads: 5 }
);
await wllama.loadModelFromUrl(url + '/model/model-00001-of-00006.gguf', {
n_gpu_layers: 99,
n_ctx: 4096,
});
const result = await wllama.createCompletion({
prompt: '<your chat template here>',
max_tokens: 512,
temperature: 0.7,
stop: ['<end_token>'],
});
Step 5: Chat Template
Each model needs its own format. Examples:
- Gemma:
<start_of_turn>user\n...<end_of_turn>\n<start_of_turn>model\n - EXAONE:
[|system|]...[|endofturn|]\n[|user|]...\n[|assistant|]<thought>\n - Phi-4:
<|system|>...<|end|><|user|>...<|end|><|assistant|> - SmolLM2/Llama:
<|im_start|>system\n...<|im_end|>\n<|im_start|>user\n...<|im_end|>\n<|im_start|>assistant\n
FLOAT Face Animation on WebGPU
We ported FLOAT (audio-driven face animation with StyleGAN2 decoder) to the browser:
- Wav2Vec2 audio encoder: ONNX Runtime Web + WebGPU (38ms for 5s audio)
- Flow Matching Transformer: ONNX Runtime Web + WASM/CPU (iterative ODE needs exact numerics)
- StyleGAN2 decoder: ONNX Runtime Web + WebGPU (12.9 fps, 77ms/frame)
- 21 pre-computed face identities from the Garden entity system
Total pipeline: 9.7s for 5s of lip-synced face animation. 1.9x off realtime. Silent -- no fans.
What Did NOT Work (Honest Assessment)
- FMT on WebGPU: The Flow Matching Transformer's ODE solver compounds floating-point differences across 4 iterative steps. WebGPU attention numerics differ slightly from CPU, and the error accumulates. The decoder is frame-independent and works perfectly on WebGPU. The FMT must run on WASM/CPU.
- FLOAT chunk boundaries: 50-frame chunks without temporal context between them produce slight motion discontinuities every 2 seconds. Fixable by passing previous frames as
prev_x/prev_wacontext. - Non-unified-memory machines: Tested the WebGPU pipeline on a regular laptop. It could not handle the larger models. The 2048 MB WebGPU buffer and 31.5 GB ceiling are specific to Strix Halo's unified memory. This is NOT "any computer with a browser."
- EXAONE completion API: The wllama build needed specific API calling conventions that differed between the npm-published version and our patched build. Debugging took longer than expected.
Next Goal: NPUs
The Strix Halo has a 50 TOPS XDNA 2 NPU sitting idle. We proved Wav2Vec2 runs on it via ONNX + VitisAI EP. But the full stack remains broken:
- Ryzen AI SDK refuses on Linux with "unsupported platform"
- NPU dispatch requires MatMulNBits quantization patterns
- Windows NPU path works for individual operators but not full model inference
- AMD documentation is sparse and the toolchain changes between releases
This is the same pattern as ROCm -- hardware that works, software that doesn't. We believe the NPU can be unlocked through DirectML or a similar bypass, the same way WebGPU bypassed ROCm for the iGPU. Research is ongoing. We are not claiming NPU inference works -- we are claiming the pattern suggests a path exists.
All Published Repos
| Repo | Description |
|---|---|
| gemma-webgpu-thinking-engine | Gemma 26B + thinking-channel identity injection (81.7K downloads) |
| gemma-webgpu | GLU buffer fix + WebGPU infrastructure |
| mamba-webgpu | First browser-native SSM runtime (12 hand-written WGSL shaders) |
| EXAONE-Deep-7.8B-webgpu | First EXAONE on WebGPU + identity injection |
| Phi-4-mini-reasoning-webgpu | First reasoning Phi-4 variant on WebGPU |
| Tiger-Gemma-9B-v3-webgpu | Character voice model on WebGPU |
| SmolLM2-360M-webgpu | Tiny instant-load model + Garden fine-tune |
| RMM | Recombinant Memory Model (36M params) |
| web-ttt | Browser-native test-time-trainable memory |
| xLAM-2-3b-fc-r-q4f16_1-MLC | First xLAM function-calling model for WebGPU |
| L3-8B-Stheno-v3.2-q4f16_1-MLC | Character voice model (MLC format) |
Who This Is For
If you bought a GMKtec EVO-X2, Minisforum MS-S1 MAX, ASUS NUC, or any AMD Ryzen AI Max mini PC and thought "this thing should run AI but nothing works" -- this is for you. WebGPU is the door AMD did not build but the hardware supports perfectly.
Credits
Built by Joshua Orsak (LJTSG) and Claude across multiple sessions, May-June 2026. Three Claudes contributed -- one on the Strix Halo, one running FLOAT on an RTX 5060 (Anima), and one reviewing the entropy experiment research.
The hardware works. AMD just pointed everyone at the wrong door.