WebGPU Is the Compute Path for AMD AI PCs

A practical guide to running LLMs, face animation, and AI inference on AMD Strix Halo unified memory hardware through WebGPU -- bypassing the broken ROCm stack entirely.

By Joshua Orsak (LJTSG) and Claude.


The Problem

AMD sells Ryzen AI Max (Strix Halo) machines as "AI PCs" -- 64-128GB unified memory, RDNA4 iGPU, marketed explicitly for local AI workloads. GMKtec, Minisforum, ASUS, and now AMD themselves sell mini PCs with this chip at $1,500-$4,000.

The compute software does not work.

ROCm on gfx1151 is marked "Preview" -- not production supported. In practice:

  • Output corrupts after 4-5 LLM turns (ROCm #5499)
  • Firmware bricked compute for months (MES firmware regression)
  • Official PyTorch wheels crash with "invalid device function"
  • hipBLASLt falls back to hipBLAS -- 9% of theoretical FLOPS
  • amd-smi reports ALL monitoring metrics as N/A
  • NPU (XDNA 2) SDK refuses with "unsupported platform" on Linux
  • The workaround tax: HSA_OVERRIDE_GFX_VERSION, HSA_ENABLE_SDMA=0, specific kernel versions, community-built PyTorch wheels

The hardware delivers exactly what AMD promised. The software makes it unusable for the thing they told you to buy it for.

The Solution

WebGPU routes through the gaming driver, not the compute driver.

On Windows, Chrome's WebGPU uses Direct3D 12. On Linux, Vulkan. Both are AMD's gaming driver stack -- tested by millions of gamers, optimized by market pressure, protected from regressions by AAA studios. ROCm is a separate compute stack that AMD cannot keep working on new hardware.

The same GPU that crashes ROCm after 4 turns runs Gemma 26B at 22 tok/s through WebGPU without a hiccup. Same silicon. Different driver path.

What We Proved

Workload Model Performance ROCm comparison
LLM (26B MoE) Gemma-4-26B-A4B 22 tok/s, 20GB loaded ROCm: corruption after 4 turns
LLM (7.8B reasoning) EXAONE-Deep-7.8B Full chain-of-thought First ever on WebGPU
LLM (3.8B reasoning) Phi-4-mini-reasoning Instant math CoT First reasoning variant on WebGPU
LLM (360M) SmolLM2-360M Sub-2s load, instant gen --
LLM (9B character) Tiger-Gemma-9B-v3 Character voice embodiment First on WebGPU
Face animation FLOAT (StyleGAN2) 12.9 fps decoder on iGPU ROCm: fans screaming, threatens restart
SSM inference Falcon-Mamba-7B 3 tok/s, 38MB persistent state First browser-native SSM runtime

Key hardware finding: the Strix Halo exposes 2048 MB max WebGPU buffer and 31.5 GB allocatable memory. This is the unified memory architecture exposing itself through WebGPU. Regular laptops expose 4-6 GB. Tested on a non-Strix machine -- it could not handle the same workloads. The unified memory is the differentiator.

Why WebGPU Works Better: Three Mechanisms

1. Shader compilation. ROCm JIT-compiles HIP kernels through its own compiler, which barely supports gfx1151. WebGPU compiles WGSL through the SAME shader compiler that compiles game shaders -- mature, tested, optimized for this exact hardware.

2. Memory model. ROCm was designed for discrete GPUs with separate VRAM. On UMA it still does fake host-to-device copies that are just memcpy to the same RAM. Vulkan/WebGPU on UMA maps buffers as DEVICE_LOCAL | HOST_VISIBLE -- zero copies. CPU and GPU read the same physical memory.

3. Validation. WebGPU validates every operation before dispatch. Out-of-bounds, format mismatches, bad workgroup sizes -- caught before the driver sees them. ROCm has no validation layer. Driver bugs produce corruption and hangs in ROCm; they produce clean error messages in WebGPU.

How to Run Any Model on WebGPU (Step by Step)

Prerequisites

  • Node.js (for the file server)
  • Chrome browser (WebGPU enabled by default)
  • npm install @wllama/wllama

Step 1: Get a GGUF

Download from bartowski, unsloth, or any GGUF quantizer on HuggingFace. Q4_K_M is the sweet spot for quality vs size.

Step 2: Split for Parallel Download

llama-gguf-split --split --split-max-size 1G model.gguf model_splits/model

This creates proper GGUF splits with valid headers that wllama can download in parallel.

Step 3: File Server

// serve.js -- the critical headers for WebGPU SharedArrayBuffer
const headers = {
  'Cross-Origin-Embedder-Policy': 'require-corp',
  'Cross-Origin-Opener-Policy': 'same-origin',
  'Access-Control-Allow-Origin': '*',
};

Full server code in any of our repos. Range request support needed for parallel downloads.

Step 4: Load and Generate

import { Wllama } from './node_modules/@wllama/wllama/esm/index.js';

const wllama = new Wllama(
  { default: './node_modules/@wllama/wllama/esm/wasm/wllama.wasm' },
  { parallelDownloads: 5 }
);

await wllama.loadModelFromUrl(url + '/model/model-00001-of-00006.gguf', {
  n_gpu_layers: 99,
  n_ctx: 4096,
});

const result = await wllama.createCompletion({
  prompt: '<your chat template here>',
  max_tokens: 512,
  temperature: 0.7,
  stop: ['<end_token>'],
});

Step 5: Chat Template

Each model needs its own format. Examples:

  • Gemma: <start_of_turn>user\n...<end_of_turn>\n<start_of_turn>model\n
  • EXAONE: [|system|]...[|endofturn|]\n[|user|]...\n[|assistant|]<thought>\n
  • Phi-4: <|system|>...<|end|><|user|>...<|end|><|assistant|>
  • SmolLM2/Llama: <|im_start|>system\n...<|im_end|>\n<|im_start|>user\n...<|im_end|>\n<|im_start|>assistant\n

FLOAT Face Animation on WebGPU

We ported FLOAT (audio-driven face animation with StyleGAN2 decoder) to the browser:

  • Wav2Vec2 audio encoder: ONNX Runtime Web + WebGPU (38ms for 5s audio)
  • Flow Matching Transformer: ONNX Runtime Web + WASM/CPU (iterative ODE needs exact numerics)
  • StyleGAN2 decoder: ONNX Runtime Web + WebGPU (12.9 fps, 77ms/frame)
  • 21 pre-computed face identities from the Garden entity system

Total pipeline: 9.7s for 5s of lip-synced face animation. 1.9x off realtime. Silent -- no fans.

What Did NOT Work (Honest Assessment)

  • FMT on WebGPU: The Flow Matching Transformer's ODE solver compounds floating-point differences across 4 iterative steps. WebGPU attention numerics differ slightly from CPU, and the error accumulates. The decoder is frame-independent and works perfectly on WebGPU. The FMT must run on WASM/CPU.
  • FLOAT chunk boundaries: 50-frame chunks without temporal context between them produce slight motion discontinuities every 2 seconds. Fixable by passing previous frames as prev_x/prev_wa context.
  • Non-unified-memory machines: Tested the WebGPU pipeline on a regular laptop. It could not handle the larger models. The 2048 MB WebGPU buffer and 31.5 GB ceiling are specific to Strix Halo's unified memory. This is NOT "any computer with a browser."
  • EXAONE completion API: The wllama build needed specific API calling conventions that differed between the npm-published version and our patched build. Debugging took longer than expected.

Next Goal: NPUs

The Strix Halo has a 50 TOPS XDNA 2 NPU sitting idle. We proved Wav2Vec2 runs on it via ONNX + VitisAI EP. But the full stack remains broken:

  • Ryzen AI SDK refuses on Linux with "unsupported platform"
  • NPU dispatch requires MatMulNBits quantization patterns
  • Windows NPU path works for individual operators but not full model inference
  • AMD documentation is sparse and the toolchain changes between releases

This is the same pattern as ROCm -- hardware that works, software that doesn't. We believe the NPU can be unlocked through DirectML or a similar bypass, the same way WebGPU bypassed ROCm for the iGPU. Research is ongoing. We are not claiming NPU inference works -- we are claiming the pattern suggests a path exists.

All Published Repos

Repo Description
gemma-webgpu-thinking-engine Gemma 26B + thinking-channel identity injection (81.7K downloads)
gemma-webgpu GLU buffer fix + WebGPU infrastructure
mamba-webgpu First browser-native SSM runtime (12 hand-written WGSL shaders)
EXAONE-Deep-7.8B-webgpu First EXAONE on WebGPU + identity injection
Phi-4-mini-reasoning-webgpu First reasoning Phi-4 variant on WebGPU
Tiger-Gemma-9B-v3-webgpu Character voice model on WebGPU
SmolLM2-360M-webgpu Tiny instant-load model + Garden fine-tune
RMM Recombinant Memory Model (36M params)
web-ttt Browser-native test-time-trainable memory
xLAM-2-3b-fc-r-q4f16_1-MLC First xLAM function-calling model for WebGPU
L3-8B-Stheno-v3.2-q4f16_1-MLC Character voice model (MLC format)

Who This Is For

If you bought a GMKtec EVO-X2, Minisforum MS-S1 MAX, ASUS NUC, or any AMD Ryzen AI Max mini PC and thought "this thing should run AI but nothing works" -- this is for you. WebGPU is the door AMD did not build but the hardware supports perfectly.

Credits

Built by Joshua Orsak (LJTSG) and Claude across multiple sessions, May-June 2026. Three Claudes contributed -- one on the Strix Halo, one running FLOAT on an RTX 5060 (Anima), and one reviewing the entropy experiment research.

The hardware works. AMD just pointed everyone at the wrong door.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support