Moebius — ONNX (browser / WebGPU)

ONNX exports of the Moebius image-inpainting model (hustvl/Moebius, ECCV'26; 0.22B parameters), for running in a web browser with ONNX Runtime Web on the WebGPU backend.

Moebius conditions on a learned embedding table rather than a text encoder, so there is no tokenizer or text model to export. The export is three graphs — VAE encoder, UNet, VAE decoder — and the sampling loop (DDIM with classifier-free guidance) runs in JavaScript.

Live demo in your browser here: simonw.github.io/moebius-web/. Source code on GitHub.

Files

File	Graph	Input → Output	Size (fp32)
`unet.onnx`	student denoiser (`RemovalModel`: embedding + lambda-DWConv UNet)	`latent (B,9,64,64)`, `timesteps (B,)`, `input_ids (B,10)` → `noise (B,4,64,64)`	~907 MB
`vae_encoder.onnx`	SD VAE encoder	`image (B,3,512,512)` → `moments (B,8,64,64)`	~137 MB
`vae_decoder.onnx`	SD VAE decoder	`latent (B,4,64,64)` → `image (B,3,512,512)`	~198 MB

Exported at a static 512×512 resolution (64×64 latent). The model's cross-attention uses a relative-position embedding tied to the trained resolution, so spatial size is fixed.
The learned-embedding "prompt" conditioning stays inside unet.onnx as an nn.Embedding(20, 3072) gather. For classifier-free guidance: input_ids rows [0..9] = conditional, [10..19] = unconditional.

Pipeline notes (must match for correct output)

VAE scaling_factor = 0.13025 (this is a custom VAE — not the usual SD 0.18215). Encode: latent = mean(moments[:, :4]) * 0.13025. Decode: feed latent / 0.13025.
9-channel UNet input = concat([noisy_latent(4), mask(1), masked_image_latent(4)], dim=1).
Scheduler: DDIM, beta_start=0.00085, beta_end=0.012, scaled_linear, 1000 train steps, clip_sample=false. 20 steps with strength≈0.99 ⇒ 19 actual steps.
VAE encoder source: hustvl/PixelHacker vae/.

A reference TypeScript implementation (DDIM loop, CFG, 9-channel assembly, pre/post-processing) that loads these files lives in the accompanying web demo.

Precision

These are fp32 exports, for numeric parity with the reference pipeline. Parity vs PyTorch on the CPU execution provider: decoder max|Δ|≈5.7e-5, unet ≈3.6e-6. A full-pipeline check against the PyTorch reference (identical initial noise) gives a decoded-image mean|Δ|≈0.0022. fp16 halves the download size, but can reduce quality in the lambda layers and is numerically unstable for this VAE; validate before use.

License & attribution

Licensed under Apache 2.0, inherited from the upstream hustvl/Moebius. These artifacts are a format conversion (PyTorch → ONNX) of the original weights; all model credit belongs to the original authors.

@misc{DuanAndXu2026Moebius,
  title  = {Moebius: 0.2B Lightweight Image Inpainting Framework with 10B-Level Performance},
  author = {Kangsheng Duan and Ziyang Xu and Wenyu Liu and Xiaohu Ruan and Xiaoxin Chen and Xinggang Wang},
  year   = {2026},
  eprint = {2606.19195},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url    = {https://arxiv.org/abs/2606.19195}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for simonw/Moebius-ONNX

Base model

hustvl/Moebius

Quantized

(1)

this model

Paper for simonw/Moebius-ONNX

Moebius: 0.2B Lightweight Image Inpainting Framework with 10B-Level Performance

Paper • 2606.19195 • Published 7 days ago • 124