Moebius β€” ONNX (browser / WebGPU)

ONNX exports of the Moebius image-inpainting model (hustvl/Moebius, ECCV'26; 0.22B parameters), for running in a web browser with ONNX Runtime Web on the WebGPU backend.

Moebius conditions on a learned embedding table rather than a text encoder, so there is no tokenizer or text model to export. The export is three graphs β€” VAE encoder, UNet, VAE decoder β€” and the sampling loop (DDIM with classifier-free guidance) runs in JavaScript.

Live demo in your browser here: simonw.github.io/moebius-web/. Source code on GitHub.

Files

File Graph Input β†’ Output Size (fp32)
unet.onnx student denoiser (RemovalModel: embedding + lambda-DWConv UNet) latent (B,9,64,64), timesteps (B,), input_ids (B,10) β†’ noise (B,4,64,64) ~907 MB
vae_encoder.onnx SD VAE encoder image (B,3,512,512) β†’ moments (B,8,64,64) ~137 MB
vae_decoder.onnx SD VAE decoder latent (B,4,64,64) β†’ image (B,3,512,512) ~198 MB
  • Exported at a static 512Γ—512 resolution (64Γ—64 latent). The model's cross-attention uses a relative-position embedding tied to the trained resolution, so spatial size is fixed.
  • The learned-embedding "prompt" conditioning stays inside unet.onnx as an nn.Embedding(20, 3072) gather. For classifier-free guidance: input_ids rows [0..9] = conditional, [10..19] = unconditional.

Pipeline notes (must match for correct output)

  • VAE scaling_factor = 0.13025 (this is a custom VAE β€” not the usual SD 0.18215). Encode: latent = mean(moments[:, :4]) * 0.13025. Decode: feed latent / 0.13025.
  • 9-channel UNet input = concat([noisy_latent(4), mask(1), masked_image_latent(4)], dim=1).
  • Scheduler: DDIM, beta_start=0.00085, beta_end=0.012, scaled_linear, 1000 train steps, clip_sample=false. 20 steps with strengthβ‰ˆ0.99 β‡’ 19 actual steps.
  • VAE encoder source: hustvl/PixelHacker vae/.

A reference TypeScript implementation (DDIM loop, CFG, 9-channel assembly, pre/post-processing) that loads these files lives in the accompanying web demo.

Precision

These are fp32 exports, for numeric parity with the reference pipeline. Parity vs PyTorch on the CPU execution provider: decoder max|Ξ”|β‰ˆ5.7e-5, unet β‰ˆ3.6e-6. A full-pipeline check against the PyTorch reference (identical initial noise) gives a decoded-image mean|Ξ”|β‰ˆ0.0022. fp16 halves the download size, but can reduce quality in the lambda layers and is numerically unstable for this VAE; validate before use.

License & attribution

Licensed under Apache 2.0, inherited from the upstream hustvl/Moebius. These artifacts are a format conversion (PyTorch β†’ ONNX) of the original weights; all model credit belongs to the original authors.

@misc{DuanAndXu2026Moebius,
  title  = {Moebius: 0.2B Lightweight Image Inpainting Framework with 10B-Level Performance},
  author = {Kangsheng Duan and Ziyang Xu and Wenyu Liu and Xiaohu Ruan and Xiaoxin Chen and Xinggang Wang},
  year   = {2026},
  eprint = {2606.19195},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url    = {https://arxiv.org/abs/2606.19195}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for simonw/Moebius-ONNX

Base model

hustvl/Moebius
Quantized
(1)
this model

Paper for simonw/Moebius-ONNX