Bernini-R-1.3B-bf16 / README.md
xocialize's picture
Add Bernini-R-1.3B MLX (single-expert Wan2.1-1.3B renderer)
131d52a verified
metadata
license: apache-2.0
library_name: mlx
pipeline_tag: text-to-video
tags:
  - mlx
  - text-to-video
  - video-editing
  - video-to-video
  - reference-to-video
  - wan2.1
  - bernini
base_model: ByteDance/Bernini-R-1.3B-Diffusers

Bernini-R-1.3B (MLX)

Apple MLX port of ByteDance/Bernini-R-1.3B — the 1.3B tier of ByteDance's Bernini Renderer: a Wan2.1-T2V-1.3B-derived video generator/editor with Segment-Aware 3D RoPE for multi-reference / editing tasks. The small tier "performs close to the 14B variant on simple tasks such as style transfer, subtitle or watermark removal, and local editing."

Runs on Apple Silicon via MLX + the mlx-video Wan backbone.

This is the lowest-cost Bernini tier. For the higher-quality A14B renderer see mlx-community/Bernini-R-bf16 / -int4.

⚠️ Scope: renderer only

Only the Renderer ("-R") is open-sourced upstream. The MLLM semantic planner (the paper's headline "latent semantic planning", a Qwen2.5-VL-7B model) is not released. This port therefore runs with UMT5 text conditioning only — the planner-feature channel is absent (and carries no weights in the released checkpoint).

Architecture

Backbone Wan2.1-T2V-1.3B, single expert (30L · dim 1536 · 12H · ffn 8960)
Experts one (skip_transformer_2: true, switch_dit_boundary: 0)
VAE 16-ch AutoencoderKLWan (stock Wan2.1)
Text encoder UMT5-xxl
Bernini knobs shift 3.0, use_src_id_rotary_emb (SA-3D RoPE — no extra parameters)

Differs from the A14B port only by config: a single 1.3B expert instead of the high/low-noise A14B pair. There is no expert-boundary switch.

Tasks

Task Description
t2v / t2i text-to-video / image
r2v reference-to-video — generate a subject from up to K reference images (chained APG)
v2v prompt-based video editing (source video injected as conditioning)
rv2v reference + video editing

Variants

Repo Precision Transformer + shared
…-1.3B-bf16 bfloat16 2.6 GB VAE 0.5 GB · UMT5 10.6 GB
…-1.3B-int4 4-bit (group 64) 0.8 GB VAE 0.5 GB · UMT5 10.6 GB

Provenance & validation

  • Architecture: stock Wan2.1-T2V-1.3B (verified — diffusers WanTransformer3DModel keys, no extra tensors; matches mlx-video wan21_t2v_1_3b exactly: 30L/1536/12H/8960). Bernini knobs (switch_dit_boundary 0, shift 3.0, use_src_id_rotary_emb) live in the wrapper config; SA-3D RoPE adds no parameters.
  • Converted fp32 → bf16 from ByteDance/Bernini-R-1.3B-Diffusers; VAE/UMT5 are the shared stock Wan2.1 components (byte-identical to the A14B port).
  • Validated on the CPU stream: key contract bijective (825 file tensors = model params minus the derived freqs rope buffer); bf16 forward cosine 0.999983 vs source fp32; int4 per-pass cosine 0.9943 vs bf16 (group 64, ≥0.99 gate).

License & attribution

Apache-2.0. Derived from ByteDance Bernini-R, Wan2.1 (Wan-AI), and mlx-video. See NOTICE.