Add Bernini-R-1.3B MLX (single-expert Wan2.1-1.3B renderer)

131d52a verified about 17 hours ago

3.35 kB

license: apache-2.0
library_name: mlx
pipeline_tag: text-to-video
tags:
  - mlx
  - text-to-video
  - video-editing
  - video-to-video
  - reference-to-video
  - wan2.1
  - bernini
base_model: ByteDance/Bernini-R-1.3B-Diffusers

Bernini-R-1.3B (MLX)

Apple MLX port of ByteDance/Bernini-R-1.3B — the 1.3B tier of ByteDance's Bernini Renderer: a Wan2.1-T2V-1.3B-derived video generator/editor with Segment-Aware 3D RoPE for multi-reference / editing tasks. The small tier "performs close to the 14B variant on simple tasks such as style transfer, subtitle or watermark removal, and local editing."

Runs on Apple Silicon via MLX + the mlx-video Wan backbone.

This is the lowest-cost Bernini tier. For the higher-quality A14B renderer see mlx-community/Bernini-R-bf16 / -int4.

⚠️ Scope: renderer only

Only the Renderer ("-R") is open-sourced upstream. The MLLM semantic planner (the paper's headline "latent semantic planning", a Qwen2.5-VL-7B model) is not released. This port therefore runs with UMT5 text conditioning only — the planner-feature channel is absent (and carries no weights in the released checkpoint).

Architecture


Backbone	Wan2.1-T2V-1.3B, single expert (30L · dim 1536 · 12H · ffn 8960)
Experts	one (`skip_transformer_2: true`, `switch_dit_boundary: 0`)
VAE	16-ch `AutoencoderKLWan` (stock Wan2.1)
Text encoder	UMT5-xxl
Bernini knobs	`shift 3.0`, `use_src_id_rotary_emb` (SA-3D RoPE — no extra parameters)

Differs from the A14B port only by config: a single 1.3B expert instead of the high/low-noise A14B pair. There is no expert-boundary switch.

Tasks

Task	Description
`t2v` / `t2i`	text-to-video / image
`r2v`	reference-to-video — generate a subject from up to K reference images (chained APG)
`v2v`	prompt-based video editing (source video injected as conditioning)
`rv2v`	reference + video editing

Variants

Repo	Precision	Transformer	+ shared
`…-1.3B-bf16`	bfloat16	2.6 GB	VAE 0.5 GB · UMT5 10.6 GB
`…-1.3B-int4`	4-bit (group 64)	0.8 GB	VAE 0.5 GB · UMT5 10.6 GB

Provenance & validation

Architecture: stock Wan2.1-T2V-1.3B (verified — diffusers WanTransformer3DModel keys, no extra tensors; matches mlx-video wan21_t2v_1_3b exactly: 30L/1536/12H/8960). Bernini knobs (switch_dit_boundary 0, shift 3.0, use_src_id_rotary_emb) live in the wrapper config; SA-3D RoPE adds no parameters.
Converted fp32 → bf16 from ByteDance/Bernini-R-1.3B-Diffusers; VAE/UMT5 are the shared stock Wan2.1 components (byte-identical to the A14B port).
Validated on the CPU stream: key contract bijective (825 file tensors = model params minus the derived freqs rope buffer); bf16 forward cosine 0.999983 vs source fp32; int4 per-pass cosine 0.9943 vs bf16 (group 64, ≥0.99 gate).

License & attribution

Apache-2.0. Derived from ByteDance Bernini-R, Wan2.1 (Wan-AI), and mlx-video. See NOTICE.