Add Bernini-R-1.3B MLX (single-expert Wan2.1-1.3B renderer)

131d52a verified about 22 hours ago

3.35 kB

	---
	license: apache-2.0
	library_name: mlx
	pipeline_tag: text-to-video
	tags:
	- mlx
	- text-to-video
	- video-editing
	- video-to-video
	- reference-to-video
	- wan2.1
	- bernini
	base_model: ByteDance/Bernini-R-1.3B-Diffusers
	---

	# Bernini-R-1.3B (MLX)

	Apple MLX port of [ByteDance/Bernini-R-1.3B](https://huggingface.co/ByteDance/Bernini-R-1.3B-Diffusers) —
	the 1.3B tier of ByteDance's Bernini Renderer: a Wan2.1-T2V-1.3B-derived video
	generator/editor with Segment-Aware 3D RoPE for multi-reference / editing tasks.
	The small tier "performs close to the 14B variant on simple tasks such as style transfer,
	subtitle or watermark removal, and local editing."

	Runs on Apple Silicon via [MLX](https://github.com/ml-explore/mlx) + the
	[mlx-video](https://github.com/Blaizzy/mlx-video) Wan backbone.

	> This is the lowest-cost Bernini tier. For the higher-quality A14B renderer see
	> [`mlx-community/Bernini-R-bf16`](https://huggingface.co/mlx-community/Bernini-R-bf16) /
	> [`-int4`](https://huggingface.co/mlx-community/Bernini-R-int4).

	## ⚠️ Scope: renderer only

	Only the Renderer ("-R") is open-sourced upstream. The MLLM semantic planner (the
	paper's headline "latent semantic planning", a Qwen2.5-VL-7B model) is not released.
	This port therefore runs with UMT5 text conditioning only — the planner-feature
	channel is absent (and carries no weights in the released checkpoint).

	## Architecture

	\| \| \|
	\|---\|---\|
	\| Backbone \| Wan2.1-T2V-1.3B, single expert (30L · dim 1536 · 12H · ffn 8960) \|
	\| Experts \| one (`skip_transformer_2: true`, `switch_dit_boundary: 0`) \|
	\| VAE \| 16-ch `AutoencoderKLWan` (stock Wan2.1) \|
	\| Text encoder \| UMT5-xxl \|
	\| Bernini knobs \| `shift 3.0`, `use_src_id_rotary_emb` (SA-3D RoPE — no extra parameters) \|

	Differs from the A14B port only by config: a single 1.3B expert instead of the
	high/low-noise A14B pair. There is no expert-boundary switch.

	## Tasks

	\| Task \| Description \|
	\|---\|---\|
	\| `t2v` / `t2i` \| text-to-video / image \|
	\| `r2v` \| reference-to-video — generate a subject from up to K reference images (chained APG) \|
	\| `v2v` \| prompt-based video editing (source video injected as conditioning) \|
	\| `rv2v` \| reference + video editing \|

	## Variants

	\| Repo \| Precision \| Transformer \| + shared \|
	\|---\|---\|---\|---\|
	\| `…-1.3B-bf16` \| bfloat16 \| 2.6 GB \| VAE 0.5 GB · UMT5 10.6 GB \|
	\| `…-1.3B-int4` \| 4-bit (group 64) \| 0.8 GB \| VAE 0.5 GB · UMT5 10.6 GB \|

	## Provenance & validation

	- Architecture: stock Wan2.1-T2V-1.3B (verified — diffusers `WanTransformer3DModel`
	keys, no extra tensors; matches `mlx-video` `wan21_t2v_1_3b` exactly: 30L/1536/12H/8960).
	Bernini knobs (`switch_dit_boundary 0`, `shift 3.0`, `use_src_id_rotary_emb`) live in
	the wrapper config; SA-3D RoPE adds no parameters.
	- Converted fp32 → bf16 from `ByteDance/Bernini-R-1.3B-Diffusers`; VAE/UMT5 are the shared
	stock Wan2.1 components (byte-identical to the A14B port).
	- Validated on the CPU stream: key contract bijective (825 file tensors = model params
	minus the derived `freqs` rope buffer); bf16 forward cosine 0.999983 vs source fp32;
	int4 per-pass cosine 0.9943 vs bf16 (group 64, ≥0.99 gate).

	## License & attribution

	Apache-2.0. Derived from ByteDance Bernini-R, Wan2.1 (Wan-AI), and mlx-video. See `NOTICE`.