Add Bernini-R MLX weights (renderer-only; Wan2.2-A14B derived)

cde3acb verified 2 days ago

2.81 kB

	---
	license: apache-2.0
	library_name: mlx
	pipeline_tag: text-to-video
	tags:
	- mlx
	- text-to-video
	- video-editing
	- video-to-video
	- reference-to-video
	- wan2.2
	- bernini
	base_model: ByteDance/Bernini-R-Diffusers
	---

	# Bernini-R (MLX)

	Apple MLX port of [ByteDance/Bernini-R](https://huggingface.co/ByteDance/Bernini-R-Diffusers) —
	the open-sourced Renderer of ByteDance's Bernini: a Wan2.2-T2V-A14B-derived video
	generator/editor with Segment-Aware 3D RoPE for multi-reference / editing tasks.

	Runs on Apple Silicon via [MLX](https://github.com/ml-explore/mlx) + the
	[mlx-video](https://github.com/Blaizzy/mlx-video) Wan2.2 backbone.

	## ⚠️ Scope: renderer only

	Only the Renderer ("-R") is open-sourced upstream. The MLLM semantic planner (the
	paper's headline "latent semantic planning", a Qwen2.5-VL-7B model) is not released.
	This port therefore runs with UMT5 text conditioning only — the planner-feature
	channel is absent (and carries no weights in the released checkpoint). You get the
	renderer's editing / reference-to-video / subject-consistency behavior, not the full
	planner-guided system.

	## Tasks

	\| Task \| Description \|
	\|---\|---\|
	\| `t2v` / `t2i` \| text-to-video / image \|
	\| `r2v` \| reference-to-video — generate a subject from up to K reference images (chained APG) \|
	\| `v2v` \| prompt-based video editing (source video injected as conditioning) \|
	\| `rv2v` \| reference + video editing \|

	## Variants

	\| Repo \| Precision \| Size / expert \|
	\|---\|---\|---\|
	\| `…-bf16` \| bfloat16 \| 28.6 GB \|
	\| `…-int4` \| 4-bit (group 64) \| 8.4 GB \|

	Two experts (high/low-noise) + 16-ch Wan2.2 VAE (0.5 GB) + UMT5 (11 GB).

	## Usage

	```python
	from bernini_r_mlx import pipeline_mlx as P

	# text-to-video
	P.t2v("path/to/ckpt", "a red fox in a snowy forest", num_frames=49, output_path="out.mp4")

	# reference-to-video (subject consistency)
	P.r2v("path/to/ckpt", "the fox running across a field",
	reference_images=["fox.png"], output_path="r2v.mp4")

	# video editing
	P.v2v("path/to/ckpt", "... autumn forest ...", source_video="in.mp4", output_path="v2v.mp4")
	```

	## Provenance & validation

	- Architecture: stock Wan2.2-T2V-A14B (verified — diffusers `WanTransformer3DModel` keys,
	no extra tensors); Bernini knobs (`switch_dit_boundary 0.875`, `shift 3.0`,
	`use_src_id_rotary_emb`) live in the wrapper config. SA-3D RoPE adds no parameters.
	- Converted fp32 → bf16 from `ByteDance/Bernini-R-Diffusers`; VAE/UMT5 from `Wan-AI/Wan2.2-T2V-A14B`.
	- Validated: SA-3D RoPE parity ~1e-7; VAE roundtrip MAD 2.1/255; multi-segment forward
	bit-exact vs t2v; int4 per-pass cosine 0.9992 vs bf16; e2e t2v / r2v / v2v coherent.

	## License & attribution

	Apache-2.0. Derived from ByteDance Bernini-R, Wan2.2 (Wan-AI), and mlx-video. See `NOTICE`.