Add Bernini-R MLX weights (renderer-only; Wan2.2-A14B derived)

Browse files

Files changed (7) hide show

NOTICE +22 -0
README.md +79 -0
config.json +55 -0
high_noise_model.safetensors +3 -0
low_noise_model.safetensors +3 -0
t5_encoder.safetensors +3 -0
vae.safetensors +3 -0

NOTICE ADDED Viewed

	@@ -0,0 +1,22 @@

+bernini-r-mlx
+Apache MLX port of ByteDance Bernini-R (the Bernini Renderer).
+This work is licensed under the Apache License, Version 2.0.
+It is derived from and depends on the following Apache-2.0 works; their notices
+and attributions are retained here:
+- ByteDance/Bernini-R — the Bernini Renderer (weights + reference inference code).
+  https://github.com/bytedance/Bernini  ·  https://huggingface.co/ByteDance/Bernini-R-Diffusers
+  Paper: "Bernini: Latent Semantic Planning for Video Diffusion" (arXiv:2605.22344).
+- Wan-AI/Wan2.2-T2V-A14B — the base DiT, 16-channel causal VAE, and UMT5 text encoder
+  that Bernini-R fine-tunes / reuses.  https://github.com/Wan-Video/Wan2.2
+- Qwen2.5-VL-7B-Instruct — the Bernini *planner* (NOT used here; not released as weights).
+- mlx-video (Blaizzy/mlx-video) — the MLX Wan2.2 backbone reused by this port.
+Scope note: only the Bernini *Renderer* is open-sourced upstream. The MLLM semantic
+planner (the paper's "latent semantic planning") is not released, so this port runs with
+UMT5 text conditioning only; the planner-feature channel is absent.

README.md ADDED Viewed

	@@ -0,0 +1,79 @@

+---
+license: apache-2.0
+library_name: mlx
+pipeline_tag: text-to-video
+tags:
+  - mlx
+  - text-to-video
+  - video-editing
+  - video-to-video
+  - reference-to-video
+  - wan2.2
+  - bernini
+base_model: ByteDance/Bernini-R-Diffusers
+---
+# Bernini-R (MLX)
+Apple MLX port of **[ByteDance/Bernini-R](https://huggingface.co/ByteDance/Bernini-R-Diffusers)** —
+the open-sourced *Renderer* of ByteDance's Bernini: a Wan2.2-T2V-A14B-derived video
+generator/editor with **Segment-Aware 3D RoPE** for multi-reference / editing tasks.
+Runs on Apple Silicon via [MLX](https://github.com/ml-explore/mlx) + the
+[mlx-video](https://github.com/Blaizzy/mlx-video) Wan2.2 backbone.
+## ⚠️ Scope: renderer only
+Only the Renderer ("-R") is open-sourced upstream. The MLLM semantic **planner** (the
+paper's headline "latent semantic planning", a Qwen2.5-VL-7B model) is **not released**.
+This port therefore runs with **UMT5 text conditioning only** — the planner-feature
+channel is absent (and carries no weights in the released checkpoint). You get the
+renderer's editing / reference-to-video / subject-consistency behavior, not the full
+planner-guided system.
+## Tasks
+| Task | Description |
+|---|---|
+| `t2v` / `t2i` | text-to-video / image |
+| `r2v` | reference-to-video — generate a subject from up to K reference images (chained APG) |
+| `v2v` | prompt-based video editing (source video injected as conditioning) |
+| `rv2v` | reference + video editing |
+## Variants
+| Repo | Precision | Size / expert |
+|---|---|---|
+| `…-bf16`  | bfloat16 | 28.6 GB |
+| `…-int4`  | 4-bit (group 64) | 8.4 GB |
+Two experts (high/low-noise) + 16-ch Wan2.2 VAE (0.5 GB) + UMT5 (11 GB).
+## Usage
+```python
+from bernini_r_mlx import pipeline_mlx as P
+# text-to-video
+P.t2v("path/to/ckpt", "a red fox in a snowy forest", num_frames=49, output_path="out.mp4")
+# reference-to-video (subject consistency)
+P.r2v("path/to/ckpt", "the fox running across a field",
+      reference_images=["fox.png"], output_path="r2v.mp4")
+# video editing
+P.v2v("path/to/ckpt", "... autumn forest ...", source_video="in.mp4", output_path="v2v.mp4")
+```
+## Provenance & validation
+- Architecture: **stock Wan2.2-T2V-A14B** (verified — diffusers `WanTransformer3DModel` keys,
+  no extra tensors); Bernini knobs (`switch_dit_boundary 0.875`, `shift 3.0`,
+  `use_src_id_rotary_emb`) live in the wrapper config. SA-3D RoPE adds **no parameters**.
+- Converted fp32 → bf16 from `ByteDance/Bernini-R-Diffusers`; VAE/UMT5 from `Wan-AI/Wan2.2-T2V-A14B`.
+- Validated: SA-3D RoPE parity ~1e-7; VAE roundtrip MAD 2.1/255; multi-segment forward
+  bit-exact vs t2v; int4 per-pass cosine 0.9992 vs bf16; e2e t2v / r2v / v2v coherent.
+## License & attribution
+Apache-2.0. Derived from ByteDance Bernini-R, Wan2.2 (Wan-AI), and mlx-video. See `NOTICE`.

config.json ADDED Viewed

	@@ -0,0 +1,55 @@

+{
+  "model_type": "t2v",
+  "model_version": "2.2",
+  "patch_size": [
+    1,
+    2,
+    2
+  ],
+  "text_len": 512,
+  "in_dim": 16,
+  "dim": 5120,
+  "ffn_dim": 13824,
+  "freq_dim": 256,
+  "text_dim": 4096,
+  "out_dim": 16,
+  "num_heads": 40,
+  "num_layers": 40,
+  "window_size": [
+    -1,
+    -1
+  ],
+  "qk_norm": true,
+  "cross_attn_norm": true,
+  "eps": 1e-06,
+  "vae_stride": [
+    4,
+    8,
+    8
+  ],
+  "vae_z_dim": 16,
+  "dual_model": true,
+  "boundary": 0.875,
+  "sample_shift": 3.0,
+  "sample_steps": 40,
+  "sample_guide_scale": [
+    3.0,
+    4.0
+  ],
+  "num_train_timesteps": 1000,
+  "sample_fps": 16,
+  "frame_num": 81,
+  "sample_neg_prompt": "\u8272\u8c03\u8273\u4e3d\uff0c\u8fc7\u66dd\uff0c\u9759\u6001\uff0c\u7ec6\u8282\u6a21\u7cca\u4e0d\u6e05\uff0c\u5b57\u5e55\uff0c\u98ce\u683c\uff0c\u4f5c\u54c1\uff0c\u753b\u4f5c\uff0c\u753b\u9762\uff0c\u9759\u6b62\uff0c\u6574\u4f53\u53d1\u7070\uff0c\u6700\u5dee\u8d28\u91cf\uff0c\u4f4e\u8d28\u91cf\uff0cJPEG\u538b\u7f29\u6b8b\u7559\uff0c\u4e11\u964b\u7684\uff0c\u6b8b\u7f3a\u7684\uff0c\u591a\u4f59\u7684\u624b\u6307\uff0c\u753b\u5f97\u4e0d\u597d\u7684\u624b\u90e8\uff0c\u753b\u5f97\u4e0d\u597d\u7684\u8138\u90e8\uff0c\u7578\u5f62\u7684\uff0c\u6bc1\u5bb9\u7684\uff0c\u5f62\u6001\u7578\u5f62\u7684\u80a2\u4f53\uff0c\u624b\u6307\u878d\u5408\uff0c\u9759\u6b62\u4e0d\u52a8\u7684\u753b\u9762\uff0c\u6742\u4e71\u7684\u80cc\u666f\uff0c\u4e09\u6761\u817f\uff0c\u80cc\u666f\u4eba\u5f88\u591a\uff0c\u5012\u7740\u8d70",
+  "max_area": 0,
+  "t5_vocab_size": 256384,
+  "t5_dim": 4096,
+  "t5_dim_attn": 4096,
+  "t5_dim_ffn": 10240,
+  "t5_num_heads": 64,
+  "t5_num_layers": 24,
+  "t5_num_buckets": 32,
+  "quantization": {
+    "group_size": 64,
+    "bits": 4
+  }
+}

high_noise_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:328e793ce1e016bc9594584fe2803badf2e1923dff94d55d32dc7aa1528fc199
+size 8379507248

low_noise_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:903e9a200b5da99ceb5d3020297eb8af937b7a578146289e38c6f25349d69def
+size 8379507248

t5_encoder.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e86ee4199903e00a88dcd43583a43a6eb898cef600e38670f222d7e37d163787
+size 11361845505

vae.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:977530e453dbfabbab31e2972e1577d8d7e2840ba7410c81aa3fd421c0cd7414
+size 507591226