xocialize commited on
Commit
960f9e1
·
verified ·
1 Parent(s): 4352cdd

Add Bernini-R MLX weights (renderer-only; Wan2.2-A14B derived)

Browse files
NOTICE ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ bernini-r-mlx
2
+ Apache MLX port of ByteDance Bernini-R (the Bernini Renderer).
3
+
4
+ This work is licensed under the Apache License, Version 2.0.
5
+
6
+ It is derived from and depends on the following Apache-2.0 works; their notices
7
+ and attributions are retained here:
8
+
9
+ - ByteDance/Bernini-R — the Bernini Renderer (weights + reference inference code).
10
+ https://github.com/bytedance/Bernini · https://huggingface.co/ByteDance/Bernini-R-Diffusers
11
+ Paper: "Bernini: Latent Semantic Planning for Video Diffusion" (arXiv:2605.22344).
12
+
13
+ - Wan-AI/Wan2.2-T2V-A14B — the base DiT, 16-channel causal VAE, and UMT5 text encoder
14
+ that Bernini-R fine-tunes / reuses. https://github.com/Wan-Video/Wan2.2
15
+
16
+ - Qwen2.5-VL-7B-Instruct — the Bernini *planner* (NOT used here; not released as weights).
17
+
18
+ - mlx-video (Blaizzy/mlx-video) — the MLX Wan2.2 backbone reused by this port.
19
+
20
+ Scope note: only the Bernini *Renderer* is open-sourced upstream. The MLLM semantic
21
+ planner (the paper's "latent semantic planning") is not released, so this port runs with
22
+ UMT5 text conditioning only; the planner-feature channel is absent.
README.md ADDED
@@ -0,0 +1,79 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: mlx
4
+ pipeline_tag: text-to-video
5
+ tags:
6
+ - mlx
7
+ - text-to-video
8
+ - video-editing
9
+ - video-to-video
10
+ - reference-to-video
11
+ - wan2.2
12
+ - bernini
13
+ base_model: ByteDance/Bernini-R-Diffusers
14
+ ---
15
+
16
+ # Bernini-R (MLX)
17
+
18
+ Apple MLX port of **[ByteDance/Bernini-R](https://huggingface.co/ByteDance/Bernini-R-Diffusers)** —
19
+ the open-sourced *Renderer* of ByteDance's Bernini: a Wan2.2-T2V-A14B-derived video
20
+ generator/editor with **Segment-Aware 3D RoPE** for multi-reference / editing tasks.
21
+
22
+ Runs on Apple Silicon via [MLX](https://github.com/ml-explore/mlx) + the
23
+ [mlx-video](https://github.com/Blaizzy/mlx-video) Wan2.2 backbone.
24
+
25
+ ## ⚠️ Scope: renderer only
26
+
27
+ Only the Renderer ("-R") is open-sourced upstream. The MLLM semantic **planner** (the
28
+ paper's headline "latent semantic planning", a Qwen2.5-VL-7B model) is **not released**.
29
+ This port therefore runs with **UMT5 text conditioning only** — the planner-feature
30
+ channel is absent (and carries no weights in the released checkpoint). You get the
31
+ renderer's editing / reference-to-video / subject-consistency behavior, not the full
32
+ planner-guided system.
33
+
34
+ ## Tasks
35
+
36
+ | Task | Description |
37
+ |---|---|
38
+ | `t2v` / `t2i` | text-to-video / image |
39
+ | `r2v` | reference-to-video — generate a subject from up to K reference images (chained APG) |
40
+ | `v2v` | prompt-based video editing (source video injected as conditioning) |
41
+ | `rv2v` | reference + video editing |
42
+
43
+ ## Variants
44
+
45
+ | Repo | Precision | Size / expert |
46
+ |---|---|---|
47
+ | `…-bf16` | bfloat16 | 28.6 GB |
48
+ | `…-int4` | 4-bit (group 64) | 8.4 GB |
49
+
50
+ Two experts (high/low-noise) + 16-ch Wan2.2 VAE (0.5 GB) + UMT5 (11 GB).
51
+
52
+ ## Usage
53
+
54
+ ```python
55
+ from bernini_r_mlx import pipeline_mlx as P
56
+
57
+ # text-to-video
58
+ P.t2v("path/to/ckpt", "a red fox in a snowy forest", num_frames=49, output_path="out.mp4")
59
+
60
+ # reference-to-video (subject consistency)
61
+ P.r2v("path/to/ckpt", "the fox running across a field",
62
+ reference_images=["fox.png"], output_path="r2v.mp4")
63
+
64
+ # video editing
65
+ P.v2v("path/to/ckpt", "... autumn forest ...", source_video="in.mp4", output_path="v2v.mp4")
66
+ ```
67
+
68
+ ## Provenance & validation
69
+
70
+ - Architecture: **stock Wan2.2-T2V-A14B** (verified — diffusers `WanTransformer3DModel` keys,
71
+ no extra tensors); Bernini knobs (`switch_dit_boundary 0.875`, `shift 3.0`,
72
+ `use_src_id_rotary_emb`) live in the wrapper config. SA-3D RoPE adds **no parameters**.
73
+ - Converted fp32 → bf16 from `ByteDance/Bernini-R-Diffusers`; VAE/UMT5 from `Wan-AI/Wan2.2-T2V-A14B`.
74
+ - Validated: SA-3D RoPE parity ~1e-7; VAE roundtrip MAD 2.1/255; multi-segment forward
75
+ bit-exact vs t2v; int4 per-pass cosine 0.9992 vs bf16; e2e t2v / r2v / v2v coherent.
76
+
77
+ ## License & attribution
78
+
79
+ Apache-2.0. Derived from ByteDance Bernini-R, Wan2.2 (Wan-AI), and mlx-video. See `NOTICE`.
config.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "t2v",
3
+ "model_version": "2.2",
4
+ "patch_size": [
5
+ 1,
6
+ 2,
7
+ 2
8
+ ],
9
+ "text_len": 512,
10
+ "in_dim": 16,
11
+ "dim": 5120,
12
+ "ffn_dim": 13824,
13
+ "freq_dim": 256,
14
+ "text_dim": 4096,
15
+ "out_dim": 16,
16
+ "num_heads": 40,
17
+ "num_layers": 40,
18
+ "window_size": [
19
+ -1,
20
+ -1
21
+ ],
22
+ "qk_norm": true,
23
+ "cross_attn_norm": true,
24
+ "eps": 1e-06,
25
+ "vae_stride": [
26
+ 4,
27
+ 8,
28
+ 8
29
+ ],
30
+ "vae_z_dim": 16,
31
+ "dual_model": true,
32
+ "boundary": 0.875,
33
+ "sample_shift": 3.0,
34
+ "sample_steps": 40,
35
+ "sample_guide_scale": [
36
+ 3.0,
37
+ 4.0
38
+ ],
39
+ "num_train_timesteps": 1000,
40
+ "sample_fps": 16,
41
+ "frame_num": 81,
42
+ "sample_neg_prompt": "\u8272\u8c03\u8273\u4e3d\uff0c\u8fc7\u66dd\uff0c\u9759\u6001\uff0c\u7ec6\u8282\u6a21\u7cca\u4e0d\u6e05\uff0c\u5b57\u5e55\uff0c\u98ce\u683c\uff0c\u4f5c\u54c1\uff0c\u753b\u4f5c\uff0c\u753b\u9762\uff0c\u9759\u6b62\uff0c\u6574\u4f53\u53d1\u7070\uff0c\u6700\u5dee\u8d28\u91cf\uff0c\u4f4e\u8d28\u91cf\uff0cJPEG\u538b\u7f29\u6b8b\u7559\uff0c\u4e11\u964b\u7684\uff0c\u6b8b\u7f3a\u7684\uff0c\u591a\u4f59\u7684\u624b\u6307\uff0c\u753b\u5f97\u4e0d\u597d\u7684\u624b\u90e8\uff0c\u753b\u5f97\u4e0d\u597d\u7684\u8138\u90e8\uff0c\u7578\u5f62\u7684\uff0c\u6bc1\u5bb9\u7684\uff0c\u5f62\u6001\u7578\u5f62\u7684\u80a2\u4f53\uff0c\u624b\u6307\u878d\u5408\uff0c\u9759\u6b62\u4e0d\u52a8\u7684\u753b\u9762\uff0c\u6742\u4e71\u7684\u80cc\u666f\uff0c\u4e09\u6761\u817f\uff0c\u80cc\u666f\u4eba\u5f88\u591a\uff0c\u5012\u7740\u8d70",
43
+ "max_area": 0,
44
+ "t5_vocab_size": 256384,
45
+ "t5_dim": 4096,
46
+ "t5_dim_attn": 4096,
47
+ "t5_dim_ffn": 10240,
48
+ "t5_num_heads": 64,
49
+ "t5_num_layers": 24,
50
+ "t5_num_buckets": 32
51
+ }
high_noise_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2dd6546d39deefa6550af38b8b5c268f4c6eacf3bc91868776be423fcd8f03f8
3
+ size 28577096986
low_noise_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e2163e6677a5e8b62dd7a4ca30af4a3535a9c6c575c5cae44b6f894390336841
3
+ size 28577096986
t5_encoder.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e86ee4199903e00a88dcd43583a43a6eb898cef600e38670f222d7e37d163787
3
+ size 11361845505
vae.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:977530e453dbfabbab31e2972e1577d8d7e2840ba7410c81aa3fd421c0cd7414
3
+ size 507591226