File size: 2,806 Bytes
cde3acb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
---
license: apache-2.0
library_name: mlx
pipeline_tag: text-to-video
tags:
  - mlx
  - text-to-video
  - video-editing
  - video-to-video
  - reference-to-video
  - wan2.2
  - bernini
base_model: ByteDance/Bernini-R-Diffusers
---

# Bernini-R (MLX)

Apple MLX port of **[ByteDance/Bernini-R](https://huggingface.co/ByteDance/Bernini-R-Diffusers)** —
the open-sourced *Renderer* of ByteDance's Bernini: a Wan2.2-T2V-A14B-derived video
generator/editor with **Segment-Aware 3D RoPE** for multi-reference / editing tasks.

Runs on Apple Silicon via [MLX](https://github.com/ml-explore/mlx) + the
[mlx-video](https://github.com/Blaizzy/mlx-video) Wan2.2 backbone.

## ⚠️ Scope: renderer only

Only the Renderer ("-R") is open-sourced upstream. The MLLM semantic **planner** (the
paper's headline "latent semantic planning", a Qwen2.5-VL-7B model) is **not released**.
This port therefore runs with **UMT5 text conditioning only** — the planner-feature
channel is absent (and carries no weights in the released checkpoint). You get the
renderer's editing / reference-to-video / subject-consistency behavior, not the full
planner-guided system.

## Tasks

| Task | Description |
|---|---|
| `t2v` / `t2i` | text-to-video / image |
| `r2v` | reference-to-video — generate a subject from up to K reference images (chained APG) |
| `v2v` | prompt-based video editing (source video injected as conditioning) |
| `rv2v` | reference + video editing |

## Variants

| Repo | Precision | Size / expert |
|---|---|---|
| `…-bf16`  | bfloat16 | 28.6 GB |
| `…-int4`  | 4-bit (group 64) | 8.4 GB |

Two experts (high/low-noise) + 16-ch Wan2.2 VAE (0.5 GB) + UMT5 (11 GB).

## Usage

```python
from bernini_r_mlx import pipeline_mlx as P

# text-to-video
P.t2v("path/to/ckpt", "a red fox in a snowy forest", num_frames=49, output_path="out.mp4")

# reference-to-video (subject consistency)
P.r2v("path/to/ckpt", "the fox running across a field",
      reference_images=["fox.png"], output_path="r2v.mp4")

# video editing
P.v2v("path/to/ckpt", "... autumn forest ...", source_video="in.mp4", output_path="v2v.mp4")
```

## Provenance & validation

- Architecture: **stock Wan2.2-T2V-A14B** (verified — diffusers `WanTransformer3DModel` keys,
  no extra tensors); Bernini knobs (`switch_dit_boundary 0.875`, `shift 3.0`,
  `use_src_id_rotary_emb`) live in the wrapper config. SA-3D RoPE adds **no parameters**.
- Converted fp32 → bf16 from `ByteDance/Bernini-R-Diffusers`; VAE/UMT5 from `Wan-AI/Wan2.2-T2V-A14B`.
- Validated: SA-3D RoPE parity ~1e-7; VAE roundtrip MAD 2.1/255; multi-segment forward
  bit-exact vs t2v; int4 per-pass cosine 0.9992 vs bf16; e2e t2v / r2v / v2v coherent.

## License & attribution

Apache-2.0. Derived from ByteDance Bernini-R, Wan2.2 (Wan-AI), and mlx-video. See `NOTICE`.