Instructions to use mlx-community/Bernini-R-int4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mlx-community/Bernini-R-int4 with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir Bernini-R-int4 mlx-community/Bernini-R-int4
- Wan2.2
How to use mlx-community/Bernini-R-int4 with Wan2.2:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
Add Bernini-R MLX weights (renderer-only; Wan2.2-A14B derived)
Browse files- NOTICE +22 -0
- README.md +79 -0
- config.json +55 -0
- high_noise_model.safetensors +3 -0
- low_noise_model.safetensors +3 -0
- t5_encoder.safetensors +3 -0
- vae.safetensors +3 -0
NOTICE
ADDED
|
@@ -0,0 +1,22 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
bernini-r-mlx
|
| 2 |
+
Apache MLX port of ByteDance Bernini-R (the Bernini Renderer).
|
| 3 |
+
|
| 4 |
+
This work is licensed under the Apache License, Version 2.0.
|
| 5 |
+
|
| 6 |
+
It is derived from and depends on the following Apache-2.0 works; their notices
|
| 7 |
+
and attributions are retained here:
|
| 8 |
+
|
| 9 |
+
- ByteDance/Bernini-R — the Bernini Renderer (weights + reference inference code).
|
| 10 |
+
https://github.com/bytedance/Bernini · https://huggingface.co/ByteDance/Bernini-R-Diffusers
|
| 11 |
+
Paper: "Bernini: Latent Semantic Planning for Video Diffusion" (arXiv:2605.22344).
|
| 12 |
+
|
| 13 |
+
- Wan-AI/Wan2.2-T2V-A14B — the base DiT, 16-channel causal VAE, and UMT5 text encoder
|
| 14 |
+
that Bernini-R fine-tunes / reuses. https://github.com/Wan-Video/Wan2.2
|
| 15 |
+
|
| 16 |
+
- Qwen2.5-VL-7B-Instruct — the Bernini *planner* (NOT used here; not released as weights).
|
| 17 |
+
|
| 18 |
+
- mlx-video (Blaizzy/mlx-video) — the MLX Wan2.2 backbone reused by this port.
|
| 19 |
+
|
| 20 |
+
Scope note: only the Bernini *Renderer* is open-sourced upstream. The MLLM semantic
|
| 21 |
+
planner (the paper's "latent semantic planning") is not released, so this port runs with
|
| 22 |
+
UMT5 text conditioning only; the planner-feature channel is absent.
|
README.md
ADDED
|
@@ -0,0 +1,79 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
library_name: mlx
|
| 4 |
+
pipeline_tag: text-to-video
|
| 5 |
+
tags:
|
| 6 |
+
- mlx
|
| 7 |
+
- text-to-video
|
| 8 |
+
- video-editing
|
| 9 |
+
- video-to-video
|
| 10 |
+
- reference-to-video
|
| 11 |
+
- wan2.2
|
| 12 |
+
- bernini
|
| 13 |
+
base_model: ByteDance/Bernini-R-Diffusers
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
# Bernini-R (MLX)
|
| 17 |
+
|
| 18 |
+
Apple MLX port of **[ByteDance/Bernini-R](https://huggingface.co/ByteDance/Bernini-R-Diffusers)** —
|
| 19 |
+
the open-sourced *Renderer* of ByteDance's Bernini: a Wan2.2-T2V-A14B-derived video
|
| 20 |
+
generator/editor with **Segment-Aware 3D RoPE** for multi-reference / editing tasks.
|
| 21 |
+
|
| 22 |
+
Runs on Apple Silicon via [MLX](https://github.com/ml-explore/mlx) + the
|
| 23 |
+
[mlx-video](https://github.com/Blaizzy/mlx-video) Wan2.2 backbone.
|
| 24 |
+
|
| 25 |
+
## ⚠️ Scope: renderer only
|
| 26 |
+
|
| 27 |
+
Only the Renderer ("-R") is open-sourced upstream. The MLLM semantic **planner** (the
|
| 28 |
+
paper's headline "latent semantic planning", a Qwen2.5-VL-7B model) is **not released**.
|
| 29 |
+
This port therefore runs with **UMT5 text conditioning only** — the planner-feature
|
| 30 |
+
channel is absent (and carries no weights in the released checkpoint). You get the
|
| 31 |
+
renderer's editing / reference-to-video / subject-consistency behavior, not the full
|
| 32 |
+
planner-guided system.
|
| 33 |
+
|
| 34 |
+
## Tasks
|
| 35 |
+
|
| 36 |
+
| Task | Description |
|
| 37 |
+
|---|---|
|
| 38 |
+
| `t2v` / `t2i` | text-to-video / image |
|
| 39 |
+
| `r2v` | reference-to-video — generate a subject from up to K reference images (chained APG) |
|
| 40 |
+
| `v2v` | prompt-based video editing (source video injected as conditioning) |
|
| 41 |
+
| `rv2v` | reference + video editing |
|
| 42 |
+
|
| 43 |
+
## Variants
|
| 44 |
+
|
| 45 |
+
| Repo | Precision | Size / expert |
|
| 46 |
+
|---|---|---|
|
| 47 |
+
| `…-bf16` | bfloat16 | 28.6 GB |
|
| 48 |
+
| `…-int4` | 4-bit (group 64) | 8.4 GB |
|
| 49 |
+
|
| 50 |
+
Two experts (high/low-noise) + 16-ch Wan2.2 VAE (0.5 GB) + UMT5 (11 GB).
|
| 51 |
+
|
| 52 |
+
## Usage
|
| 53 |
+
|
| 54 |
+
```python
|
| 55 |
+
from bernini_r_mlx import pipeline_mlx as P
|
| 56 |
+
|
| 57 |
+
# text-to-video
|
| 58 |
+
P.t2v("path/to/ckpt", "a red fox in a snowy forest", num_frames=49, output_path="out.mp4")
|
| 59 |
+
|
| 60 |
+
# reference-to-video (subject consistency)
|
| 61 |
+
P.r2v("path/to/ckpt", "the fox running across a field",
|
| 62 |
+
reference_images=["fox.png"], output_path="r2v.mp4")
|
| 63 |
+
|
| 64 |
+
# video editing
|
| 65 |
+
P.v2v("path/to/ckpt", "... autumn forest ...", source_video="in.mp4", output_path="v2v.mp4")
|
| 66 |
+
```
|
| 67 |
+
|
| 68 |
+
## Provenance & validation
|
| 69 |
+
|
| 70 |
+
- Architecture: **stock Wan2.2-T2V-A14B** (verified — diffusers `WanTransformer3DModel` keys,
|
| 71 |
+
no extra tensors); Bernini knobs (`switch_dit_boundary 0.875`, `shift 3.0`,
|
| 72 |
+
`use_src_id_rotary_emb`) live in the wrapper config. SA-3D RoPE adds **no parameters**.
|
| 73 |
+
- Converted fp32 → bf16 from `ByteDance/Bernini-R-Diffusers`; VAE/UMT5 from `Wan-AI/Wan2.2-T2V-A14B`.
|
| 74 |
+
- Validated: SA-3D RoPE parity ~1e-7; VAE roundtrip MAD 2.1/255; multi-segment forward
|
| 75 |
+
bit-exact vs t2v; int4 per-pass cosine 0.9992 vs bf16; e2e t2v / r2v / v2v coherent.
|
| 76 |
+
|
| 77 |
+
## License & attribution
|
| 78 |
+
|
| 79 |
+
Apache-2.0. Derived from ByteDance Bernini-R, Wan2.2 (Wan-AI), and mlx-video. See `NOTICE`.
|
config.json
ADDED
|
@@ -0,0 +1,55 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"model_type": "t2v",
|
| 3 |
+
"model_version": "2.2",
|
| 4 |
+
"patch_size": [
|
| 5 |
+
1,
|
| 6 |
+
2,
|
| 7 |
+
2
|
| 8 |
+
],
|
| 9 |
+
"text_len": 512,
|
| 10 |
+
"in_dim": 16,
|
| 11 |
+
"dim": 5120,
|
| 12 |
+
"ffn_dim": 13824,
|
| 13 |
+
"freq_dim": 256,
|
| 14 |
+
"text_dim": 4096,
|
| 15 |
+
"out_dim": 16,
|
| 16 |
+
"num_heads": 40,
|
| 17 |
+
"num_layers": 40,
|
| 18 |
+
"window_size": [
|
| 19 |
+
-1,
|
| 20 |
+
-1
|
| 21 |
+
],
|
| 22 |
+
"qk_norm": true,
|
| 23 |
+
"cross_attn_norm": true,
|
| 24 |
+
"eps": 1e-06,
|
| 25 |
+
"vae_stride": [
|
| 26 |
+
4,
|
| 27 |
+
8,
|
| 28 |
+
8
|
| 29 |
+
],
|
| 30 |
+
"vae_z_dim": 16,
|
| 31 |
+
"dual_model": true,
|
| 32 |
+
"boundary": 0.875,
|
| 33 |
+
"sample_shift": 3.0,
|
| 34 |
+
"sample_steps": 40,
|
| 35 |
+
"sample_guide_scale": [
|
| 36 |
+
3.0,
|
| 37 |
+
4.0
|
| 38 |
+
],
|
| 39 |
+
"num_train_timesteps": 1000,
|
| 40 |
+
"sample_fps": 16,
|
| 41 |
+
"frame_num": 81,
|
| 42 |
+
"sample_neg_prompt": "\u8272\u8c03\u8273\u4e3d\uff0c\u8fc7\u66dd\uff0c\u9759\u6001\uff0c\u7ec6\u8282\u6a21\u7cca\u4e0d\u6e05\uff0c\u5b57\u5e55\uff0c\u98ce\u683c\uff0c\u4f5c\u54c1\uff0c\u753b\u4f5c\uff0c\u753b\u9762\uff0c\u9759\u6b62\uff0c\u6574\u4f53\u53d1\u7070\uff0c\u6700\u5dee\u8d28\u91cf\uff0c\u4f4e\u8d28\u91cf\uff0cJPEG\u538b\u7f29\u6b8b\u7559\uff0c\u4e11\u964b\u7684\uff0c\u6b8b\u7f3a\u7684\uff0c\u591a\u4f59\u7684\u624b\u6307\uff0c\u753b\u5f97\u4e0d\u597d\u7684\u624b\u90e8\uff0c\u753b\u5f97\u4e0d\u597d\u7684\u8138\u90e8\uff0c\u7578\u5f62\u7684\uff0c\u6bc1\u5bb9\u7684\uff0c\u5f62\u6001\u7578\u5f62\u7684\u80a2\u4f53\uff0c\u624b\u6307\u878d\u5408\uff0c\u9759\u6b62\u4e0d\u52a8\u7684\u753b\u9762\uff0c\u6742\u4e71\u7684\u80cc\u666f\uff0c\u4e09\u6761\u817f\uff0c\u80cc\u666f\u4eba\u5f88\u591a\uff0c\u5012\u7740\u8d70",
|
| 43 |
+
"max_area": 0,
|
| 44 |
+
"t5_vocab_size": 256384,
|
| 45 |
+
"t5_dim": 4096,
|
| 46 |
+
"t5_dim_attn": 4096,
|
| 47 |
+
"t5_dim_ffn": 10240,
|
| 48 |
+
"t5_num_heads": 64,
|
| 49 |
+
"t5_num_layers": 24,
|
| 50 |
+
"t5_num_buckets": 32,
|
| 51 |
+
"quantization": {
|
| 52 |
+
"group_size": 64,
|
| 53 |
+
"bits": 4
|
| 54 |
+
}
|
| 55 |
+
}
|
high_noise_model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:328e793ce1e016bc9594584fe2803badf2e1923dff94d55d32dc7aa1528fc199
|
| 3 |
+
size 8379507248
|
low_noise_model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:903e9a200b5da99ceb5d3020297eb8af937b7a578146289e38c6f25349d69def
|
| 3 |
+
size 8379507248
|
t5_encoder.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:e86ee4199903e00a88dcd43583a43a6eb898cef600e38670f222d7e37d163787
|
| 3 |
+
size 11361845505
|
vae.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:977530e453dbfabbab31e2972e1577d8d7e2840ba7410c81aa3fd421c0cd7414
|
| 3 |
+
size 507591226
|