Bernini-R-1.3B-bf16 / README.md
xocialize's picture
Add Bernini-R-1.3B MLX (single-expert Wan2.1-1.3B renderer)
131d52a verified
---
license: apache-2.0
library_name: mlx
pipeline_tag: text-to-video
tags:
- mlx
- text-to-video
- video-editing
- video-to-video
- reference-to-video
- wan2.1
- bernini
base_model: ByteDance/Bernini-R-1.3B-Diffusers
---
# Bernini-R-1.3B (MLX)
Apple MLX port of **[ByteDance/Bernini-R-1.3B](https://huggingface.co/ByteDance/Bernini-R-1.3B-Diffusers)**
the **1.3B** tier of ByteDance's Bernini *Renderer*: a Wan2.1-T2V-1.3B-derived video
generator/editor with **Segment-Aware 3D RoPE** for multi-reference / editing tasks.
The small tier "performs close to the 14B variant on simple tasks such as style transfer,
subtitle or watermark removal, and local editing."
Runs on Apple Silicon via [MLX](https://github.com/ml-explore/mlx) + the
[mlx-video](https://github.com/Blaizzy/mlx-video) Wan backbone.
> **This is the lowest-cost Bernini tier.** For the higher-quality A14B renderer see
> [`mlx-community/Bernini-R-bf16`](https://huggingface.co/mlx-community/Bernini-R-bf16) /
> [`-int4`](https://huggingface.co/mlx-community/Bernini-R-int4).
## ⚠️ Scope: renderer only
Only the Renderer ("-R") is open-sourced upstream. The MLLM semantic **planner** (the
paper's headline "latent semantic planning", a Qwen2.5-VL-7B model) is **not released**.
This port therefore runs with **UMT5 text conditioning only** — the planner-feature
channel is absent (and carries no weights in the released checkpoint).
## Architecture
| | |
|---|---|
| Backbone | **Wan2.1-T2V-1.3B**, single expert (30L · dim 1536 · 12H · ffn 8960) |
| Experts | **one** (`skip_transformer_2: true`, `switch_dit_boundary: 0`) |
| VAE | 16-ch `AutoencoderKLWan` (stock Wan2.1) |
| Text encoder | UMT5-xxl |
| Bernini knobs | `shift 3.0`, `use_src_id_rotary_emb` (SA-3D RoPE — **no extra parameters**) |
Differs from the A14B port only by config: a single 1.3B expert instead of the
high/low-noise A14B pair. There is no expert-boundary switch.
## Tasks
| Task | Description |
|---|---|
| `t2v` / `t2i` | text-to-video / image |
| `r2v` | reference-to-video — generate a subject from up to K reference images (chained APG) |
| `v2v` | prompt-based video editing (source video injected as conditioning) |
| `rv2v` | reference + video editing |
## Variants
| Repo | Precision | Transformer | + shared |
|---|---|---|---|
| `…-1.3B-bf16` | bfloat16 | 2.6 GB | VAE 0.5 GB · UMT5 10.6 GB |
| `…-1.3B-int4` | 4-bit (group 64) | 0.8 GB | VAE 0.5 GB · UMT5 10.6 GB |
## Provenance & validation
- Architecture: **stock Wan2.1-T2V-1.3B** (verified — diffusers `WanTransformer3DModel`
keys, no extra tensors; matches `mlx-video` `wan21_t2v_1_3b` exactly: 30L/1536/12H/8960).
Bernini knobs (`switch_dit_boundary 0`, `shift 3.0`, `use_src_id_rotary_emb`) live in
the wrapper config; SA-3D RoPE adds **no parameters**.
- Converted fp32 → bf16 from `ByteDance/Bernini-R-1.3B-Diffusers`; VAE/UMT5 are the shared
stock Wan2.1 components (byte-identical to the A14B port).
- Validated on the CPU stream: key contract bijective (825 file tensors = model params
minus the derived `freqs` rope buffer); **bf16 forward cosine 0.999983 vs source fp32**;
int4 per-pass cosine **0.9943** vs bf16 (group 64, ≥0.99 gate).
## License & attribution
Apache-2.0. Derived from ByteDance Bernini-R, Wan2.1 (Wan-AI), and mlx-video. See `NOTICE`.