--- license: apache-2.0 library_name: mlx pipeline_tag: text-to-video tags: - mlx - text-to-video - video-editing - video-to-video - reference-to-video - wan2.1 - bernini base_model: ByteDance/Bernini-R-1.3B-Diffusers --- # Bernini-R-1.3B (MLX) Apple MLX port of **[ByteDance/Bernini-R-1.3B](https://huggingface.co/ByteDance/Bernini-R-1.3B-Diffusers)** — the **1.3B** tier of ByteDance's Bernini *Renderer*: a Wan2.1-T2V-1.3B-derived video generator/editor with **Segment-Aware 3D RoPE** for multi-reference / editing tasks. The small tier "performs close to the 14B variant on simple tasks such as style transfer, subtitle or watermark removal, and local editing." Runs on Apple Silicon via [MLX](https://github.com/ml-explore/mlx) + the [mlx-video](https://github.com/Blaizzy/mlx-video) Wan backbone. > **This is the lowest-cost Bernini tier.** For the higher-quality A14B renderer see > [`mlx-community/Bernini-R-bf16`](https://huggingface.co/mlx-community/Bernini-R-bf16) / > [`-int4`](https://huggingface.co/mlx-community/Bernini-R-int4). ## ⚠️ Scope: renderer only Only the Renderer ("-R") is open-sourced upstream. The MLLM semantic **planner** (the paper's headline "latent semantic planning", a Qwen2.5-VL-7B model) is **not released**. This port therefore runs with **UMT5 text conditioning only** — the planner-feature channel is absent (and carries no weights in the released checkpoint). ## Architecture | | | |---|---| | Backbone | **Wan2.1-T2V-1.3B**, single expert (30L · dim 1536 · 12H · ffn 8960) | | Experts | **one** (`skip_transformer_2: true`, `switch_dit_boundary: 0`) | | VAE | 16-ch `AutoencoderKLWan` (stock Wan2.1) | | Text encoder | UMT5-xxl | | Bernini knobs | `shift 3.0`, `use_src_id_rotary_emb` (SA-3D RoPE — **no extra parameters**) | Differs from the A14B port only by config: a single 1.3B expert instead of the high/low-noise A14B pair. There is no expert-boundary switch. ## Tasks | Task | Description | |---|---| | `t2v` / `t2i` | text-to-video / image | | `r2v` | reference-to-video — generate a subject from up to K reference images (chained APG) | | `v2v` | prompt-based video editing (source video injected as conditioning) | | `rv2v` | reference + video editing | ## Variants | Repo | Precision | Transformer | + shared | |---|---|---|---| | `…-1.3B-bf16` | bfloat16 | 2.6 GB | VAE 0.5 GB · UMT5 10.6 GB | | `…-1.3B-int4` | 4-bit (group 64) | 0.8 GB | VAE 0.5 GB · UMT5 10.6 GB | ## Provenance & validation - Architecture: **stock Wan2.1-T2V-1.3B** (verified — diffusers `WanTransformer3DModel` keys, no extra tensors; matches `mlx-video` `wan21_t2v_1_3b` exactly: 30L/1536/12H/8960). Bernini knobs (`switch_dit_boundary 0`, `shift 3.0`, `use_src_id_rotary_emb`) live in the wrapper config; SA-3D RoPE adds **no parameters**. - Converted fp32 → bf16 from `ByteDance/Bernini-R-1.3B-Diffusers`; VAE/UMT5 are the shared stock Wan2.1 components (byte-identical to the A14B port). - Validated on the CPU stream: key contract bijective (825 file tensors = model params minus the derived `freqs` rope buffer); **bf16 forward cosine 0.999983 vs source fp32**; int4 per-pass cosine **0.9943** vs bf16 (group 64, ≥0.99 gate). ## License & attribution Apache-2.0. Derived from ByteDance Bernini-R, Wan2.1 (Wan-AI), and mlx-video. See `NOTICE`.