Instructions to use mlx-community/Bernini-R-1.3B-bf16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mlx-community/Bernini-R-1.3B-bf16 with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir Bernini-R-1.3B-bf16 mlx-community/Bernini-R-1.3B-bf16
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
| license: apache-2.0 | |
| library_name: mlx | |
| pipeline_tag: text-to-video | |
| tags: | |
| - mlx | |
| - text-to-video | |
| - video-editing | |
| - video-to-video | |
| - reference-to-video | |
| - wan2.1 | |
| - bernini | |
| base_model: ByteDance/Bernini-R-1.3B-Diffusers | |
| # Bernini-R-1.3B (MLX) | |
| Apple MLX port of **[ByteDance/Bernini-R-1.3B](https://huggingface.co/ByteDance/Bernini-R-1.3B-Diffusers)** — | |
| the **1.3B** tier of ByteDance's Bernini *Renderer*: a Wan2.1-T2V-1.3B-derived video | |
| generator/editor with **Segment-Aware 3D RoPE** for multi-reference / editing tasks. | |
| The small tier "performs close to the 14B variant on simple tasks such as style transfer, | |
| subtitle or watermark removal, and local editing." | |
| Runs on Apple Silicon via [MLX](https://github.com/ml-explore/mlx) + the | |
| [mlx-video](https://github.com/Blaizzy/mlx-video) Wan backbone. | |
| > **This is the lowest-cost Bernini tier.** For the higher-quality A14B renderer see | |
| > [`mlx-community/Bernini-R-bf16`](https://huggingface.co/mlx-community/Bernini-R-bf16) / | |
| > [`-int4`](https://huggingface.co/mlx-community/Bernini-R-int4). | |
| ## ⚠️ Scope: renderer only | |
| Only the Renderer ("-R") is open-sourced upstream. The MLLM semantic **planner** (the | |
| paper's headline "latent semantic planning", a Qwen2.5-VL-7B model) is **not released**. | |
| This port therefore runs with **UMT5 text conditioning only** — the planner-feature | |
| channel is absent (and carries no weights in the released checkpoint). | |
| ## Architecture | |
| | | | | |
| |---|---| | |
| | Backbone | **Wan2.1-T2V-1.3B**, single expert (30L · dim 1536 · 12H · ffn 8960) | | |
| | Experts | **one** (`skip_transformer_2: true`, `switch_dit_boundary: 0`) | | |
| | VAE | 16-ch `AutoencoderKLWan` (stock Wan2.1) | | |
| | Text encoder | UMT5-xxl | | |
| | Bernini knobs | `shift 3.0`, `use_src_id_rotary_emb` (SA-3D RoPE — **no extra parameters**) | | |
| Differs from the A14B port only by config: a single 1.3B expert instead of the | |
| high/low-noise A14B pair. There is no expert-boundary switch. | |
| ## Tasks | |
| | Task | Description | | |
| |---|---| | |
| | `t2v` / `t2i` | text-to-video / image | | |
| | `r2v` | reference-to-video — generate a subject from up to K reference images (chained APG) | | |
| | `v2v` | prompt-based video editing (source video injected as conditioning) | | |
| | `rv2v` | reference + video editing | | |
| ## Variants | |
| | Repo | Precision | Transformer | + shared | | |
| |---|---|---|---| | |
| | `…-1.3B-bf16` | bfloat16 | 2.6 GB | VAE 0.5 GB · UMT5 10.6 GB | | |
| | `…-1.3B-int4` | 4-bit (group 64) | 0.8 GB | VAE 0.5 GB · UMT5 10.6 GB | | |
| ## Provenance & validation | |
| - Architecture: **stock Wan2.1-T2V-1.3B** (verified — diffusers `WanTransformer3DModel` | |
| keys, no extra tensors; matches `mlx-video` `wan21_t2v_1_3b` exactly: 30L/1536/12H/8960). | |
| Bernini knobs (`switch_dit_boundary 0`, `shift 3.0`, `use_src_id_rotary_emb`) live in | |
| the wrapper config; SA-3D RoPE adds **no parameters**. | |
| - Converted fp32 → bf16 from `ByteDance/Bernini-R-1.3B-Diffusers`; VAE/UMT5 are the shared | |
| stock Wan2.1 components (byte-identical to the A14B port). | |
| - Validated on the CPU stream: key contract bijective (825 file tensors = model params | |
| minus the derived `freqs` rope buffer); **bf16 forward cosine 0.999983 vs source fp32**; | |
| int4 per-pass cosine **0.9943** vs bf16 (group 64, ≥0.99 gate). | |
| ## License & attribution | |
| Apache-2.0. Derived from ByteDance Bernini-R, Wan2.1 (Wan-AI), and mlx-video. See `NOTICE`. | |