Instructions to use mlx-community/Bernini-R-1.3B-bf16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mlx-community/Bernini-R-1.3B-bf16 with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir Bernini-R-1.3B-bf16 mlx-community/Bernini-R-1.3B-bf16
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
license: apache-2.0
library_name: mlx
pipeline_tag: text-to-video
tags:
- mlx
- text-to-video
- video-editing
- video-to-video
- reference-to-video
- wan2.1
- bernini
base_model: ByteDance/Bernini-R-1.3B-Diffusers
Bernini-R-1.3B (MLX)
Apple MLX port of ByteDance/Bernini-R-1.3B — the 1.3B tier of ByteDance's Bernini Renderer: a Wan2.1-T2V-1.3B-derived video generator/editor with Segment-Aware 3D RoPE for multi-reference / editing tasks. The small tier "performs close to the 14B variant on simple tasks such as style transfer, subtitle or watermark removal, and local editing."
Runs on Apple Silicon via MLX + the mlx-video Wan backbone.
This is the lowest-cost Bernini tier. For the higher-quality A14B renderer see
mlx-community/Bernini-R-bf16/-int4.
⚠️ Scope: renderer only
Only the Renderer ("-R") is open-sourced upstream. The MLLM semantic planner (the paper's headline "latent semantic planning", a Qwen2.5-VL-7B model) is not released. This port therefore runs with UMT5 text conditioning only — the planner-feature channel is absent (and carries no weights in the released checkpoint).
Architecture
| Backbone | Wan2.1-T2V-1.3B, single expert (30L · dim 1536 · 12H · ffn 8960) |
| Experts | one (skip_transformer_2: true, switch_dit_boundary: 0) |
| VAE | 16-ch AutoencoderKLWan (stock Wan2.1) |
| Text encoder | UMT5-xxl |
| Bernini knobs | shift 3.0, use_src_id_rotary_emb (SA-3D RoPE — no extra parameters) |
Differs from the A14B port only by config: a single 1.3B expert instead of the high/low-noise A14B pair. There is no expert-boundary switch.
Tasks
| Task | Description |
|---|---|
t2v / t2i |
text-to-video / image |
r2v |
reference-to-video — generate a subject from up to K reference images (chained APG) |
v2v |
prompt-based video editing (source video injected as conditioning) |
rv2v |
reference + video editing |
Variants
| Repo | Precision | Transformer | + shared |
|---|---|---|---|
…-1.3B-bf16 |
bfloat16 | 2.6 GB | VAE 0.5 GB · UMT5 10.6 GB |
…-1.3B-int4 |
4-bit (group 64) | 0.8 GB | VAE 0.5 GB · UMT5 10.6 GB |
Provenance & validation
- Architecture: stock Wan2.1-T2V-1.3B (verified — diffusers
WanTransformer3DModelkeys, no extra tensors; matchesmlx-videowan21_t2v_1_3bexactly: 30L/1536/12H/8960). Bernini knobs (switch_dit_boundary 0,shift 3.0,use_src_id_rotary_emb) live in the wrapper config; SA-3D RoPE adds no parameters. - Converted fp32 → bf16 from
ByteDance/Bernini-R-1.3B-Diffusers; VAE/UMT5 are the shared stock Wan2.1 components (byte-identical to the A14B port). - Validated on the CPU stream: key contract bijective (825 file tensors = model params
minus the derived
freqsrope buffer); bf16 forward cosine 0.999983 vs source fp32; int4 per-pass cosine 0.9943 vs bf16 (group 64, ≥0.99 gate).
License & attribution
Apache-2.0. Derived from ByteDance Bernini-R, Wan2.1 (Wan-AI), and mlx-video. See NOTICE.