---
license: apache-2.0
library_name: mlx
pipeline_tag: text-to-video
tags:
  - mlx
  - text-to-video
  - video-editing
  - video-to-video
  - reference-to-video
  - wan2.1
  - bernini
base_model: ByteDance/Bernini-R-1.3B-Diffusers
---

# Bernini-R-1.3B (MLX)

Apple MLX port of **[ByteDance/Bernini-R-1.3B](https://huggingface.co/ByteDance/Bernini-R-1.3B-Diffusers)** —
the **1.3B** tier of ByteDance's Bernini *Renderer*: a Wan2.1-T2V-1.3B-derived video
generator/editor with **Segment-Aware 3D RoPE** for multi-reference / editing tasks.
The small tier "performs close to the 14B variant on simple tasks such as style transfer,
subtitle or watermark removal, and local editing."

Runs on Apple Silicon via [MLX](https://github.com/ml-explore/mlx) + the
[mlx-video](https://github.com/Blaizzy/mlx-video) Wan backbone.

> **This is the lowest-cost Bernini tier.** For the higher-quality A14B renderer see
> [`mlx-community/Bernini-R-bf16`](https://huggingface.co/mlx-community/Bernini-R-bf16) /
> [`-int4`](https://huggingface.co/mlx-community/Bernini-R-int4).

## ⚠️ Scope: renderer only

Only the Renderer ("-R") is open-sourced upstream. The MLLM semantic **planner** (the
paper's headline "latent semantic planning", a Qwen2.5-VL-7B model) is **not released**.
This port therefore runs with **UMT5 text conditioning only** — the planner-feature
channel is absent (and carries no weights in the released checkpoint).

## Architecture

| | |
|---|---|
| Backbone | **Wan2.1-T2V-1.3B**, single expert (30L · dim 1536 · 12H · ffn 8960) |
| Experts | **one** (`skip_transformer_2: true`, `switch_dit_boundary: 0`) |
| VAE | 16-ch `AutoencoderKLWan` (stock Wan2.1) |
| Text encoder | UMT5-xxl |
| Bernini knobs | `shift 3.0`, `use_src_id_rotary_emb` (SA-3D RoPE — **no extra parameters**) |

Differs from the A14B port only by config: a single 1.3B expert instead of the
high/low-noise A14B pair. There is no expert-boundary switch.

## Tasks

| Task | Description |
|---|---|
| `t2v` / `t2i` | text-to-video / image |
| `r2v` | reference-to-video — generate a subject from up to K reference images (chained APG) |
| `v2v` | prompt-based video editing (source video injected as conditioning) |
| `rv2v` | reference + video editing |

## Variants

| Repo | Precision | Transformer | + shared |
|---|---|---|---|
| `…-1.3B-bf16` | bfloat16 | 2.6 GB | VAE 0.5 GB · UMT5 10.6 GB |
| `…-1.3B-int4` | 4-bit (group 64) | 0.8 GB | VAE 0.5 GB · UMT5 10.6 GB |

## Provenance & validation

- Architecture: **stock Wan2.1-T2V-1.3B** (verified — diffusers `WanTransformer3DModel`
  keys, no extra tensors; matches `mlx-video` `wan21_t2v_1_3b` exactly: 30L/1536/12H/8960).
  Bernini knobs (`switch_dit_boundary 0`, `shift 3.0`, `use_src_id_rotary_emb`) live in
  the wrapper config; SA-3D RoPE adds **no parameters**.
- Converted fp32 → bf16 from `ByteDance/Bernini-R-1.3B-Diffusers`; VAE/UMT5 are the shared
  stock Wan2.1 components (byte-identical to the A14B port).
- Validated on the CPU stream: key contract bijective (825 file tensors = model params
  minus the derived `freqs` rope buffer); **bf16 forward cosine 0.999983 vs source fp32**;
  int4 per-pass cosine **0.9943** vs bf16 (group 64, ≥0.99 gate).

## License & attribution

Apache-2.0. Derived from ByteDance Bernini-R, Wan2.1 (Wan-AI), and mlx-video. See `NOTICE`.