--- license: apache-2.0 pipeline_tag: image-text-to-video ---
## ๐ฆ Installation
### Requirements
- **Python** 3.11.2.
- **CUDA GPU** โ a Hopper GPU (H100/H800/H200) is recommended so FlashAttention-3
can be used; other CUDA GPUs fall back to FlashAttention-2 or PyTorch SDPA.
- **CUDA toolkit** 12.4 (matches the pinned `torch==2.5.1+cu124`; 12.3+ is the
minimum if you build FlashAttention-3).
- Pinned in `requirements.txt`: `torch==2.5.1+cu124`, `diffusers==0.35.2`,
`accelerate==0.34.2`, `transformers==4.57.3`.
Reference environment (Bernini-R is developed and tested on this setup):
| Component | Version |
|-----------|--------------|
| GPU | NVIDIA H100 |
| CUDA | 12.4 |
| Python | 3.11.2 |
| PyTorch | 2.5.1+cu124 |
### Install
```bash
git clone https://github.com/bytedance/Bernini.git bernini && cd bernini
pip install -r requirements.txt
```
Optional extras:
- **Multi-GPU sequence parallel** needs [Open-VeOmni](https://github.com/ByteDance-Seed/VeOmni)
(Apache-2.0, Python 3.11). Use `--no-deps` so VeOmni does not pull in a
different torch build and override the pinned `torch==2.5.1+cu124`:
`pip install --no-deps git+https://github.com/ByteDance-Seed/VeOmni.git@v0.1.10`.
Single-GPU inference does not need it.
- **Faster attention** (auto-detected if installed; otherwise PyTorch SDPA is used):
- FlashAttention-2 โ general CUDA GPUs (incl. A100/A800): `pip install flash-attn==2.8.3`.
- FlashAttention-3 โ Hopper only (H100/H800/H200, CUDA โฅ 12.3, PyTorch โฅ 2.4).
`flash_attn_interface` is not on PyPI; build it from the
[flash-attention](https://github.com/Dao-AILab/flash-attention) repo's
`hopper/` directory at tag `v2.8.3`:
```bash
git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention && git checkout v2.8.3
cd hopper && MAX_JOBS=$(nproc) python3 setup.py install --user
```
### Weights
Bernini-R uses two sets of weights:
1. **Wan2.2 base** โ [`Wan-AI/Wan2.2-T2V-A14B-Diffusers`](https://huggingface.co/Wan-AI/Wan2.2-T2V-A14B-Diffusers) on Hugging Face. Supplies the
VAE, UMT5 text encoder, tokenizer, and the transformer architecture/base weights.
It is downloaded automatically on first run (configured by `wan22_base` in
`configs/bernini_renderer_wan22/config.json`).
2. **Bernini-R checkpoint** โ the trained high-noise / low-noise transformer weights
(safetensors) from [Hugging Face](https://huggingface.co/ByteDance/Bernini), passed with
`--high_noise_ckpt` / `--low_noise_ckpt`. Both a local directory and a Hugging
Face repo id are accepted.
## ๐ Usage
A run is described by a **case file** โ a small JSON under
[`assets/testcases/`](assets/testcases/) that bundles one task's routing and
inputs (`task_type`, `guidance_mode`, `prompt`, source media, `output`). This
keeps long prompts out of the command line. Each task has a directory under
`assets/testcases/` holding one or more case files; see
[`assets/testcases/`](assets/testcases/) for the format and the bundled
`t2i` / `i2i` / `t2v` / `v2v` / `rv2v` /`r2v` examples.
### Prompt enhancer (recommended)
`--use_pe` enhances the prompt through an OpenAI-compatible endpoint and is
recommended for best generation quality. The `openai` SDK is installed by
`requirements.txt`; configure the endpoint with environment variables:
```bash
export BERNINI_PE_API_KEY=... # or OPENAI_API_KEY
export BERNINI_PE_BASE_URL=... # or OPENAI_BASE_URL
export BERNINI_PE_MODEL=... # vision-capable chat model
```
### Examples by task type
Unless an example specifies otherwise, inference outputs **480p / 16fps** (the
defaults โ `--max_image_size 848`, `--fps 16`).
Each example runs a bundled case in
[`assets/testcases/`](assets/testcases/) โ replace `