paris2 / README.md
Bidhan Roy
Serve model card images from Cloudflare R2
3ec9d29
|
raw
history blame
10.8 kB
---
license: other
gated: manual
tags:
- paris2
- text-to-video
- image-to-video
- mixture-of-experts
- decentralized-diffusion-model
extra_gated_heading: "Request access to Paris 2.0: A Decentralized Diffusion Model for Video Generation"
extra_gated_description: |
<img src="https://pub-2c09ae97630f4932a23e622b450076e0.r2.dev/paris2/model-card/v1/bagel_labs_logo.png" alt="Bagel Labs" height="28" style="margin-bottom: 12px;"/>
Access is granted on a per-request basis after manual review by Bagel Labs.
Each request is reviewed individually. Typical turnaround is 2-3 business days.
extra_gated_button_content: "Acknowledge license and request access"
extra_gated_prompt: >
By requesting access, you agree to abide by the license and Bagel Labs
acceptable use policy. These model weights are released for research and
evaluation. Commercial use is not granted by default and requires written
agreement with Bagel Labs.
extra_gated_fields:
Full name: text
Affiliation: text
Affiliation type:
type: select
options:
- Academic / university
- Industry research lab
- Startup
- Large company
- Independent researcher
- {label: Other, value: other}
Company or institution website: text
Job title or role: text
Country: country
Intended use (1-2 sentences): text
Will the model be used in a commercial product or service?:
type: select
options:
- "No, research and evaluation only"
- "Possibly in the future"
- "Yes"
Are you following @bageldotcom on Hugging Face?: checkbox
Email used for this request matches my official affiliation domain: checkbox
I agree to use this model for non-commercial research and evaluation only unless I have a separate written agreement with Bagel Labs: checkbox
I agree to the license and the acceptable use policy: checkbox
---
<p align="center">
<img src="https://pub-2c09ae97630f4932a23e622b450076e0.r2.dev/paris2/model-card/v1/bagel_labs_logo.png" alt="Bagel Labs">
</p>
<h1 align="center">Paris 2.0: A Decentralized Diffusion Model for Video Generation</h1>
<p align="center">
<a href="https://huggingface.co/bageldotcom/paris2" target="_blank">
<img src="https://img.shields.io/badge/πŸ€—_DOWNLOAD_PARIS_2.0_WEIGHTS-FFD21E?style=for-the-badge&logoColor=000000" alt="Download Paris 2.0 Weights">
</a>
<a href="https://arxiv.org/abs/2605.26064" target="_blank">
<img src="https://img.shields.io/badge/πŸ“„_READ_PARIS_2.0_PAPER-FF6B6B?style=for-the-badge&logoColor=white" alt="Read Paris 2.0 Technical Report">
</a>
</p>
Paris 2.0 is a Decentralized Diffusion Model (DDM) for video generation,
extending the Paris 1.0 DDM recipe from image generation to temporally
coherent video. A DDM trains independent expert diffusion models without
gradient synchronization, parameter sharing, or activation exchange, then uses
a lightweight router to select experts during denoising.
# Generated Samples
<img src="https://pub-2c09ae97630f4932a23e622b450076e0.r2.dev/paris2/model-card/v1/paris2_sample_talking_head.png" alt="Paris 2.0 generated talking-head video frames" style="width: 100%; max-width: 960px; border-radius: 8px; margin: 14px 0 4px;"/>
<p style="color: #6b7280; font-size: 14px; margin-top: 0;"><em>Prompt: A woman with long, blond, wavy hair is speaking directly to the camera.</em></p>
<img src="https://pub-2c09ae97630f4932a23e622b450076e0.r2.dev/paris2/model-card/v1/paris2_sample_paper_craft.png" alt="Paris 2.0 generated paper-craft video frames" style="width: 100%; max-width: 960px; border-radius: 8px; margin: 14px 0 4px;"/>
<p style="color: #6b7280; font-size: 14px; margin-top: 0;"><em>Prompt: A person's hands perform a paper-folding craft on a green cutting mat.</em></p>
<img src="https://pub-2c09ae97630f4932a23e622b450076e0.r2.dev/paris2/model-card/v1/paris2_sample_slime.png" alt="Paris 2.0 generated slime video frames" style="width: 100%; max-width: 960px; border-radius: 8px; margin: 14px 0 4px;"/>
<p style="color: #6b7280; font-size: 14px; margin-top: 0;"><em>Prompt: A pair of hands interacts with translucent blue slime.</em></p>
# Results
In a low-resolution text-to-video study, Paris 2.0 is
compared against a monolithic model trained on the same data under a matched
total compute budget. The decentralized model reduces FVD from 561.04 to
279.01 and improves CLIP text-video similarity and aesthetic score under the
same generation protocol.
<img src="https://pub-2c09ae97630f4932a23e622b450076e0.r2.dev/paris2/model-card/v1/paris2_relative_improvement.png" alt="Paris 2.0 relative improvement over monolithic baseline" style="width: 100%; max-width: 960px; margin: 18px 0 8px;"/>
<p style="color: #6b7280; font-size: 14px; margin-top: 0;"><em>Relative improvement over the monolithic baseline. Each bar shows the gain over monolithic, so a taller bar means a larger improvement (for FVD this corresponds to a lower distance, for CLIP and aesthetic to a higher score). Motion is descriptive and has no preferred direction.</em></p>
| Metric | Paris 2.0 DDM | Monolithic baseline |
|---|---:|---:|
| FVD ↓ | 279.01 | 561.04 |
| CLIP text-video ↑ | 0.2178 Β± 0.0012 | 0.2032 Β± 0.0011 |
| Aesthetic ↑ | 3.9036 Β± 0.0082 | 3.7950 Β± 0.0077 |
| Motion (px/frame) | 0.712 Β± 0.057 | 0.555 Β± 0.043 |
# Inference Pipeline
<img src="https://pub-2c09ae97630f4932a23e622b450076e0.r2.dev/paris2/model-card/v1/paris2_inference_pipeline.png" alt="Paris 2.0 inference pipeline" style="width: 100%; max-width: 960px; margin: 18px 0 8px;"/>
<p style="color: #6b7280; font-size: 14px; margin-top: 0;"><em>A lightweight router selects top-K Flux MM-DiT experts at each denoising step, and the routed velocity is decoded into video through HunyuanVAE.</em></p>
# Key Characteristics
- Three 11B Flux MM-DiT expert diffusion models
- Lightweight router selecting experts during denoising
- No gradient synchronization, parameter sharing, or activation exchange
between experts during training
- Supports text-to-video and image-to-video generation
- Multi-stage checkpoints at 256Γ—256 and 768Γ—768 video resolutions
---
# What This Repository Contains
This repository contains the Paris 2.0 expert pool and learned router. Each
expert includes Stage 2 and Stage 3 checkpoints for 256Γ—256 and 768Γ—768 video
resolutions.
```
expert1/ Expert 1
expert2/ Expert 2
expert3/ Expert 3
Router/ Routing model
model_index.json
```
Each checkpoint is provided in both unwrapped single-file
(`master.safetensors`) and sharded (`model/`) formats for compatibility with
different inference frameworks.
---
# Setup β€” Required External Components
Inference requires four third-party components that are **not bundled** in
this repository. Each is released by its original authors under its own
license, and you should fetch them directly from the upstream sources. After
downloading, place them in the working directory alongside the contents of
this repo using the layout below.
```bash
# 1. Hunyuan Video VAE (Tencent)
hf download tencent/HunyuanVideo hunyuan-video-t2v-720p/vae/pytorch_model.pt --local-dir ./hunyuan_vae
mv ./hunyuan_vae/hunyuan-video-t2v-720p/vae/pytorch_model.pt ./vae.pt
# 2. T5 text encoder, fp16, encoder-only (community-maintained Flux variant)
hf download comfyanonymous/flux_text_encoders t5xxl_fp16.safetensors --local-dir ./t5
mv ./t5/t5xxl_fp16.safetensors ./t5/model.safetensors
# 3. T5 tokenizer + config (Google)
hf download google/t5-v1_1-xxl config.json spiece.model special_tokens_map.json tokenizer_config.json --local-dir ./t5
# 4. CLIP ViT-L/14 (OpenAI)
hf download openai/clip-vit-large-patch14 --local-dir ./clip
```
Final layout after running the four commands above plus this repo:
```
.
β”œβ”€β”€ expert1/ expert2/ expert3/ Router/ (this repo)
β”œβ”€β”€ model_index.json (this repo)
β”œβ”€β”€ vae.pt (Tencent HunyuanVideo)
β”œβ”€β”€ t5/ (Google T5 + Flux encoder-only safetensors)
└── clip/ (OpenAI CLIP)
```
## Third-party components and licenses
| Component | Upstream | License |
|---|---|---|
| Hunyuan Video VAE | [`tencent/HunyuanVideo`](https://huggingface.co/tencent/HunyuanVideo) | [Tencent Hunyuan Community License](https://huggingface.co/tencent/HunyuanVideo/blob/main/LICENSE.txt) |
| T5 text encoder weights (encoder-only fp16) | [`comfyanonymous/flux_text_encoders`](https://huggingface.co/comfyanonymous/flux_text_encoders) | Apache 2.0 (derived from Google T5-v1.1) |
| T5 tokenizer and config | [`google/t5-v1_1-xxl`](https://huggingface.co/google/t5-v1_1-xxl) | Apache 2.0 |
| CLIP ViT-L/14 | [`openai/clip-vit-large-patch14`](https://huggingface.co/openai/clip-vit-large-patch14) | MIT |
Use of each component is governed by its own upstream license. The license
field on this repository applies only to the expert and router weights we
trained.
The null-conditioning tensors `null_clip.pt` and `null_t5.pt` referenced by
`model_index.json` for classifier-free guidance are produced by encoding an
empty string through CLIP and T5 respectively; once you have the encoders
above, you can regenerate them yourself with a few lines of code.
---
# Architecture Details
| Component | Specification |
|---|---|
| Architecture | Flux MM-DiT |
| Parameters per Expert | 11B |
| Number of Experts | 3 |
| Routing Model | Lightweight transformer router |
| Text Conditioning | T5 + CLIP ViT-L/14 |
| Video VAE | Hunyuan Video VAE (4Γ— temporal, 8Γ— spatial) |
| Latent Resolution (stage 2) | 32Γ—32 per frame |
| Latent Resolution (stage 3) | 96Γ—96 per frame |
| Video Resolution (stage 2) | 256Γ—256 |
| Video Resolution (stage 3) | 768Γ—768 |
| Generation Modes | text-to-video, image-to-video |
---
# Citation
<div style="height: 6px; background: #AE3E06; border-radius: 999px; margin: 12px 0 10px;"></div>
```bibtex
@misc{rouzbayani2026paris20decentralizeddiffusion,
title={Paris 2.0: A Decentralized Diffusion Model for Video Generation},
author={Ali Rouzbayani and Bidhan Roy and Marcos Villagra and Zhiying Jiang},
year={2026},
eprint={2605.26064},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2605.26064},
}
```
---
# License
See the `license` field above. Released for research and evaluation. By
requesting access you agree to the terms of the license and the acceptable
use policy.
---
<div style="display: flex; align-items: center; gap: 8px;">
<span>Made with ❀️ by</span>
<a href="https://twitter.com/bageldotcom" target="_blank">
<img src="https://img.shields.io/badge/Bagel_Labs-1DA1F2?style=for-the-badge&logo=twitter&logoColor=white" alt="Follow Bagel Labs on Twitter" height="28">
</a>
</div>