| --- |
| license: other |
| gated: manual |
| tags: |
| - paris2 |
| - text-to-video |
| - image-to-video |
| - mixture-of-experts |
| - decentralized-diffusion-model |
| extra_gated_heading: "Request access to Paris 2.0: A Decentralized Diffusion Model for Video Generation" |
| extra_gated_description: | |
| <img src="https://pub-2c09ae97630f4932a23e622b450076e0.r2.dev/paris2/model-card/v1/bagel_labs_logo.png" alt="Bagel Labs" height="28" style="margin-bottom: 12px;"/> |
| |
| Access is granted on a per-request basis after manual review by Bagel Labs. |
| Each request is reviewed individually. Typical turnaround is 2-3 business days. |
| extra_gated_button_content: "Acknowledge license and request access" |
| extra_gated_prompt: > |
| By requesting access, you agree to abide by the license and Bagel Labs |
| acceptable use policy. These model weights are released for research and |
| evaluation. Commercial use is not granted by default and requires written |
| agreement with Bagel Labs. |
| extra_gated_fields: |
| Full name: text |
| Affiliation: text |
| Affiliation type: |
| type: select |
| options: |
| - Academic / university |
| - Industry research lab |
| - Startup |
| - Large company |
| - Independent researcher |
| - {label: Other, value: other} |
| Company or institution website: text |
| Job title or role: text |
| Country: country |
| Intended use (1-2 sentences): text |
| Will the model be used in a commercial product or service?: |
| type: select |
| options: |
| - "No, research and evaluation only" |
| - "Possibly in the future" |
| - "Yes" |
| Are you following @bageldotcom on Hugging Face?: checkbox |
| Email used for this request matches my official affiliation domain: checkbox |
| I agree to use this model for non-commercial research and evaluation only unless I have a separate written agreement with Bagel Labs: checkbox |
| I agree to the license and the acceptable use policy: checkbox |
| --- |
| |
| <p align="center"> |
| <img src="https://pub-2c09ae97630f4932a23e622b450076e0.r2.dev/paris2/model-card/v1/bagel_labs_logo.png" alt="Bagel Labs"> |
| </p> |
|
|
| <h1 align="center">Paris 2.0: A Decentralized Diffusion Model for Video Generation</h1> |
|
|
| <p align="center"> |
| <a href="https://huggingface.co/bageldotcom/paris2" target="_blank"> |
| <img src="https://img.shields.io/badge/π€_DOWNLOAD_PARIS_2.0_WEIGHTS-FFD21E?style=for-the-badge&logoColor=000000" alt="Download Paris 2.0 Weights"> |
| </a> |
| <a href="https://arxiv.org/abs/2605.26064" target="_blank"> |
| <img src="https://img.shields.io/badge/π_READ_PARIS_2.0_PAPER-FF6B6B?style=for-the-badge&logoColor=white" alt="Read Paris 2.0 Technical Report"> |
| </a> |
| </p> |
| |
| Paris 2.0 is a Decentralized Diffusion Model (DDM) for video generation, |
| extending the Paris 1.0 DDM recipe from image generation to temporally |
| coherent video. A DDM trains independent expert diffusion models without |
| gradient synchronization, parameter sharing, or activation exchange, then uses |
| a lightweight router to select experts during denoising. |
|
|
| # Generated Samples |
|
|
| <img src="https://pub-2c09ae97630f4932a23e622b450076e0.r2.dev/paris2/model-card/v1/paris2_sample_talking_head.png" alt="Paris 2.0 generated talking-head video frames" style="width: 100%; max-width: 960px; border-radius: 8px; margin: 14px 0 4px;"/> |
|
|
| <p style="color: #6b7280; font-size: 14px; margin-top: 0;"><em>Prompt: A woman with long, blond, wavy hair is speaking directly to the camera.</em></p> |
|
|
| <img src="https://pub-2c09ae97630f4932a23e622b450076e0.r2.dev/paris2/model-card/v1/paris2_sample_paper_craft.png" alt="Paris 2.0 generated paper-craft video frames" style="width: 100%; max-width: 960px; border-radius: 8px; margin: 14px 0 4px;"/> |
|
|
| <p style="color: #6b7280; font-size: 14px; margin-top: 0;"><em>Prompt: A person's hands perform a paper-folding craft on a green cutting mat.</em></p> |
|
|
| <img src="https://pub-2c09ae97630f4932a23e622b450076e0.r2.dev/paris2/model-card/v1/paris2_sample_slime.png" alt="Paris 2.0 generated slime video frames" style="width: 100%; max-width: 960px; border-radius: 8px; margin: 14px 0 4px;"/> |
|
|
| <p style="color: #6b7280; font-size: 14px; margin-top: 0;"><em>Prompt: A pair of hands interacts with translucent blue slime.</em></p> |
|
|
| # Results |
|
|
| In a low-resolution text-to-video study, Paris 2.0 is |
| compared against a monolithic model trained on the same data under a matched |
| total compute budget. The decentralized model reduces FVD from 561.04 to |
| 279.01 and improves CLIP text-video similarity and aesthetic score under the |
| same generation protocol. |
|
|
| <img src="https://pub-2c09ae97630f4932a23e622b450076e0.r2.dev/paris2/model-card/v1/paris2_relative_improvement.png" alt="Paris 2.0 relative improvement over monolithic baseline" style="width: 100%; max-width: 960px; margin: 18px 0 8px;"/> |
|
|
| <p style="color: #6b7280; font-size: 14px; margin-top: 0;"><em>Relative improvement over the monolithic baseline. Each bar shows the gain over monolithic, so a taller bar means a larger improvement (for FVD this corresponds to a lower distance, for CLIP and aesthetic to a higher score). Motion is descriptive and has no preferred direction.</em></p> |
|
|
| | Metric | Paris 2.0 DDM | Monolithic baseline | |
| |---|---:|---:| |
| | FVD β | 279.01 | 561.04 | |
| | CLIP text-video β | 0.2178 Β± 0.0012 | 0.2032 Β± 0.0011 | |
| | Aesthetic β | 3.9036 Β± 0.0082 | 3.7950 Β± 0.0077 | |
| | Motion (px/frame) | 0.712 Β± 0.057 | 0.555 Β± 0.043 | |
|
|
| # Inference Pipeline |
|
|
| <img src="https://pub-2c09ae97630f4932a23e622b450076e0.r2.dev/paris2/model-card/v1/paris2_inference_pipeline.png" alt="Paris 2.0 inference pipeline" style="width: 100%; max-width: 960px; margin: 18px 0 8px;"/> |
|
|
| <p style="color: #6b7280; font-size: 14px; margin-top: 0;"><em>A lightweight router selects top-K Flux MM-DiT experts at each denoising step, and the routed velocity is decoded into video through HunyuanVAE.</em></p> |
|
|
| # Key Characteristics |
|
|
| - Three 11B Flux MM-DiT expert diffusion models |
| - Lightweight router selecting experts during denoising |
| - No gradient synchronization, parameter sharing, or activation exchange |
| between experts during training |
| - Supports text-to-video and image-to-video generation |
| - Multi-stage checkpoints at 256Γ256 and 768Γ768 video resolutions |
|
|
| --- |
|
|
| # What This Repository Contains |
|
|
| This repository contains the Paris 2.0 expert pool and learned router. Each |
| expert includes Stage 2 and Stage 3 checkpoints for 256Γ256 and 768Γ768 video |
| resolutions. |
|
|
| ``` |
| expert1/ Expert 1 |
| expert2/ Expert 2 |
| expert3/ Expert 3 |
| Router/ Routing model |
| model_index.json |
| ``` |
|
|
| Each checkpoint is provided in both unwrapped single-file |
| (`master.safetensors`) and sharded (`model/`) formats for compatibility with |
| different inference frameworks. |
|
|
| --- |
|
|
| # Setup β Required External Components |
|
|
| Inference requires four third-party components that are **not bundled** in |
| this repository. Each is released by its original authors under its own |
| license, and you should fetch them directly from the upstream sources. After |
| downloading, place them in the working directory alongside the contents of |
| this repo using the layout below. |
|
|
| ```bash |
| # 1. Hunyuan Video VAE (Tencent) |
| hf download tencent/HunyuanVideo hunyuan-video-t2v-720p/vae/pytorch_model.pt --local-dir ./hunyuan_vae |
| mv ./hunyuan_vae/hunyuan-video-t2v-720p/vae/pytorch_model.pt ./vae.pt |
| |
| # 2. T5 text encoder, fp16, encoder-only (community-maintained Flux variant) |
| hf download comfyanonymous/flux_text_encoders t5xxl_fp16.safetensors --local-dir ./t5 |
| mv ./t5/t5xxl_fp16.safetensors ./t5/model.safetensors |
| |
| # 3. T5 tokenizer + config (Google) |
| hf download google/t5-v1_1-xxl config.json spiece.model special_tokens_map.json tokenizer_config.json --local-dir ./t5 |
| |
| # 4. CLIP ViT-L/14 (OpenAI) |
| hf download openai/clip-vit-large-patch14 --local-dir ./clip |
| ``` |
|
|
| Final layout after running the four commands above plus this repo: |
|
|
| ``` |
| . |
| βββ expert1/ expert2/ expert3/ Router/ (this repo) |
| βββ model_index.json (this repo) |
| βββ vae.pt (Tencent HunyuanVideo) |
| βββ t5/ (Google T5 + Flux encoder-only safetensors) |
| βββ clip/ (OpenAI CLIP) |
| ``` |
|
|
| ## Third-party components and licenses |
|
|
| | Component | Upstream | License | |
| |---|---|---| |
| | Hunyuan Video VAE | [`tencent/HunyuanVideo`](https://huggingface.co/tencent/HunyuanVideo) | [Tencent Hunyuan Community License](https://huggingface.co/tencent/HunyuanVideo/blob/main/LICENSE.txt) | |
| | T5 text encoder weights (encoder-only fp16) | [`comfyanonymous/flux_text_encoders`](https://huggingface.co/comfyanonymous/flux_text_encoders) | Apache 2.0 (derived from Google T5-v1.1) | |
| | T5 tokenizer and config | [`google/t5-v1_1-xxl`](https://huggingface.co/google/t5-v1_1-xxl) | Apache 2.0 | |
| | CLIP ViT-L/14 | [`openai/clip-vit-large-patch14`](https://huggingface.co/openai/clip-vit-large-patch14) | MIT | |
|
|
| Use of each component is governed by its own upstream license. The license |
| field on this repository applies only to the expert and router weights we |
| trained. |
|
|
| The null-conditioning tensors `null_clip.pt` and `null_t5.pt` referenced by |
| `model_index.json` for classifier-free guidance are produced by encoding an |
| empty string through CLIP and T5 respectively; once you have the encoders |
| above, you can regenerate them yourself with a few lines of code. |
|
|
| --- |
|
|
| # Architecture Details |
|
|
| | Component | Specification | |
| |---|---| |
| | Architecture | Flux MM-DiT | |
| | Parameters per Expert | 11B | |
| | Number of Experts | 3 | |
| | Routing Model | Lightweight transformer router | |
| | Text Conditioning | T5 + CLIP ViT-L/14 | |
| | Video VAE | Hunyuan Video VAE (4Γ temporal, 8Γ spatial) | |
| | Latent Resolution (stage 2) | 32Γ32 per frame | |
| | Latent Resolution (stage 3) | 96Γ96 per frame | |
| | Video Resolution (stage 2) | 256Γ256 | |
| | Video Resolution (stage 3) | 768Γ768 | |
| | Generation Modes | text-to-video, image-to-video | |
|
|
| --- |
|
|
| # Citation |
|
|
| <div style="height: 6px; background: #AE3E06; border-radius: 999px; margin: 12px 0 10px;"></div> |
|
|
| ```bibtex |
| @misc{rouzbayani2026paris20decentralizeddiffusion, |
| title={Paris 2.0: A Decentralized Diffusion Model for Video Generation}, |
| author={Ali Rouzbayani and Bidhan Roy and Marcos Villagra and Zhiying Jiang}, |
| year={2026}, |
| eprint={2605.26064}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.CV}, |
| url={https://arxiv.org/abs/2605.26064}, |
| } |
| ``` |
|
|
| --- |
|
|
| # License |
|
|
| See the `license` field above. Released for research and evaluation. By |
| requesting access you agree to the terms of the license and the acceptable |
| use policy. |
|
|
| --- |
|
|
| <div style="display: flex; align-items: center; gap: 8px;"> |
| <span>Made with β€οΈ by</span> |
| <a href="https://twitter.com/bageldotcom" target="_blank"> |
| <img src="https://img.shields.io/badge/Bagel_Labs-1DA1F2?style=for-the-badge&logo=twitter&logoColor=white" alt="Follow Bagel Labs on Twitter" height="28"> |
| </a> |
| </div> |
| |