paris2 / README.md

Bidhan Roy

Serve model card images from Cloudflare R2

3ec9d29 about 22 hours ago

10.8 kB

	---
	license: other
	gated: manual
	tags:
	- paris2
	- text-to-video
	- image-to-video
	- mixture-of-experts
	- decentralized-diffusion-model
	extra_gated_heading: "Request access to Paris 2.0: A Decentralized Diffusion Model for Video Generation"
	extra_gated_description: \|
	<img src="https://pub-2c09ae97630f4932a23e622b450076e0.r2.dev/paris2/model-card/v1/bagel_labs_logo.png" alt="Bagel Labs" height="28" style="margin-bottom: 12px;"/>

	Access is granted on a per-request basis after manual review by Bagel Labs.
	Each request is reviewed individually. Typical turnaround is 2-3 business days.
	extra_gated_button_content: "Acknowledge license and request access"
	extra_gated_prompt: >
	By requesting access, you agree to abide by the license and Bagel Labs
	acceptable use policy. These model weights are released for research and
	evaluation. Commercial use is not granted by default and requires written
	agreement with Bagel Labs.
	extra_gated_fields:
	Full name: text
	Affiliation: text
	Affiliation type:
	type: select
	options:
	- Academic / university
	- Industry research lab
	- Startup
	- Large company
	- Independent researcher
	- {label: Other, value: other}
	Company or institution website: text
	Job title or role: text
	Country: country
	Intended use (1-2 sentences): text
	Will the model be used in a commercial product or service?:
	type: select
	options:
	- "No, research and evaluation only"
	- "Possibly in the future"
	- "Yes"
	Are you following @bageldotcom on Hugging Face?: checkbox
	Email used for this request matches my official affiliation domain: checkbox
	I agree to use this model for non-commercial research and evaluation only unless I have a separate written agreement with Bagel Labs: checkbox
	I agree to the license and the acceptable use policy: checkbox
	---

	<p align="center">
	<img src="https://pub-2c09ae97630f4932a23e622b450076e0.r2.dev/paris2/model-card/v1/bagel_labs_logo.png" alt="Bagel Labs">
	</p>

	<h1 align="center">Paris 2.0: A Decentralized Diffusion Model for Video Generation</h1>

	<p align="center">
	<a href="https://huggingface.co/bageldotcom/paris2" target="_blank">
	<img src="https://img.shields.io/badge/🤗_DOWNLOAD_PARIS_2.0_WEIGHTS-FFD21E?style=for-the-badge&logoColor=000000" alt="Download Paris 2.0 Weights">
	</a>
	<a href="https://arxiv.org/abs/2605.26064" target="_blank">
	<img src="https://img.shields.io/badge/📄_READ_PARIS_2.0_PAPER-FF6B6B?style=for-the-badge&logoColor=white" alt="Read Paris 2.0 Technical Report">
	</a>
	</p>

	Paris 2.0 is a Decentralized Diffusion Model (DDM) for video generation,
	extending the Paris 1.0 DDM recipe from image generation to temporally
	coherent video. A DDM trains independent expert diffusion models without
	gradient synchronization, parameter sharing, or activation exchange, then uses
	a lightweight router to select experts during denoising.

	# Generated Samples

	<img src="https://pub-2c09ae97630f4932a23e622b450076e0.r2.dev/paris2/model-card/v1/paris2_sample_talking_head.png" alt="Paris 2.0 generated talking-head video frames" style="width: 100%; max-width: 960px; border-radius: 8px; margin: 14px 0 4px;"/>

	<p style="color: #6b7280; font-size: 14px; margin-top: 0;"><em>Prompt: A woman with long, blond, wavy hair is speaking directly to the camera.</em></p>

	<img src="https://pub-2c09ae97630f4932a23e622b450076e0.r2.dev/paris2/model-card/v1/paris2_sample_paper_craft.png" alt="Paris 2.0 generated paper-craft video frames" style="width: 100%; max-width: 960px; border-radius: 8px; margin: 14px 0 4px;"/>

	<p style="color: #6b7280; font-size: 14px; margin-top: 0;"><em>Prompt: A person's hands perform a paper-folding craft on a green cutting mat.</em></p>

	<img src="https://pub-2c09ae97630f4932a23e622b450076e0.r2.dev/paris2/model-card/v1/paris2_sample_slime.png" alt="Paris 2.0 generated slime video frames" style="width: 100%; max-width: 960px; border-radius: 8px; margin: 14px 0 4px;"/>

	<p style="color: #6b7280; font-size: 14px; margin-top: 0;"><em>Prompt: A pair of hands interacts with translucent blue slime.</em></p>

	# Results

	In a low-resolution text-to-video study, Paris 2.0 is
	compared against a monolithic model trained on the same data under a matched
	total compute budget. The decentralized model reduces FVD from 561.04 to
	279.01 and improves CLIP text-video similarity and aesthetic score under the
	same generation protocol.

	<img src="https://pub-2c09ae97630f4932a23e622b450076e0.r2.dev/paris2/model-card/v1/paris2_relative_improvement.png" alt="Paris 2.0 relative improvement over monolithic baseline" style="width: 100%; max-width: 960px; margin: 18px 0 8px;"/>

	<p style="color: #6b7280; font-size: 14px; margin-top: 0;"><em>Relative improvement over the monolithic baseline. Each bar shows the gain over monolithic, so a taller bar means a larger improvement (for FVD this corresponds to a lower distance, for CLIP and aesthetic to a higher score). Motion is descriptive and has no preferred direction.</em></p>

	\| Metric \| Paris 2.0 DDM \| Monolithic baseline \|
	\|---\|---:\|---:\|
	\| FVD ↓ \| 279.01 \| 561.04 \|
	\| CLIP text-video ↑ \| 0.2178 ± 0.0012 \| 0.2032 ± 0.0011 \|
	\| Aesthetic ↑ \| 3.9036 ± 0.0082 \| 3.7950 ± 0.0077 \|
	\| Motion (px/frame) \| 0.712 ± 0.057 \| 0.555 ± 0.043 \|

	# Inference Pipeline

	<img src="https://pub-2c09ae97630f4932a23e622b450076e0.r2.dev/paris2/model-card/v1/paris2_inference_pipeline.png" alt="Paris 2.0 inference pipeline" style="width: 100%; max-width: 960px; margin: 18px 0 8px;"/>

	<p style="color: #6b7280; font-size: 14px; margin-top: 0;"><em>A lightweight router selects top-K Flux MM-DiT experts at each denoising step, and the routed velocity is decoded into video through HunyuanVAE.</em></p>

	# Key Characteristics

	- Three 11B Flux MM-DiT expert diffusion models
	- Lightweight router selecting experts during denoising
	- No gradient synchronization, parameter sharing, or activation exchange
	between experts during training
	- Supports text-to-video and image-to-video generation
	- Multi-stage checkpoints at 256×256 and 768×768 video resolutions

	---

	# What This Repository Contains

	This repository contains the Paris 2.0 expert pool and learned router. Each
	expert includes Stage 2 and Stage 3 checkpoints for 256×256 and 768×768 video
	resolutions.

	```
	expert1/ Expert 1
	expert2/ Expert 2
	expert3/ Expert 3
	Router/ Routing model
	model_index.json
	```

	Each checkpoint is provided in both unwrapped single-file
	(`master.safetensors`) and sharded (`model/`) formats for compatibility with
	different inference frameworks.

	---

	# Setup — Required External Components

	Inference requires four third-party components that are not bundled in
	this repository. Each is released by its original authors under its own
	license, and you should fetch them directly from the upstream sources. After
	downloading, place them in the working directory alongside the contents of
	this repo using the layout below.

	```bash
	# 1. Hunyuan Video VAE (Tencent)
	hf download tencent/HunyuanVideo hunyuan-video-t2v-720p/vae/pytorch_model.pt --local-dir ./hunyuan_vae
	mv ./hunyuan_vae/hunyuan-video-t2v-720p/vae/pytorch_model.pt ./vae.pt

	# 2. T5 text encoder, fp16, encoder-only (community-maintained Flux variant)
	hf download comfyanonymous/flux_text_encoders t5xxl_fp16.safetensors --local-dir ./t5
	mv ./t5/t5xxl_fp16.safetensors ./t5/model.safetensors

	# 3. T5 tokenizer + config (Google)
	hf download google/t5-v1_1-xxl config.json spiece.model special_tokens_map.json tokenizer_config.json --local-dir ./t5

	# 4. CLIP ViT-L/14 (OpenAI)
	hf download openai/clip-vit-large-patch14 --local-dir ./clip
	```

	Final layout after running the four commands above plus this repo:

	```
	.
	├── expert1/ expert2/ expert3/ Router/ (this repo)
	├── model_index.json (this repo)
	├── vae.pt (Tencent HunyuanVideo)
	├── t5/ (Google T5 + Flux encoder-only safetensors)
	└── clip/ (OpenAI CLIP)
	```

	## Third-party components and licenses

	\| Component \| Upstream \| License \|
	\|---\|---\|---\|
	\| Hunyuan Video VAE \| [`tencent/HunyuanVideo`](https://huggingface.co/tencent/HunyuanVideo) \| [Tencent Hunyuan Community License](https://huggingface.co/tencent/HunyuanVideo/blob/main/LICENSE.txt) \|
	\| T5 text encoder weights (encoder-only fp16) \| [`comfyanonymous/flux_text_encoders`](https://huggingface.co/comfyanonymous/flux_text_encoders) \| Apache 2.0 (derived from Google T5-v1.1) \|
	\| T5 tokenizer and config \| [`google/t5-v1_1-xxl`](https://huggingface.co/google/t5-v1_1-xxl) \| Apache 2.0 \|
	\| CLIP ViT-L/14 \| [`openai/clip-vit-large-patch14`](https://huggingface.co/openai/clip-vit-large-patch14) \| MIT \|

	Use of each component is governed by its own upstream license. The license
	field on this repository applies only to the expert and router weights we
	trained.

	The null-conditioning tensors `null_clip.pt` and `null_t5.pt` referenced by
	`model_index.json` for classifier-free guidance are produced by encoding an
	empty string through CLIP and T5 respectively; once you have the encoders
	above, you can regenerate them yourself with a few lines of code.

	---

	# Architecture Details

	\| Component \| Specification \|
	\|---\|---\|
	\| Architecture \| Flux MM-DiT \|
	\| Parameters per Expert \| 11B \|
	\| Number of Experts \| 3 \|
	\| Routing Model \| Lightweight transformer router \|
	\| Text Conditioning \| T5 + CLIP ViT-L/14 \|
	\| Video VAE \| Hunyuan Video VAE (4× temporal, 8× spatial) \|
	\| Latent Resolution (stage 2) \| 32×32 per frame \|
	\| Latent Resolution (stage 3) \| 96×96 per frame \|
	\| Video Resolution (stage 2) \| 256×256 \|
	\| Video Resolution (stage 3) \| 768×768 \|
	\| Generation Modes \| text-to-video, image-to-video \|

	---

	# Citation

	<div style="height: 6px; background: #AE3E06; border-radius: 999px; margin: 12px 0 10px;"></div>

	```bibtex
	@misc{rouzbayani2026paris20decentralizeddiffusion,
	title={Paris 2.0: A Decentralized Diffusion Model for Video Generation},
	author={Ali Rouzbayani and Bidhan Roy and Marcos Villagra and Zhiying Jiang},
	year={2026},
	eprint={2605.26064},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2605.26064},
	}
	```

	---

	# License

	See the `license` field above. Released for research and evaluation. By
	requesting access you agree to the terms of the license and the acceptable
	use policy.

	---

	<div style="display: flex; align-items: center; gap: 8px;">
	<span>Made with ❤️ by</span>
	<a href="https://twitter.com/bageldotcom" target="_blank">
	<img src="https://img.shields.io/badge/Bagel_Labs-1DA1F2?style=for-the-badge&logo=twitter&logoColor=white" alt="Follow Bagel Labs on Twitter" height="28">
	</a>
	</div>