Cosmos-Predict2.5-2B-Diffusers / README.md

Upload folder using huggingface_hub

a63d81a verified 26 days ago

4.93 kB

	# Cosmos-Predict2.5-2B (Diffusers Format)

	This is the NVIDIA Cosmos Predict 2.5 2B model in Diffusers-compatible format for use with FastVideo.

	## Model Components

	This model consists of the following components:

	### 1. Transformer (DiT)
	- Class: `Cosmos25Transformer3DModel`
	- Architecture: 28 layers, 16 attention heads, 128 head dim
	- Parameters: ~2B parameters
	- Input channels: 16 (latent space)
	- Patch size: (1, 2, 2) for temporal and spatial dimensions
	- Features:
	- AdaLN-LoRA conditioning (dim=256)
	- RoPE positional embeddings with 3D scaling
	- Cross-attention projection for text conditioning
	- RMS normalization for Q/K

	### 2. VAE (Wan2.1)
	- Class: `AutoencoderKLWan`
	- Latent channels: 16
	- Compression: 8x spatial, 4x temporal
	- Architecture: 4-stage encoder/decoder with residual blocks
	- Features:
	- Feature caching for efficiency
	- Configurable tiling support
	- Clip output to [-1, 1]

	### 3. Scheduler
	- Class: `FlowUniPCMultistepScheduler`
	- Type: Multi-step flow matching solver (UniPC)
	- Order: 2 (predictor-corrector)
	- Configuration:
	- Training timesteps: 1000
	- Shift: 1
	- No dynamic shifting
	- Solver type: bh2 (recommended for >10 steps)

	### 4. Text Encoder & Tokenizer
	- Note: Text encoder and tokenizer are not included in this directory
	- Official Implementation: Uses Reason1 or official TextEncoder from `cosmos_predict2`
	- Expected format: Text embeddings with shape (batch, 512, 100352)

	## Directory Structure

	```
	models--nvidia--Cosmos-Predict2.5-2B-Diffusers/
	├── model_index.json # Pipeline component registry
	├── README.md # This file
	├── transformer/
	│ ├── config.json # Transformer configuration
	│ └── 81edfebe-bd6a-4039-8c1d-737df1a790bf_ema_bf16.pt # Model weights
	├── vae/
	│ ├── config.json # VAE configuration
	│ └── tokenizer.pth # VAE weights
	└── scheduler/
	└── scheduler_config.json # Scheduler configuration
	```

	## Usage with FastVideo

	### Option 1: Using FastVideo Pipeline (Recommended)

	```python
	from fastvideo import FastVideoArgs
	from fastvideo.pipelines.basic.cosmos.cosmos2_5_pipeline import Cosmos2_5Pipeline

	# Initialize pipeline
	args = FastVideoArgs.from_cli_args(model="nvidia/Cosmos-Predict2.5-2B-Diffusers")
	pipeline = Cosmos2_5Pipeline(args)

	# Generate video
	output = pipeline(
	prompt="A robot welding in an industrial setting",
	height=480,
	width=832,
	num_frames=121,
	num_inference_steps=35,
	guidance_scale=7.0,
	)
	```

	### Option 2: Manual Component Loading

	```python
	from fastvideo.models.dits.cosmos2_5 import Cosmos25Transformer3DModel
	from fastvideo.models.vaes.wanvae import AutoencoderKLWan
	from fastvideo.models.schedulers.scheduling_flow_unipc_multistep import FlowUniPCMultistepScheduler

	# Load components
	transformer = Cosmos25Transformer3DModel.from_pretrained(
	"nvidia/Cosmos-Predict2.5-2B-Diffusers",
	subfolder="transformer"
	)

	vae = AutoencoderKLWan.from_pretrained(
	"nvidia/Cosmos-Predict2.5-2B-Diffusers",
	subfolder="vae"
	)

	scheduler = FlowUniPCMultistepScheduler.from_pretrained(
	"nvidia/Cosmos-Predict2.5-2B-Diffusers",
	subfolder="scheduler"
	)
	```

	## Key Differences from Official

	1. Scheduler: This model uses `FlowUniPCMultistepScheduler` (multi-step) which matches the official Cosmos 2.5 implementation, NOT `FlowMatchEulerDiscreteScheduler` (single-step) used in some FastVideo examples.

	2. Weight Format: Uses FastVideo-compatible weight format with proper key mapping.

	3. Configuration: All hyperparameters match the official Cosmos 2.5 2B model.

	## Inference Parameters

	Recommended settings for best quality:

	- Resolution: 480x832 (or multiples of 16)
	- Frames: 121 (or any compatible length)
	- Steps: 35 (with UniPC scheduler)
	- Guidance Scale: 7.0
	- Scheduler Shift: 5.0 (dynamic, applied during inference)
	- FPS: 24.0

	## Model Information

	- Model Size: ~2B parameters (transformer only)
	- Precision: BFloat16
	- Context: Trained for video prediction/generation
	- License: Check NVIDIA's official license for Cosmos models

	## Citation

	If you use this model, please cite:

	```bibtex
	@misc{cosmos2024,
	title={Cosmos: Foundation Models for Video Generation},
	author={NVIDIA},
	year={2024}
	}
	```

	## Notes

	1. This is a Diffusers-compatible format but uses FastVideo classes, not standard Diffusers classes.
	2. The text encoder component needs to be loaded separately from the official cosmos_predict2 package.
	3. For best results, use the same scheduler (FlowUniPCMultistepScheduler) that the official model uses.
	4. The model expects text embeddings in the shape (batch, 512, 100352) - make sure your text encoder produces this format.