This is a modular diffusion pipeline built with 🧨 Diffusers' modular pipeline framework.

Pipeline Type: WorldEngineBlocks

Description:

This pipeline uses a 5-block architecture that can be customized and extended.

Example Usage

[TODO]

Pipeline Architecture

This modular pipeline is composed of the following blocks:

text_encoder (WorldEngineTextEncoderStep)
- Text Encoder step that generates text embeddings to guide frame generation
controller_encoder (WorldEngineControllerEncoderStep)
- Controller Encoder step that encodes mouse, button, and scroll inputs for conditioning
before_denoise (WorldEngineBeforeDenoiseStep)
- Before denoise step that prepares inputs for denoising:
- set_timesteps: WorldEngineSetTimestepsStep
  - Sets up scheduler sigmas for rectified flow denoising
- setup_kv_cache: WorldEngineSetupKVCacheStep
  - Initializes or reuses KV cache for autoregressive frame generation
- prepare_latents: WorldEnginePrepareLatentsStep
  - Prepares latents for frame generation. If an image is provided on the first frame, encodes it and caches it as context. Always creates fresh random noise for the actual denoising.
denoise (WorldEngineDenoiseLoop)
- Denoises latents using rectified flow (x = x + dsigma * v) and updates KV cache for autoregressive generation.
decode (WorldEngineDecodeStep)
- Decodes denoised latents to RGB image using the VAE decoder

Model Components

text_encoder (UMT5EncoderModel)
tokenizer (AutoTokenizer)
image_processor (VaeImageProcessor)
transformer (AutoModel)
vae (AutoModel)

Configuration Parameters

n_buttons (default: 256) scheduler_sigmas (default: [1.0, 0.94921875, 0.83984375, 0.0]) channels (default: 16) height (default: 16) width (default: 16) patch (default: [2, 2]) vae_scale_factor (default: 16)

Input/Output Specification

Inputs Optional:

prompt (Any): The prompt or prompts to guide the frame generation
prompt_embeds (Tensor): Pre-computed text embeddings
prompt_pad_mask (Tensor): Padding mask for prompt embeddings
button (Set), default: set(): Set of pressed button IDs
mouse (Tuple), default: (0.0, 0.0): Mouse velocity (x, y)
scroll (int), default: 0: Scroll wheel direction (-1, 0, 1)
button_tensor (Tensor): One-hot encoded button tensor
mouse_tensor (Tensor): Mouse velocity tensor
scroll_tensor (Tensor): Scroll wheel sign tensor
scheduler_sigmas (List): Custom scheduler sigmas (overrides config)
frame_timestamp (Tensor): Current frame timestamp
kv_cache (Optional): Existing KV cache (will be reused if provided)
reset_cache (bool), default: False: If True, reset the KV cache even if one exists
image (Union): Input image (PIL Image or [H, W, 3] uint8 tensor), only used on first frame
latents (Tensor): Latent tensor for denoising [1, 1, C, H, W]. Only used if use_random_latents=False.
use_random_latents (bool), default: True: If True, always generate fresh random latents. If False, use provided latents.
generator (Generator): torch Generator for deterministic output
output_type (Any), default: pil: The output format for the generated images (pil, latent, pt, or np)

Outputs - `prompt_embeds` (`Tensor`): Text embeddings used to guide frame generation

prompt_pad_mask (Tensor): Padding mask for prompt embeddings
button_tensor (Tensor): One-hot encoded button tensor
mouse_tensor (Tensor): Mouse velocity tensor
scroll_tensor (Tensor): Scroll wheel sign tensor
scheduler_sigmas (Tensor): Tensor of scheduler sigmas for denoising
frame_timestamp (Tensor): Current frame timestamp
kv_cache (StaticKVCache): KV cache for transformer attention
latents (Tensor): Latent tensor for denoising [1, 1, C, H, W]
images (Union): Decoded RGB image in requested output format

Downloads last month: -