This is a modular diffusion pipeline built with 🧨 Diffusers' modular pipeline framework.

Pipeline Type: WorldEngineBlocks

Description:

This pipeline uses a 5-block architecture that can be customized and extended.

Example Usage

[TODO]

Pipeline Architecture

This modular pipeline is composed of the following blocks:

  1. text_encoder (WorldEngineTextEncoderStep)
    • Text Encoder step that generates text embeddings to guide frame generation
  2. controller_encoder (WorldEngineControllerEncoderStep)
    • Controller Encoder step that encodes mouse, button, and scroll inputs for conditioning
  3. before_denoise (WorldEngineBeforeDenoiseStep)
    • Before denoise step that prepares inputs for denoising:
    • set_timesteps: WorldEngineSetTimestepsStep
      • Sets up scheduler sigmas for rectified flow denoising
    • setup_kv_cache: WorldEngineSetupKVCacheStep
      • Initializes or reuses KV cache for autoregressive frame generation
    • prepare_latents: WorldEnginePrepareLatentsStep
      • Prepares latents for frame generation. If an image is provided on the first frame, encodes it and caches it as context. Always creates fresh random noise for the actual denoising.
  4. denoise (WorldEngineDenoiseLoop)
    • Denoises latents using rectified flow (x = x + dsigma * v) and updates KV cache for autoregressive generation.
  5. decode (WorldEngineDecodeStep)
    • Decodes denoised latents to RGB image using the VAE decoder

Model Components

  1. text_encoder (UMT5EncoderModel)
  2. tokenizer (AutoTokenizer)
  3. image_processor (VaeImageProcessor)
  4. transformer (AutoModel)
  5. vae (AutoModel)

Configuration Parameters

n_buttons (default: 256) scheduler_sigmas (default: [1.0, 0.94921875, 0.83984375, 0.0]) channels (default: 16) height (default: 16) width (default: 16) patch (default: [2, 2]) vae_scale_factor (default: 16)

Input/Output Specification

Inputs Optional:

  • prompt (Any): The prompt or prompts to guide the frame generation
  • prompt_embeds (Tensor): Pre-computed text embeddings
  • prompt_pad_mask (Tensor): Padding mask for prompt embeddings
  • button (Set), default: set(): Set of pressed button IDs
  • mouse (Tuple), default: (0.0, 0.0): Mouse velocity (x, y)
  • scroll (int), default: 0: Scroll wheel direction (-1, 0, 1)
  • button_tensor (Tensor): One-hot encoded button tensor
  • mouse_tensor (Tensor): Mouse velocity tensor
  • scroll_tensor (Tensor): Scroll wheel sign tensor
  • scheduler_sigmas (List): Custom scheduler sigmas (overrides config)
  • frame_timestamp (Tensor): Current frame timestamp
  • kv_cache (Optional): Existing KV cache (will be reused if provided)
  • reset_cache (bool), default: False: If True, reset the KV cache even if one exists
  • image (Union): Input image (PIL Image or [H, W, 3] uint8 tensor), only used on first frame
  • latents (Tensor): Latent tensor for denoising [1, 1, C, H, W]. Only used if use_random_latents=False.
  • use_random_latents (bool), default: True: If True, always generate fresh random latents. If False, use provided latents.
  • generator (Generator): torch Generator for deterministic output
  • output_type (Any), default: pil: The output format for the generated images (pil, latent, pt, or np)

Outputs - prompt_embeds (Tensor): Text embeddings used to guide frame generation

  • prompt_pad_mask (Tensor): Padding mask for prompt embeddings
  • button_tensor (Tensor): One-hot encoded button tensor
  • mouse_tensor (Tensor): Mouse velocity tensor
  • scroll_tensor (Tensor): Scroll wheel sign tensor
  • scheduler_sigmas (Tensor): Tensor of scheduler sigmas for denoising
  • frame_timestamp (Tensor): Current frame timestamp
  • kv_cache (StaticKVCache): KV cache for transformer attention
  • latents (Tensor): Latent tensor for denoising [1, 1, C, H, W]
  • images (Union): Decoded RGB image in requested output format
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support