This is a modular diffusion pipeline built with 🧨 Diffusers' modular pipeline framework.
Pipeline Type: WorldEngineBlocks
Description:
This pipeline uses a 5-block architecture that can be customized and extended.
Example Usage
[TODO]
Pipeline Architecture
This modular pipeline is composed of the following blocks:
- text_encoder (
WorldEngineTextEncoderStep)- Text Encoder step that generates text embeddings to guide frame generation
- controller_encoder (
WorldEngineControllerEncoderStep)- Controller Encoder step that encodes mouse, button, and scroll inputs for conditioning
- before_denoise (
WorldEngineBeforeDenoiseStep)- Before denoise step that prepares inputs for denoising:
- set_timesteps:
WorldEngineSetTimestepsStep- Sets up scheduler sigmas for rectified flow denoising
- setup_kv_cache:
WorldEngineSetupKVCacheStep- Initializes or reuses KV cache for autoregressive frame generation
- prepare_latents:
WorldEnginePrepareLatentsStep- Prepares latents for frame generation. If an image is provided on the first frame, encodes it and caches it as context. Always creates fresh random noise for the actual denoising.
- denoise (
WorldEngineDenoiseLoop)- Denoises latents using rectified flow (x = x + dsigma * v) and updates KV cache for autoregressive generation.
- decode (
WorldEngineDecodeStep)- Decodes denoised latents to RGB image using the VAE decoder
Model Components
- text_encoder (
UMT5EncoderModel) - tokenizer (
AutoTokenizer) - image_processor (
VaeImageProcessor) - transformer (
AutoModel) - vae (
AutoModel)
Configuration Parameters
n_buttons (default: 256) scheduler_sigmas (default: [1.0, 0.94921875, 0.83984375, 0.0]) channels (default: 16) height (default: 16) width (default: 16) patch (default: [2, 2]) vae_scale_factor (default: 16)
Input/Output Specification
Inputs Optional:
prompt(Any): The prompt or prompts to guide the frame generationprompt_embeds(Tensor): Pre-computed text embeddingsprompt_pad_mask(Tensor): Padding mask for prompt embeddingsbutton(Set), default:set(): Set of pressed button IDsmouse(Tuple), default:(0.0, 0.0): Mouse velocity (x, y)scroll(int), default:0: Scroll wheel direction (-1, 0, 1)button_tensor(Tensor): One-hot encoded button tensormouse_tensor(Tensor): Mouse velocity tensorscroll_tensor(Tensor): Scroll wheel sign tensorscheduler_sigmas(List): Custom scheduler sigmas (overrides config)frame_timestamp(Tensor): Current frame timestampkv_cache(Optional): Existing KV cache (will be reused if provided)reset_cache(bool), default:False: If True, reset the KV cache even if one existsimage(Union): Input image (PIL Image or [H, W, 3] uint8 tensor), only used on first framelatents(Tensor): Latent tensor for denoising [1, 1, C, H, W]. Only used if use_random_latents=False.use_random_latents(bool), default:True: If True, always generate fresh random latents. If False, use provided latents.generator(Generator): torch Generator for deterministic outputoutput_type(Any), default:pil: The output format for the generated images (pil, latent, pt, or np)
Outputs - prompt_embeds (Tensor): Text embeddings used to guide frame generation
prompt_pad_mask(Tensor): Padding mask for prompt embeddingsbutton_tensor(Tensor): One-hot encoded button tensormouse_tensor(Tensor): Mouse velocity tensorscroll_tensor(Tensor): Scroll wheel sign tensorscheduler_sigmas(Tensor): Tensor of scheduler sigmas for denoisingframe_timestamp(Tensor): Current frame timestampkv_cache(StaticKVCache): KV cache for transformer attentionlatents(Tensor): Latent tensor for denoising [1, 1, C, H, W]images(Union): Decoded RGB image in requested output format
- Downloads last month
- -