--- license: apache-2.0 tags: - video-generation - multi-reference - LTX-2.3 base_model: - Lightricks/LTX-2.3 github: https://github.com/liconstudio/ComfyUI-Licon-MSR library_name: diffusers --- ## Overview This model implements a novel approach to multi-reference video generation using **Multiple Subject Reference (MSR)**. Instead of introducing additional encoder branches or fusion modules, we transform multiple static reference images into a pseudo-video sequence that shares the same representation space as the target video. ## Usage This LoRA requires the **[ComfyUI-Licon-MSR](https://github.com/liconstudio/ComfyUI-Licon-MSR)** plugin for ComfyUI. A sample workflow is included in the model files for easy testing and experimentation. ## Key Features ### Multi-Reference Visual Memory - **Token-level reference preservation**: Multiple reference images are encoded as video latents, preserving fine-grained visual information at token level rather than compressing into a single embedding - **Native self-attention retrieval**: The target video tokens directly access reference tokens through the model's existing self-attention mechanism—no new architectural components needed - **In-context conditioning**: References serve as "visual memory" within the main token sequence, not as external conditioning inputs ### Flexible Reference Composition - **2 to 5 reference images**: Supports varying numbers of reference inputs with increasing complexity - **Complementary semantic roles**: Each reference image can carry different information: - Subject identity - Object/prop details - Scene/background - Local textures - Multiple viewpoints ## What It Can Do ### Identity Preservation Across References Generate videos where multiple reference identities are simultaneously preserved: - Multiple characters from different reference images - Character + object combinations - Object + scene compositions ### Relation-Based Composition Beyond mere identity preservation, the model can compose references based on textual relation descriptions: - Action interactions (handing, picking up, pushing) - Spatial relationships (left-right, foreground-background) - Temporal event structures (start → process → result) ### Cross-Reference Attribute Selection The model learns to selectively retrieve attributes from different references: - Face from reference A, clothing from reference B - Object identity from one reference, pose/position from another - Background elements from scene references ## Usage Tips (V1 Version) - **Prompt description**: Requires concise but accurate description of reference images. Over-description or under-description both lead to consistency degradation - **High-motion scenes**: 50fps recommended to ensure smooth motion coherence - **Generation reliability**: Typically requires 2-3 sampling runs to achieve accurate results ## Results Showcase ### V1 Version | Reference Images | Generated Video | |:---:|:---:| | | [▶ Play](validition_v1/01/video.mp4) | | | [▶ Play](validition_v1/07/video.mp4) | | | [▶ Play](validition_v1/05/video.mp4) | ---