LiconStudio's picture
Update README.md
7449318 verified
|
Raw
History Blame Contribute Delete
3.59 kB
---
license: apache-2.0
tags:
- video-generation
- multi-reference
- LTX-2.3
base_model:
- Lightricks/LTX-2.3
github: https://github.com/liconstudio/ComfyUI-Licon-MSR
library_name: diffusers
---
## Overview
This model implements a novel approach to multi-reference video generation using **Multiple Subject Reference (MSR)**. Instead of introducing additional encoder branches or fusion modules, we transform multiple static reference images into a pseudo-video sequence that shares the same representation space as the target video.
## Usage
This LoRA requires the **[ComfyUI-Licon-MSR](https://github.com/liconstudio/ComfyUI-Licon-MSR)** plugin for ComfyUI. A sample workflow is included in the model files for easy testing and experimentation.
## Key Features
### Multi-Reference Visual Memory
- **Token-level reference preservation**: Multiple reference images are encoded as video latents, preserving fine-grained visual information at token level rather than compressing into a single embedding
- **Native self-attention retrieval**: The target video tokens directly access reference tokens through the model's existing self-attention mechanism—no new architectural components needed
- **In-context conditioning**: References serve as "visual memory" within the main token sequence, not as external conditioning inputs
### Flexible Reference Composition
- **2 to 5 reference images**: Supports varying numbers of reference inputs with increasing complexity
- **Complementary semantic roles**: Each reference image can carry different information:
- Subject identity
- Object/prop details
- Scene/background
- Local textures
- Multiple viewpoints
## What It Can Do
### Identity Preservation Across References
Generate videos where multiple reference identities are simultaneously preserved:
- Multiple characters from different reference images
- Character + object combinations
- Object + scene compositions
### Relation-Based Composition
Beyond mere identity preservation, the model can compose references based on textual relation descriptions:
- Action interactions (handing, picking up, pushing)
- Spatial relationships (left-right, foreground-background)
- Temporal event structures (start → process → result)
### Cross-Reference Attribute Selection
The model learns to selectively retrieve attributes from different references:
- Face from reference A, clothing from reference B
- Object identity from one reference, pose/position from another
- Background elements from scene references
## Usage Tips (V1 Version)
- **Prompt description**: Requires concise but accurate description of reference images. Over-description or under-description both lead to consistency degradation
- **High-motion scenes**: 50fps recommended to ensure smooth motion coherence
- **Generation reliability**: Typically requires 2-3 sampling runs to achieve accurate results
## Results Showcase
### V1 Version
| Reference Images | Generated Video |
|:---:|:---:|
| <img src="validition_v1/01/1.jpg" width="80"> <img src="validition_v1/01/2.jpg" width="80"> <img src="validition_v1/01/bg.png" width="80"> | [▶ Play](validition_v1/01/video.mp4) |
| <img src="validition_v1/07/1.jpg" width="80"> <img src="validition_v1/07/2.jpg" width="80"> <img src="validition_v1/07/bg.png" width="80"> | [▶ Play](validition_v1/07/video.mp4) |
| <img src="validition_v1/05/1.png" width="70"> <img src="validition_v1/05/2.png" width="70"> <img src="validition_v1/05/5.png" width="70"> <img src="validition_v1/05/bg.png" width="70"> | [▶ Play](validition_v1/05/video.mp4) |
---