Update README.md

7449318 verified 13 days ago

3.59 kB

	---
	license: apache-2.0
	tags:
	- video-generation
	- multi-reference
	- LTX-2.3
	base_model:
	- Lightricks/LTX-2.3
	github: https://github.com/liconstudio/ComfyUI-Licon-MSR
	library_name: diffusers
	---

	## Overview

	This model implements a novel approach to multi-reference video generation using Multiple Subject Reference (MSR). Instead of introducing additional encoder branches or fusion modules, we transform multiple static reference images into a pseudo-video sequence that shares the same representation space as the target video.

	## Usage

	This LoRA requires the [ComfyUI-Licon-MSR](https://github.com/liconstudio/ComfyUI-Licon-MSR) plugin for ComfyUI. A sample workflow is included in the model files for easy testing and experimentation.

	## Key Features

	### Multi-Reference Visual Memory

	- Token-level reference preservation: Multiple reference images are encoded as video latents, preserving fine-grained visual information at token level rather than compressing into a single embedding
	- Native self-attention retrieval: The target video tokens directly access reference tokens through the model's existing self-attention mechanism—no new architectural components needed
	- In-context conditioning: References serve as "visual memory" within the main token sequence, not as external conditioning inputs

	### Flexible Reference Composition

	- 2 to 5 reference images: Supports varying numbers of reference inputs with increasing complexity
	- Complementary semantic roles: Each reference image can carry different information:
	- Subject identity
	- Object/prop details
	- Scene/background
	- Local textures
	- Multiple viewpoints


	## What It Can Do

	### Identity Preservation Across References

	Generate videos where multiple reference identities are simultaneously preserved:
	- Multiple characters from different reference images
	- Character + object combinations
	- Object + scene compositions

	### Relation-Based Composition

	Beyond mere identity preservation, the model can compose references based on textual relation descriptions:
	- Action interactions (handing, picking up, pushing)
	- Spatial relationships (left-right, foreground-background)
	- Temporal event structures (start → process → result)

	### Cross-Reference Attribute Selection

	The model learns to selectively retrieve attributes from different references:
	- Face from reference A, clothing from reference B
	- Object identity from one reference, pose/position from another
	- Background elements from scene references

	## Usage Tips (V1 Version)

	- Prompt description: Requires concise but accurate description of reference images. Over-description or under-description both lead to consistency degradation
	- High-motion scenes: 50fps recommended to ensure smooth motion coherence
	- Generation reliability: Typically requires 2-3 sampling runs to achieve accurate results


	## Results Showcase

	### V1 Version

	\| Reference Images \| Generated Video \|
	\|:---:\|:---:\|
	\| <img src="validition_v1/01/1.jpg" width="80"> <img src="validition_v1/01/2.jpg" width="80"> <img src="validition_v1/01/bg.png" width="80"> \| [▶ Play](validition_v1/01/video.mp4) \|
	\| <img src="validition_v1/07/1.jpg" width="80"> <img src="validition_v1/07/2.jpg" width="80"> <img src="validition_v1/07/bg.png" width="80"> \| [▶ Play](validition_v1/07/video.mp4) \|
	\| <img src="validition_v1/05/1.png" width="70"> <img src="validition_v1/05/2.png" width="70"> <img src="validition_v1/05/5.png" width="70"> <img src="validition_v1/05/bg.png" width="70"> \| [▶ Play](validition_v1/05/video.mp4) \|

	---