Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,114 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
library_name: diffusers
|
| 3 |
+
tags:
|
| 4 |
+
- ltx
|
| 5 |
+
- video-generation
|
| 6 |
+
- audio-to-video
|
| 7 |
+
- video-conditioning
|
| 8 |
+
license: apache-2.0
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
# LTX-2 Audio-to-Video Pipeline with Video Conditioning
|
| 12 |
+
|
| 13 |
+
A custom diffusers pipeline for LTX-2 that extends audio-to-video generation with **video conditioning** support.
|
| 14 |
+
|
| 15 |
+
## Features
|
| 16 |
+
|
| 17 |
+
- Audio-conditioned video generation (lip-sync)
|
| 18 |
+
- **Video conditioning** for motion/pose guidance
|
| 19 |
+
- Configurable conditioning strength and start frame
|
| 20 |
+
- Compatible with LTX-2 LoRAs (face-swap, camera control, etc.)
|
| 21 |
+
|
| 22 |
+
## Installation
|
| 23 |
+
|
| 24 |
+
```bash
|
| 25 |
+
pip install diffusers transformers torch torchaudio av
|
| 26 |
+
```
|
| 27 |
+
|
| 28 |
+
## Usage
|
| 29 |
+
|
| 30 |
+
```python
|
| 31 |
+
import torch
|
| 32 |
+
from diffusers import DiffusionPipeline
|
| 33 |
+
from diffusers.utils import load_image
|
| 34 |
+
|
| 35 |
+
# Load pipeline with custom video conditioning support
|
| 36 |
+
pipe = DiffusionPipeline.from_pretrained(
|
| 37 |
+
"Lightricks/LTX-2",
|
| 38 |
+
custom_pipeline="linoyts/ltx2-audio-video-conditioning",
|
| 39 |
+
torch_dtype=torch.bfloat16
|
| 40 |
+
)
|
| 41 |
+
pipe.to("cuda")
|
| 42 |
+
|
| 43 |
+
# Optional: Load a LoRA (e.g., face-swap)
|
| 44 |
+
# pipe.load_lora_weights("Alissonerdx/BFS-Best-Face-Swap-Video",
|
| 45 |
+
# weight_name="ltx-2/head_swap_v1_13500_first_frame.safetensors")
|
| 46 |
+
# pipe.fuse_lora(lora_scale=1.1)
|
| 47 |
+
|
| 48 |
+
# Load inputs
|
| 49 |
+
image = load_image("input_face.png")
|
| 50 |
+
|
| 51 |
+
# Generate with video conditioning
|
| 52 |
+
video, audio = pipe(
|
| 53 |
+
image=image, # Frame 0 appearance
|
| 54 |
+
video="reference_motion.mp4", # Video for motion conditioning
|
| 55 |
+
video_conditioning_strength=1.0, # How strongly to follow motion (0-1)
|
| 56 |
+
video_conditioning_frame_idx=1, # Start video conditioning at frame 1
|
| 57 |
+
audio="audio.wav", # Audio for lip-sync
|
| 58 |
+
prompt="a person speaking naturally, smooth animation",
|
| 59 |
+
negative_prompt="low quality, blurry, distorted",
|
| 60 |
+
width=512,
|
| 61 |
+
height=768,
|
| 62 |
+
num_frames=121,
|
| 63 |
+
frame_rate=24.0,
|
| 64 |
+
num_inference_steps=40,
|
| 65 |
+
guidance_scale=4.0,
|
| 66 |
+
return_dict=False,
|
| 67 |
+
)
|
| 68 |
+
```
|
| 69 |
+
|
| 70 |
+
## Parameters
|
| 71 |
+
|
| 72 |
+
| Parameter | Type | Default | Description |
|
| 73 |
+
|-----------|------|---------|-------------|
|
| 74 |
+
| `image` | PIL.Image | None | Input image for frame 0 conditioning |
|
| 75 |
+
| `video` | str/List/Tensor | None | Reference video for motion conditioning |
|
| 76 |
+
| `video_conditioning_strength` | float | 1.0 | Strength of video conditioning (0.0-1.0) |
|
| 77 |
+
| `video_conditioning_frame_idx` | int | 1 | Frame index where video conditioning starts |
|
| 78 |
+
| `audio` | str/Tensor | None | Audio input for lip-sync |
|
| 79 |
+
|
| 80 |
+
### Video Conditioning Frame Index
|
| 81 |
+
|
| 82 |
+
- `0`: Video conditioning replaces all frames
|
| 83 |
+
- `1` (default): Frame 0 = image, frames 1+ = video motion
|
| 84 |
+
- `N`: Frames 0 to N-1 = image/noise, frames N+ = video conditioning
|
| 85 |
+
|
| 86 |
+
## Distilled Model (8-step)
|
| 87 |
+
|
| 88 |
+
For faster generation with the distilled model:
|
| 89 |
+
|
| 90 |
+
```python
|
| 91 |
+
pipe = DiffusionPipeline.from_pretrained(
|
| 92 |
+
"rootonchair/LTX-2-19b-distilled",
|
| 93 |
+
custom_pipeline="linoyts/ltx2-audio-video-conditioning",
|
| 94 |
+
torch_dtype=torch.bfloat16
|
| 95 |
+
)
|
| 96 |
+
pipe.to("cuda")
|
| 97 |
+
|
| 98 |
+
DISTILLED_SIGMAS = [1.0, 0.99375, 0.9875, 0.98125, 0.975, 0.909375, 0.725, 0.421875]
|
| 99 |
+
|
| 100 |
+
video, audio = pipe(
|
| 101 |
+
image=image,
|
| 102 |
+
video="reference.mp4",
|
| 103 |
+
audio="audio.wav",
|
| 104 |
+
prompt="...",
|
| 105 |
+
num_inference_steps=8,
|
| 106 |
+
sigmas=DISTILLED_SIGMAS,
|
| 107 |
+
guidance_scale=1.0,
|
| 108 |
+
return_dict=False,
|
| 109 |
+
)
|
| 110 |
+
```
|
| 111 |
+
|
| 112 |
+
## License
|
| 113 |
+
|
| 114 |
+
Apache 2.0
|