linoyts
/

ltx2-audio-video-conditioning

+---
+library_name: diffusers
+tags:
+  - ltx
+  - video-generation
+  - audio-to-video
+  - video-conditioning
+license: apache-2.0
+---
+# LTX-2 Audio-to-Video Pipeline with Video Conditioning
+A custom diffusers pipeline for LTX-2 that extends audio-to-video generation with **video conditioning** support.
+## Features
+- Audio-conditioned video generation (lip-sync)
+- **Video conditioning** for motion/pose guidance
+- Configurable conditioning strength and start frame
+- Compatible with LTX-2 LoRAs (face-swap, camera control, etc.)
+## Installation
+```bash
+pip install diffusers transformers torch torchaudio av
+```
+## Usage
+```python
+import torch
+from diffusers import DiffusionPipeline
+from diffusers.utils import load_image
+# Load pipeline with custom video conditioning support
+pipe = DiffusionPipeline.from_pretrained(
+    "Lightricks/LTX-2",
+    custom_pipeline="linoyts/ltx2-audio-video-conditioning",
+    torch_dtype=torch.bfloat16
+)
+pipe.to("cuda")
+# Optional: Load a LoRA (e.g., face-swap)
+# pipe.load_lora_weights("Alissonerdx/BFS-Best-Face-Swap-Video",
+#                        weight_name="ltx-2/head_swap_v1_13500_first_frame.safetensors")
+# pipe.fuse_lora(lora_scale=1.1)
+# Load inputs
+image = load_image("input_face.png")
+# Generate with video conditioning
+video, audio = pipe(
+    image=image,                          # Frame 0 appearance
+    video="reference_motion.mp4",         # Video for motion conditioning
+    video_conditioning_strength=1.0,      # How strongly to follow motion (0-1)
+    video_conditioning_frame_idx=1,       # Start video conditioning at frame 1
+    audio="audio.wav",                    # Audio for lip-sync
+    prompt="a person speaking naturally, smooth animation",
+    negative_prompt="low quality, blurry, distorted",
+    width=512,
+    height=768,
+    num_frames=121,
+    frame_rate=24.0,
+    num_inference_steps=40,
+    guidance_scale=4.0,
+    return_dict=False,
+)
+```
+## Parameters
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `image` | PIL.Image | None | Input image for frame 0 conditioning |
+| `video` | str/List/Tensor | None | Reference video for motion conditioning |
+| `video_conditioning_strength` | float | 1.0 | Strength of video conditioning (0.0-1.0) |
+| `video_conditioning_frame_idx` | int | 1 | Frame index where video conditioning starts |
+| `audio` | str/Tensor | None | Audio input for lip-sync |
+### Video Conditioning Frame Index
+- `0`: Video conditioning replaces all frames
+- `1` (default): Frame 0 = image, frames 1+ = video motion
+- `N`: Frames 0 to N-1 = image/noise, frames N+ = video conditioning
+## Distilled Model (8-step)
+For faster generation with the distilled model:
+```python
+pipe = DiffusionPipeline.from_pretrained(
+    "rootonchair/LTX-2-19b-distilled",
+    custom_pipeline="linoyts/ltx2-audio-video-conditioning",
+    torch_dtype=torch.bfloat16
+)
+pipe.to("cuda")
+DISTILLED_SIGMAS = [1.0, 0.99375, 0.9875, 0.98125, 0.975, 0.909375, 0.725, 0.421875]
+video, audio = pipe(
+    image=image,
+    video="reference.mp4",
+    audio="audio.wav",
+    prompt="...",
+    num_inference_steps=8,
+    sigmas=DISTILLED_SIGMAS,
+    guidance_scale=1.0,
+    return_dict=False,
+)
+```
+## License
+Apache 2.0