square-zero-labs's picture
Fixed duplicate sound in last window
b40231e
|
Raw
History Blame Contribute Delete
12.4 kB
---
license: apache-2.0
tags:
- comfyui
- ltx-video
- ltx-2.3
- foley
- video-to-audio
- audio-generation
- workflow
- foley-lora
---
# LTX 2.3 Foley V2A ComfyUI Workflow
This repository contains ready-to-test ComfyUI workflows for the
[`FuzzPuppy/LTX-2.3-Foley-LoRA`](https://huggingface.co/FuzzPuppy/LTX-2.3-Foley-LoRA)
LoRA. The LoRA adds Foley sound effects to a silent input video using LTX-2.3:
given a video and a prompt describing the visible action, the loop workflow
generates matching non-speech, non-music sound effects and saves a new MP4.
There are two workflows provided:
1. `foley-sliding-window.json`: long-video workflow with overlapping audio windows and stitching.
2. `ltx_23_foley_v2a.json`: original short-clip workflow.
If you want run a quick short test, use `ltx_23_foley_v2a.json`. Otherwise, use `foley-sliding-window.json` so you can generate longer audio while keeping memory under control.
## Tutorial
[![Watch the tutorial: using the LTX-2.3 Foley LoRA in ComfyUI](https://img.youtube.com/vi/qnHFDlrySmw/hqdefault.jpg)](https://youtu.be/qnHFDlrySmw)
[Watch the tutorial on YouTube](https://youtu.be/qnHFDlrySmw)
## What Is Included
- `foley-sliding-window.json`: long-video workflow with overlapping audio windows and stitching.
- `ltx_23_foley_v2a.json`: original short-clip workflow.
- `setup_runpod_ltx_foley.sh`: one-command RunPod setup script.
- `ltx_foley_v2a`: small helper-node package.
- `tennis-no-sound.mp4`: default silent test video for RunPod setup.
Both workflows require the `ltx_foley_v2a` helper-node package. If ComfyUI shows
missing nodes named `LTXFoleyForLoopOpen`, `LTXFoleyWindowSelect`,
`LTXFoleyVideoToAudioLatent`, or `LTXFoleyAudioVAEDecode`, the workflow JSON was
loaded before these helper nodes were installed into `ComfyUI/custom_nodes`.
The helper-node package handles the workflow-specific pieces that stock ComfyUI
does not currently cover cleanly:
- plans the window count from the uploaded video
- provides a small local ComfyUI for-loop so no external loop-node pack is needed
- splits longer videos into overlapping windows
- freezes each source window as LTX video latents while leaving matching audio
latents empty for Foley generation
- decodes each audio window into the Comfy audio tensor layout expected by
current video saving nodes
- writes each raw decoded window as a WAV before stitching so artifacts can be
checked before the final crossfade
- crossfades and stitches generated audio windows into one final track
Prompt text, model loading, LoRA loading, video creation, and MP4 saving use
normal ComfyUI/LTXVideo nodes.
## Fastest RunPod Test
Use the official RunPod **ComfyUI - CUDA 12.8** template:
https://console.runpod.io/deploy?template=cw3nka7d08&ref=k7b1cgii
1. In RunPod, under "Additional Filters" filter CUDA versions to CUDA 12.8.
2. Select a 48 GB GPU: A40, RTX A6000, L40/L40S, or A100.
3. Make sure the `ComfyUI - CUDA 12.8` template is selected.
4. The template's default volume disk is `50 GB`, which is enough for the core workflow files, but tight once caches and reruns accumulate. Change the volume disk to `100 GB` if you want more breathing room.
5. Start the pod and open a terminal.
6. Run:
```bash
cd /workspace
curl -L https://huggingface.co/FuzzPuppy/LTX-2.3-Foley-Workflow/resolve/main/setup_runpod_ltx_foley.sh -o setup_runpod_ltx_foley.sh
bash setup_runpod_ltx_foley.sh
```
The setup script installs the nodes and models, downloads the tennis test video as `input.mp4`, restarts ComfyUI without stopping the pod (with `--cache-classic`, see the Manual ComfyUI Install notes), and waits until the UI responds on port `8188`.
By default the script installs ComfyUI `v0.27.0`. To test another ComfyUI release, set `COMFYUI_CORE_REF`:
```bash
COMFYUI_CORE_REF=v0.19.0 bash setup_runpod_ltx_foley.sh
```
To install workflow files from a different Hugging Face branch, set
`WORKFLOW_REVISION`:
```bash
WORKFLOW_REVISION=windows bash setup_runpod_ltx_foley.sh
```
After the script finishes:
1. Open ComfyUI from the RunPod web UI.
2. Under workflows, select `foley-sliding-window.json`.
3. Hit `Run`.
The default input video and prompt are already set:
```text
Two men are playing tennis. No speech is present. No music is present.
```
## What The Script Installs
The script assumes the official CUDA 12.8 template layout from
`runpod-workers/comfyui-base`:
- ComfyUI: `/workspace/runpod-slim/ComfyUI`
- Python environment: `/workspace/runpod-slim/ComfyUI/.venv-cu128`
- ComfyUI port: `8188`
It installs or refreshes:
- `Lightricks/ComfyUI-LTXVideo`
- `ltx_foley_v2a` helper nodes
- `foley-sliding-window.json`
- `ltx_23_foley_v2a.json`
- `tennis-no-sound.mp4`
The script also applies a small compatibility patch to the installed
`ComfyUI-LTXVideo/pyramid_blending.py` file so current Kornia builds can import
the node pack on fresh ComfyUI installs.
It downloads these model files:
- Base checkpoint:
[`Lightricks/LTX-2.3-fp8/ltx-2.3-22b-dev-fp8.safetensors`](https://huggingface.co/Lightricks/LTX-2.3-fp8/blob/main/ltx-2.3-22b-dev-fp8.safetensors)
- Text encoder:
[`Comfy-Org/ltx-2/split_files/text_encoders/gemma_3_12B_it_fp8_scaled.safetensors`](https://huggingface.co/Comfy-Org/ltx-2/blob/main/split_files/text_encoders/gemma_3_12B_it_fp8_scaled.safetensors)
- Foley LoRA:
[`FuzzPuppy/LTX-2.3-Foley-LoRA/ltx-2.3-foley-400-steps.safetensors`](https://huggingface.co/FuzzPuppy/LTX-2.3-Foley-LoRA/blob/main/ltx-2.3-foley-400-steps.safetensors)
Large model downloads are SHA-256 verified. Completed files are skipped on
rerun, interrupted downloads resume from `*.part` files, and corrupt partials
are retried once from scratch.
## Manual ComfyUI Install
If you are not using the RunPod script:
1. Install or update ComfyUI.
2. Install the official LTXVideo custom nodes:
`https://github.com/Lightricks/ComfyUI-LTXVideo`
3. Install the Foley helper nodes by placing the workflow repo's
`ltx_foley_v2a` folder into:
`ComfyUI/custom_nodes/`
4. Copy the either `foley-sliding-window.json` or `ltx_23_foley_v2a.json` into your ComfyUI user workflows folder. In a standard ComfyUI install this is:
`ComfyUI/user/default/workflows`.
5. Put the model files in:
- checkpoint:
[`ltx-2.3-22b-dev-fp8.safetensors`](https://huggingface.co/Lightricks/LTX-2.3-fp8/blob/main/ltx-2.3-22b-dev-fp8.safetensors)
in `ComfyUI/models/checkpoints`
- text encoder:
[`gemma_3_12B_it_fp8_scaled.safetensors`](https://huggingface.co/Comfy-Org/ltx-2/blob/main/split_files/text_encoders/gemma_3_12B_it_fp8_scaled.safetensors)
in `ComfyUI/models/text_encoders`
- Foley LoRA:
[`ltx-2.3-foley-400-steps.safetensors`](https://huggingface.co/FuzzPuppy/LTX-2.3-Foley-LoRA/blob/main/ltx-2.3-foley-400-steps.safetensors)
in `ComfyUI/models/loras`
6. Restart ComfyUI, starting it with the `--cache-classic` flag:
```bash
python main.py --cache-classic
```
On newer ComfyUI versions (`v0.27.0`+) the default caching mode is RAM-pressure
caching, which can evict node outputs in the middle of a run while the large
LTX models load. For `foley-sliding-window.json` that forces the window plan,
video decode, and model loaders to re-execute between windows, making long
runs much slower. `--cache-classic` keeps those outputs cached for the whole
run. The flag also exists on older releases such as `v0.19.0`, where it is
harmless.
7. Under workflows, select `foley-sliding-window.json` or `ltx_23_foley_v2a.json`.
8. Hit `Run`.
## Workflow Defaults
- Input video: `input.mp4`
- Prompt: `Two men are playing tennis. No speech is present. No music is present.`
- Negative prompt: anti-music/anti-vocal prompt
- Conditioning size: `576x576`
- Frame window: `89` frames
- Window overlap: `1.0` second
- Maximum windows: `16`
- Random ID: `42`
- Sampling steps: `30`
- Guidance: `4.0`
- Save window audio: `true`
- Window audio prefix: `ltx_foley_window`
- LoRA strength: `1.0`
Advanced sampler/STG settings are visible nodes in the loop body:
sampler `euler_ancestral_cfg_pp`, STG scale `1.0`, rescale `0.7`, STG blocks
`14, 19`, max shift `2.05`, base shift `0.95`, terminal `0.1`.
The `foley-sliding-window.json` workflow uses the full uploaded video. Videos longer than the selected
window are processed as overlapping windows and stitched into one generated
audio track. Shorter videos are padded internally by repeating the last frame.
The saved MP4 uses the source frames plus the stitched generated audio.
Raw generated window WAVs are saved under ComfyUI's output directory in
`ltx_foley_windows/` and their paths are listed in the manifest output.
## VRAM Notes
Sampling is the VRAM peak.
If you need to reduce memory use, try these changes in order:
- reduce frames from `89` to `57`, `41`, or `25`
- reduce conditioning size from `576x576` to `448x448` or `384x384`
- reduce sampling steps from `30` to `20`
Frame counts should stay one more than a multiple of 8:
```text
9, 17, 25, 33, 41, 49, 57, ..., 89, ..., 257
```
For l`foley-sliding-window.json`, the default `max_windows` is `16` so accidental very long inputs
fail clearly instead of running for hours. Increase it only when you expect the
extra runtime.
## Troubleshooting
### Missing Nodes
If ComfyUI reports missing `LTXFoley...` nodes after manual setup, verify that
these files exist and then restart ComfyUI:
```text
ComfyUI/custom_nodes/ltx_foley_v2a/__init__.py
ComfyUI/custom_nodes/ltx_foley_v2a/nodes.py
```
### Models Reload Or Nodes Re-Execute Between Windows
If the log shows `planned N windows` repeating, or the checkpoint/text-encoder
reloading before every window of `foley-sliding-window.json`, ComfyUI is running
with its default RAM-pressure caching and is evicting node outputs mid-run.
Start ComfyUI with `--cache-classic` (the RunPod script already does this). The
generated audio is still correct either way — the re-execution only costs time.
### Duplicate Sounds At Window Boundaries
In `foley-sliding-window.json`, neighboring windows overlap (default `1.0`
second) and each window generates its audio independently. If a distinct sound
event (a door close, a footstep) falls inside an overlap region, both windows
may render it slightly out of alignment, and you can hear the event twice
around a window boundary. The run log's `planned N windows starts=[...]` line
shows where the boundaries are (`start_frame / fps` seconds).
If you hear this, reduce the **Window overlap** (`overlap_seconds` on the
window-plan node), for example from `1.0` to `0.5`. A smaller overlap makes it
less likely an event lands in the shared region, at the cost of a shorter
crossfade between windows. Avoid large overlaps: the bigger the overlap, the
more of the video is generated twice, which increases the chance of doubled
sounds.
### Audio Artifacts On Some ComfyUI Versions
The workflows have been tested on ComfyUI `v0.27.0` and run
successfully there. However, on `v0.27.0` and newer ComfyUI versions generally, we have noticed that LTX-2.3 video-to-audio can produce a high-pitched squeak or audio artifacts in some generated audio.
If you notice the audio artifacts on a generation, rollback to `v0.19.0` of ComfyUI.
If you are using the RunPod setup you can rollback by simply:
```bash
cd /workspace
COMFYUI_CORE_REF=v0.19.0 bash setup_runpod_ltx_foley.sh
```
Then reload `foley-sliding-window.json` and run it again.
### RunPod Setup
#### Restarting/Rerun
If you rerun setup after a workflow or node update:
```bash
cd /workspace && bash setup_runpod_ltx_foley.sh
```
The script will skip verified model files, refresh the workflow/helper nodes,
and restart ComfyUI.
#### Model Downloads
If model downloads fail with authorization errors, accept the relevant Hugging
Face model terms and rerun with `HF_TOKEN` set.
#### Logs
Logs from the script-managed ComfyUI restart are written to:
```text
/workspace/runpod-slim/comfyui-restart.log
```
## License Scope
The files in this workflow repository are released under the Apache-2.0 license.
That applies to the workflow JSON, setup script, helper-node code, README/model
card text, and bundled test assets in this repository.
This workflow downloads and uses third-party model files that are governed by
their own licenses and terms, including LTX-2.3, the Gemma text encoder, and the
`FuzzPuppy/LTX-2.3-Foley-LoRA` weights.