| --- |
| library_name: pytorch |
| license: mit |
| pipeline_tag: text-to-audio |
| --- |
| |
| # Foley-Omni |
|
|
|
|
| [GitHub Code](https://github.com/NJU-Speech/Foley-Omni) | [arXiv](https://arxiv.org/abs/2606.03672) | [Demo](https://ty0402.github.io/Foley-omni-Web/) |
|
|
| ## Overview |
|
|
| This repository packages the public inference checkpoint set for **Foley-Omni**. |
| The release focuses on **Video-to-Soundtrack (V2ST)** generation, where the model jointly generates synchronized **speech**, **sound effects**, and **music** from a video and optional text prompt. |
|
|
| ## Model Size |
| 5.5B |
|
|
| ## Repository Contents |
|
|
| ```text |
| ckpts/ |
| βββ Foley-Omni/ |
| β βββ v2st.pth |
| βββ Wan2.2-TI2V-5B/ |
| β βββ models_t5_umt5-xxl-enc-bf16.pth |
| β βββ google/ |
| β βββ umt5-xxl/ |
| β βββ special_tokens_map.json |
| β βββ spiece.model |
| β βββ tokenizer.json |
| β βββ tokenizer_config.json |
| βββ mmaudio/ |
| βββ ext_weights/ |
| βββ v1-16.pth |
| βββ best_netG.pt |
| βββ synchformer_state_dict.pth |
| ``` |
|
|
| What each part is used for: |
|
|
| - `ckpts/Foley-Omni/v2st.pth`: released inference-only Foley-Omni weights |
| - `ckpts/Wan2.2-TI2V-5B/*`: text encoder and tokenizer for text conditioning |
| - `ckpts/mmaudio/ext_weights/v1-16.pth`: audio VAE for the 16 kHz inference path |
| - `ckpts/mmaudio/ext_weights/best_netG.pt`: vocoder for waveform decoding |
| - `ckpts/mmaudio/ext_weights/synchformer_state_dict.pth`: online visual feature extraction |
|
|
| ## Online Feature Extraction |
|
|
| This release supports both: |
|
|
| - direct V2ST inference with pre-extracted `clip_feature_path` and `sync_feature_path` |
| - V2ST inference without pre-extracted features, using online visual feature extraction |
|
|
| Notes: |
|
|
| - `synchformer_state_dict.pth` is included in this repository because it is required for online Sync feature extraction. |
| - The CLIP image encoder is loaded by `open_clip` from `apple/DFN5B-CLIP-ViT-H-14-384` on first use. The current code path does not use a separate local CLIP checkpoint file. |
|
|
| ## Source Attribution |
|
|
| This repository redistributes a small subset of files from the following upstream releases for convenience: |
|
|
| - **Wan2.2-TI2V-5B**: text encoder and tokenizer files |
| - **MMAudio**: audio VAE, vocoder, and Synchformer files |
|
|
| Please refer to the original upstream repositories for their licenses, usage terms, and project details. |