File size: 2,406 Bytes
5cff1a3 94444a3 840af95 94444a3 5cff1a3 94444a3 5d3a675 94444a3 87b4b59 94444a3 1ebd877 94444a3 1ebd877 94444a3 840af95 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 | ---
library_name: pytorch
license: mit
pipeline_tag: text-to-audio
---
# Foley-Omni
[GitHub Code](https://github.com/NJU-Speech/Foley-Omni) | [arXiv](https://arxiv.org/abs/2606.03672) | [Demo](https://ty0402.github.io/Foley-omni-Web/)
## Overview
This repository packages the public inference checkpoint set for **Foley-Omni**.
The release focuses on **Video-to-Soundtrack (V2ST)** generation, where the model jointly generates synchronized **speech**, **sound effects**, and **music** from a video and optional text prompt.
## Model Size
5.5B
## Repository Contents
```text
ckpts/
βββ Foley-Omni/
β βββ v2st.pth
βββ Wan2.2-TI2V-5B/
β βββ models_t5_umt5-xxl-enc-bf16.pth
β βββ google/
β βββ umt5-xxl/
β βββ special_tokens_map.json
β βββ spiece.model
β βββ tokenizer.json
β βββ tokenizer_config.json
βββ mmaudio/
βββ ext_weights/
βββ v1-16.pth
βββ best_netG.pt
βββ synchformer_state_dict.pth
```
What each part is used for:
- `ckpts/Foley-Omni/v2st.pth`: released inference-only Foley-Omni weights
- `ckpts/Wan2.2-TI2V-5B/*`: text encoder and tokenizer for text conditioning
- `ckpts/mmaudio/ext_weights/v1-16.pth`: audio VAE for the 16 kHz inference path
- `ckpts/mmaudio/ext_weights/best_netG.pt`: vocoder for waveform decoding
- `ckpts/mmaudio/ext_weights/synchformer_state_dict.pth`: online visual feature extraction
## Online Feature Extraction
This release supports both:
- direct V2ST inference with pre-extracted `clip_feature_path` and `sync_feature_path`
- V2ST inference without pre-extracted features, using online visual feature extraction
Notes:
- `synchformer_state_dict.pth` is included in this repository because it is required for online Sync feature extraction.
- The CLIP image encoder is loaded by `open_clip` from `apple/DFN5B-CLIP-ViT-H-14-384` on first use. The current code path does not use a separate local CLIP checkpoint file.
## Source Attribution
This repository redistributes a small subset of files from the following upstream releases for convenience:
- **Wan2.2-TI2V-5B**: text encoder and tokenizer files
- **MMAudio**: audio VAE, vocoder, and Synchformer files
Please refer to the original upstream repositories for their licenses, usage terms, and project details. |