metadata
library_name: pytorch
license: mit
pipeline_tag: text-to-audio
Foley-Omni
GitHub Code | arXiv | Demo
Overview
This repository packages the public inference checkpoint set for Foley-Omni. The release focuses on Video-to-Soundtrack (V2ST) generation, where the model jointly generates synchronized speech, sound effects, and music from a video and optional text prompt.
Model Size
5.5B
Repository Contents
ckpts/
βββ Foley-Omni/
β βββ v2st.pth
βββ Wan2.2-TI2V-5B/
β βββ models_t5_umt5-xxl-enc-bf16.pth
β βββ google/
β βββ umt5-xxl/
β βββ special_tokens_map.json
β βββ spiece.model
β βββ tokenizer.json
β βββ tokenizer_config.json
βββ mmaudio/
βββ ext_weights/
βββ v1-16.pth
βββ best_netG.pt
βββ synchformer_state_dict.pth
What each part is used for:
ckpts/Foley-Omni/v2st.pth: released inference-only Foley-Omni weightsckpts/Wan2.2-TI2V-5B/*: text encoder and tokenizer for text conditioningckpts/mmaudio/ext_weights/v1-16.pth: audio VAE for the 16 kHz inference pathckpts/mmaudio/ext_weights/best_netG.pt: vocoder for waveform decodingckpts/mmaudio/ext_weights/synchformer_state_dict.pth: online visual feature extraction
Online Feature Extraction
This release supports both:
- direct V2ST inference with pre-extracted
clip_feature_pathandsync_feature_path - V2ST inference without pre-extracted features, using online visual feature extraction
Notes:
synchformer_state_dict.pthis included in this repository because it is required for online Sync feature extraction.- The CLIP image encoder is loaded by
open_clipfromapple/DFN5B-CLIP-ViT-H-14-384on first use. The current code path does not use a separate local CLIP checkpoint file.
Source Attribution
This repository redistributes a small subset of files from the following upstream releases for convenience:
- Wan2.2-TI2V-5B: text encoder and tokenizer files
- MMAudio: audio VAE, vocoder, and Synchformer files
Please refer to the original upstream repositories for their licenses, usage terms, and project details.