CocoBro
/

Foley-Omni

Model card Files Files and versions

Foley-Omni / README.md

CocoBro's picture

Update README.md

840af95 verified 1 day ago

|

history blame contribute delete

2.41 kB

	---
	library_name: pytorch
	license: mit
	pipeline_tag: text-to-audio
	---

	# Foley-Omni


	[GitHub Code](https://github.com/NJU-Speech/Foley-Omni) \| [arXiv](https://arxiv.org/abs/2606.03672) \| [Demo](https://ty0402.github.io/Foley-omni-Web/)

	## Overview

	This repository packages the public inference checkpoint set for Foley-Omni.
	The release focuses on Video-to-Soundtrack (V2ST) generation, where the model jointly generates synchronized speech, sound effects, and music from a video and optional text prompt.

	## Model Size
	5.5B

	## Repository Contents

	```text
	ckpts/
	├── Foley-Omni/
	│ └── v2st.pth
	├── Wan2.2-TI2V-5B/
	│ ├── models_t5_umt5-xxl-enc-bf16.pth
	│ └── google/
	│ └── umt5-xxl/
	│ ├── special_tokens_map.json
	│ ├── spiece.model
	│ ├── tokenizer.json
	│ └── tokenizer_config.json
	└── mmaudio/
	└── ext_weights/
	├── v1-16.pth
	├── best_netG.pt
	└── synchformer_state_dict.pth
	```

	What each part is used for:

	- `ckpts/Foley-Omni/v2st.pth`: released inference-only Foley-Omni weights
	- `ckpts/Wan2.2-TI2V-5B/*`: text encoder and tokenizer for text conditioning
	- `ckpts/mmaudio/ext_weights/v1-16.pth`: audio VAE for the 16 kHz inference path
	- `ckpts/mmaudio/ext_weights/best_netG.pt`: vocoder for waveform decoding
	- `ckpts/mmaudio/ext_weights/synchformer_state_dict.pth`: online visual feature extraction

	## Online Feature Extraction

	This release supports both:

	- direct V2ST inference with pre-extracted `clip_feature_path` and `sync_feature_path`
	- V2ST inference without pre-extracted features, using online visual feature extraction

	Notes:

	- `synchformer_state_dict.pth` is included in this repository because it is required for online Sync feature extraction.
	- The CLIP image encoder is loaded by `open_clip` from `apple/DFN5B-CLIP-ViT-H-14-384` on first use. The current code path does not use a separate local CLIP checkpoint file.

	## Source Attribution

	This repository redistributes a small subset of files from the following upstream releases for convenience:

	- Wan2.2-TI2V-5B: text encoder and tokenizer files
	- MMAudio: audio VAE, vocoder, and Synchformer files

	Please refer to the original upstream repositories for their licenses, usage terms, and project details.