File size: 2,406 Bytes
5cff1a3
94444a3
840af95
94444a3
5cff1a3
94444a3
 
 
 
5d3a675
94444a3
 
 
 
 
 
87b4b59
 
94444a3
 
 
 
 
 
1ebd877
94444a3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1ebd877
94444a3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
840af95
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
---
library_name: pytorch
license: mit
pipeline_tag: text-to-audio
---

# Foley-Omni


[GitHub Code](https://github.com/NJU-Speech/Foley-Omni) | [arXiv](https://arxiv.org/abs/2606.03672) | [Demo](https://ty0402.github.io/Foley-omni-Web/)

## Overview

This repository packages the public inference checkpoint set for **Foley-Omni**.
The release focuses on **Video-to-Soundtrack (V2ST)** generation, where the model jointly generates synchronized **speech**, **sound effects**, and **music** from a video and optional text prompt.

## Model Size
5.5B

## Repository Contents

```text
ckpts/
β”œβ”€β”€ Foley-Omni/
β”‚   └── v2st.pth
β”œβ”€β”€ Wan2.2-TI2V-5B/
β”‚   β”œβ”€β”€ models_t5_umt5-xxl-enc-bf16.pth
β”‚   └── google/
β”‚       └── umt5-xxl/
β”‚           β”œβ”€β”€ special_tokens_map.json
β”‚           β”œβ”€β”€ spiece.model
β”‚           β”œβ”€β”€ tokenizer.json
β”‚           └── tokenizer_config.json
└── mmaudio/
    └── ext_weights/
        β”œβ”€β”€ v1-16.pth
        β”œβ”€β”€ best_netG.pt
        └── synchformer_state_dict.pth
```

What each part is used for:

- `ckpts/Foley-Omni/v2st.pth`: released inference-only Foley-Omni weights
- `ckpts/Wan2.2-TI2V-5B/*`: text encoder and tokenizer for text conditioning
- `ckpts/mmaudio/ext_weights/v1-16.pth`: audio VAE for the 16 kHz inference path
- `ckpts/mmaudio/ext_weights/best_netG.pt`: vocoder for waveform decoding
- `ckpts/mmaudio/ext_weights/synchformer_state_dict.pth`: online visual feature extraction

## Online Feature Extraction

This release supports both:

- direct V2ST inference with pre-extracted `clip_feature_path` and `sync_feature_path`
- V2ST inference without pre-extracted features, using online visual feature extraction

Notes:

- `synchformer_state_dict.pth` is included in this repository because it is required for online Sync feature extraction.
- The CLIP image encoder is loaded by `open_clip` from `apple/DFN5B-CLIP-ViT-H-14-384` on first use. The current code path does not use a separate local CLIP checkpoint file.

## Source Attribution

This repository redistributes a small subset of files from the following upstream releases for convenience:

- **Wan2.2-TI2V-5B**: text encoder and tokenizer files
- **MMAudio**: audio VAE, vocoder, and Synchformer files

Please refer to the original upstream repositories for their licenses, usage terms, and project details.