Upload README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,101 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: cc-by-nc-4.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
pipeline_tag: text-to-audio
|
| 6 |
+
tags:
|
| 7 |
+
- t2a
|
| 8 |
+
- v2a
|
| 9 |
+
- text-to-audio
|
| 10 |
+
- video-to-audio
|
| 11 |
+
- woosh
|
| 12 |
+
- comfyui
|
| 13 |
+
- diffusion
|
| 14 |
+
- audio
|
| 15 |
+
- flow-matching
|
| 16 |
+
---
|
| 17 |
+
|
| 18 |
+
# Woosh — Sound Effect Generative Models
|
| 19 |
+
|
| 20 |
+
Inference code and open weights for sound effect generative models developed at Sony AI.
|
| 21 |
+
|
| 22 |
+
[](https://github.com/SonyResearch/Woosh)
|
| 23 |
+
[](https://github.com/Saganaki22/ComfyUI-Woosh)
|
| 25 |
+
[](https://arxiv.org/abs/2502.07359)
|
| 26 |
+
|
| 27 |
+
|
| 28 |
+
|
| 29 |
+

|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
<video controls width="100%">
|
| 33 |
+
<source src="https://huggingface.co/drbaph/Woosh/resolve/main/ComfyUI-Woosh-example.mp4" type="video/mp4">
|
| 34 |
+
Your browser does not support the video tag.
|
| 35 |
+
</video>
|
| 36 |
+
|
| 37 |
+
## Models
|
| 38 |
+
|
| 39 |
+
| Model | Task | Steps | CFG | Description |
|
| 40 |
+
|-------|------|-------|-----|-------------|
|
| 41 |
+
| **Woosh-Flow** | Text-to-Audio | 50 | 4.5 | Base model, best quality |
|
| 42 |
+
| **Woosh-DFlow** | Text-to-Audio | 4 | 1.0 | Distilled Flow, fast generation |
|
| 43 |
+
| **Woosh-VFlow** | Video-to-Audio | 50 | 4.5 | Base video-to-audio model |
|
| 44 |
+
| **Woosh-DVFlow** | Video-to-Audio | 4 | 1.0 | Distilled VFlow, fast video-to-audio |
|
| 45 |
+
|
| 46 |
+
### Components
|
| 47 |
+
|
| 48 |
+
- **Woosh-AE** — High-quality latent encoder/decoder. Provides latents for generative modeling and decodes audio from
|
| 49 |
+
generated latents.
|
| 50 |
+
- **Woosh-CLAP (TextConditionerA/V)** — Multimodal text-audio alignment model. Provides token latents for diffusion
|
| 51 |
+
model conditioning. TextConditionerA for T2A, TextConditionerV for V2A.
|
| 52 |
+
- **Woosh-Flow / Woosh-DFlow** — Original and distilled LDMs for text-to-audio generation.
|
| 53 |
+
- **Woosh-VFlow** — Multimodal LDM generating audio from video with optional text prompts.
|
| 54 |
+
|
| 55 |
+
## ComfyUI Nodes
|
| 56 |
+
|
| 57 |
+
Use these models in [ComfyUI](https://github.com/comfyanonymous/ComfyUI) with
|
| 58 |
+
[ComfyUI-Woosh](https://github.com/Saganaki22/ComfyUI-Woosh):
|
| 59 |
+
|
| 60 |
+
```bash
|
| 61 |
+
# Via ComfyUI Manager — search "Woosh" and click Install
|
| 62 |
+
# Or manually:
|
| 63 |
+
cd ComfyUI/custom_nodes
|
| 64 |
+
git clone https://github.com/Saganaki22/ComfyUI-Woosh.git
|
| 65 |
+
pip install -r ComfyUI-Woosh/requirements.txt
|
| 66 |
+
```
|
| 67 |
+
|
| 68 |
+
Place downloaded model folders in `ComfyUI/models/woosh/`. See the [ComfyUI-Woosh
|
| 69 |
+
README](https://github.com/Saganaki22/ComfyUI-Woosh) for full setup and workflow examples.
|
| 70 |
+
|
| 71 |
+
> **Note:** Set the Woosh TextConditioning node to **T2A** for Flow/DFlow models and **V2A** for VFlow/DVFlow models.
|
| 72 |
+
|
| 73 |
+
## Inference
|
| 74 |
+
|
| 75 |
+
See the [official Woosh repository](https://github.com/SonyResearch/Woosh) for standalone inference code and training
|
| 76 |
+
details.
|
| 77 |
+
|
| 78 |
+
## VRAM Requirements
|
| 79 |
+
|
| 80 |
+
| Model | VRAM (Approx) |
|
| 81 |
+
|-------|---------------|
|
| 82 |
+
| Flow / VFlow | ~8-12 GB |
|
| 83 |
+
| DFlow / DVFlow | ~4-6 GB |
|
| 84 |
+
| With CPU offload | ~2-4 GB |
|
| 85 |
+
|
| 86 |
+
## Citation
|
| 87 |
+
|
| 88 |
+
```bibtex
|
| 89 |
+
@article{saghibakshi2025woosh,
|
| 90 |
+
title={Woosh: Enhancing Text-to-Audio Generation with Flow Matching and FlowMap Distillation},
|
| 91 |
+
author={Saghibakshi, Ali and Bakshi, Soroosh and Tagliasacchi, Antonio and Wang, Shaojie and Choi, Jongmin and
|
| 92 |
+
Kawakami, Kazuhiro and Gu, Yuxuan},
|
| 93 |
+
journal={arXiv preprint arXiv:2502.07359},
|
| 94 |
+
year={2025}
|
| 95 |
+
}
|
| 96 |
+
```
|
| 97 |
+
|
| 98 |
+
## License
|
| 99 |
+
|
| 100 |
+
- **Code** — Apache 2.0
|
| 101 |
+
- **Model Weights** — [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/)
|