woosh-model / README.md
LanguaMan's picture
Upload README.md
0c6b5b7 verified
---
license: cc-by-nc-4.0
language:
- en
pipeline_tag: text-to-audio
tags:
- t2a
- v2a
- text-to-audio
- video-to-audio
- woosh
- comfyui
- diffusion
- audio
- flow-matching
---
# Woosh β€” Sound Effect Generative Models
Inference code and open weights for sound effect generative models developed at Sony AI.
[![GitHub](https://img.shields.io/badge/GitHub-SonyResearch%2FWoosh-black)](https://github.com/SonyResearch/Woosh)
[![ComfyUI
Node](https://img.shields.io/badge/ComfyUI-ComfyUI--Woosh-blue)](https://github.com/Saganaki22/ComfyUI-Woosh)
[![arXiv](https://img.shields.io/badge/arXiv-2502.07359-b31b1b)](https://arxiv.org/abs/2502.07359)
![Screenshot 2026-04-12 013347](https://cdn-uploads.huggingface.co/production/uploads/63473b59e5c0717e6737b872/kafOo1f9eZYfyyHgcbzPj.png)
<video controls width="100%">
<source src="https://huggingface.co/drbaph/Woosh/resolve/main/ComfyUI-Woosh-example.mp4" type="video/mp4">
Your browser does not support the video tag.
</video>
## Models
| Model | Task | Steps | CFG | Description |
|-------|------|-------|-----|-------------|
| **Woosh-Flow** | Text-to-Audio | 50 | 4.5 | Base model, best quality |
| **Woosh-DFlow** | Text-to-Audio | 4 | 1.0 | Distilled Flow, fast generation |
| **Woosh-VFlow** | Video-to-Audio | 50 | 4.5 | Base video-to-audio model |
| **Woosh-DVFlow** | Video-to-Audio | 4 | 1.0 | Distilled VFlow, fast video-to-audio |
### Components
- **Woosh-AE** β€” High-quality latent encoder/decoder. Provides latents for generative modeling and decodes audio from
generated latents.
- **Woosh-CLAP (TextConditionerA/V)** β€” Multimodal text-audio alignment model. Provides token latents for diffusion
model conditioning. TextConditionerA for T2A, TextConditionerV for V2A.
- **Woosh-Flow / Woosh-DFlow** β€” Original and distilled LDMs for text-to-audio generation.
- **Woosh-VFlow** β€” Multimodal LDM generating audio from video with optional text prompts.
## ComfyUI Nodes
Use these models in [ComfyUI](https://github.com/comfyanonymous/ComfyUI) with
[ComfyUI-Woosh](https://github.com/Saganaki22/ComfyUI-Woosh):
```bash
# Via ComfyUI Manager β€” search "Woosh" and click Install
# Or manually:
cd ComfyUI/custom_nodes
git clone https://github.com/Saganaki22/ComfyUI-Woosh.git
pip install -r ComfyUI-Woosh/requirements.txt
```
Place downloaded model folders in `ComfyUI/models/woosh/`. See the [ComfyUI-Woosh
README](https://github.com/Saganaki22/ComfyUI-Woosh) for full setup and workflow examples.
> **Note:** Set the Woosh TextConditioning node to **T2A** for Flow/DFlow models and **V2A** for VFlow/DVFlow models.
## Inference
See the [official Woosh repository](https://github.com/SonyResearch/Woosh) for standalone inference code and training
details.
## VRAM Requirements
| Model | VRAM (Approx) |
|-------|---------------|
| Flow / VFlow | ~8-12 GB |
| DFlow / DVFlow | ~4-6 GB |
| With CPU offload | ~2-4 GB |
## Citation
```bibtex
@article{saghibakshi2025woosh,
title={Woosh: Enhancing Text-to-Audio Generation with Flow Matching and FlowMap Distillation},
author={Saghibakshi, Ali and Bakshi, Soroosh and Tagliasacchi, Antonio and Wang, Shaojie and Choi, Jongmin and
Kawakami, Kazuhiro and Gu, Yuxuan},
journal={arXiv preprint arXiv:2502.07359},
year={2025}
}
```
## License
- **Code** β€” Apache 2.0
- **Model Weights** β€” [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/)