metadata
license: cc-by-nc-4.0
language:
- en
pipeline_tag: text-to-audio
tags:
- t2a
- v2a
- text-to-audio
- video-to-audio
- woosh
- comfyui
- diffusion
- audio
- flow-matching
Woosh — Sound Effect Generative Models
Inference code and open weights for sound effect generative models developed at Sony AI.
Models
| Model | Task | Steps | CFG | Description |
|---|---|---|---|---|
| Woosh-Flow | Text-to-Audio | 50 | 4.5 | Base model, best quality |
| Woosh-DFlow | Text-to-Audio | 4 | 1.0 | Distilled Flow, fast generation |
| Woosh-VFlow | Video-to-Audio | 50 | 4.5 | Base video-to-audio model |
| Woosh-DVFlow | Video-to-Audio | 4 | 1.0 | Distilled VFlow, fast video-to-audio |
Components
- Woosh-AE — High-quality latent encoder/decoder. Provides latents for generative modeling and decodes audio from generated latents.
- Woosh-CLAP (TextConditionerA/V) — Multimodal text-audio alignment model. Provides token latents for diffusion model conditioning. TextConditionerA for T2A, TextConditionerV for V2A.
- Woosh-Flow / Woosh-DFlow — Original and distilled LDMs for text-to-audio generation.
- Woosh-VFlow — Multimodal LDM generating audio from video with optional text prompts.
ComfyUI Nodes
Use these models in ComfyUI with ComfyUI-Woosh:
# Via ComfyUI Manager — search "Woosh" and click Install
# Or manually:
cd ComfyUI/custom_nodes
git clone https://github.com/Saganaki22/ComfyUI-Woosh.git
pip install -r ComfyUI-Woosh/requirements.txt
Place downloaded model folders in ComfyUI/models/woosh/. See the ComfyUI-Woosh
README for full setup and workflow examples.
Note: Set the Woosh TextConditioning node to T2A for Flow/DFlow models and V2A for VFlow/DVFlow models.
Inference
See the official Woosh repository for standalone inference code and training details.
VRAM Requirements
| Model | VRAM (Approx) |
|---|---|
| Flow / VFlow | ~8-12 GB |
| DFlow / DVFlow | ~4-6 GB |
| With CPU offload | ~2-4 GB |
Citation
@article{saghibakshi2025woosh,
title={Woosh: Enhancing Text-to-Audio Generation with Flow Matching and FlowMap Distillation},
author={Saghibakshi, Ali and Bakshi, Soroosh and Tagliasacchi, Antonio and Wang, Shaojie and Choi, Jongmin and
Kawakami, Kazuhiro and Gu, Yuxuan},
journal={arXiv preprint arXiv:2502.07359},
year={2025}
}
License
- Code — Apache 2.0
- Model Weights — CC BY-NC 4.0
