LanguaMan
/

woosh-model

+---
+license: cc-by-nc-4.0
+language:
+- en
+pipeline_tag: text-to-audio
+tags:
+- t2a
+- v2a
+- text-to-audio
+- video-to-audio
+- woosh
+- comfyui
+- diffusion
+- audio
+- flow-matching
+---
+ # Woosh — Sound Effect Generative Models
+  Inference code and open weights for sound effect generative models developed at Sony AI.
+  [![GitHub](https://img.shields.io/badge/GitHub-SonyResearch%2FWoosh-black)](https://github.com/SonyResearch/Woosh)
+  [![ComfyUI
+  Node](https://img.shields.io/badge/ComfyUI-ComfyUI--Woosh-blue)](https://github.com/Saganaki22/ComfyUI-Woosh)
+  [![arXiv](https://img.shields.io/badge/arXiv-2502.07359-b31b1b)](https://arxiv.org/abs/2502.07359)
+![Screenshot 2026-04-12 013347](https://cdn-uploads.huggingface.co/production/uploads/63473b59e5c0717e6737b872/kafOo1f9eZYfyyHgcbzPj.png)
+<video controls width="100%">
+  <source src="https://huggingface.co/drbaph/Woosh/resolve/main/ComfyUI-Woosh-example.mp4" type="video/mp4">
+  Your browser does not support the video tag.
+</video>
+  ## Models
+  | Model | Task | Steps | CFG | Description |
+  |-------|------|-------|-----|-------------|
+  | **Woosh-Flow** | Text-to-Audio | 50 | 4.5 | Base model, best quality |
+  | **Woosh-DFlow** | Text-to-Audio | 4 | 1.0 | Distilled Flow, fast generation |
+  | **Woosh-VFlow** | Video-to-Audio | 50 | 4.5 | Base video-to-audio model |
+  | **Woosh-DVFlow** | Video-to-Audio | 4 | 1.0 | Distilled VFlow, fast video-to-audio |
+  ### Components
+  - **Woosh-AE** — High-quality latent encoder/decoder. Provides latents for generative modeling and decodes audio from
+  generated latents.
+  - **Woosh-CLAP (TextConditionerA/V)** — Multimodal text-audio alignment model. Provides token latents for diffusion
+  model conditioning. TextConditionerA for T2A, TextConditionerV for V2A.
+  - **Woosh-Flow / Woosh-DFlow** — Original and distilled LDMs for text-to-audio generation.
+  - **Woosh-VFlow** — Multimodal LDM generating audio from video with optional text prompts.
+  ## ComfyUI Nodes
+  Use these models in [ComfyUI](https://github.com/comfyanonymous/ComfyUI) with
+  [ComfyUI-Woosh](https://github.com/Saganaki22/ComfyUI-Woosh):
+  ```bash
+  # Via ComfyUI Manager — search "Woosh" and click Install
+  # Or manually:
+  cd ComfyUI/custom_nodes
+  git clone https://github.com/Saganaki22/ComfyUI-Woosh.git
+  pip install -r ComfyUI-Woosh/requirements.txt
+  ```
+  Place downloaded model folders in `ComfyUI/models/woosh/`. See the [ComfyUI-Woosh
+  README](https://github.com/Saganaki22/ComfyUI-Woosh) for full setup and workflow examples.
+  > **Note:** Set the Woosh TextConditioning node to **T2A** for Flow/DFlow models and **V2A** for VFlow/DVFlow models.
+  ## Inference
+  See the [official Woosh repository](https://github.com/SonyResearch/Woosh) for standalone inference code and training
+  details.
+  ## VRAM Requirements
+  | Model | VRAM (Approx) |
+  |-------|---------------|
+  | Flow / VFlow | ~8-12 GB |
+  | DFlow / DVFlow | ~4-6 GB |
+  | With CPU offload | ~2-4 GB |
+  ## Citation
+  ```bibtex
+  @article{saghibakshi2025woosh,
+        title={Woosh: Enhancing Text-to-Audio Generation with Flow Matching and FlowMap Distillation},
+        author={Saghibakshi, Ali and Bakshi, Soroosh and Tagliasacchi, Antonio and Wang, Shaojie and Choi, Jongmin and
+  Kawakami, Kazuhiro and Gu, Yuxuan},
+        journal={arXiv preprint arXiv:2502.07359},
+        year={2025}
+  }
+  ```
+  ## License
+  - **Code** — Apache 2.0
+  - **Model Weights** — [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/)