LanguaMan
/

woosh-model

Model card Files Files and versions

woosh-model / README.md

LanguaMan's picture

Upload README.md

0c6b5b7 verified 1 day ago

|

history blame contribute delete

3.53 kB

	---
	license: cc-by-nc-4.0
	language:
	- en
	pipeline_tag: text-to-audio
	tags:
	- t2a
	- v2a
	- text-to-audio
	- video-to-audio
	- woosh
	- comfyui
	- diffusion
	- audio
	- flow-matching
	---

	# Woosh — Sound Effect Generative Models

	Inference code and open weights for sound effect generative models developed at Sony AI.

	[![GitHub](https://img.shields.io/badge/GitHub-SonyResearch%2FWoosh-black)](https://github.com/SonyResearch/Woosh)
	[![ComfyUI
	Node](https://img.shields.io/badge/ComfyUI-ComfyUI--Woosh-blue)](https://github.com/Saganaki22/ComfyUI-Woosh)
	[![arXiv](https://img.shields.io/badge/arXiv-2502.07359-b31b1b)](https://arxiv.org/abs/2502.07359)



	![Screenshot 2026-04-12 013347](https://cdn-uploads.huggingface.co/production/uploads/63473b59e5c0717e6737b872/kafOo1f9eZYfyyHgcbzPj.png)


	<video controls width="100%">
	<source src="https://huggingface.co/drbaph/Woosh/resolve/main/ComfyUI-Woosh-example.mp4" type="video/mp4">
	Your browser does not support the video tag.
	</video>

	## Models

	\| Model \| Task \| Steps \| CFG \| Description \|
	\|-------\|------\|-------\|-----\|-------------\|
	\| Woosh-Flow \| Text-to-Audio \| 50 \| 4.5 \| Base model, best quality \|
	\| Woosh-DFlow \| Text-to-Audio \| 4 \| 1.0 \| Distilled Flow, fast generation \|
	\| Woosh-VFlow \| Video-to-Audio \| 50 \| 4.5 \| Base video-to-audio model \|
	\| Woosh-DVFlow \| Video-to-Audio \| 4 \| 1.0 \| Distilled VFlow, fast video-to-audio \|

	### Components

	- Woosh-AE — High-quality latent encoder/decoder. Provides latents for generative modeling and decodes audio from
	generated latents.
	- Woosh-CLAP (TextConditionerA/V) — Multimodal text-audio alignment model. Provides token latents for diffusion
	model conditioning. TextConditionerA for T2A, TextConditionerV for V2A.
	- Woosh-Flow / Woosh-DFlow — Original and distilled LDMs for text-to-audio generation.
	- Woosh-VFlow — Multimodal LDM generating audio from video with optional text prompts.

	## ComfyUI Nodes

	Use these models in [ComfyUI](https://github.com/comfyanonymous/ComfyUI) with
	[ComfyUI-Woosh](https://github.com/Saganaki22/ComfyUI-Woosh):

	```bash
	# Via ComfyUI Manager — search "Woosh" and click Install
	# Or manually:
	cd ComfyUI/custom_nodes
	git clone https://github.com/Saganaki22/ComfyUI-Woosh.git
	pip install -r ComfyUI-Woosh/requirements.txt
	```

	Place downloaded model folders in `ComfyUI/models/woosh/`. See the [ComfyUI-Woosh
	README](https://github.com/Saganaki22/ComfyUI-Woosh) for full setup and workflow examples.

	> Note: Set the Woosh TextConditioning node to T2A for Flow/DFlow models and V2A for VFlow/DVFlow models.

	## Inference

	See the [official Woosh repository](https://github.com/SonyResearch/Woosh) for standalone inference code and training
	details.

	## VRAM Requirements

	\| Model \| VRAM (Approx) \|
	\|-------\|---------------\|
	\| Flow / VFlow \| ~8-12 GB \|
	\| DFlow / DVFlow \| ~4-6 GB \|
	\| With CPU offload \| ~2-4 GB \|

	## Citation

	```bibtex
	@article{saghibakshi2025woosh,
	title={Woosh: Enhancing Text-to-Audio Generation with Flow Matching and FlowMap Distillation},
	author={Saghibakshi, Ali and Bakshi, Soroosh and Tagliasacchi, Antonio and Wang, Shaojie and Choi, Jongmin and
	Kawakami, Kazuhiro and Gu, Yuxuan},
	journal={arXiv preprint arXiv:2502.07359},
	year={2025}
	}
	```

	## License

	- Code — Apache 2.0
	- Model Weights — [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/)