Papers
arxiv:2603.07865

SoundWeaver: Semantic Warm-Starting for Text-to-Audio Diffusion Serving

Published on Mar 9
· Submitted by
Ayush Barik
on Mar 13
Authors:
,
,
,
,
,

Abstract

SoundWeaver accelerates text-to-audio diffusion generation by caching semantically similar audio and dynamically skipping function evaluations, achieving significant latency reduction with minimal quality loss.

AI-generated summary

Text-to-audio diffusion models produce high-fidelity audio but require tens of function evaluations (NFEs), incurring multi-second latency and limited throughput. We present SoundWeaver, the first training-free, model-agnostic serving system that accelerates text-to-audio diffusion by warm-starting from semantically similar cached audio. SoundWeaver introduces three components: a Reference Selector that retrieves and temporally aligns cached candidates via semantic and duration-aware gating; a Skip Gater that dynamically determines the percentage of NFEs to skip; and a lightweight Cache Manager that maintains cache utility through quality-aware eviction and refinement. On real-world audio traces, SoundWeaver achieves 1.8--3.0 times latency reduction with a cache of only {sim}1K entries while preserving or improving perceptual quality.

Community

Paper author Paper submitter

Screenshot 2026-03-12 at 9.33.26 PM

Tired of multi-second waits for stunning AI audio? We introduce SoundWeaver, the first training-free, model-agnostic serving system that revolutionizes text-to-audio diffusion by semantically warm-starting from a tiny cache of similar audio clips! With just ~1K cached entries, it delivers massive 1.8–3.0× latency reduction while actually improving perceptual quality! Additionally, the first Text-To-Audio paper to supplement quality analysis with a fine-crafted LLM-as-judge evaluation scheme (prompt available in paper)!

Paper author Paper submitter

Hello everyone!! Please have a read, let me know if you wish to access code. We see amazing results with little overhead, very very easy to integrate into your workflow. Warm-starting really needs to be explored more within the diffusion audio space.

Furthermore we are the FIRST paper to use LLM-as-judge for text to audio, I highly recommend using this as a supplementary metric in addition to the usual CLAP, FD, KL etc. Feel free to use our prompt!

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.07865 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.07865 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.07865 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.