arxiv:2603.07865

SoundWeaver: Semantic Warm-Starting for Text-to-Audio Diffusion Serving

Published on Mar 9

· Submitted by

Ayush Barik on Mar 13

University of Illinois at Urbana-Champaign

Upvote

Authors:

Ayush Barik ,

Abstract

SoundWeaver accelerates text-to-audio diffusion generation by caching semantically similar audio and dynamically skipping function evaluations, achieving significant latency reduction with minimal quality loss.

AI-generated summary

Text-to-audio diffusion models produce high-fidelity audio but require tens of function evaluations (NFEs), incurring multi-second latency and limited throughput. We present SoundWeaver, the first training-free, model-agnostic serving system that accelerates text-to-audio diffusion by warm-starting from semantically similar cached audio. SoundWeaver introduces three components: a Reference Selector that retrieves and temporally aligns cached candidates via semantic and duration-aware gating; a Skip Gater that dynamically determines the percentage of NFEs to skip; and a lightweight Cache Manager that maintains cache utility through quality-aware eviction and refinement. On real-world audio traces, SoundWeaver achieves 1.8--3.0 times latency reduction with a cache of only {sim}1K entries while preserving or improving perceptual quality.

View arXiv page View PDF Add to collection

Community

jormungandr2017

Paper author Paper submitter about 10 hours ago

Tired of multi-second waits for stunning AI audio? We introduce SoundWeaver, the first training-free, model-agnostic serving system that revolutionizes text-to-audio diffusion by semantically warm-starting from a tiny cache of similar audio clips! With just ~1K cached entries, it delivers massive 1.8–3.0× latency reduction while actually improving perceptual quality! Additionally, the first Text-To-Audio paper to supplement quality analysis with a fine-crafted LLM-as-judge evaluation scheme (prompt available in paper)!

jormungandr2017

Paper author Paper submitter about 9 hours ago

Hello everyone!! Please have a read, let me know if you wish to access code. We see amazing results with little overhead, very very easy to integrate into your workflow. Warm-starting really needs to be explored more within the diffusion audio space.

Furthermore we are the FIRST paper to use LLM-as-judge for text to audio, I highly recommend using this as a supplementary metric in addition to the usual CLAP, FD, KL etc. Feel free to use our prompt!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.07865 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.07865 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.07865 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.