SwiftAudio

SwiftAudio: One-step Text-to-Audio Diffusion-Based Generation with an Audio-Free Distillation Technique

Live Demo

SwiftAudio is a one-step text-to-audio diffusion model. It distills a pretrained multi-step text-to-audio teacher into a one-step student using text captions only, without requiring paired audio data during distillation. The method adapts Variational Score Distillation (VSD) to the audio domain.

This repository contains the checkpoint used by the public SwiftAudio Gradio demo.

Highlights

  • Generates approximately 10 seconds of audio at 16 kHz.
  • Uses a single student UNet forward pass with no iterative denoising.
  • Uses caption-only, audio-free distillation.
  • Includes the text encoder, tokenizer, VAE, UNet, condition adapter, scheduler, and vocoder required by the demo.
  • Built on an Auffusion-compatible latent audio backbone.

Model Details

Property Value
Task Text-to-audio generation
Output Mono waveform, 16 kHz, approximately 10 seconds
Inference One-step latent prediction
Text encoder CLIP text encoder
Backbone Conditional UNet
Audio decoder Spectrogram VAE and neural vocoder
Distillation method Variational Score Distillation

Although the checkpoint uses Diffusers' StableDiffusionPipeline container, SwiftAudio generates mel spectrograms rather than natural images. The generated spectrogram is converted to a waveform by the included vocoder.

Quick Start

Requirements

  • Python 3.10 or newer
  • CUDA-capable GPU recommended
  • Approximately 10 GB of GPU memory for FP16 inference
  • Approximately 5 GB of disk space for the checkpoint

Clone the model repository and install its dependencies:

git clone https://huggingface.co/dinhhung1508/SwiftAudio
cd SwiftAudio

python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

Generate Audio From The Command Line

python inference.py \
  --prompt "Ocean waves with seabirds" \
  --seed 42 \
  --output ocean.wav

inference.py accepts the following options:

Option Description Default
--prompt Text description of the requested audio Required
--output Output WAV file path output.wav
--seed Random seed for reproducible generation 42
--model Hugging Face model ID or local model path dinhhung1508/SwiftAudio
--device Inference device: cuda or cpu Automatically detected

Launch The Gradio Demo

python app.py

Open the local URL printed by Gradio, usually http://127.0.0.1:7860.

Use From Python

import soundfile as sf

from inference import SAMPLE_RATE, generate_audio, load_model


pipeline, vocoder, device, dtype = load_model("dinhhung1508/SwiftAudio")
audio = generate_audio(
    pipeline,
    vocoder,
    prompt="Rain and thunder",
    seed=42,
    device=device,
    dtype=dtype,
)
sf.write("rain.wav", audio, SAMPLE_RATE)

CPU Inference

CPU inference is supported but is significantly slower and requires more system memory:

python inference.py \
  --device cpu \
  --prompt "A train whistles" \
  --output train.wav

How Inference Works

  1. Encode the text prompt with the included CLIP tokenizer and text encoder.
  2. Sample a latent noise tensor with shape (1, 4, 32, 128).
  3. Run the student UNet once at the final diffusion timestep.
  4. Recover and decode the predicted clean latent into a mel spectrogram.
  5. Convert the spectrogram into a 16 kHz waveform with the included vocoder.

The checkpoint uses Diffusers' StableDiffusionPipeline as a component container, but it generates spectrograms rather than natural images. Loading it as a standard image-generation pipeline without SwiftAudio's post-processing and vocoder will not produce audio.

Repository Structure

SwiftAudio/
β”œβ”€β”€ app.py                         # Local Gradio demo
β”œβ”€β”€ inference.py                   # CLI and Python inference API
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ auffusion/                     # Vocoder and spectrogram utilities
β”œβ”€β”€ unet/                          # One-step student UNet
β”œβ”€β”€ vae/                           # Spectrogram VAE
β”œβ”€β”€ vocoder/                       # Spectrogram-to-waveform vocoder
β”œβ”€β”€ text_encoder/ and tokenizer/   # CLIP text conditioning
└── scheduler/                     # Diffusion scheduler configuration

Troubleshooting

CUDA out of memory

Close other GPU workloads or use a GPU with at least approximately 10 GB of available memory. CPU inference can be selected with --device cpu.

The first run takes longer

The first run downloads approximately 5 GB of model files and initializes all pipeline components. Later runs reuse the Hugging Face cache.

The output is an image instead of audio

Use the provided inference.py or app.py. A bare StableDiffusionPipeline(...) call does not run the included vocoder.

Reproducibility

Use the same prompt, seed, device type, and dependency versions. Results can vary slightly across hardware and library versions.

Example Prompts

  • Rain and thunder
  • Ocean waves with seabirds
  • A train whistles
  • Dishes clattering in a kitchen
  • Adult female is speaking and a young child is crying

Intended Use

SwiftAudio is intended for research and demonstration of fast text-conditioned sound generation, including sound effects, environmental audio, and acoustic scenes.

Limitations

  • The model generates fixed-length audio of approximately 10 seconds.
  • It is designed for acoustically plausible sound events, not intelligible or controllable speech synthesis.
  • Spoken content may not correspond to a specific language or transcript.
  • Complex temporal sequences and fine-grained event timing may not always follow the prompt.
  • Generated audio can inherit biases and limitations from the pretrained teacher and its training data.

Users are responsible for evaluating generated audio and ensuring that their use complies with applicable laws, policies, and rights.

Citation

The paper is currently under double-blind review:

@article{anonymous2026swiftaudio,
  title={SwiftAudio: One-step Text-to-Audio Diffusion-Based Generation
         with an Audio-Free Distillation Technique},
  author={Anonymous Authors},
  journal={Under Review},
  year={2026}
}

Acknowledgements

SwiftAudio uses an Auffusion-compatible latent audio architecture and builds on the Diffusers and Transformers ecosystems.

Downloads last month
38
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for dinhhung1508/SwiftAudio

Finetuned
(1)
this model

Space using dinhhung1508/SwiftAudio 1