SwiftAudio

SwiftAudio: One-step Text-to-Audio Diffusion-Based Generation with an Audio-Free Distillation Technique

SwiftAudio is a one-step text-to-audio diffusion model. It distills a pretrained multi-step text-to-audio teacher into a one-step student using text captions only, without requiring paired audio data during distillation. The method adapts Variational Score Distillation (VSD) to the audio domain.

This repository contains the checkpoint used by the public SwiftAudio Gradio demo.

Highlights

Generates approximately 10 seconds of audio at 16 kHz.
Uses a single student UNet forward pass with no iterative denoising.
Uses caption-only, audio-free distillation.
Includes the text encoder, tokenizer, VAE, UNet, condition adapter, scheduler, and vocoder required by the demo.
Built on an Auffusion-compatible latent audio backbone.

Model Details

Property	Value
Task	Text-to-audio generation
Output	Mono waveform, 16 kHz, approximately 10 seconds
Inference	One-step latent prediction
Text encoder	CLIP text encoder
Backbone	Conditional UNet
Audio decoder	Spectrogram VAE and neural vocoder
Distillation method	Variational Score Distillation

Although the checkpoint uses Diffusers' StableDiffusionPipeline container, SwiftAudio generates mel spectrograms rather than natural images. The generated spectrogram is converted to a waveform by the included vocoder.

Quick Start

Requirements

Python 3.10 or newer
CUDA-capable GPU recommended
Approximately 10 GB of GPU memory for FP16 inference
Approximately 5 GB of disk space for the checkpoint

Clone the model repository and install its dependencies:

git clone https://huggingface.co/dinhhung1508/SwiftAudio
cd SwiftAudio

python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

Generate Audio From The Command Line

python inference.py \
  --prompt "Ocean waves with seabirds" \
  --seed 42 \
  --output ocean.wav

inference.py accepts the following options:

Option	Description	Default
`--prompt`	Text description of the requested audio	Required
`--output`	Output WAV file path	`output.wav`
`--seed`	Random seed for reproducible generation	`42`
`--model`	Hugging Face model ID or local model path	`dinhhung1508/SwiftAudio`
`--device`	Inference device: `cuda` or `cpu`	Automatically detected

Launch The Gradio Demo

python app.py

Open the local URL printed by Gradio, usually http://127.0.0.1:7860.

Use From Python

import soundfile as sf

from inference import SAMPLE_RATE, generate_audio, load_model


pipeline, vocoder, device, dtype = load_model("dinhhung1508/SwiftAudio")
audio = generate_audio(
    pipeline,
    vocoder,
    prompt="Rain and thunder",
    seed=42,
    device=device,
    dtype=dtype,
)
sf.write("rain.wav", audio, SAMPLE_RATE)

CPU Inference

CPU inference is supported but is significantly slower and requires more system memory:

python inference.py \
  --device cpu \
  --prompt "A train whistles" \
  --output train.wav

How Inference Works

Encode the text prompt with the included CLIP tokenizer and text encoder.
Sample a latent noise tensor with shape (1, 4, 32, 128).
Run the student UNet once at the final diffusion timestep.
Recover and decode the predicted clean latent into a mel spectrogram.
Convert the spectrogram into a 16 kHz waveform with the included vocoder.

The checkpoint uses Diffusers' StableDiffusionPipeline as a component container, but it generates spectrograms rather than natural images. Loading it as a standard image-generation pipeline without SwiftAudio's post-processing and vocoder will not produce audio.

Repository Structure

SwiftAudio/
├── app.py                         # Local Gradio demo
├── inference.py                   # CLI and Python inference API
├── requirements.txt
├── auffusion/                     # Vocoder and spectrogram utilities
├── unet/                          # One-step student UNet
├── vae/                           # Spectrogram VAE
├── vocoder/                       # Spectrogram-to-waveform vocoder
├── text_encoder/ and tokenizer/   # CLIP text conditioning
└── scheduler/                     # Diffusion scheduler configuration

Troubleshooting

CUDA out of memory

Close other GPU workloads or use a GPU with at least approximately 10 GB of available memory. CPU inference can be selected with --device cpu.

The first run takes longer

The first run downloads approximately 5 GB of model files and initializes all pipeline components. Later runs reuse the Hugging Face cache.

The output is an image instead of audio

Use the provided inference.py or app.py. A bare StableDiffusionPipeline(...) call does not run the included vocoder.

Reproducibility

Use the same prompt, seed, device type, and dependency versions. Results can vary slightly across hardware and library versions.

Example Prompts

Rain and thunder
Ocean waves with seabirds
A train whistles
Dishes clattering in a kitchen
Adult female is speaking and a young child is crying

Intended Use

SwiftAudio is intended for research and demonstration of fast text-conditioned sound generation, including sound effects, environmental audio, and acoustic scenes.

Limitations

The model generates fixed-length audio of approximately 10 seconds.
It is designed for acoustically plausible sound events, not intelligible or controllable speech synthesis.
Spoken content may not correspond to a specific language or transcript.
Complex temporal sequences and fine-grained event timing may not always follow the prompt.
Generated audio can inherit biases and limitations from the pretrained teacher and its training data.

Users are responsible for evaluating generated audio and ensuring that their use complies with applicable laws, policies, and rights.

Citation

The paper is currently under double-blind review:

@article{anonymous2026swiftaudio,
  title={SwiftAudio: One-step Text-to-Audio Diffusion-Based Generation
         with an Audio-Free Distillation Technique},
  author={Anonymous Authors},
  journal={Under Review},
  year={2026}
}

Acknowledgements

SwiftAudio uses an Auffusion-compatible latent audio architecture and builds on the Diffusers and Transformers ecosystems.

Downloads last month: 38

Model tree for dinhhung1508/SwiftAudio

Base model

auffusion/auffusion-full-no-adapter

Finetuned

(1)

this model

dinhhung1508
/

SwiftAudio