Instructions to use dinhhung1508/SwiftAudio with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use dinhhung1508/SwiftAudio with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("dinhhung1508/SwiftAudio", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
SwiftAudio
SwiftAudio: One-step Text-to-Audio Diffusion-Based Generation with an Audio-Free Distillation Technique
SwiftAudio is a one-step text-to-audio diffusion model. It distills a pretrained multi-step text-to-audio teacher into a one-step student using text captions only, without requiring paired audio data during distillation. The method adapts Variational Score Distillation (VSD) to the audio domain.
This repository contains the checkpoint used by the public SwiftAudio Gradio demo.
Highlights
- Generates approximately 10 seconds of audio at 16 kHz.
- Uses a single student UNet forward pass with no iterative denoising.
- Uses caption-only, audio-free distillation.
- Includes the text encoder, tokenizer, VAE, UNet, condition adapter, scheduler, and vocoder required by the demo.
- Built on an Auffusion-compatible latent audio backbone.
Model Details
| Property | Value |
|---|---|
| Task | Text-to-audio generation |
| Output | Mono waveform, 16 kHz, approximately 10 seconds |
| Inference | One-step latent prediction |
| Text encoder | CLIP text encoder |
| Backbone | Conditional UNet |
| Audio decoder | Spectrogram VAE and neural vocoder |
| Distillation method | Variational Score Distillation |
Although the checkpoint uses Diffusers' StableDiffusionPipeline container,
SwiftAudio generates mel spectrograms rather than natural images. The generated
spectrogram is converted to a waveform by the included vocoder.
Quick Start
Requirements
- Python 3.10 or newer
- CUDA-capable GPU recommended
- Approximately 10 GB of GPU memory for FP16 inference
- Approximately 5 GB of disk space for the checkpoint
Clone the model repository and install its dependencies:
git clone https://huggingface.co/dinhhung1508/SwiftAudio
cd SwiftAudio
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
Generate Audio From The Command Line
python inference.py \
--prompt "Ocean waves with seabirds" \
--seed 42 \
--output ocean.wav
inference.py accepts the following options:
| Option | Description | Default |
|---|---|---|
--prompt |
Text description of the requested audio | Required |
--output |
Output WAV file path | output.wav |
--seed |
Random seed for reproducible generation | 42 |
--model |
Hugging Face model ID or local model path | dinhhung1508/SwiftAudio |
--device |
Inference device: cuda or cpu |
Automatically detected |
Launch The Gradio Demo
python app.py
Open the local URL printed by Gradio, usually
http://127.0.0.1:7860.
Use From Python
import soundfile as sf
from inference import SAMPLE_RATE, generate_audio, load_model
pipeline, vocoder, device, dtype = load_model("dinhhung1508/SwiftAudio")
audio = generate_audio(
pipeline,
vocoder,
prompt="Rain and thunder",
seed=42,
device=device,
dtype=dtype,
)
sf.write("rain.wav", audio, SAMPLE_RATE)
CPU Inference
CPU inference is supported but is significantly slower and requires more system memory:
python inference.py \
--device cpu \
--prompt "A train whistles" \
--output train.wav
How Inference Works
- Encode the text prompt with the included CLIP tokenizer and text encoder.
- Sample a latent noise tensor with shape
(1, 4, 32, 128). - Run the student UNet once at the final diffusion timestep.
- Recover and decode the predicted clean latent into a mel spectrogram.
- Convert the spectrogram into a 16 kHz waveform with the included vocoder.
The checkpoint uses Diffusers' StableDiffusionPipeline as a component
container, but it generates spectrograms rather than natural images. Loading it
as a standard image-generation pipeline without SwiftAudio's post-processing
and vocoder will not produce audio.
Repository Structure
SwiftAudio/
βββ app.py # Local Gradio demo
βββ inference.py # CLI and Python inference API
βββ requirements.txt
βββ auffusion/ # Vocoder and spectrogram utilities
βββ unet/ # One-step student UNet
βββ vae/ # Spectrogram VAE
βββ vocoder/ # Spectrogram-to-waveform vocoder
βββ text_encoder/ and tokenizer/ # CLIP text conditioning
βββ scheduler/ # Diffusion scheduler configuration
Troubleshooting
CUDA out of memory
Close other GPU workloads or use a GPU with at least approximately 10 GB of
available memory. CPU inference can be selected with --device cpu.
The first run takes longer
The first run downloads approximately 5 GB of model files and initializes all pipeline components. Later runs reuse the Hugging Face cache.
The output is an image instead of audio
Use the provided inference.py or app.py. A bare
StableDiffusionPipeline(...) call does not run the included vocoder.
Reproducibility
Use the same prompt, seed, device type, and dependency versions. Results can vary slightly across hardware and library versions.
Example Prompts
Rain and thunderOcean waves with seabirdsA train whistlesDishes clattering in a kitchenAdult female is speaking and a young child is crying
Intended Use
SwiftAudio is intended for research and demonstration of fast text-conditioned sound generation, including sound effects, environmental audio, and acoustic scenes.
Limitations
- The model generates fixed-length audio of approximately 10 seconds.
- It is designed for acoustically plausible sound events, not intelligible or controllable speech synthesis.
- Spoken content may not correspond to a specific language or transcript.
- Complex temporal sequences and fine-grained event timing may not always follow the prompt.
- Generated audio can inherit biases and limitations from the pretrained teacher and its training data.
Users are responsible for evaluating generated audio and ensuring that their use complies with applicable laws, policies, and rights.
Citation
The paper is currently under double-blind review:
@article{anonymous2026swiftaudio,
title={SwiftAudio: One-step Text-to-Audio Diffusion-Based Generation
with an Audio-Free Distillation Technique},
author={Anonymous Authors},
journal={Under Review},
year={2026}
}
Acknowledgements
SwiftAudio uses an Auffusion-compatible latent audio architecture and builds on the Diffusers and Transformers ecosystems.
- Downloads last month
- 38
Model tree for dinhhung1508/SwiftAudio
Base model
auffusion/auffusion-full-no-adapter