AudioX-Turbo GGUF
GGUF format conversion of AudioX-Turbo, a text-to-audio diffusion model using MMDiT (Multi-Modal Diffusion Transformer) architecture with DMD 4-step distilled sampling.
Models
| File | Quantization | Size |
|---|---|---|
mmdit-f16.gguf |
FP16 | 4.25 GB |
mmdit-q8.gguf |
Q8_0 | 2.2 GB |
mmdit-q4.gguf |
Q4_0 | 1.2 GB |
vae-f16.gguf |
FP16 | 360 MB |
vae-q8.gguf |
Q8_0 | ~200 MB |
Architecture
- MMDiT Backbone: 24-layer Multi-Modal Diffusion Transformer
- Embed dim: 1536
- Attention heads: 64 (head_dim = 24)
- Conditioning: T5-base text encoder (768-dim)
- VAE: Oobleck-style autoencoder with 2048x downsampling
- Diffusion: Rectified flow, 4-step DMD distillation
- Output: 44.1kHz stereo audio
Python Usage
Dependencies
pip install torch numpy scipy gguf
pip install transformers # for T5 text encoder
Full text-to-audio generation
import torch
import numpy as np
import scipy.io.wavfile as wavfile
import gguf
# Load T5 text encoder for conditioning
from transformers import T5EncoderModel, T5Tokenizer
device = "cpu"
model_name = "t5-base" # Must match training config
tokenizer = T5Tokenizer.from_pretrained(model_name)
t5 = T5EncoderModel.from_pretrained(model_name).eval()
# Load GGUF model
import gguf
reader = gguf.GGUFReader("mmdit-f16.gguf")
# Note: Full C++ inference pipeline is in audiox.cpp
# For Python inference, use the original checkpoint directly:
# python3 scripts/reference_inference.py --ckpt audiox_turbo.ckpt ...
Text embedding export
from transformers import T5EncoderModel, T5Tokenizer
tokenizer = T5Tokenizer.from_pretrained("t5-base")
encoder = T5EncoderModel.from_pretrained("t5-base")
prompt = "A woman speaking clearly and slowly"
inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True, max_length=512)
with torch.no_grad():
embeddings = encoder(**inputs).last_hidden_state # [1, seq_len, 768]
np.save("text_embedding.npy", embeddings.numpy())
GGUF Conversion Details
QKV Weight Reordering
The conversion script (convert/convert_mmdit.py) applies QKV weight reordering to match the C++ inference layout:
Self-attention QKV: The original PyTorch checkpoint stores Q, K, V interleaved per attention head as (Q0, K0, V0, Q1, K1, V1, ...) with stride 3 between consecutive per-head dims. The GGUF conversion reorders this to contiguous blocks: [Q(1536) | K(1536) | V(1536)].
Cross-attention KV: Similarly, the cross-attention to_kv weights are reordered from (K0, V0, K1, V1, ...) at stride 2 to contiguous [K(1536) | V(1536)].
This reordering is applied to all 24 layers and verified to match the original weights after transformation.
Conversion command
python3 convert/convert_mmdit.py audiox_turbo.ckpt mmdit-f16.gguf --dtype f16
python3 convert/convert_vae.py vae.ckpt vae-f16.gguf --dtype f16
For quantized versions:
python3 convert/convert_mmdit.py audiox_turbo.ckpt mmdit-q8.gguf --dtype q8_0
python3 convert/convert_mmdit.py audiox_turbo.ckpt mmdit-q4.gguf --dtype q4_0
Notes
- The model was trained with
t5-baseas the text encoder, notgoogle/flan-t5-base - DMD distillation produces high-quality audio in just 4 sampling steps
- The C++ port is under active development at audiox.cpp
- Downloads last month
- 140
16-bit