AudioX-Turbo GGUF

GGUF format conversion of AudioX-Turbo, a text-to-audio diffusion model using MMDiT (Multi-Modal Diffusion Transformer) architecture with DMD 4-step distilled sampling.

Models

File	Quantization	Size
`mmdit-f16.gguf`	FP16	4.25 GB
`mmdit-q8.gguf`	Q8_0	2.2 GB
`mmdit-q4.gguf`	Q4_0	1.2 GB
`vae-f16.gguf`	FP16	360 MB
`vae-q8.gguf`	Q8_0	~200 MB

Architecture

MMDiT Backbone: 24-layer Multi-Modal Diffusion Transformer
Embed dim: 1536
Attention heads: 64 (head_dim = 24)
Conditioning: T5-base text encoder (768-dim)
VAE: Oobleck-style autoencoder with 2048x downsampling
Diffusion: Rectified flow, 4-step DMD distillation
Output: 44.1kHz stereo audio

Python Usage

Dependencies

pip install torch numpy scipy gguf
pip install transformers  # for T5 text encoder

Full text-to-audio generation

import torch
import numpy as np
import scipy.io.wavfile as wavfile
import gguf

# Load T5 text encoder for conditioning
from transformers import T5EncoderModel, T5Tokenizer

device = "cpu"
model_name = "t5-base"  # Must match training config
tokenizer = T5Tokenizer.from_pretrained(model_name)
t5 = T5EncoderModel.from_pretrained(model_name).eval()

# Load GGUF model
import gguf
reader = gguf.GGUFReader("mmdit-f16.gguf")
# Note: Full C++ inference pipeline is in audiox.cpp

# For Python inference, use the original checkpoint directly:
# python3 scripts/reference_inference.py --ckpt audiox_turbo.ckpt ...

Text embedding export

from transformers import T5EncoderModel, T5Tokenizer

tokenizer = T5Tokenizer.from_pretrained("t5-base")
encoder = T5EncoderModel.from_pretrained("t5-base")

prompt = "A woman speaking clearly and slowly"
inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True, max_length=512)

with torch.no_grad():
    embeddings = encoder(**inputs).last_hidden_state  # [1, seq_len, 768]

np.save("text_embedding.npy", embeddings.numpy())

GGUF Conversion Details

QKV Weight Reordering

The conversion script (convert/convert_mmdit.py) applies QKV weight reordering to match the C++ inference layout:

Self-attention QKV: The original PyTorch checkpoint stores Q, K, V interleaved per attention head as (Q0, K0, V0, Q1, K1, V1, ...) with stride 3 between consecutive per-head dims. The GGUF conversion reorders this to contiguous blocks: [Q(1536) | K(1536) | V(1536)].

Cross-attention KV: Similarly, the cross-attention to_kv weights are reordered from (K0, V0, K1, V1, ...) at stride 2 to contiguous [K(1536) | V(1536)].

This reordering is applied to all 24 layers and verified to match the original weights after transformation.

Conversion command

python3 convert/convert_mmdit.py audiox_turbo.ckpt mmdit-f16.gguf --dtype f16
python3 convert/convert_vae.py vae.ckpt vae-f16.gguf --dtype f16

For quantized versions:

python3 convert/convert_mmdit.py audiox_turbo.ckpt mmdit-q8.gguf --dtype q8_0
python3 convert/convert_mmdit.py audiox_turbo.ckpt mmdit-q4.gguf --dtype q4_0

Notes

The model was trained with t5-base as the text encoder, not google/flan-t5-base
DMD distillation produces high-quality audio in just 4 sampling steps
The C++ port is under active development at audiox.cpp

Downloads last month: 140

GGUF

Model size

2B params

Architecture

audiox-mmdit

Hardware compatibility

16-bit

View +3 variants