AudioX-Turbo GGUF

GGUF format conversion of AudioX-Turbo, a text-to-audio diffusion model using MMDiT (Multi-Modal Diffusion Transformer) architecture with DMD 4-step distilled sampling.

Models

File Quantization Size
mmdit-f16.gguf FP16 4.25 GB
mmdit-q8.gguf Q8_0 2.2 GB
mmdit-q4.gguf Q4_0 1.2 GB
vae-f16.gguf FP16 360 MB
vae-q8.gguf Q8_0 ~200 MB

Architecture

  • MMDiT Backbone: 24-layer Multi-Modal Diffusion Transformer
  • Embed dim: 1536
  • Attention heads: 64 (head_dim = 24)
  • Conditioning: T5-base text encoder (768-dim)
  • VAE: Oobleck-style autoencoder with 2048x downsampling
  • Diffusion: Rectified flow, 4-step DMD distillation
  • Output: 44.1kHz stereo audio

Python Usage

Dependencies

pip install torch numpy scipy gguf
pip install transformers  # for T5 text encoder

Full text-to-audio generation

import torch
import numpy as np
import scipy.io.wavfile as wavfile
import gguf

# Load T5 text encoder for conditioning
from transformers import T5EncoderModel, T5Tokenizer

device = "cpu"
model_name = "t5-base"  # Must match training config
tokenizer = T5Tokenizer.from_pretrained(model_name)
t5 = T5EncoderModel.from_pretrained(model_name).eval()

# Load GGUF model
import gguf
reader = gguf.GGUFReader("mmdit-f16.gguf")
# Note: Full C++ inference pipeline is in audiox.cpp

# For Python inference, use the original checkpoint directly:
# python3 scripts/reference_inference.py --ckpt audiox_turbo.ckpt ...

Text embedding export

from transformers import T5EncoderModel, T5Tokenizer

tokenizer = T5Tokenizer.from_pretrained("t5-base")
encoder = T5EncoderModel.from_pretrained("t5-base")

prompt = "A woman speaking clearly and slowly"
inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True, max_length=512)

with torch.no_grad():
    embeddings = encoder(**inputs).last_hidden_state  # [1, seq_len, 768]

np.save("text_embedding.npy", embeddings.numpy())

GGUF Conversion Details

QKV Weight Reordering

The conversion script (convert/convert_mmdit.py) applies QKV weight reordering to match the C++ inference layout:

Self-attention QKV: The original PyTorch checkpoint stores Q, K, V interleaved per attention head as (Q0, K0, V0, Q1, K1, V1, ...) with stride 3 between consecutive per-head dims. The GGUF conversion reorders this to contiguous blocks: [Q(1536) | K(1536) | V(1536)].

Cross-attention KV: Similarly, the cross-attention to_kv weights are reordered from (K0, V0, K1, V1, ...) at stride 2 to contiguous [K(1536) | V(1536)].

This reordering is applied to all 24 layers and verified to match the original weights after transformation.

Conversion command

python3 convert/convert_mmdit.py audiox_turbo.ckpt mmdit-f16.gguf --dtype f16
python3 convert/convert_vae.py vae.ckpt vae-f16.gguf --dtype f16

For quantized versions:

python3 convert/convert_mmdit.py audiox_turbo.ckpt mmdit-q8.gguf --dtype q8_0
python3 convert/convert_mmdit.py audiox_turbo.ckpt mmdit-q4.gguf --dtype q4_0

Notes

  • The model was trained with t5-base as the text encoder, not google/flan-t5-base
  • DMD distillation produces high-quality audio in just 4 sampling steps
  • The C++ port is under active development at audiox.cpp
Downloads last month
140
GGUF
Model size
2B params
Architecture
audiox-mmdit
Hardware compatibility
Log In to add your hardware

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support