audiogen-medium-mlx / README.md
ClementDuhamel's picture
fix: critical T5 conditioner key sanitization and metadata
387ced5 verified
metadata
license: cc-by-nc-4.0
library_name: mlx
pipeline_tag: text-to-audio
base_model: facebook/audiogen-medium
tags:
  - audio-generation
  - text-to-audio
  - audiogen
  - mlx
  - encodec

AudioGen Medium (MLX)

This is the MLX-native port of facebook/audiogen-medium, a 1.5B parameter autoregressive transformer for text-to-audio generation.

Model Details

  • Architecture: Autoregressive Transformer LM over EnCodec discrete tokens
  • Parameters: ~1.5B (LM) + EnCodec compression model
  • Sampling rate: 16 kHz
  • Frame rate: 50 Hz (4 codebooks, delayed pattern)
  • Text encoder: T5-large (d_model=1024, 24 layers, 16 heads)
  • Max duration: 10 seconds (configurable)

Files

  • config.json — Model configuration (includes t5_model_name reference)
  • model.safetensors — LM + EnCodec weights
  • model.safetensors.index.json — Weight index (for sharded variants)

T5 Conditioner (extracted separately)

The T5-large text encoder weights are not included in this repository. Use extract_t5.py to extract them from the original facebook/audiogen-medium checkpoint:

python extract_t5.py --output /path/to/audiogen-mlx/t5

This produces a t5/ directory with config.json, model.safetensors, and tokenizer files.

Note: The T5 safetensors keys use MLX-compatible naming (.layer_0. / .layer_1. instead of HuggingFace's .layer.0. / .layer.1.). This is required because MLX's ModuleParameters.unflattened() splits on all dots.

Usage (Swift/MLX)

import MLXAudioGen

let model = try await AudioGenModel.fromPretrained(
    modelFolder: modelURL,
    t5Folder: t5URL
)

let tokens = try await model.generate(
    descriptions: ["dog barking"],
    duration: 5.0,
    cfgCoef: 3.0,
    temperature: 1.0,
    topK: 250
)

let audio = model.decode(tokens: tokens)

T5 Attention

T5's self-attention intentionally does not scale scores by 1/sqrt(d_k). This is a deliberate design choice in the T5 architecture — do not add scaling in the inference code.

License

This model is published under the CC-BY-NC 4.0 license (non-commercial use only), following the original AudioGen license.