Instructions to use mlx-community/audiogen-medium-mlx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mlx-community/audiogen-medium-mlx with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir audiogen-medium-mlx mlx-community/audiogen-medium-mlx
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
license: cc-by-nc-4.0
library_name: mlx
pipeline_tag: text-to-audio
base_model: facebook/audiogen-medium
tags:
- audio-generation
- text-to-audio
- audiogen
- mlx
- encodec
AudioGen Medium (MLX)
This is the MLX-native port of facebook/audiogen-medium, a 1.5B parameter autoregressive transformer for text-to-audio generation.
Model Details
- Architecture: Autoregressive Transformer LM over EnCodec discrete tokens
- Parameters: ~1.5B (LM) + EnCodec compression model
- Sampling rate: 16 kHz
- Frame rate: 50 Hz (4 codebooks, delayed pattern)
- Text encoder: T5-large (d_model=1024, 24 layers, 16 heads)
- Max duration: 10 seconds (configurable)
Files
config.json— Model configuration (includest5_model_namereference)model.safetensors— LM + EnCodec weightsmodel.safetensors.index.json— Weight index (for sharded variants)
T5 Conditioner (extracted separately)
The T5-large text encoder weights are not included in this repository. Use extract_t5.py to extract them from the original facebook/audiogen-medium checkpoint:
python extract_t5.py --output /path/to/audiogen-mlx/t5
This produces a t5/ directory with config.json, model.safetensors, and tokenizer files.
Note: The T5 safetensors keys use MLX-compatible naming (
.layer_0./.layer_1.instead of HuggingFace's.layer.0./.layer.1.). This is required because MLX'sModuleParameters.unflattened()splits on all dots.
Usage (Swift/MLX)
import MLXAudioGen
let model = try await AudioGenModel.fromPretrained(
modelFolder: modelURL,
t5Folder: t5URL
)
let tokens = try await model.generate(
descriptions: ["dog barking"],
duration: 5.0,
cfgCoef: 3.0,
temperature: 1.0,
topK: 250
)
let audio = model.decode(tokens: tokens)
T5 Attention
T5's self-attention intentionally does not scale scores by 1/sqrt(d_k). This is a deliberate design choice in the T5 architecture — do not add scaling in the inference code.
License
This model is published under the CC-BY-NC 4.0 license (non-commercial use only), following the original AudioGen license.