Instructions to use mlx-community/audiogen-medium-mlx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mlx-community/audiogen-medium-mlx with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir audiogen-medium-mlx mlx-community/audiogen-medium-mlx
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
| license: cc-by-nc-4.0 | |
| library_name: mlx | |
| pipeline_tag: text-to-audio | |
| base_model: facebook/audiogen-medium | |
| tags: | |
| - audio-generation | |
| - text-to-audio | |
| - audiogen | |
| - mlx | |
| - encodec | |
| # AudioGen Medium (MLX) | |
| This is the MLX-native port of [facebook/audiogen-medium](https://huggingface.co/facebook/audiogen-medium), a 1.5B parameter autoregressive transformer for text-to-audio generation. | |
| ## Model Details | |
| - **Architecture**: Autoregressive Transformer LM over EnCodec discrete tokens | |
| - **Parameters**: ~1.5B (LM) + EnCodec compression model | |
| - **Sampling rate**: 16 kHz | |
| - **Frame rate**: 50 Hz (4 codebooks, delayed pattern) | |
| - **Text encoder**: T5-large (d_model=1024, 24 layers, 16 heads) | |
| - **Max duration**: 10 seconds (configurable) | |
| ## Files | |
| - `config.json` — Model configuration (includes `t5_model_name` reference) | |
| - `model.safetensors` — LM + EnCodec weights | |
| - `model.safetensors.index.json` — Weight index (for sharded variants) | |
| ### T5 Conditioner (extracted separately) | |
| The T5-large text encoder weights are not included in this repository. Use `extract_t5.py` to extract them from the original `facebook/audiogen-medium` checkpoint: | |
| ```bash | |
| python extract_t5.py --output /path/to/audiogen-mlx/t5 | |
| ``` | |
| This produces a `t5/` directory with `config.json`, `model.safetensors`, and tokenizer files. | |
| > **Note**: The T5 safetensors keys use MLX-compatible naming (`.layer_0.` / `.layer_1.` | |
| > instead of HuggingFace's `.layer.0.` / `.layer.1.`). This is required because MLX's | |
| > `ModuleParameters.unflattened()` splits on all dots. | |
| ## Usage (Swift/MLX) | |
| ```swift | |
| import MLXAudioGen | |
| let model = try await AudioGenModel.fromPretrained( | |
| modelFolder: modelURL, | |
| t5Folder: t5URL | |
| ) | |
| let tokens = try await model.generate( | |
| descriptions: ["dog barking"], | |
| duration: 5.0, | |
| cfgCoef: 3.0, | |
| temperature: 1.0, | |
| topK: 250 | |
| ) | |
| let audio = model.decode(tokens: tokens) | |
| ``` | |
| ## T5 Attention | |
| T5's self-attention intentionally does **not** scale scores by `1/sqrt(d_k)`. This is a deliberate design choice in the T5 architecture — do not add scaling in the inference code. | |
| ## License | |
| This model is published under the [CC-BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/) license (non-commercial use only), following the original [AudioGen license](https://huggingface.co/facebook/audiogen-medium). | |