--- license: cc-by-nc-4.0 library_name: mlx pipeline_tag: text-to-audio base_model: facebook/audiogen-medium tags: - audio-generation - text-to-audio - audiogen - mlx - encodec --- # AudioGen Medium (MLX) This is the MLX-native port of [facebook/audiogen-medium](https://huggingface.co/facebook/audiogen-medium), a 1.5B parameter autoregressive transformer for text-to-audio generation. ## Model Details - **Architecture**: Autoregressive Transformer LM over EnCodec discrete tokens - **Parameters**: ~1.5B (LM) + EnCodec compression model - **Sampling rate**: 16 kHz - **Frame rate**: 50 Hz (4 codebooks, delayed pattern) - **Text encoder**: T5-large (d_model=1024, 24 layers, 16 heads) - **Max duration**: 10 seconds (configurable) ## Files - `config.json` — Model configuration (includes `t5_model_name` reference) - `model.safetensors` — LM + EnCodec weights - `model.safetensors.index.json` — Weight index (for sharded variants) ### T5 Conditioner (extracted separately) The T5-large text encoder weights are not included in this repository. Use `extract_t5.py` to extract them from the original `facebook/audiogen-medium` checkpoint: ```bash python extract_t5.py --output /path/to/audiogen-mlx/t5 ``` This produces a `t5/` directory with `config.json`, `model.safetensors`, and tokenizer files. > **Note**: The T5 safetensors keys use MLX-compatible naming (`.layer_0.` / `.layer_1.` > instead of HuggingFace's `.layer.0.` / `.layer.1.`). This is required because MLX's > `ModuleParameters.unflattened()` splits on all dots. ## Usage (Swift/MLX) ```swift import MLXAudioGen let model = try await AudioGenModel.fromPretrained( modelFolder: modelURL, t5Folder: t5URL ) let tokens = try await model.generate( descriptions: ["dog barking"], duration: 5.0, cfgCoef: 3.0, temperature: 1.0, topK: 250 ) let audio = model.decode(tokens: tokens) ``` ## T5 Attention T5's self-attention intentionally does **not** scale scores by `1/sqrt(d_k)`. This is a deliberate design choice in the T5 architecture — do not add scaling in the inference code. ## License This model is published under the [CC-BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/) license (non-commercial use only), following the original [AudioGen license](https://huggingface.co/facebook/audiogen-medium).