fix: critical T5 conditioner key sanitization and metadata

387ced5 verified 4 days ago

2.35 kB

	---
	license: cc-by-nc-4.0
	library_name: mlx
	pipeline_tag: text-to-audio
	base_model: facebook/audiogen-medium
	tags:
	- audio-generation
	- text-to-audio
	- audiogen
	- mlx
	- encodec
	---

	# AudioGen Medium (MLX)

	This is the MLX-native port of [facebook/audiogen-medium](https://huggingface.co/facebook/audiogen-medium), a 1.5B parameter autoregressive transformer for text-to-audio generation.

	## Model Details

	- Architecture: Autoregressive Transformer LM over EnCodec discrete tokens
	- Parameters: ~1.5B (LM) + EnCodec compression model
	- Sampling rate: 16 kHz
	- Frame rate: 50 Hz (4 codebooks, delayed pattern)
	- Text encoder: T5-large (d_model=1024, 24 layers, 16 heads)
	- Max duration: 10 seconds (configurable)

	## Files

	- `config.json` — Model configuration (includes `t5_model_name` reference)
	- `model.safetensors` — LM + EnCodec weights
	- `model.safetensors.index.json` — Weight index (for sharded variants)

	### T5 Conditioner (extracted separately)

	The T5-large text encoder weights are not included in this repository. Use `extract_t5.py` to extract them from the original `facebook/audiogen-medium` checkpoint:

	```bash
	python extract_t5.py --output /path/to/audiogen-mlx/t5
	```

	This produces a `t5/` directory with `config.json`, `model.safetensors`, and tokenizer files.

	> Note: The T5 safetensors keys use MLX-compatible naming (`.layer_0.` / `.layer_1.`
	> instead of HuggingFace's `.layer.0.` / `.layer.1.`). This is required because MLX's
	> `ModuleParameters.unflattened()` splits on all dots.

	## Usage (Swift/MLX)

	```swift
	import MLXAudioGen

	let model = try await AudioGenModel.fromPretrained(
	modelFolder: modelURL,
	t5Folder: t5URL
	)

	let tokens = try await model.generate(
	descriptions: ["dog barking"],
	duration: 5.0,
	cfgCoef: 3.0,
	temperature: 1.0,
	topK: 250
	)

	let audio = model.decode(tokens: tokens)
	```

	## T5 Attention

	T5's self-attention intentionally does not scale scores by `1/sqrt(d_k)`. This is a deliberate design choice in the T5 architecture — do not add scaling in the inference code.

	## License

	This model is published under the [CC-BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/) license (non-commercial use only), following the original [AudioGen license](https://huggingface.co/facebook/audiogen-medium).