enrich model card: CLAP scores, RTF, architecture diagram

2ab25fe verified 15 days ago

4.31 kB

	---
	license: cc-by-nc-4.0
	language:
	- en
	tags:
	- mlx
	- audio
	- music
	- text-to-music
	- music-editing
	- instruct-musicgen
	- lora-merged
	- quantized
	- int4
	base_model: facebook/musicgen-large
	library_name: mlx
	pipeline_tag: text-to-audio
	---

	# Instruct-MusicGen-MLX-4bit

	- [speech-swift](https://github.com/soniqo/speech-swift) — Apple SDK
	- [soniqo.audio](https://soniqo.audio) — website
	- [blog](https://soniqo.audio/blog) — blog

	MLX port of [Instruct-MusicGen](https://arxiv.org/abs/2405.18386) — text-instructed
	music editing. Built on MusicGen-large (3.3B params, 48-layer autoregressive
	transformer over EnCodec 32 kHz tokens) with cross-attention base weights from
	the upstream checkpoint, LoRA-merged on Q/V (α/r = 2.0), plus a 48-layer
	CPTransformer adapter that injects the input audio's per-layer Q/K/V via
	prefix-attention into every self-attention block.

	## Inputs / Outputs

	- Input: text instruction (e.g. `"Music piece. Instruct: Only Drums."`) +
	input audio (mono float32 @ 32 kHz, ≤ 10 s window)
	- Output: edited audio (mono float32 @ 32 kHz, matches input length)

	## Performance (Apple Silicon, INT4)

	\| Metric \| Value \|
	\|---\|---\|
	\| Bundle size \| ~2.2 GB on disk \|
	\| RTF (wall / audio) \| ~1.21 (for 5 s output @ 250 AR steps) \|
	\| Peak RSS \| ~3 GB \|

	## Quality (CLAP score vs instruction)

	Mean CLAP score (laion/clap-htsat-unfused) across 4 edit instructions on a
	MusicGen-generated input clip — output vs the instruction text:

	\| Variant \| mean CLAP \| "Only Drums" \| "Only Piano" \| "Remove Drums" \| "Only Bass" \|
	\|---\|---\|---\|---\|---\|---\|
	\| FP16 \| +0.352 \| +0.40 \| +0.36 \| +0.42 \| +0.22 \|
	\| INT4 (this bundle) \| +0.311 \| +0.45 \| +0.17 \| +0.40 \| +0.21 \|
	\| INT8 \| +0.311 \| +0.44 \| +0.20 \| +0.39 \| +0.21 \|

	INT4 ≈ INT8 in CLAP, both within ~12 % of FP16. The "Only X" instructions
	generally produce a positive Δ vs the input clip's CLAP score — i.e. the edit
	moves the audio toward the instruction. "Only Bass" remains the hardest case.

	## Usage (sketch)

	```python
	from huggingface_hub import snapshot_download
	bundle = snapshot_download("aufklarer/Instruct-MusicGen-MLX-4bit")

	# Production loader: https://github.com/soniqo/speech-swift
	# Minimal MLX sketch:
	# 1. Read bundle/config.json (HF MusicGen config + cp_transformer metadata)
	# 2. Construct InstructMusicGen MLX class
	# 3. Replay quantization on linear projections (mlx.nn.quantize, bits=4)
	# 4. Load weights from bundle/model.safetensors
	# 5. audio = model.generate(text, input_audio, max_steps=250)
	```

	## Architecture details

	```
	text instruction ── T5-base ── [LoRA-merged] cross-attn ──┐
	│
	input audio ─ EnCodec encode ─ CPTransformer ─ prefix Q/K/V ──┐
	│
	┌──────────────────────┘
	▼
	MusicGen-large LM (48 AR layers, delay pattern)
	│
	▼
	EnCodec decoder → 32 kHz wav
	```

	- CPTransformer: shares the base LM's transformer blocks (norm/self-attn/FFN)
	but adds learned `pos_emb` (49, 501, 2048), `merge_linear[i]` per layer
	(2048 → 2048), and a zero-init `gate[i]` scalar.
	- Prefix injection (per self-attn): second SDPA over the input audio's
	K/V, with `dt_q = prefix_q[step] + main_q`, gated add before `out_proj`:
	`attn = main_attn + dt_attn × gate[i]`.

	## Files

	- `model.safetensors` — quantized LM (INT4 affine, group size 64) + adapter weights
	- `config.json` — architecture + quantization + instruct metadata
	- `compression_state_dict.bin` — passthrough of upstream EnCodec for offline init

	## Source

	- Upstream: [ldzhangyx/instruct-MusicGen](https://huggingface.co/ldzhangyx/instruct-MusicGen)
	(CC-BY-NC, re-trained on public datasets)
	- Paper: [Instruct-MusicGen (arxiv 2405.18386)](https://arxiv.org/abs/2405.18386)
	- Base: [facebook/musicgen-large](https://huggingface.co/facebook/musicgen-large)

	## License

	CC-BY-NC 4.0 — inherited from MusicGen + the upstream checkpoint. Non-commercial use only.