--- license: cc-by-nc-4.0 language: - en tags: - mlx - audio - music - text-to-music - music-editing - instruct-musicgen - lora-merged - quantized - int4 base_model: facebook/musicgen-large library_name: mlx pipeline_tag: text-to-audio --- # Instruct-MusicGen-MLX-4bit - [speech-swift](https://github.com/soniqo/speech-swift) — Apple SDK - [soniqo.audio](https://soniqo.audio) — website - [blog](https://soniqo.audio/blog) — blog MLX port of [Instruct-MusicGen](https://arxiv.org/abs/2405.18386) — text-instructed music editing. Built on **MusicGen-large** (3.3B params, 48-layer autoregressive transformer over EnCodec 32 kHz tokens) with cross-attention base weights from the upstream checkpoint, LoRA-merged on Q/V (α/r = 2.0), plus a 48-layer **CPTransformer** adapter that injects the input audio's per-layer Q/K/V via prefix-attention into every self-attention block. ## Inputs / Outputs - **Input**: text instruction (e.g. `"Music piece. Instruct: Only Drums."`) + input audio (mono float32 @ 32 kHz, ≤ 10 s window) - **Output**: edited audio (mono float32 @ 32 kHz, matches input length) ## Performance (Apple Silicon, INT4) | Metric | Value | |---|---| | Bundle size | ~2.2 GB on disk | | RTF (wall / audio) | ~1.21 (for 5 s output @ 250 AR steps) | | Peak RSS | ~3 GB | ## Quality (CLAP score vs instruction) Mean CLAP score (laion/clap-htsat-unfused) across 4 edit instructions on a MusicGen-generated input clip — output vs the *instruction text*: | Variant | mean CLAP | "Only Drums" | "Only Piano" | "Remove Drums" | "Only Bass" | |---|---|---|---|---|---| | FP16 | +0.352 | +0.40 | +0.36 | +0.42 | +0.22 | | INT4 (this bundle) | +0.311 | +0.45 | +0.17 | +0.40 | +0.21 | | INT8 | +0.311 | +0.44 | +0.20 | +0.39 | +0.21 | INT4 ≈ INT8 in CLAP, both within ~12 % of FP16. The "Only X" instructions generally produce a positive Δ vs the input clip's CLAP score — i.e. the edit moves the audio toward the instruction. "Only Bass" remains the hardest case. ## Usage (sketch) ```python from huggingface_hub import snapshot_download bundle = snapshot_download("aufklarer/Instruct-MusicGen-MLX-4bit") # Production loader: https://github.com/soniqo/speech-swift # Minimal MLX sketch: # 1. Read bundle/config.json (HF MusicGen config + cp_transformer metadata) # 2. Construct InstructMusicGen MLX class # 3. Replay quantization on linear projections (mlx.nn.quantize, bits=4) # 4. Load weights from bundle/model.safetensors # 5. audio = model.generate(text, input_audio, max_steps=250) ``` ## Architecture details ``` text instruction ── T5-base ── [LoRA-merged] cross-attn ──┐ │ input audio ─ EnCodec encode ─ CPTransformer ─ prefix Q/K/V ──┐ │ ┌──────────────────────┘ ▼ MusicGen-large LM (48 AR layers, delay pattern) │ ▼ EnCodec decoder → 32 kHz wav ``` - **CPTransformer**: shares the base LM's transformer blocks (norm/self-attn/FFN) but adds learned `pos_emb` (49, 501, 2048), `merge_linear[i]` per layer (2048 → 2048), and a zero-init `gate[i]` scalar. - **Prefix injection** (per self-attn): second SDPA over the input audio's K/V, with `dt_q = prefix_q[step] + main_q`, gated add before `out_proj`: `attn = main_attn + dt_attn × gate[i]`. ## Files - `model.safetensors` — quantized LM (INT4 affine, group size 64) + adapter weights - `config.json` — architecture + quantization + instruct metadata - `compression_state_dict.bin` — passthrough of upstream EnCodec for offline init ## Source - Upstream: [ldzhangyx/instruct-MusicGen](https://huggingface.co/ldzhangyx/instruct-MusicGen) (CC-BY-NC, re-trained on public datasets) - Paper: [Instruct-MusicGen (arxiv 2405.18386)](https://arxiv.org/abs/2405.18386) - Base: [facebook/musicgen-large](https://huggingface.co/facebook/musicgen-large) ## License **CC-BY-NC 4.0** — inherited from MusicGen + the upstream checkpoint. **Non-commercial use only.**