aufklarer's picture
enrich model card: CLAP scores, RTF, architecture diagram
2ab25fe verified
---
license: cc-by-nc-4.0
language:
- en
tags:
- mlx
- audio
- music
- text-to-music
- music-editing
- instruct-musicgen
- lora-merged
- quantized
- int4
base_model: facebook/musicgen-large
library_name: mlx
pipeline_tag: text-to-audio
---
# Instruct-MusicGen-MLX-4bit
- [speech-swift](https://github.com/soniqo/speech-swift) β€” Apple SDK
- [soniqo.audio](https://soniqo.audio) β€” website
- [blog](https://soniqo.audio/blog) β€” blog
MLX port of [Instruct-MusicGen](https://arxiv.org/abs/2405.18386) β€” text-instructed
music editing. Built on **MusicGen-large** (3.3B params, 48-layer autoregressive
transformer over EnCodec 32 kHz tokens) with cross-attention base weights from
the upstream checkpoint, LoRA-merged on Q/V (Ξ±/r = 2.0), plus a 48-layer
**CPTransformer** adapter that injects the input audio's per-layer Q/K/V via
prefix-attention into every self-attention block.
## Inputs / Outputs
- **Input**: text instruction (e.g. `"Music piece. Instruct: Only Drums."`) +
input audio (mono float32 @ 32 kHz, ≀ 10 s window)
- **Output**: edited audio (mono float32 @ 32 kHz, matches input length)
## Performance (Apple Silicon, INT4)
| Metric | Value |
|---|---|
| Bundle size | ~2.2 GB on disk |
| RTF (wall / audio) | ~1.21 (for 5 s output @ 250 AR steps) |
| Peak RSS | ~3 GB |
## Quality (CLAP score vs instruction)
Mean CLAP score (laion/clap-htsat-unfused) across 4 edit instructions on a
MusicGen-generated input clip β€” output vs the *instruction text*:
| Variant | mean CLAP | "Only Drums" | "Only Piano" | "Remove Drums" | "Only Bass" |
|---|---|---|---|---|---|
| FP16 | +0.352 | +0.40 | +0.36 | +0.42 | +0.22 |
| INT4 (this bundle) | +0.311 | +0.45 | +0.17 | +0.40 | +0.21 |
| INT8 | +0.311 | +0.44 | +0.20 | +0.39 | +0.21 |
INT4 β‰ˆ INT8 in CLAP, both within ~12 % of FP16. The "Only X" instructions
generally produce a positive Ξ” vs the input clip's CLAP score β€” i.e. the edit
moves the audio toward the instruction. "Only Bass" remains the hardest case.
## Usage (sketch)
```python
from huggingface_hub import snapshot_download
bundle = snapshot_download("aufklarer/Instruct-MusicGen-MLX-4bit")
# Production loader: https://github.com/soniqo/speech-swift
# Minimal MLX sketch:
# 1. Read bundle/config.json (HF MusicGen config + cp_transformer metadata)
# 2. Construct InstructMusicGen MLX class
# 3. Replay quantization on linear projections (mlx.nn.quantize, bits=4)
# 4. Load weights from bundle/model.safetensors
# 5. audio = model.generate(text, input_audio, max_steps=250)
```
## Architecture details
```
text instruction ── T5-base ── [LoRA-merged] cross-attn ──┐
β”‚
input audio ─ EnCodec encode ─ CPTransformer ─ prefix Q/K/V ──┐
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β–Ό
MusicGen-large LM (48 AR layers, delay pattern)
β”‚
β–Ό
EnCodec decoder β†’ 32 kHz wav
```
- **CPTransformer**: shares the base LM's transformer blocks (norm/self-attn/FFN)
but adds learned `pos_emb` (49, 501, 2048), `merge_linear[i]` per layer
(2048 β†’ 2048), and a zero-init `gate[i]` scalar.
- **Prefix injection** (per self-attn): second SDPA over the input audio's
K/V, with `dt_q = prefix_q[step] + main_q`, gated add before `out_proj`:
`attn = main_attn + dt_attn Γ— gate[i]`.
## Files
- `model.safetensors` β€” quantized LM (INT4 affine, group size 64) + adapter weights
- `config.json` β€” architecture + quantization + instruct metadata
- `compression_state_dict.bin` β€” passthrough of upstream EnCodec for offline init
## Source
- Upstream: [ldzhangyx/instruct-MusicGen](https://huggingface.co/ldzhangyx/instruct-MusicGen)
(CC-BY-NC, re-trained on public datasets)
- Paper: [Instruct-MusicGen (arxiv 2405.18386)](https://arxiv.org/abs/2405.18386)
- Base: [facebook/musicgen-large](https://huggingface.co/facebook/musicgen-large)
## License
**CC-BY-NC 4.0** β€” inherited from MusicGen + the upstream checkpoint. **Non-commercial use only.**