aufklarer
/

Instruct-MusicGen-MLX-4bit

@@ -24,47 +24,91 @@ pipeline_tag: text-to-audio
 - [blog](https://soniqo.audio/blog) — blog
 MLX port of [Instruct-MusicGen](https://arxiv.org/abs/2405.18386) — text-instructed
-music editing. Built on **MusicGen-large** (3.3B) with cross-attention base weights
-from the upstream checkpoint, LoRA-merged (q,v at α/r=2.0), plus a 48-layer
-CPTransformer adapter that injects input-audio Q/K/V via prefix-attention into
-every self-attention layer.
-## Usage
 ```python
 from huggingface_hub import snapshot_download
 bundle = snapshot_download("aufklarer/Instruct-MusicGen-MLX-4bit")
-# Then: model.generate(text="Music piece. Instruct: Only Drums.", audio=<32kHz wav>)
-# See https://github.com/soniqo/speech-swift for production loader.
 ```
-## Model
-| | |
-|---|---|
-| Base | facebook/musicgen-large (3.3B) |
-| Quantization | INT4 weight-only (group 64) |
-| Sample rate | 32 kHz mono |
-| Max input window | 10 s (500 EnCodec frames @ 50 Hz) |
-| Adapter | CPTransformer over 48 layers, ~264 M extra params |
-| Cross-attn LoRA | r=32, α=64 → scale 2.0 (q,v projections) |
-| Inputs | text instruction + input audio @ 32 kHz |
-| Output | edited audio @ 32 kHz |
-| Bundle size | 2501 MB |
-## Performance (Apple Silicon, 5 s audio)
-| Metric | Value |
-|---|---|
-| RTF | 1.21 |
 ## Source
-- Upstream checkpoint: [ldzhangyx/instruct-MusicGen](https://huggingface.co/ldzhangyx/instruct-MusicGen)
 - Paper: [Instruct-MusicGen (arxiv 2405.18386)](https://arxiv.org/abs/2405.18386)
-- Base architecture: [facebook/musicgen-large](https://huggingface.co/facebook/musicgen-large)
 ## License
-**CC-BY-NC 4.0** — inherited from MusicGen-large + the ldzhangyx checkpoint.
-Non-commercial use only.

 - [blog](https://soniqo.audio/blog) — blog
 MLX port of [Instruct-MusicGen](https://arxiv.org/abs/2405.18386) — text-instructed
+music editing. Built on **MusicGen-large** (3.3B params, 48-layer autoregressive
+transformer over EnCodec 32 kHz tokens) with cross-attention base weights from
+the upstream checkpoint, LoRA-merged on Q/V (α/r = 2.0), plus a 48-layer
+**CPTransformer** adapter that injects the input audio's per-layer Q/K/V via
+prefix-attention into every self-attention block.
+## Inputs / Outputs
+- **Input**: text instruction (e.g. `"Music piece. Instruct: Only Drums."`) +
+  input audio (mono float32 @ 32 kHz, ≤ 10 s window)
+- **Output**: edited audio (mono float32 @ 32 kHz, matches input length)
+## Performance (Apple Silicon, INT4)
+| Metric | Value |
+|---|---|
+| Bundle size | ~2.2 GB on disk |
+| RTF (wall / audio) | ~1.21 (for 5 s output @ 250 AR steps) |
+| Peak RSS | ~3 GB |
+## Quality (CLAP score vs instruction)
+Mean CLAP score (laion/clap-htsat-unfused) across 4 edit instructions on a
+MusicGen-generated input clip — output vs the *instruction text*:
+| Variant | mean CLAP | "Only Drums" | "Only Piano" | "Remove Drums" | "Only Bass" |
+|---|---|---|---|---|---|
+| FP16 | +0.352 | +0.40 | +0.36 | +0.42 | +0.22 |
+| INT4 (this bundle) | +0.311 | +0.45 | +0.17 | +0.40 | +0.21 |
+| INT8 | +0.311 | +0.44 | +0.20 | +0.39 | +0.21 |
+INT4 ≈ INT8 in CLAP, both within ~12 % of FP16. The "Only X" instructions
+generally produce a positive Δ vs the input clip's CLAP score — i.e. the edit
+moves the audio toward the instruction. "Only Bass" remains the hardest case.
+## Usage (sketch)
 ```python
 from huggingface_hub import snapshot_download
 bundle = snapshot_download("aufklarer/Instruct-MusicGen-MLX-4bit")
+# Production loader: https://github.com/soniqo/speech-swift
+# Minimal MLX sketch:
+#   1. Read bundle/config.json (HF MusicGen config + cp_transformer metadata)
+#   2. Construct InstructMusicGen MLX class
+#   3. Replay quantization on linear projections (mlx.nn.quantize, bits=4)
+#   4. Load weights from bundle/model.safetensors
+#   5. audio = model.generate(text, input_audio, max_steps=250)
 ```
+## Architecture details
+```
+text instruction ── T5-base ── [LoRA-merged] cross-attn ──┐
+                                                          │
+input audio ─ EnCodec encode ─ CPTransformer ─ prefix Q/K/V ──┐
+                                                              │
+                                       ┌──────────────────────┘
+                                       ▼
+              MusicGen-large LM (48 AR layers, delay pattern)
+                                       │
+                                       ▼
+                              EnCodec decoder → 32 kHz wav
+```
+- **CPTransformer**: shares the base LM's transformer blocks (norm/self-attn/FFN)
+  but adds learned `pos_emb` (49, 501, 2048), `merge_linear[i]` per layer
+  (2048 → 2048), and a zero-init `gate[i]` scalar.
+- **Prefix injection** (per self-attn): second SDPA over the input audio's
+  K/V, with `dt_q = prefix_q[step] + main_q`, gated add before `out_proj`:
+  `attn = main_attn + dt_attn × gate[i]`.
+## Files
+- `model.safetensors` — quantized LM (INT4 affine, group size 64) + adapter weights
+- `config.json` — architecture + quantization + instruct metadata
+- `compression_state_dict.bin` — passthrough of upstream EnCodec for offline init
 ## Source
+- Upstream: [ldzhangyx/instruct-MusicGen](https://huggingface.co/ldzhangyx/instruct-MusicGen)
+  (CC-BY-NC, re-trained on public datasets)
 - Paper: [Instruct-MusicGen (arxiv 2405.18386)](https://arxiv.org/abs/2405.18386)
+- Base: [facebook/musicgen-large](https://huggingface.co/facebook/musicgen-large)
 ## License
+**CC-BY-NC 4.0** — inherited from MusicGen + the upstream checkpoint. **Non-commercial use only.**