aufklarer commited on
Commit
2ab25fe
Β·
verified Β·
1 Parent(s): ad7ffc9

enrich model card: CLAP scores, RTF, architecture diagram

Browse files
Files changed (1) hide show
  1. README.md +72 -28
README.md CHANGED
@@ -24,47 +24,91 @@ pipeline_tag: text-to-audio
24
  - [blog](https://soniqo.audio/blog) β€” blog
25
 
26
  MLX port of [Instruct-MusicGen](https://arxiv.org/abs/2405.18386) β€” text-instructed
27
- music editing. Built on **MusicGen-large** (3.3B) with cross-attention base weights
28
- from the upstream checkpoint, LoRA-merged (q,v at Ξ±/r=2.0), plus a 48-layer
29
- CPTransformer adapter that injects input-audio Q/K/V via prefix-attention into
30
- every self-attention layer.
 
31
 
32
- ## Usage
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
 
34
  ```python
35
  from huggingface_hub import snapshot_download
36
  bundle = snapshot_download("aufklarer/Instruct-MusicGen-MLX-4bit")
37
- # Then: model.generate(text="Music piece. Instruct: Only Drums.", audio=<32kHz wav>)
38
- # See https://github.com/soniqo/speech-swift for production loader.
 
 
 
 
 
 
39
  ```
40
 
41
- ## Model
42
 
43
- | | |
44
- |---|---|
45
- | Base | facebook/musicgen-large (3.3B) |
46
- | Quantization | INT4 weight-only (group 64) |
47
- | Sample rate | 32 kHz mono |
48
- | Max input window | 10 s (500 EnCodec frames @ 50 Hz) |
49
- | Adapter | CPTransformer over 48 layers, ~264 M extra params |
50
- | Cross-attn LoRA | r=32, Ξ±=64 β†’ scale 2.0 (q,v projections) |
51
- | Inputs | text instruction + input audio @ 32 kHz |
52
- | Output | edited audio @ 32 kHz |
53
- | Bundle size | 2501 MB |
54
-
55
- ## Performance (Apple Silicon, 5 s audio)
56
 
57
- | Metric | Value |
58
- |---|---|
59
- | RTF | 1.21 |
 
 
 
 
 
 
 
 
 
60
 
61
  ## Source
62
 
63
- - Upstream checkpoint: [ldzhangyx/instruct-MusicGen](https://huggingface.co/ldzhangyx/instruct-MusicGen)
 
64
  - Paper: [Instruct-MusicGen (arxiv 2405.18386)](https://arxiv.org/abs/2405.18386)
65
- - Base architecture: [facebook/musicgen-large](https://huggingface.co/facebook/musicgen-large)
66
 
67
  ## License
68
 
69
- **CC-BY-NC 4.0** β€” inherited from MusicGen-large + the ldzhangyx checkpoint.
70
- Non-commercial use only.
 
24
  - [blog](https://soniqo.audio/blog) β€” blog
25
 
26
  MLX port of [Instruct-MusicGen](https://arxiv.org/abs/2405.18386) β€” text-instructed
27
+ music editing. Built on **MusicGen-large** (3.3B params, 48-layer autoregressive
28
+ transformer over EnCodec 32 kHz tokens) with cross-attention base weights from
29
+ the upstream checkpoint, LoRA-merged on Q/V (Ξ±/r = 2.0), plus a 48-layer
30
+ **CPTransformer** adapter that injects the input audio's per-layer Q/K/V via
31
+ prefix-attention into every self-attention block.
32
 
33
+ ## Inputs / Outputs
34
+
35
+ - **Input**: text instruction (e.g. `"Music piece. Instruct: Only Drums."`) +
36
+ input audio (mono float32 @ 32 kHz, ≀ 10 s window)
37
+ - **Output**: edited audio (mono float32 @ 32 kHz, matches input length)
38
+
39
+ ## Performance (Apple Silicon, INT4)
40
+
41
+ | Metric | Value |
42
+ |---|---|
43
+ | Bundle size | ~2.2 GB on disk |
44
+ | RTF (wall / audio) | ~1.21 (for 5 s output @ 250 AR steps) |
45
+ | Peak RSS | ~3 GB |
46
+
47
+ ## Quality (CLAP score vs instruction)
48
+
49
+ Mean CLAP score (laion/clap-htsat-unfused) across 4 edit instructions on a
50
+ MusicGen-generated input clip β€” output vs the *instruction text*:
51
+
52
+ | Variant | mean CLAP | "Only Drums" | "Only Piano" | "Remove Drums" | "Only Bass" |
53
+ |---|---|---|---|---|---|
54
+ | FP16 | +0.352 | +0.40 | +0.36 | +0.42 | +0.22 |
55
+ | INT4 (this bundle) | +0.311 | +0.45 | +0.17 | +0.40 | +0.21 |
56
+ | INT8 | +0.311 | +0.44 | +0.20 | +0.39 | +0.21 |
57
+
58
+ INT4 β‰ˆ INT8 in CLAP, both within ~12 % of FP16. The "Only X" instructions
59
+ generally produce a positive Ξ” vs the input clip's CLAP score β€” i.e. the edit
60
+ moves the audio toward the instruction. "Only Bass" remains the hardest case.
61
+
62
+ ## Usage (sketch)
63
 
64
  ```python
65
  from huggingface_hub import snapshot_download
66
  bundle = snapshot_download("aufklarer/Instruct-MusicGen-MLX-4bit")
67
+
68
+ # Production loader: https://github.com/soniqo/speech-swift
69
+ # Minimal MLX sketch:
70
+ # 1. Read bundle/config.json (HF MusicGen config + cp_transformer metadata)
71
+ # 2. Construct InstructMusicGen MLX class
72
+ # 3. Replay quantization on linear projections (mlx.nn.quantize, bits=4)
73
+ # 4. Load weights from bundle/model.safetensors
74
+ # 5. audio = model.generate(text, input_audio, max_steps=250)
75
  ```
76
 
77
+ ## Architecture details
78
 
79
+ ```
80
+ text instruction ── T5-base ── [LoRA-merged] cross-attn ──┐
81
+ β”‚
82
+ input audio ─ EnCodec encode ─ CPTransformer ─ prefix Q/K/V ──┐
83
+ β”‚
84
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
85
+ β–Ό
86
+ MusicGen-large LM (48 AR layers, delay pattern)
87
+ β”‚
88
+ β–Ό
89
+ EnCodec decoder β†’ 32 kHz wav
90
+ ```
 
91
 
92
+ - **CPTransformer**: shares the base LM's transformer blocks (norm/self-attn/FFN)
93
+ but adds learned `pos_emb` (49, 501, 2048), `merge_linear[i]` per layer
94
+ (2048 β†’ 2048), and a zero-init `gate[i]` scalar.
95
+ - **Prefix injection** (per self-attn): second SDPA over the input audio's
96
+ K/V, with `dt_q = prefix_q[step] + main_q`, gated add before `out_proj`:
97
+ `attn = main_attn + dt_attn Γ— gate[i]`.
98
+
99
+ ## Files
100
+
101
+ - `model.safetensors` β€” quantized LM (INT4 affine, group size 64) + adapter weights
102
+ - `config.json` β€” architecture + quantization + instruct metadata
103
+ - `compression_state_dict.bin` β€” passthrough of upstream EnCodec for offline init
104
 
105
  ## Source
106
 
107
+ - Upstream: [ldzhangyx/instruct-MusicGen](https://huggingface.co/ldzhangyx/instruct-MusicGen)
108
+ (CC-BY-NC, re-trained on public datasets)
109
  - Paper: [Instruct-MusicGen (arxiv 2405.18386)](https://arxiv.org/abs/2405.18386)
110
+ - Base: [facebook/musicgen-large](https://huggingface.co/facebook/musicgen-large)
111
 
112
  ## License
113
 
114
+ **CC-BY-NC 4.0** β€” inherited from MusicGen + the upstream checkpoint. **Non-commercial use only.**