Text-to-Audio
MLX
Safetensors
English
musicgen
audio
music
text-to-music
music-editing
instruct-musicgen
lora-merged
quantized
int4
Instructions to use aufklarer/Instruct-MusicGen-MLX-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use aufklarer/Instruct-MusicGen-MLX-4bit with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir Instruct-MusicGen-MLX-4bit aufklarer/Instruct-MusicGen-MLX-4bit
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
enrich model card: CLAP scores, RTF, architecture diagram
Browse files
README.md
CHANGED
|
@@ -24,47 +24,91 @@ pipeline_tag: text-to-audio
|
|
| 24 |
- [blog](https://soniqo.audio/blog) β blog
|
| 25 |
|
| 26 |
MLX port of [Instruct-MusicGen](https://arxiv.org/abs/2405.18386) β text-instructed
|
| 27 |
-
music editing. Built on **MusicGen-large** (3.3B
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
|
|
|
| 31 |
|
| 32 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
|
| 34 |
```python
|
| 35 |
from huggingface_hub import snapshot_download
|
| 36 |
bundle = snapshot_download("aufklarer/Instruct-MusicGen-MLX-4bit")
|
| 37 |
-
|
| 38 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
```
|
| 40 |
|
| 41 |
-
##
|
| 42 |
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
## Performance (Apple Silicon, 5 s audio)
|
| 56 |
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 60 |
|
| 61 |
## Source
|
| 62 |
|
| 63 |
-
- Upstream
|
|
|
|
| 64 |
- Paper: [Instruct-MusicGen (arxiv 2405.18386)](https://arxiv.org/abs/2405.18386)
|
| 65 |
-
- Base
|
| 66 |
|
| 67 |
## License
|
| 68 |
|
| 69 |
-
**CC-BY-NC 4.0** β inherited from MusicGen
|
| 70 |
-
Non-commercial use only.
|
|
|
|
| 24 |
- [blog](https://soniqo.audio/blog) β blog
|
| 25 |
|
| 26 |
MLX port of [Instruct-MusicGen](https://arxiv.org/abs/2405.18386) β text-instructed
|
| 27 |
+
music editing. Built on **MusicGen-large** (3.3B params, 48-layer autoregressive
|
| 28 |
+
transformer over EnCodec 32 kHz tokens) with cross-attention base weights from
|
| 29 |
+
the upstream checkpoint, LoRA-merged on Q/V (Ξ±/r = 2.0), plus a 48-layer
|
| 30 |
+
**CPTransformer** adapter that injects the input audio's per-layer Q/K/V via
|
| 31 |
+
prefix-attention into every self-attention block.
|
| 32 |
|
| 33 |
+
## Inputs / Outputs
|
| 34 |
+
|
| 35 |
+
- **Input**: text instruction (e.g. `"Music piece. Instruct: Only Drums."`) +
|
| 36 |
+
input audio (mono float32 @ 32 kHz, β€ 10 s window)
|
| 37 |
+
- **Output**: edited audio (mono float32 @ 32 kHz, matches input length)
|
| 38 |
+
|
| 39 |
+
## Performance (Apple Silicon, INT4)
|
| 40 |
+
|
| 41 |
+
| Metric | Value |
|
| 42 |
+
|---|---|
|
| 43 |
+
| Bundle size | ~2.2 GB on disk |
|
| 44 |
+
| RTF (wall / audio) | ~1.21 (for 5 s output @ 250 AR steps) |
|
| 45 |
+
| Peak RSS | ~3 GB |
|
| 46 |
+
|
| 47 |
+
## Quality (CLAP score vs instruction)
|
| 48 |
+
|
| 49 |
+
Mean CLAP score (laion/clap-htsat-unfused) across 4 edit instructions on a
|
| 50 |
+
MusicGen-generated input clip β output vs the *instruction text*:
|
| 51 |
+
|
| 52 |
+
| Variant | mean CLAP | "Only Drums" | "Only Piano" | "Remove Drums" | "Only Bass" |
|
| 53 |
+
|---|---|---|---|---|---|
|
| 54 |
+
| FP16 | +0.352 | +0.40 | +0.36 | +0.42 | +0.22 |
|
| 55 |
+
| INT4 (this bundle) | +0.311 | +0.45 | +0.17 | +0.40 | +0.21 |
|
| 56 |
+
| INT8 | +0.311 | +0.44 | +0.20 | +0.39 | +0.21 |
|
| 57 |
+
|
| 58 |
+
INT4 β INT8 in CLAP, both within ~12 % of FP16. The "Only X" instructions
|
| 59 |
+
generally produce a positive Ξ vs the input clip's CLAP score β i.e. the edit
|
| 60 |
+
moves the audio toward the instruction. "Only Bass" remains the hardest case.
|
| 61 |
+
|
| 62 |
+
## Usage (sketch)
|
| 63 |
|
| 64 |
```python
|
| 65 |
from huggingface_hub import snapshot_download
|
| 66 |
bundle = snapshot_download("aufklarer/Instruct-MusicGen-MLX-4bit")
|
| 67 |
+
|
| 68 |
+
# Production loader: https://github.com/soniqo/speech-swift
|
| 69 |
+
# Minimal MLX sketch:
|
| 70 |
+
# 1. Read bundle/config.json (HF MusicGen config + cp_transformer metadata)
|
| 71 |
+
# 2. Construct InstructMusicGen MLX class
|
| 72 |
+
# 3. Replay quantization on linear projections (mlx.nn.quantize, bits=4)
|
| 73 |
+
# 4. Load weights from bundle/model.safetensors
|
| 74 |
+
# 5. audio = model.generate(text, input_audio, max_steps=250)
|
| 75 |
```
|
| 76 |
|
| 77 |
+
## Architecture details
|
| 78 |
|
| 79 |
+
```
|
| 80 |
+
text instruction ββ T5-base ββ [LoRA-merged] cross-attn βββ
|
| 81 |
+
β
|
| 82 |
+
input audio β EnCodec encode β CPTransformer β prefix Q/K/V βββ
|
| 83 |
+
β
|
| 84 |
+
ββββββββββββββββββββββββ
|
| 85 |
+
βΌ
|
| 86 |
+
MusicGen-large LM (48 AR layers, delay pattern)
|
| 87 |
+
β
|
| 88 |
+
βΌ
|
| 89 |
+
EnCodec decoder β 32 kHz wav
|
| 90 |
+
```
|
|
|
|
| 91 |
|
| 92 |
+
- **CPTransformer**: shares the base LM's transformer blocks (norm/self-attn/FFN)
|
| 93 |
+
but adds learned `pos_emb` (49, 501, 2048), `merge_linear[i]` per layer
|
| 94 |
+
(2048 β 2048), and a zero-init `gate[i]` scalar.
|
| 95 |
+
- **Prefix injection** (per self-attn): second SDPA over the input audio's
|
| 96 |
+
K/V, with `dt_q = prefix_q[step] + main_q`, gated add before `out_proj`:
|
| 97 |
+
`attn = main_attn + dt_attn Γ gate[i]`.
|
| 98 |
+
|
| 99 |
+
## Files
|
| 100 |
+
|
| 101 |
+
- `model.safetensors` β quantized LM (INT4 affine, group size 64) + adapter weights
|
| 102 |
+
- `config.json` β architecture + quantization + instruct metadata
|
| 103 |
+
- `compression_state_dict.bin` β passthrough of upstream EnCodec for offline init
|
| 104 |
|
| 105 |
## Source
|
| 106 |
|
| 107 |
+
- Upstream: [ldzhangyx/instruct-MusicGen](https://huggingface.co/ldzhangyx/instruct-MusicGen)
|
| 108 |
+
(CC-BY-NC, re-trained on public datasets)
|
| 109 |
- Paper: [Instruct-MusicGen (arxiv 2405.18386)](https://arxiv.org/abs/2405.18386)
|
| 110 |
+
- Base: [facebook/musicgen-large](https://huggingface.co/facebook/musicgen-large)
|
| 111 |
|
| 112 |
## License
|
| 113 |
|
| 114 |
+
**CC-BY-NC 4.0** β inherited from MusicGen + the upstream checkpoint. **Non-commercial use only.**
|
|
|