Text-to-Audio
MLX
Safetensors
English
musicgen
audio
music
text-to-music
music-editing
instruct-musicgen
lora-merged
quantized
int4
Instructions to use aufklarer/Instruct-MusicGen-MLX-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use aufklarer/Instruct-MusicGen-MLX-4bit with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir Instruct-MusicGen-MLX-4bit aufklarer/Instruct-MusicGen-MLX-4bit
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
| license: cc-by-nc-4.0 | |
| language: | |
| - en | |
| tags: | |
| - mlx | |
| - audio | |
| - music | |
| - text-to-music | |
| - music-editing | |
| - instruct-musicgen | |
| - lora-merged | |
| - quantized | |
| - int4 | |
| base_model: facebook/musicgen-large | |
| library_name: mlx | |
| pipeline_tag: text-to-audio | |
| # Instruct-MusicGen-MLX-4bit | |
| - [speech-swift](https://github.com/soniqo/speech-swift) β Apple SDK | |
| - [soniqo.audio](https://soniqo.audio) β website | |
| - [blog](https://soniqo.audio/blog) β blog | |
| MLX port of [Instruct-MusicGen](https://arxiv.org/abs/2405.18386) β text-instructed | |
| music editing. Built on **MusicGen-large** (3.3B params, 48-layer autoregressive | |
| transformer over EnCodec 32 kHz tokens) with cross-attention base weights from | |
| the upstream checkpoint, LoRA-merged on Q/V (Ξ±/r = 2.0), plus a 48-layer | |
| **CPTransformer** adapter that injects the input audio's per-layer Q/K/V via | |
| prefix-attention into every self-attention block. | |
| ## Inputs / Outputs | |
| - **Input**: text instruction (e.g. `"Music piece. Instruct: Only Drums."`) + | |
| input audio (mono float32 @ 32 kHz, β€ 10 s window) | |
| - **Output**: edited audio (mono float32 @ 32 kHz, matches input length) | |
| ## Performance (Apple Silicon, INT4) | |
| | Metric | Value | | |
| |---|---| | |
| | Bundle size | ~2.2 GB on disk | | |
| | RTF (wall / audio) | ~1.21 (for 5 s output @ 250 AR steps) | | |
| | Peak RSS | ~3 GB | | |
| ## Quality (CLAP score vs instruction) | |
| Mean CLAP score (laion/clap-htsat-unfused) across 4 edit instructions on a | |
| MusicGen-generated input clip β output vs the *instruction text*: | |
| | Variant | mean CLAP | "Only Drums" | "Only Piano" | "Remove Drums" | "Only Bass" | | |
| |---|---|---|---|---|---| | |
| | FP16 | +0.352 | +0.40 | +0.36 | +0.42 | +0.22 | | |
| | INT4 (this bundle) | +0.311 | +0.45 | +0.17 | +0.40 | +0.21 | | |
| | INT8 | +0.311 | +0.44 | +0.20 | +0.39 | +0.21 | | |
| INT4 β INT8 in CLAP, both within ~12 % of FP16. The "Only X" instructions | |
| generally produce a positive Ξ vs the input clip's CLAP score β i.e. the edit | |
| moves the audio toward the instruction. "Only Bass" remains the hardest case. | |
| ## Usage (sketch) | |
| ```python | |
| from huggingface_hub import snapshot_download | |
| bundle = snapshot_download("aufklarer/Instruct-MusicGen-MLX-4bit") | |
| # Production loader: https://github.com/soniqo/speech-swift | |
| # Minimal MLX sketch: | |
| # 1. Read bundle/config.json (HF MusicGen config + cp_transformer metadata) | |
| # 2. Construct InstructMusicGen MLX class | |
| # 3. Replay quantization on linear projections (mlx.nn.quantize, bits=4) | |
| # 4. Load weights from bundle/model.safetensors | |
| # 5. audio = model.generate(text, input_audio, max_steps=250) | |
| ``` | |
| ## Architecture details | |
| ``` | |
| text instruction ββ T5-base ββ [LoRA-merged] cross-attn βββ | |
| β | |
| input audio β EnCodec encode β CPTransformer β prefix Q/K/V βββ | |
| β | |
| ββββββββββββββββββββββββ | |
| βΌ | |
| MusicGen-large LM (48 AR layers, delay pattern) | |
| β | |
| βΌ | |
| EnCodec decoder β 32 kHz wav | |
| ``` | |
| - **CPTransformer**: shares the base LM's transformer blocks (norm/self-attn/FFN) | |
| but adds learned `pos_emb` (49, 501, 2048), `merge_linear[i]` per layer | |
| (2048 β 2048), and a zero-init `gate[i]` scalar. | |
| - **Prefix injection** (per self-attn): second SDPA over the input audio's | |
| K/V, with `dt_q = prefix_q[step] + main_q`, gated add before `out_proj`: | |
| `attn = main_attn + dt_attn Γ gate[i]`. | |
| ## Files | |
| - `model.safetensors` β quantized LM (INT4 affine, group size 64) + adapter weights | |
| - `config.json` β architecture + quantization + instruct metadata | |
| - `compression_state_dict.bin` β passthrough of upstream EnCodec for offline init | |
| ## Source | |
| - Upstream: [ldzhangyx/instruct-MusicGen](https://huggingface.co/ldzhangyx/instruct-MusicGen) | |
| (CC-BY-NC, re-trained on public datasets) | |
| - Paper: [Instruct-MusicGen (arxiv 2405.18386)](https://arxiv.org/abs/2405.18386) | |
| - Base: [facebook/musicgen-large](https://huggingface.co/facebook/musicgen-large) | |
| ## License | |
| **CC-BY-NC 4.0** β inherited from MusicGen + the upstream checkpoint. **Non-commercial use only.** | |