aufklarer's picture
enrich model card: CLAP scores, RTF, architecture diagram
2ab25fe verified
metadata
license: cc-by-nc-4.0
language:
  - en
tags:
  - mlx
  - audio
  - music
  - text-to-music
  - music-editing
  - instruct-musicgen
  - lora-merged
  - quantized
  - int4
base_model: facebook/musicgen-large
library_name: mlx
pipeline_tag: text-to-audio

Instruct-MusicGen-MLX-4bit

MLX port of Instruct-MusicGen β€” text-instructed music editing. Built on MusicGen-large (3.3B params, 48-layer autoregressive transformer over EnCodec 32 kHz tokens) with cross-attention base weights from the upstream checkpoint, LoRA-merged on Q/V (Ξ±/r = 2.0), plus a 48-layer CPTransformer adapter that injects the input audio's per-layer Q/K/V via prefix-attention into every self-attention block.

Inputs / Outputs

  • Input: text instruction (e.g. "Music piece. Instruct: Only Drums.") + input audio (mono float32 @ 32 kHz, ≀ 10 s window)
  • Output: edited audio (mono float32 @ 32 kHz, matches input length)

Performance (Apple Silicon, INT4)

Metric Value
Bundle size ~2.2 GB on disk
RTF (wall / audio) ~1.21 (for 5 s output @ 250 AR steps)
Peak RSS ~3 GB

Quality (CLAP score vs instruction)

Mean CLAP score (laion/clap-htsat-unfused) across 4 edit instructions on a MusicGen-generated input clip β€” output vs the instruction text:

Variant mean CLAP "Only Drums" "Only Piano" "Remove Drums" "Only Bass"
FP16 +0.352 +0.40 +0.36 +0.42 +0.22
INT4 (this bundle) +0.311 +0.45 +0.17 +0.40 +0.21
INT8 +0.311 +0.44 +0.20 +0.39 +0.21

INT4 β‰ˆ INT8 in CLAP, both within ~12 % of FP16. The "Only X" instructions generally produce a positive Ξ” vs the input clip's CLAP score β€” i.e. the edit moves the audio toward the instruction. "Only Bass" remains the hardest case.

Usage (sketch)

from huggingface_hub import snapshot_download
bundle = snapshot_download("aufklarer/Instruct-MusicGen-MLX-4bit")

# Production loader: https://github.com/soniqo/speech-swift
# Minimal MLX sketch:
#   1. Read bundle/config.json (HF MusicGen config + cp_transformer metadata)
#   2. Construct InstructMusicGen MLX class
#   3. Replay quantization on linear projections (mlx.nn.quantize, bits=4)
#   4. Load weights from bundle/model.safetensors
#   5. audio = model.generate(text, input_audio, max_steps=250)

Architecture details

text instruction ── T5-base ── [LoRA-merged] cross-attn ──┐
                                                          β”‚
input audio ─ EnCodec encode ─ CPTransformer ─ prefix Q/K/V ──┐
                                                              β”‚
                                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                       β–Ό
              MusicGen-large LM (48 AR layers, delay pattern)
                                       β”‚
                                       β–Ό
                              EnCodec decoder β†’ 32 kHz wav
  • CPTransformer: shares the base LM's transformer blocks (norm/self-attn/FFN) but adds learned pos_emb (49, 501, 2048), merge_linear[i] per layer (2048 β†’ 2048), and a zero-init gate[i] scalar.
  • Prefix injection (per self-attn): second SDPA over the input audio's K/V, with dt_q = prefix_q[step] + main_q, gated add before out_proj: attn = main_attn + dt_attn Γ— gate[i].

Files

  • model.safetensors β€” quantized LM (INT4 affine, group size 64) + adapter weights
  • config.json β€” architecture + quantization + instruct metadata
  • compression_state_dict.bin β€” passthrough of upstream EnCodec for offline init

Source

License

CC-BY-NC 4.0 β€” inherited from MusicGen + the upstream checkpoint. Non-commercial use only.