clap-music-coreml / README.md
philippherzig's picture
Model card
4f17a42 verified
metadata
license: apache-2.0
tags:
  - audio
  - audio-embedding
  - clap
  - laion-clap
  - coreml
  - on-device
library_name: coreml
base_model: laion/larger_clap_music

LAION-CLAP (Music) → Core ML

On-device audio-embedding model for Apple Silicon Macs. Converted from laion/larger_clap_music (HTSAT-base audio encoder + audio projection) to a self-contained Core ML .mlpackage, int8-quantized.

Used by Gridshift for sample similarity search — "find samples that sound like this kick" — and (in a later phase) text-to-sample retrieval.

Input / output contract

audio:       fp32 tensor [1, 480000]   10 s mono @ 48 kHz, peak-normalized to [-1, 1]
embedding:   fp32 tensor [1, 512]      L2-normalized, cosine = dot product

Mel-spectrogram preprocessing is baked into the model graph (via convmelspec STFT), so the client does zero DSP preprocessing — just supply raw audio samples.

Accuracy vs PyTorch reference (5 synthetic signals)

signal cos(ref, coreml)
sine 440 Hz 0.99851
sine 220 Hz 0.99746
white noise 0.99977
silence 0.99986
clipped noise 0.99977

Pairwise distance structure between signals is preserved with max drift 0.004 (threshold ≤ 0.02), so relative similarity rankings between samples remain intact through the int8 quantization.

Handling audio of different lengths

The Core ML graph is shape-rigid at 480 000 samples (10 s). The client is expected to preprocess:

  • ≤ 10 s: zero-pad on the right.
  • < 200 ms (short one-shots): repeat-pad to ~500 ms, then zero-pad. Prevents padding from dominating the embedding.
  • > 10 s (loops): sliding-window 3× with 50 % overlap, then mean-pool the three 512-d embeddings. Re-normalize to unit length.

License and attribution

Apache-2.0, inherited from upstream LAION-CLAP. Please cite:

Wu et al., "Large-scale Contrastive Language-Audio Pretraining with Feature
Fusion and Keyword-to-Caption Augmentation", 2022.
https://arxiv.org/abs/2211.06687

Conversion details

Conversion was done with the script at app/ml/clap/convert_to_coreml.py in the Gridshift source tree, using:

  • PyTorch 2.11 + torch.export
  • coremltools 9.0 MLProgram backend
  • int8 symmetric weight quantization
  • bicubic → bilinear interp swap for Core ML compat (minimal accuracy impact)
  • CLAP window-size patch for torch.jit.is_tracing branch divergence
  • Fixed input shape [1, 480000] baked into the graph

Target: macOS 14+.