clap-music-coreml / README.md
philippherzig's picture
Model card
4f17a42 verified
---
license: apache-2.0
tags:
- audio
- audio-embedding
- clap
- laion-clap
- coreml
- on-device
library_name: coreml
base_model: laion/larger_clap_music
---
# LAION-CLAP (Music) β†’ Core ML
On-device audio-embedding model for Apple Silicon Macs. Converted from
[laion/larger_clap_music](https://huggingface.co/laion/larger_clap_music)
(HTSAT-base audio encoder + audio projection) to a self-contained Core ML
`.mlpackage`, int8-quantized.
Used by [Gridshift](https://gridshift.studio) for sample similarity search β€”
"find samples that sound like this kick" β€” and (in a later phase) text-to-sample
retrieval.
## Input / output contract
```
audio: fp32 tensor [1, 480000] 10 s mono @ 48 kHz, peak-normalized to [-1, 1]
embedding: fp32 tensor [1, 512] L2-normalized, cosine = dot product
```
Mel-spectrogram preprocessing is baked into the model graph (via
[convmelspec](https://github.com/adobe-research/convmelspec) STFT), so the
client does zero DSP preprocessing β€” just supply raw audio samples.
## Accuracy vs PyTorch reference (5 synthetic signals)
| signal | cos(ref, coreml) |
|---------------|-----------------:|
| sine 440 Hz | 0.99851 |
| sine 220 Hz | 0.99746 |
| white noise | 0.99977 |
| silence | 0.99986 |
| clipped noise | 0.99977 |
Pairwise distance structure between signals is preserved with max drift
0.004 (threshold ≀ 0.02), so relative similarity rankings between samples
remain intact through the int8 quantization.
## Handling audio of different lengths
The Core ML graph is shape-rigid at 480 000 samples (10 s). The client is
expected to preprocess:
- **≀ 10 s**: zero-pad on the right.
- **< 200 ms** (short one-shots): repeat-pad to ~500 ms, then zero-pad. Prevents
padding from dominating the embedding.
- **> 10 s** (loops): sliding-window 3Γ— with 50 % overlap, then mean-pool the
three 512-d embeddings. Re-normalize to unit length.
## License and attribution
Apache-2.0, inherited from upstream LAION-CLAP. Please cite:
```
Wu et al., "Large-scale Contrastive Language-Audio Pretraining with Feature
Fusion and Keyword-to-Caption Augmentation", 2022.
https://arxiv.org/abs/2211.06687
```
## Conversion details
Conversion was done with the script at
`app/ml/clap/convert_to_coreml.py` in the Gridshift source tree, using:
- PyTorch 2.11 + torch.export
- coremltools 9.0 MLProgram backend
- int8 symmetric weight quantization
- bicubic β†’ bilinear interp swap for Core ML compat (minimal accuracy impact)
- CLAP window-size patch for `torch.jit.is_tracing` branch divergence
- Fixed input shape [1, 480000] baked into the graph
Target: macOS 14+.