File size: 2,703 Bytes
4f17a42 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 | ---
license: apache-2.0
tags:
- audio
- audio-embedding
- clap
- laion-clap
- coreml
- on-device
library_name: coreml
base_model: laion/larger_clap_music
---
# LAION-CLAP (Music) → Core ML
On-device audio-embedding model for Apple Silicon Macs. Converted from
[laion/larger_clap_music](https://huggingface.co/laion/larger_clap_music)
(HTSAT-base audio encoder + audio projection) to a self-contained Core ML
`.mlpackage`, int8-quantized.
Used by [Gridshift](https://gridshift.studio) for sample similarity search —
"find samples that sound like this kick" — and (in a later phase) text-to-sample
retrieval.
## Input / output contract
```
audio: fp32 tensor [1, 480000] 10 s mono @ 48 kHz, peak-normalized to [-1, 1]
embedding: fp32 tensor [1, 512] L2-normalized, cosine = dot product
```
Mel-spectrogram preprocessing is baked into the model graph (via
[convmelspec](https://github.com/adobe-research/convmelspec) STFT), so the
client does zero DSP preprocessing — just supply raw audio samples.
## Accuracy vs PyTorch reference (5 synthetic signals)
| signal | cos(ref, coreml) |
|---------------|-----------------:|
| sine 440 Hz | 0.99851 |
| sine 220 Hz | 0.99746 |
| white noise | 0.99977 |
| silence | 0.99986 |
| clipped noise | 0.99977 |
Pairwise distance structure between signals is preserved with max drift
0.004 (threshold ≤ 0.02), so relative similarity rankings between samples
remain intact through the int8 quantization.
## Handling audio of different lengths
The Core ML graph is shape-rigid at 480 000 samples (10 s). The client is
expected to preprocess:
- **≤ 10 s**: zero-pad on the right.
- **< 200 ms** (short one-shots): repeat-pad to ~500 ms, then zero-pad. Prevents
padding from dominating the embedding.
- **> 10 s** (loops): sliding-window 3× with 50 % overlap, then mean-pool the
three 512-d embeddings. Re-normalize to unit length.
## License and attribution
Apache-2.0, inherited from upstream LAION-CLAP. Please cite:
```
Wu et al., "Large-scale Contrastive Language-Audio Pretraining with Feature
Fusion and Keyword-to-Caption Augmentation", 2022.
https://arxiv.org/abs/2211.06687
```
## Conversion details
Conversion was done with the script at
`app/ml/clap/convert_to_coreml.py` in the Gridshift source tree, using:
- PyTorch 2.11 + torch.export
- coremltools 9.0 MLProgram backend
- int8 symmetric weight quantization
- bicubic → bilinear interp swap for Core ML compat (minimal accuracy impact)
- CLAP window-size patch for `torch.jit.is_tracing` branch divergence
- Fixed input shape [1, 480000] baked into the graph
Target: macOS 14+.
|