| --- |
| license: apache-2.0 |
| tags: |
| - audio |
| - audio-embedding |
| - clap |
| - laion-clap |
| - coreml |
| - on-device |
| library_name: coreml |
| base_model: laion/larger_clap_music |
| --- |
| |
| # LAION-CLAP (Music) β Core ML |
|
|
| On-device audio-embedding model for Apple Silicon Macs. Converted from |
| [laion/larger_clap_music](https://huggingface.co/laion/larger_clap_music) |
| (HTSAT-base audio encoder + audio projection) to a self-contained Core ML |
| `.mlpackage`, int8-quantized. |
|
|
| Used by [Gridshift](https://gridshift.studio) for sample similarity search β |
| "find samples that sound like this kick" β and (in a later phase) text-to-sample |
| retrieval. |
|
|
| ## Input / output contract |
|
|
| ``` |
| audio: fp32 tensor [1, 480000] 10 s mono @ 48 kHz, peak-normalized to [-1, 1] |
| embedding: fp32 tensor [1, 512] L2-normalized, cosine = dot product |
| ``` |
|
|
| Mel-spectrogram preprocessing is baked into the model graph (via |
| [convmelspec](https://github.com/adobe-research/convmelspec) STFT), so the |
| client does zero DSP preprocessing β just supply raw audio samples. |
|
|
| ## Accuracy vs PyTorch reference (5 synthetic signals) |
|
|
| | signal | cos(ref, coreml) | |
| |---------------|-----------------:| |
| | sine 440 Hz | 0.99851 | |
| | sine 220 Hz | 0.99746 | |
| | white noise | 0.99977 | |
| | silence | 0.99986 | |
| | clipped noise | 0.99977 | |
|
|
| Pairwise distance structure between signals is preserved with max drift |
| 0.004 (threshold β€ 0.02), so relative similarity rankings between samples |
| remain intact through the int8 quantization. |
|
|
| ## Handling audio of different lengths |
|
|
| The Core ML graph is shape-rigid at 480 000 samples (10 s). The client is |
| expected to preprocess: |
|
|
| - **β€ 10 s**: zero-pad on the right. |
| - **< 200 ms** (short one-shots): repeat-pad to ~500 ms, then zero-pad. Prevents |
| padding from dominating the embedding. |
| - **> 10 s** (loops): sliding-window 3Γ with 50 % overlap, then mean-pool the |
| three 512-d embeddings. Re-normalize to unit length. |
|
|
| ## License and attribution |
|
|
| Apache-2.0, inherited from upstream LAION-CLAP. Please cite: |
|
|
| ``` |
| Wu et al., "Large-scale Contrastive Language-Audio Pretraining with Feature |
| Fusion and Keyword-to-Caption Augmentation", 2022. |
| https://arxiv.org/abs/2211.06687 |
| ``` |
|
|
| ## Conversion details |
|
|
| Conversion was done with the script at |
| `app/ml/clap/convert_to_coreml.py` in the Gridshift source tree, using: |
|
|
| - PyTorch 2.11 + torch.export |
| - coremltools 9.0 MLProgram backend |
| - int8 symmetric weight quantization |
| - bicubic β bilinear interp swap for Core ML compat (minimal accuracy impact) |
| - CLAP window-size patch for `torch.jit.is_tracing` branch divergence |
| - Fixed input shape [1, 480000] baked into the graph |
|
|
| Target: macOS 14+. |
|
|