File size: 2,703 Bytes
4f17a42
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
---
license: apache-2.0
tags:
  - audio
  - audio-embedding
  - clap
  - laion-clap
  - coreml
  - on-device
library_name: coreml
base_model: laion/larger_clap_music
---

# LAION-CLAP (Music) → Core ML

On-device audio-embedding model for Apple Silicon Macs. Converted from
[laion/larger_clap_music](https://huggingface.co/laion/larger_clap_music)
(HTSAT-base audio encoder + audio projection) to a self-contained Core ML
`.mlpackage`, int8-quantized.

Used by [Gridshift](https://gridshift.studio) for sample similarity search —
"find samples that sound like this kick" — and (in a later phase) text-to-sample
retrieval.

## Input / output contract

```
audio:       fp32 tensor [1, 480000]   10 s mono @ 48 kHz, peak-normalized to [-1, 1]
embedding:   fp32 tensor [1, 512]      L2-normalized, cosine = dot product
```

Mel-spectrogram preprocessing is baked into the model graph (via
[convmelspec](https://github.com/adobe-research/convmelspec) STFT), so the
client does zero DSP preprocessing — just supply raw audio samples.

## Accuracy vs PyTorch reference (5 synthetic signals)

| signal        | cos(ref, coreml) |
|---------------|-----------------:|
| sine 440 Hz   |          0.99851 |
| sine 220 Hz   |          0.99746 |
| white noise   |          0.99977 |
| silence       |          0.99986 |
| clipped noise |          0.99977 |

Pairwise distance structure between signals is preserved with max drift
0.004 (threshold ≤ 0.02), so relative similarity rankings between samples
remain intact through the int8 quantization.

## Handling audio of different lengths

The Core ML graph is shape-rigid at 480 000 samples (10 s). The client is
expected to preprocess:

- **≤ 10 s**: zero-pad on the right.
- **< 200 ms** (short one-shots): repeat-pad to ~500 ms, then zero-pad. Prevents
  padding from dominating the embedding.
- **> 10 s** (loops): sliding-window 3× with 50 % overlap, then mean-pool the
  three 512-d embeddings. Re-normalize to unit length.

## License and attribution

Apache-2.0, inherited from upstream LAION-CLAP. Please cite:

```
Wu et al., "Large-scale Contrastive Language-Audio Pretraining with Feature
Fusion and Keyword-to-Caption Augmentation", 2022.
https://arxiv.org/abs/2211.06687
```

## Conversion details

Conversion was done with the script at
`app/ml/clap/convert_to_coreml.py` in the Gridshift source tree, using:

- PyTorch 2.11 + torch.export
- coremltools 9.0 MLProgram backend
- int8 symmetric weight quantization
- bicubic → bilinear interp swap for Core ML compat (minimal accuracy impact)
- CLAP window-size patch for `torch.jit.is_tracing` branch divergence
- Fixed input shape [1, 480000] baked into the graph

Target: macOS 14+.