File size: 11,064 Bytes
0f5275f
 
afd1866
 
 
 
 
 
 
 
 
5976179
afd1866
509ca57
 
 
 
5976179
509ca57
5976179
 
509ca57
5976179
 
 
 
 
 
 
 
509ca57
 
 
5976179
 
509ca57
 
 
5976179
509ca57
 
 
5976179
 
 
 
 
 
 
 
 
 
509ca57
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0fe658c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
509ca57
5976179
509ca57
 
 
5976179
 
509ca57
5976179
 
509ca57
5976179
 
 
 
 
 
 
 
 
 
 
 
 
509ca57
 
 
 
5976179
 
509ca57
 
 
 
 
 
 
 
 
0fe658c
509ca57
5976179
 
 
 
 
 
509ca57
 
5976179
509ca57
 
 
 
 
 
 
 
5976179
 
 
 
 
 
 
 
 
 
 
 
 
 
509ca57
 
 
 
 
 
 
 
 
 
 
 
 
 
5976179
 
 
509ca57
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
---
license: apache-2.0
library_name: coremltools
base_model: laion/larger_clap_general
pipeline_tag: feature-extraction
tags:
  - audio
  - audio-embedding
  - clap
  - htsat
  - core-ml
  - onnx
  - apple-silicon
---

# larger-clap-general-coreml

Two artifacts derived from [`laion/larger_clap_general`](https://huggingface.co/laion/larger_clap_general), kept in the same embedding space so they can be used as a pair:

- **`clap_audio_encoder.mlpackage`** β€” native Core ML build of the audio encoder + projection head. Runs accelerated on Apple GPU via `MLComputeUnits.cpuAndGPU`.
- **`text_model.onnx`** β€” ONNX build of the text encoder + projection head. Standard ORT-compatible, cross-platform.

Both take their respective inputs and return an L2-normalized 512-d embedding in the joint CLAP space (cosine similarity == dot product).

`larger_clap_general` is trained on **general audio, music and speech** β€” use the pair for zero-shot audio classification or open-vocabulary retrieval.

## Why this repo exists

- **Audio side**: `ort`'s CoreML execution provider can't accelerate HTSAT β€” reflect-pad, 5-D reshapes, relative-position-bias gather, and dynamic shapes shred the graph into CPU partitions, so the EP "registers" but every node runs on CPU. Loading the `.mlpackage` directly via Core ML (skipping ORT entirely) runs the full graph on the Apple GPU.
- **Text side**: this `text_model.onnx` is re-exported directly from LAION's PyTorch with no `optimum` graph fusion. Xenova's matching `larger_clap_general` ONNX export of the text encoder is in a *slightly* different numerical subspace than LAION's PyTorch (graph fusions + quantization add up), so pairing Xenova-text with our LAION-derived audio model collapses text→audio cosine to ~0.2. Re-exporting text from the same PyTorch source recovers ~0.7+ on good matches.

## Inputs / Outputs

### Audio (`clap_audio_encoder.mlpackage`)

| | name | shape | dtype | notes |
|---|---|---|---|---|
| Input | `audio` | `[1, 480000]` | float32 | 10 s mono @ 48 kHz, peak-normalized to `[-1, 1]` |
| Output | `embedding` | `[1, 512]` | float32 | L2-normalized; cosine == dot product |

The mel-spectrogram extraction (STFT, Slaney mel filterbank, log) is **baked into the model graph** β€” you pass raw audio, not features.

### Text (`text_model.onnx`)

| | name | shape | dtype | notes |
|---|---|---|---|---|
| Input | `input_ids` | `[B, T]` | int64 | RoBERTa tokenizer output |
| Input | `attention_mask` | `[B, T]` | int64 | 1 for real tokens, 0 for padding |
| Output | `text_embeds` | `[B, 512]` | float32 | L2-normalized; cosine == dot product |

Both batch and sequence length are dynamic. Use the tokenizer from `Xenova/larger_clap_general` (or any `larger_clap_general` mirror with the standard RoBERTa tokenizer config) β€” vocab + special tokens are identical across exports.

## Variable-length audio

The graph has a fixed 10 s input shape. For arbitrary-length audio, recommended recipe:

| Duration | Strategy |
|---|---|
| ≀ 10 s | Zero-pad to 480_000 samples, single forward pass. |
| > 10 s | Sliding 10 s windows with 50 % overlap, embed each window, **mean-pool the embeddings, re-L2-normalize.** |

For very long files cap window count to bound runtime β€” uniformly spacing N windows across `[0, T-10s]` gives full-file coverage without per-window blow-up.

## Usage

### Swift (Core ML)
```swift
import CoreML

let config = MLModelConfiguration()
config.computeUnits = .cpuAndGPU
let model = try MLModel(contentsOf: compiledURL, configuration: config)

let audio = try MLMultiArray(shape: [1, 480_000], dataType: .float32)
// copy your normalized waveform into audio.dataPointer ...

let provider = try MLDictionaryFeatureProvider(dictionary: ["audio": audio])
let out = try model.prediction(from: provider)
let embedding = out.featureValue(for: "embedding")!.multiArrayValue!
```

### Rust (objc2-core-ml)
The `objc2`/`objc2-core-ml` crates give direct Rust bindings to Core ML. Sketch:

```rust
use objc2_core_ml::{MLModel, MLModelConfiguration, MLMultiArray, MLMultiArrayDataType,
                    MLDictionaryFeatureProvider, MLFeatureValue, MLComputeUnits};

// Core ML wants a compiled .mlmodelc β€” compile the .mlpackage once,
// then load with cpuAndGPU compute units.
let compiled = unsafe { MLModel::compileModelAtURL_error(&mlpackage_url) }?;
let config = unsafe { MLModelConfiguration::new() };
unsafe { config.setComputeUnits(MLComputeUnits::CPUAndGPU) };
let model = unsafe { MLModel::modelWithContentsOfURL_configuration_error(&compiled, &config) }?;

// Build [1, 480000] float32 input, copy waveform via dataPointer,
// wrap in MLFeatureValue + MLDictionaryFeatureProvider, run prediction.
```

### Python (audio via coremltools, text via onnxruntime)
```python
import coremltools as ct
import numpy as np
import onnxruntime as ort
from transformers import AutoTokenizer

# --- audio: Core ML ---
audio_model = ct.models.MLModel("clap_audio_encoder.mlpackage")
waveform = np.zeros((1, 480_000), dtype=np.float32)
audio_emb = audio_model.predict({"audio": waveform})["embedding"]  # (1, 512)

# --- text: ONNX ---
tok = AutoTokenizer.from_pretrained("Xenova/larger_clap_general")
text_sess = ort.InferenceSession("text_model.onnx", providers=["CPUExecutionProvider"])
encoded = tok("a dog barking", return_tensors="np", padding=True)
text_emb = text_sess.run(["text_embeds"], {
    "input_ids": encoded["input_ids"].astype(np.int64),
    "attention_mask": encoded["attention_mask"].astype(np.int64),
})[0]  # (1, 512)

# Joint-embedding similarity:
similarity = float(np.dot(audio_emb.flatten(), text_emb.flatten()))
```

## How it was built

### Audio (`clap_audio_encoder.mlpackage`)

`coremltools` 8 + `torch.export` from `laion/larger_clap_general`'s PyTorch weights, then `convert_to="mlprogram"` + int8 linear weight quantization. The conversion is non-trivial β€” `ct.convert` rejects the model out of the box. Patches applied:

1. **`F.interpolate(mode='bicubic')` β†’ `'bilinear'`** β€” CoreML's MIL backend lacks bicubic upsampling. Used by HTSAT's positional-embedding resize. Accuracy delta is negligible.
2. **`torch.jit.is_tracing()` β†’ `True`** β€” forces HF's CLAP code onto the static-shape path during conversion.
3. **`ClapAudioLayer.set_shift_and_window_size` β†’ no-op** β€” the dynamic window adjustment hits a "data-dependent guard" error in `torch.export`. For our fixed `[1, 1, 1001, 64]` input the `__init__` values are already correct, so neutralizing is safe.
4. **Custom STFT** β€” `torch.stft`'s op signature drifts across torch versions and the coremltools handler unpacks the wrong arity; implemented as strided conv1d with pre-baked cos/sin Hann bases instead.
5. **Custom `fmod` MIL lowering** β€” HTSAT's relative-position arithmetic uses float modulo; coremltools has no built-in handler. Registered as `x - trunc(x/y) * y`.
6. **`slice_scatter` override** β€” HTSAT's attention-mask builder generates empty-slice `slice_scatter` calls at deeper Swin stages (e.g. `slice(0, -window_size)` evaluates to `slice(0, 0)`). The built-in handler's shape check rejects these; registered override that no-ops empty slices and reduces non-empty ones to `slice_by_index + concat`.

A full conversion script that applies all six patches is included in this repo: [`convert-clap-to-coreml.py`](./convert-clap-to-coreml.py). Run with `pip install coremltools>=8,<9 torch>=2.6,<2.10 transformers>=4.40 numpy>=1.24,<2` then `python convert-clap-to-coreml.py --output clap_audio_encoder.mlpackage`. Validation (cosine vs PyTorch reference) runs automatically.

### Text (`text_model.onnx`)

Plain `torch.onnx.export` from the same PyTorch source β€” no `optimum`, no graph fusion, no quantization. RoBERTa exports cleanly so no per-op patches are needed. Recent `torch.onnx.export` writes weights to a sidecar `.onnx.data` file by default; the conversion script consolidates them back into a single ~500 MB `.onnx` so distribution is one file. Opset 17.

Companion script: [`convert-clap-text-to-onnx.py`](./convert-clap-text-to-onnx.py). Same dependencies as the audio script plus `pip install onnx onnxruntime`.

## Validation

### Audio
Cosine similarity vs the PyTorch reference, on random `[1, 480000]` peak-normalized inputs:

| Trial | Cosine |
|---|---|
| 1 | 0.999393 |
| 2 | 0.998725 |
| 3 | 0.998992 |

Drift is dominated by int8 weight quantization. For full fp32 weights, re-run the audio conversion with `--quantize none` (~3Γ— larger file, ~1.0 cosine).

### Text
Cosine similarity vs the PyTorch reference, on five sample queries:

| Query | Cosine |
|---|---|
| `"a dog barking"` | 1.000000 |
| `"808 kick drum"` | 1.000000 |
| `"lo-fi piano loop with vinyl crackle"` | 1.000000 |
| `"ambient pad with reverb"` | 1.000000 |
| `"voice saying hello"` | 1.000000 |

No quantization on the text side β†’ bit-exact (within fp32 noise) against PyTorch.

## Performance

Apple M-series, `MLComputeUnits.cpuAndGPU`:

| | Latency per 10 s window |
|---|---|
| Cold start (first forward pass) | ~5 s (Core ML graph compile + GPU upload) |
| Steady state | ~30 ms |

Compared to running the original `.onnx` via `ort` on Apple Silicon CPU, that's a roughly 10Γ— speedup for the steady state. ANE was not attempted (`MLComputeUnits.all`) β€” `CPUAndGPU` was the sweet spot during testing; the strictest backend often rejects whole-graph compilation for transformer audio models.

## Limitations

- **No `logit_scale`.** The original CLAP model's learnable temperature isn't included here β€” projection heads only. For zero-shot classification you can either ignore it (cosine alone usually ranks correctly) or pull it from the original `laion/larger_clap_general` checkpoint.
- **Fixed audio input shape.** Audio shorter than 10 s must be zero-padded; longer requires the sliding-window recipe above.
- **int8 audio quantization.** ~99.9 % cosine is sufficient for retrieval / search use cases; if you're using these embeddings as inputs to downstream training, re-run audio conversion with `--quantize none`.

## Credits

- [LAION](https://laion.ai) for [`larger_clap_general`](https://huggingface.co/laion/larger_clap_general).
- [gridshiftstudio/clap-music-coreml](https://huggingface.co/gridshiftstudio/clap-music-coreml) for the first public proof that this conversion is viable + the two key patches.

## Citation

If you use this model in your work, please cite the original CLAP paper ([arXiv:2211.06687](https://arxiv.org/abs/2211.06687)):

```bibtex
@misc{https://doi.org/10.48550/arxiv.2211.06687,
  doi = {10.48550/ARXIV.2211.06687},
  url = {https://arxiv.org/abs/2211.06687},
  author = {Wu, Yusong and Chen, Ke and Zhang, Tianyu and Hui, Yuchen and Berg-Kirkpatrick, Taylor and Dubnov, Shlomo},
  title = {Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation},
  publisher = {arXiv},
  year = {2022},
  copyright = {Creative Commons Attribution 4.0 International}
}
```

## License

This artifact inherits the source model's license: **Apache 2.0**.