File size: 11,064 Bytes
0f5275f afd1866 5976179 afd1866 509ca57 5976179 509ca57 5976179 509ca57 5976179 509ca57 5976179 509ca57 5976179 509ca57 5976179 509ca57 0fe658c 509ca57 5976179 509ca57 5976179 509ca57 5976179 509ca57 5976179 509ca57 5976179 509ca57 0fe658c 509ca57 5976179 509ca57 5976179 509ca57 5976179 509ca57 5976179 509ca57 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 | ---
license: apache-2.0
library_name: coremltools
base_model: laion/larger_clap_general
pipeline_tag: feature-extraction
tags:
- audio
- audio-embedding
- clap
- htsat
- core-ml
- onnx
- apple-silicon
---
# larger-clap-general-coreml
Two artifacts derived from [`laion/larger_clap_general`](https://huggingface.co/laion/larger_clap_general), kept in the same embedding space so they can be used as a pair:
- **`clap_audio_encoder.mlpackage`** β native Core ML build of the audio encoder + projection head. Runs accelerated on Apple GPU via `MLComputeUnits.cpuAndGPU`.
- **`text_model.onnx`** β ONNX build of the text encoder + projection head. Standard ORT-compatible, cross-platform.
Both take their respective inputs and return an L2-normalized 512-d embedding in the joint CLAP space (cosine similarity == dot product).
`larger_clap_general` is trained on **general audio, music and speech** β use the pair for zero-shot audio classification or open-vocabulary retrieval.
## Why this repo exists
- **Audio side**: `ort`'s CoreML execution provider can't accelerate HTSAT β reflect-pad, 5-D reshapes, relative-position-bias gather, and dynamic shapes shred the graph into CPU partitions, so the EP "registers" but every node runs on CPU. Loading the `.mlpackage` directly via Core ML (skipping ORT entirely) runs the full graph on the Apple GPU.
- **Text side**: this `text_model.onnx` is re-exported directly from LAION's PyTorch with no `optimum` graph fusion. Xenova's matching `larger_clap_general` ONNX export of the text encoder is in a *slightly* different numerical subspace than LAION's PyTorch (graph fusions + quantization add up), so pairing Xenova-text with our LAION-derived audio model collapses textβaudio cosine to ~0.2. Re-exporting text from the same PyTorch source recovers ~0.7+ on good matches.
## Inputs / Outputs
### Audio (`clap_audio_encoder.mlpackage`)
| | name | shape | dtype | notes |
|---|---|---|---|---|
| Input | `audio` | `[1, 480000]` | float32 | 10 s mono @ 48 kHz, peak-normalized to `[-1, 1]` |
| Output | `embedding` | `[1, 512]` | float32 | L2-normalized; cosine == dot product |
The mel-spectrogram extraction (STFT, Slaney mel filterbank, log) is **baked into the model graph** β you pass raw audio, not features.
### Text (`text_model.onnx`)
| | name | shape | dtype | notes |
|---|---|---|---|---|
| Input | `input_ids` | `[B, T]` | int64 | RoBERTa tokenizer output |
| Input | `attention_mask` | `[B, T]` | int64 | 1 for real tokens, 0 for padding |
| Output | `text_embeds` | `[B, 512]` | float32 | L2-normalized; cosine == dot product |
Both batch and sequence length are dynamic. Use the tokenizer from `Xenova/larger_clap_general` (or any `larger_clap_general` mirror with the standard RoBERTa tokenizer config) β vocab + special tokens are identical across exports.
## Variable-length audio
The graph has a fixed 10 s input shape. For arbitrary-length audio, recommended recipe:
| Duration | Strategy |
|---|---|
| β€ 10 s | Zero-pad to 480_000 samples, single forward pass. |
| > 10 s | Sliding 10 s windows with 50 % overlap, embed each window, **mean-pool the embeddings, re-L2-normalize.** |
For very long files cap window count to bound runtime β uniformly spacing N windows across `[0, T-10s]` gives full-file coverage without per-window blow-up.
## Usage
### Swift (Core ML)
```swift
import CoreML
let config = MLModelConfiguration()
config.computeUnits = .cpuAndGPU
let model = try MLModel(contentsOf: compiledURL, configuration: config)
let audio = try MLMultiArray(shape: [1, 480_000], dataType: .float32)
// copy your normalized waveform into audio.dataPointer ...
let provider = try MLDictionaryFeatureProvider(dictionary: ["audio": audio])
let out = try model.prediction(from: provider)
let embedding = out.featureValue(for: "embedding")!.multiArrayValue!
```
### Rust (objc2-core-ml)
The `objc2`/`objc2-core-ml` crates give direct Rust bindings to Core ML. Sketch:
```rust
use objc2_core_ml::{MLModel, MLModelConfiguration, MLMultiArray, MLMultiArrayDataType,
MLDictionaryFeatureProvider, MLFeatureValue, MLComputeUnits};
// Core ML wants a compiled .mlmodelc β compile the .mlpackage once,
// then load with cpuAndGPU compute units.
let compiled = unsafe { MLModel::compileModelAtURL_error(&mlpackage_url) }?;
let config = unsafe { MLModelConfiguration::new() };
unsafe { config.setComputeUnits(MLComputeUnits::CPUAndGPU) };
let model = unsafe { MLModel::modelWithContentsOfURL_configuration_error(&compiled, &config) }?;
// Build [1, 480000] float32 input, copy waveform via dataPointer,
// wrap in MLFeatureValue + MLDictionaryFeatureProvider, run prediction.
```
### Python (audio via coremltools, text via onnxruntime)
```python
import coremltools as ct
import numpy as np
import onnxruntime as ort
from transformers import AutoTokenizer
# --- audio: Core ML ---
audio_model = ct.models.MLModel("clap_audio_encoder.mlpackage")
waveform = np.zeros((1, 480_000), dtype=np.float32)
audio_emb = audio_model.predict({"audio": waveform})["embedding"] # (1, 512)
# --- text: ONNX ---
tok = AutoTokenizer.from_pretrained("Xenova/larger_clap_general")
text_sess = ort.InferenceSession("text_model.onnx", providers=["CPUExecutionProvider"])
encoded = tok("a dog barking", return_tensors="np", padding=True)
text_emb = text_sess.run(["text_embeds"], {
"input_ids": encoded["input_ids"].astype(np.int64),
"attention_mask": encoded["attention_mask"].astype(np.int64),
})[0] # (1, 512)
# Joint-embedding similarity:
similarity = float(np.dot(audio_emb.flatten(), text_emb.flatten()))
```
## How it was built
### Audio (`clap_audio_encoder.mlpackage`)
`coremltools` 8 + `torch.export` from `laion/larger_clap_general`'s PyTorch weights, then `convert_to="mlprogram"` + int8 linear weight quantization. The conversion is non-trivial β `ct.convert` rejects the model out of the box. Patches applied:
1. **`F.interpolate(mode='bicubic')` β `'bilinear'`** β CoreML's MIL backend lacks bicubic upsampling. Used by HTSAT's positional-embedding resize. Accuracy delta is negligible.
2. **`torch.jit.is_tracing()` β `True`** β forces HF's CLAP code onto the static-shape path during conversion.
3. **`ClapAudioLayer.set_shift_and_window_size` β no-op** β the dynamic window adjustment hits a "data-dependent guard" error in `torch.export`. For our fixed `[1, 1, 1001, 64]` input the `__init__` values are already correct, so neutralizing is safe.
4. **Custom STFT** β `torch.stft`'s op signature drifts across torch versions and the coremltools handler unpacks the wrong arity; implemented as strided conv1d with pre-baked cos/sin Hann bases instead.
5. **Custom `fmod` MIL lowering** β HTSAT's relative-position arithmetic uses float modulo; coremltools has no built-in handler. Registered as `x - trunc(x/y) * y`.
6. **`slice_scatter` override** β HTSAT's attention-mask builder generates empty-slice `slice_scatter` calls at deeper Swin stages (e.g. `slice(0, -window_size)` evaluates to `slice(0, 0)`). The built-in handler's shape check rejects these; registered override that no-ops empty slices and reduces non-empty ones to `slice_by_index + concat`.
A full conversion script that applies all six patches is included in this repo: [`convert-clap-to-coreml.py`](./convert-clap-to-coreml.py). Run with `pip install coremltools>=8,<9 torch>=2.6,<2.10 transformers>=4.40 numpy>=1.24,<2` then `python convert-clap-to-coreml.py --output clap_audio_encoder.mlpackage`. Validation (cosine vs PyTorch reference) runs automatically.
### Text (`text_model.onnx`)
Plain `torch.onnx.export` from the same PyTorch source β no `optimum`, no graph fusion, no quantization. RoBERTa exports cleanly so no per-op patches are needed. Recent `torch.onnx.export` writes weights to a sidecar `.onnx.data` file by default; the conversion script consolidates them back into a single ~500 MB `.onnx` so distribution is one file. Opset 17.
Companion script: [`convert-clap-text-to-onnx.py`](./convert-clap-text-to-onnx.py). Same dependencies as the audio script plus `pip install onnx onnxruntime`.
## Validation
### Audio
Cosine similarity vs the PyTorch reference, on random `[1, 480000]` peak-normalized inputs:
| Trial | Cosine |
|---|---|
| 1 | 0.999393 |
| 2 | 0.998725 |
| 3 | 0.998992 |
Drift is dominated by int8 weight quantization. For full fp32 weights, re-run the audio conversion with `--quantize none` (~3Γ larger file, ~1.0 cosine).
### Text
Cosine similarity vs the PyTorch reference, on five sample queries:
| Query | Cosine |
|---|---|
| `"a dog barking"` | 1.000000 |
| `"808 kick drum"` | 1.000000 |
| `"lo-fi piano loop with vinyl crackle"` | 1.000000 |
| `"ambient pad with reverb"` | 1.000000 |
| `"voice saying hello"` | 1.000000 |
No quantization on the text side β bit-exact (within fp32 noise) against PyTorch.
## Performance
Apple M-series, `MLComputeUnits.cpuAndGPU`:
| | Latency per 10 s window |
|---|---|
| Cold start (first forward pass) | ~5 s (Core ML graph compile + GPU upload) |
| Steady state | ~30 ms |
Compared to running the original `.onnx` via `ort` on Apple Silicon CPU, that's a roughly 10Γ speedup for the steady state. ANE was not attempted (`MLComputeUnits.all`) β `CPUAndGPU` was the sweet spot during testing; the strictest backend often rejects whole-graph compilation for transformer audio models.
## Limitations
- **No `logit_scale`.** The original CLAP model's learnable temperature isn't included here β projection heads only. For zero-shot classification you can either ignore it (cosine alone usually ranks correctly) or pull it from the original `laion/larger_clap_general` checkpoint.
- **Fixed audio input shape.** Audio shorter than 10 s must be zero-padded; longer requires the sliding-window recipe above.
- **int8 audio quantization.** ~99.9 % cosine is sufficient for retrieval / search use cases; if you're using these embeddings as inputs to downstream training, re-run audio conversion with `--quantize none`.
## Credits
- [LAION](https://laion.ai) for [`larger_clap_general`](https://huggingface.co/laion/larger_clap_general).
- [gridshiftstudio/clap-music-coreml](https://huggingface.co/gridshiftstudio/clap-music-coreml) for the first public proof that this conversion is viable + the two key patches.
## Citation
If you use this model in your work, please cite the original CLAP paper ([arXiv:2211.06687](https://arxiv.org/abs/2211.06687)):
```bibtex
@misc{https://doi.org/10.48550/arxiv.2211.06687,
doi = {10.48550/ARXIV.2211.06687},
url = {https://arxiv.org/abs/2211.06687},
author = {Wu, Yusong and Chen, Ke and Zhang, Tianyu and Hui, Yuchen and Berg-Kirkpatrick, Taylor and Dubnov, Shlomo},
title = {Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation},
publisher = {arXiv},
year = {2022},
copyright = {Creative Commons Attribution 4.0 International}
}
```
## License
This artifact inherits the source model's license: **Apache 2.0**.
|