gridshiftstudio
/

clap-music-coreml

audio-embedding

Model card Files Files and versions

clap-music-coreml / README.md

philippherzig's picture

Model card

4f17a42 verified 26 days ago

|

history blame contribute delete

2.7 kB

	---
	license: apache-2.0
	tags:
	- audio
	- audio-embedding
	- clap
	- laion-clap
	- coreml
	- on-device
	library_name: coreml
	base_model: laion/larger_clap_music
	---

	# LAION-CLAP (Music) → Core ML

	On-device audio-embedding model for Apple Silicon Macs. Converted from
	[laion/larger_clap_music](https://huggingface.co/laion/larger_clap_music)
	(HTSAT-base audio encoder + audio projection) to a self-contained Core ML
	`.mlpackage`, int8-quantized.

	Used by [Gridshift](https://gridshift.studio) for sample similarity search —
	"find samples that sound like this kick" — and (in a later phase) text-to-sample
	retrieval.

	## Input / output contract

	```
	audio: fp32 tensor [1, 480000] 10 s mono @ 48 kHz, peak-normalized to [-1, 1]
	embedding: fp32 tensor [1, 512] L2-normalized, cosine = dot product
	```

	Mel-spectrogram preprocessing is baked into the model graph (via
	[convmelspec](https://github.com/adobe-research/convmelspec) STFT), so the
	client does zero DSP preprocessing — just supply raw audio samples.

	## Accuracy vs PyTorch reference (5 synthetic signals)

	\| signal \| cos(ref, coreml) \|
	\|---------------\|-----------------:\|
	\| sine 440 Hz \| 0.99851 \|
	\| sine 220 Hz \| 0.99746 \|
	\| white noise \| 0.99977 \|
	\| silence \| 0.99986 \|
	\| clipped noise \| 0.99977 \|

	Pairwise distance structure between signals is preserved with max drift
	0.004 (threshold ≤ 0.02), so relative similarity rankings between samples
	remain intact through the int8 quantization.

	## Handling audio of different lengths

	The Core ML graph is shape-rigid at 480 000 samples (10 s). The client is
	expected to preprocess:

	- ≤ 10 s: zero-pad on the right.
	- < 200 ms (short one-shots): repeat-pad to ~500 ms, then zero-pad. Prevents
	padding from dominating the embedding.
	- > 10 s (loops): sliding-window 3× with 50 % overlap, then mean-pool the
	three 512-d embeddings. Re-normalize to unit length.

	## License and attribution

	Apache-2.0, inherited from upstream LAION-CLAP. Please cite:

	```
	Wu et al., "Large-scale Contrastive Language-Audio Pretraining with Feature
	Fusion and Keyword-to-Caption Augmentation", 2022.
	https://arxiv.org/abs/2211.06687
	```

	## Conversion details

	Conversion was done with the script at
	`app/ml/clap/convert_to_coreml.py` in the Gridshift source tree, using:

	- PyTorch 2.11 + torch.export
	- coremltools 9.0 MLProgram backend
	- int8 symmetric weight quantization
	- bicubic → bilinear interp swap for Core ML compat (minimal accuracy impact)
	- CLAP window-size patch for `torch.jit.is_tracing` branch divergence
	- Fixed input shape [1, 480000] baked into the graph

	Target: macOS 14+.