--- license: apache-2.0 tags: - audio - audio-embedding - clap - laion-clap - coreml - on-device library_name: coreml base_model: laion/larger_clap_music --- # LAION-CLAP (Music) → Core ML On-device audio-embedding model for Apple Silicon Macs. Converted from [laion/larger_clap_music](https://huggingface.co/laion/larger_clap_music) (HTSAT-base audio encoder + audio projection) to a self-contained Core ML `.mlpackage`, int8-quantized. Used by [Gridshift](https://gridshift.studio) for sample similarity search — "find samples that sound like this kick" — and (in a later phase) text-to-sample retrieval. ## Input / output contract ``` audio: fp32 tensor [1, 480000] 10 s mono @ 48 kHz, peak-normalized to [-1, 1] embedding: fp32 tensor [1, 512] L2-normalized, cosine = dot product ``` Mel-spectrogram preprocessing is baked into the model graph (via [convmelspec](https://github.com/adobe-research/convmelspec) STFT), so the client does zero DSP preprocessing — just supply raw audio samples. ## Accuracy vs PyTorch reference (5 synthetic signals) | signal | cos(ref, coreml) | |---------------|-----------------:| | sine 440 Hz | 0.99851 | | sine 220 Hz | 0.99746 | | white noise | 0.99977 | | silence | 0.99986 | | clipped noise | 0.99977 | Pairwise distance structure between signals is preserved with max drift 0.004 (threshold ≤ 0.02), so relative similarity rankings between samples remain intact through the int8 quantization. ## Handling audio of different lengths The Core ML graph is shape-rigid at 480 000 samples (10 s). The client is expected to preprocess: - **≤ 10 s**: zero-pad on the right. - **< 200 ms** (short one-shots): repeat-pad to ~500 ms, then zero-pad. Prevents padding from dominating the embedding. - **> 10 s** (loops): sliding-window 3× with 50 % overlap, then mean-pool the three 512-d embeddings. Re-normalize to unit length. ## License and attribution Apache-2.0, inherited from upstream LAION-CLAP. Please cite: ``` Wu et al., "Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation", 2022. https://arxiv.org/abs/2211.06687 ``` ## Conversion details Conversion was done with the script at `app/ml/clap/convert_to_coreml.py` in the Gridshift source tree, using: - PyTorch 2.11 + torch.export - coremltools 9.0 MLProgram backend - int8 symmetric weight quantization - bicubic → bilinear interp swap for Core ML compat (minimal accuracy impact) - CLAP window-size patch for `torch.jit.is_tracing` branch divergence - Fixed input shape [1, 480000] baked into the graph Target: macOS 14+.