TinyCLAP CoreML
CoreML-converted tinyCLAP models for on-device semantic audio search on macOS and iOS.
What's inside
This repository contains the student tinyCLAP encoder pair converted to CoreML:
| File | Size | Purpose |
|---|---|---|
TinyCLAP_AudioEncoder.mlpackage |
~17 MB | PhiNet-based audio encoder. Consumes log-mel spectrograms and emits 1024-dim L2-normalized embeddings. |
TinyCLAP_TextEncoder.mlpackage |
~422 MB | BERT-base text encoder. Consumes tokenized text and emits 1024-dim L2-normalized embeddings. |
mel_filter_bank.json |
~170 KB | Pre-computed 64-bin mel filter bank (Slaney, 50–14000 Hz) for audio preprocessing. |
text_embeddings.json |
~220 KB | Sample text embeddings for sanity-checking the model output. |
Architecture
Audio file (44.1 kHz mono)
↓
STFT (n_fft=1024, hop=320, win=1024, Hann)
↓
Log-mel spectrogram (64 bins, 50–14000 Hz)
↓
TinyCLAP_AudioEncoder.mlpackage
↓
1024-dim L2-normalised embedding
Text query
↓
BERT tokenization (max_length=100)
↓
TinyCLAP_TextEncoder.mlpackage
↓
1024-dim L2-normalised embedding
Cosine similarity between the two embeddings yields semantic relevance scores.
Preprocessing spec (must match exactly)
| Parameter | Value |
|---|---|
| Sample rate | 44100 Hz |
| Channels | Mono |
| FFT size | 1024 |
| Hop length | 320 |
| Window length | 1024 |
| Window | Hann |
| Center | true |
| Pad mode | reflect |
| Mel bins | 64 |
| Mel fmin | 50 Hz |
| Mel fmax | 14000 Hz |
| Ref | 1.0 |
| Amin | 1e-10 |
| Max text length | 100 tokens |
| Output shape | [1, 1, 640, 64] (batch, channel, time, freq) |
CoreML conversion details
- Backend:
mlprogram(convert_to="mlprogram") - Compute units:
ALL(CPU + GPU + Apple Neural Engine) - Precision:
FLOAT32 - Minimum deployment target: macOS 14 / iOS 17
- Audio encoder input: flexible time dimension (640 frames nominal)
- Text encoder input: fixed
[1, 100]for bothinput_idsandattention_mask
The neuralnetwork backend was used initially but forced CPU-only execution due to internal shape-validation warnings. The switch to mlprogram with the fuse_transpose_matmul optimization disabled and FLOAT32 precision enabled achieves bit-exact numerical parity with PyTorch while unlocking ANE/GPU acceleration.
Numerical verification:
- Audio encoder: cosine similarity ≈ 1.00000036 vs PyTorch
- Text encoder: cosine similarity ≈ 1.00000012 vs PyTorch
Original work
This is a CoreML port of the tinyCLAP architecture (student-teacher CLAP distillation with PhiNet audio encoder and BERT text encoder). The conversion was performed using coremltools 9.0 and PyTorch 2.12.
License
This repository contains CoreML-converted artifacts derived from tinyCLAP, which is licensed under the Apache License 2.0. See LICENSE for the full text.
The original model weights, training code, and architecture remain the intellectual property of the original authors (Francesco Paissan et al.). This repository only provides the converted CoreML models and does not claim ownership over the original work.
- Downloads last month
- 36