TinyCLAP CoreML

CoreML-converted tinyCLAP models for on-device semantic audio search on macOS and iOS.

What's inside

This repository contains the student tinyCLAP encoder pair converted to CoreML:

File Size Purpose
TinyCLAP_AudioEncoder.mlpackage ~17 MB PhiNet-based audio encoder. Consumes log-mel spectrograms and emits 1024-dim L2-normalized embeddings.
TinyCLAP_TextEncoder.mlpackage ~422 MB BERT-base text encoder. Consumes tokenized text and emits 1024-dim L2-normalized embeddings.
mel_filter_bank.json ~170 KB Pre-computed 64-bin mel filter bank (Slaney, 50–14000 Hz) for audio preprocessing.
text_embeddings.json ~220 KB Sample text embeddings for sanity-checking the model output.

Architecture

Audio file (44.1 kHz mono)
    ↓
STFT (n_fft=1024, hop=320, win=1024, Hann)
    ↓
Log-mel spectrogram (64 bins, 50–14000 Hz)
    ↓
TinyCLAP_AudioEncoder.mlpackage
    ↓
1024-dim L2-normalised embedding
Text query
    ↓
BERT tokenization (max_length=100)
    ↓
TinyCLAP_TextEncoder.mlpackage
    ↓
1024-dim L2-normalised embedding

Cosine similarity between the two embeddings yields semantic relevance scores.

Preprocessing spec (must match exactly)

Parameter Value
Sample rate 44100 Hz
Channels Mono
FFT size 1024
Hop length 320
Window length 1024
Window Hann
Center true
Pad mode reflect
Mel bins 64
Mel fmin 50 Hz
Mel fmax 14000 Hz
Ref 1.0
Amin 1e-10
Max text length 100 tokens
Output shape [1, 1, 640, 64] (batch, channel, time, freq)

CoreML conversion details

  • Backend: mlprogram (convert_to="mlprogram")
  • Compute units: ALL (CPU + GPU + Apple Neural Engine)
  • Precision: FLOAT32
  • Minimum deployment target: macOS 14 / iOS 17
  • Audio encoder input: flexible time dimension (640 frames nominal)
  • Text encoder input: fixed [1, 100] for both input_ids and attention_mask

The neuralnetwork backend was used initially but forced CPU-only execution due to internal shape-validation warnings. The switch to mlprogram with the fuse_transpose_matmul optimization disabled and FLOAT32 precision enabled achieves bit-exact numerical parity with PyTorch while unlocking ANE/GPU acceleration.

Numerical verification:

  • Audio encoder: cosine similarity ≈ 1.00000036 vs PyTorch
  • Text encoder: cosine similarity ≈ 1.00000012 vs PyTorch

Original work

This is a CoreML port of the tinyCLAP architecture (student-teacher CLAP distillation with PhiNet audio encoder and BERT text encoder). The conversion was performed using coremltools 9.0 and PyTorch 2.12.

License

This repository contains CoreML-converted artifacts derived from tinyCLAP, which is licensed under the Apache License 2.0. See LICENSE for the full text.

The original model weights, training code, and architecture remain the intellectual property of the original authors (Francesco Paissan et al.). This repository only provides the converted CoreML models and does not claim ownership over the original work.

Downloads last month
36
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support