TinyCLAP CoreML

CoreML-converted tinyCLAP models for on-device semantic audio search on macOS and iOS.

What's inside

This repository contains the student tinyCLAP encoder pair converted to CoreML:

File	Size	Purpose
`TinyCLAP_AudioEncoder.mlpackage`	~17 MB	PhiNet-based audio encoder. Consumes log-mel spectrograms and emits 1024-dim L2-normalized embeddings.
`TinyCLAP_TextEncoder.mlpackage`	~422 MB	BERT-base text encoder. Consumes tokenized text and emits 1024-dim L2-normalized embeddings.
`mel_filter_bank.json`	~170 KB	Pre-computed 64-bin mel filter bank (Slaney, 50–14000 Hz) for audio preprocessing.
`text_embeddings.json`	~220 KB	Sample text embeddings for sanity-checking the model output.

Architecture

Audio file (44.1 kHz mono)
    ↓
STFT (n_fft=1024, hop=320, win=1024, Hann)
    ↓
Log-mel spectrogram (64 bins, 50–14000 Hz)
    ↓
TinyCLAP_AudioEncoder.mlpackage
    ↓
1024-dim L2-normalised embedding

Text query
    ↓
BERT tokenization (max_length=100)
    ↓
TinyCLAP_TextEncoder.mlpackage
    ↓
1024-dim L2-normalised embedding

Cosine similarity between the two embeddings yields semantic relevance scores.

Preprocessing spec (must match exactly)

Parameter	Value
Sample rate	44100 Hz
Channels	Mono
FFT size	1024
Hop length	320
Window length	1024
Window	Hann
Center	true
Pad mode	reflect
Mel bins	64
Mel fmin	50 Hz
Mel fmax	14000 Hz
Ref	1.0
Amin	1e-10
Max text length	100 tokens
Output shape	`[1, 1, 640, 64]` (batch, channel, time, freq)

CoreML conversion details

Backend: mlprogram (convert_to="mlprogram")
Compute units: ALL (CPU + GPU + Apple Neural Engine)
Precision: FLOAT32
Minimum deployment target: macOS 14 / iOS 17
Audio encoder input: flexible time dimension (640 frames nominal)
Text encoder input: fixed [1, 100] for both input_ids and attention_mask

The neuralnetwork backend was used initially but forced CPU-only execution due to internal shape-validation warnings. The switch to mlprogram with the fuse_transpose_matmul optimization disabled and FLOAT32 precision enabled achieves bit-exact numerical parity with PyTorch while unlocking ANE/GPU acceleration.

Numerical verification:

Audio encoder: cosine similarity ≈ 1.00000036 vs PyTorch
Text encoder: cosine similarity ≈ 1.00000012 vs PyTorch

Original work

This is a CoreML port of the tinyCLAP architecture (student-teacher CLAP distillation with PhiNet audio encoder and BERT text encoder). The conversion was performed using coremltools 9.0 and PyTorch 2.12.

License

This repository contains CoreML-converted artifacts derived from tinyCLAP, which is licensed under the Apache License 2.0. See LICENSE for the full text.

The original model weights, training code, and architecture remain the intellectual property of the original authors (Francesco Paissan et al.). This repository only provides the converted CoreML models and does not claim ownership over the original work.

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support