---
tags:
  - coreml
  - audio
  - clap
  - semantic-search
  - onnx  # TODO: remove once CoreML-only is verified
license: apache-2.0
---

# TinyCLAP CoreML

CoreML-converted tinyCLAP models for on-device semantic audio search on macOS and iOS.

## What's inside

This repository contains the student tinyCLAP encoder pair converted to CoreML:

| File | Size | Purpose |
|------|------|---------|
| `TinyCLAP_AudioEncoder.mlpackage` | ~17 MB | PhiNet-based audio encoder. Consumes log-mel spectrograms and emits 1024-dim L2-normalized embeddings. |
| `TinyCLAP_TextEncoder.mlpackage` | ~422 MB | BERT-base text encoder. Consumes tokenized text and emits 1024-dim L2-normalized embeddings. |
| `mel_filter_bank.json` | ~170 KB | Pre-computed 64-bin mel filter bank (Slaney, 50–14000 Hz) for audio preprocessing. |
| `text_embeddings.json` | ~220 KB | Sample text embeddings for sanity-checking the model output. |

## Architecture

```
Audio file (44.1 kHz mono)
    ↓
STFT (n_fft=1024, hop=320, win=1024, Hann)
    ↓
Log-mel spectrogram (64 bins, 50–14000 Hz)
    ↓
TinyCLAP_AudioEncoder.mlpackage
    ↓
1024-dim L2-normalised embedding
```

```
Text query
    ↓
BERT tokenization (max_length=100)
    ↓
TinyCLAP_TextEncoder.mlpackage
    ↓
1024-dim L2-normalised embedding
```

Cosine similarity between the two embeddings yields semantic relevance scores.

## Preprocessing spec (must match exactly)

| Parameter | Value |
|-----------|-------|
| Sample rate | 44100 Hz |
| Channels | Mono |
| FFT size | 1024 |
| Hop length | 320 |
| Window length | 1024 |
| Window | Hann |
| Center | true |
| Pad mode | reflect |
| Mel bins | 64 |
| Mel fmin | 50 Hz |
| Mel fmax | 14000 Hz |
| Ref | 1.0 |
| Amin | 1e-10 |
| Max text length | 100 tokens |
| Output shape | `[1, 1, 640, 64]` (batch, channel, time, freq) |

## CoreML conversion details

- **Backend**: `mlprogram` (`convert_to="mlprogram"`)
- **Compute units**: `ALL` (CPU + GPU + Apple Neural Engine)
- **Precision**: `FLOAT32`
- **Minimum deployment target**: macOS 14 / iOS 17
- **Audio encoder input**: flexible time dimension (640 frames nominal)
- **Text encoder input**: fixed `[1, 100]` for both `input_ids` and `attention_mask`

The `neuralnetwork` backend was used initially but forced CPU-only execution due to internal shape-validation warnings. The switch to `mlprogram` with the `fuse_transpose_matmul` optimization disabled and `FLOAT32` precision enabled achieves bit-exact numerical parity with PyTorch while unlocking ANE/GPU acceleration.

Numerical verification:
- Audio encoder: cosine similarity ≈ 1.00000036 vs PyTorch
- Text encoder: cosine similarity ≈ 1.00000012 vs PyTorch

## Original work

This is a CoreML port of the tinyCLAP architecture (student-teacher CLAP distillation with PhiNet audio encoder and BERT text encoder). The conversion was performed using `coremltools` 9.0 and PyTorch 2.12.

## License

This repository contains CoreML-converted artifacts derived from [tinyCLAP](https://github.com/fpaissan/tinyclap), which is licensed under the **Apache License 2.0**. See [LICENSE](LICENSE) for the full text.

The original model weights, training code, and architecture remain the intellectual property of the original authors (Francesco Paissan et al.). This repository only provides the converted CoreML models and does not claim ownership over the original work.