File size: 2,468 Bytes

6f03a79

---
license: bsd-3-clause
tags:
- audio
- audio-classification
- sample-tagging
- clap
- htsat
- onnx
library_name: onnxruntime
---

# MAGDA Sample Tagger

ONNX exports of [LAION's CLAP HTSAT-unfused model](https://huggingface.co/laion/clap-htsat-unfused)
plus the RoBERTa tokenizer, packaged for the
[MAGDA DAW](https://github.com/Conceptual-Machines/magda-core)'s sample
library (issue #768).

## What's in this repo

| File | Size | SHA-256 |
|------|------|---------|
| `clap_audio.onnx` | 111.8 MB | `3f42f71e555b62709910b6efa66fa5879f00d9571874b12b0fa674f82dbfe332` |
| `clap_text.onnx` | 478.2 MB | `c07b27204836877d5b615c103685b66ea8f21bc6b5b70a572be356125423a8bf` |
| `tokenizer.json` | 3.4 MB | `4fd1d86b4f5b53f40a609fcd11c1f34024b735f870a07439d70202b98493661a` |

- `clap_audio.onnx` — audio encoder. Takes a mono 48 kHz waveform,
  produces a 512-d normalised embedding suitable for cosine similarity
  search.
- `clap_text.onnx` — text encoder. Takes RoBERTa token ids + attention
  mask, produces a 512-d normalised embedding in the same space as the
  audio encoder so a text query can rank audio files by similarity.
- `tokenizer.json` — the RoBERTa BPE tokenizer that pairs with the
  text encoder. MAGDA's C++ tokenizer reads this file directly.

## How MAGDA uses these

MAGDA's media database (a SQLite catalogue of audio samples) uses
these encoders to:

- Compute an embedding per indexed sample at index time, stored in the
  `media_embedding` table.
- Encode the user's free-text search query at query time and rank
  samples by cosine similarity to the query embedding.

Without these models MAGDA falls back to filename / tag full-text
search — still useful, just no semantic similarity.

## Export procedure

ONNX exports are generated from `laion/clap-htsat-unfused` via the
export script in MAGDA's prototype:

```
prototypes/media_db/src/media_db/embeddings/onnx_export.py
```

Notes:

- Run on CPU (MPS does not support float64 used by the audio encoder's
  mel filterbank).
- Requires `transformers >= 5.x`. The audio-feature accessor was renamed
  from `audios=` to `audio=` between 4.x and 5.x; passing the old kwarg
  silently returns wrong shapes.
- `tokenizer.json` is the unmodified file from the upstream HF repo,
  fetched via `AutoTokenizer.from_pretrained(...).save_pretrained(...)`.

## License

BSD-3-Clause — same as the upstream LAION CLAP weights. See the
upstream repo for the original notice and attribution.