magda-sample-tagger / README.md
ConceptualMachines's picture
model card
6f03a79 verified
---
license: bsd-3-clause
tags:
- audio
- audio-classification
- sample-tagging
- clap
- htsat
- onnx
library_name: onnxruntime
---
# MAGDA Sample Tagger
ONNX exports of [LAION's CLAP HTSAT-unfused model](https://huggingface.co/laion/clap-htsat-unfused)
plus the RoBERTa tokenizer, packaged for the
[MAGDA DAW](https://github.com/Conceptual-Machines/magda-core)'s sample
library (issue #768).
## What's in this repo
| File | Size | SHA-256 |
|------|------|---------|
| `clap_audio.onnx` | 111.8 MB | `3f42f71e555b62709910b6efa66fa5879f00d9571874b12b0fa674f82dbfe332` |
| `clap_text.onnx` | 478.2 MB | `c07b27204836877d5b615c103685b66ea8f21bc6b5b70a572be356125423a8bf` |
| `tokenizer.json` | 3.4 MB | `4fd1d86b4f5b53f40a609fcd11c1f34024b735f870a07439d70202b98493661a` |
- `clap_audio.onnx` β€” audio encoder. Takes a mono 48 kHz waveform,
produces a 512-d normalised embedding suitable for cosine similarity
search.
- `clap_text.onnx` β€” text encoder. Takes RoBERTa token ids + attention
mask, produces a 512-d normalised embedding in the same space as the
audio encoder so a text query can rank audio files by similarity.
- `tokenizer.json` β€” the RoBERTa BPE tokenizer that pairs with the
text encoder. MAGDA's C++ tokenizer reads this file directly.
## How MAGDA uses these
MAGDA's media database (a SQLite catalogue of audio samples) uses
these encoders to:
- Compute an embedding per indexed sample at index time, stored in the
`media_embedding` table.
- Encode the user's free-text search query at query time and rank
samples by cosine similarity to the query embedding.
Without these models MAGDA falls back to filename / tag full-text
search β€” still useful, just no semantic similarity.
## Export procedure
ONNX exports are generated from `laion/clap-htsat-unfused` via the
export script in MAGDA's prototype:
```
prototypes/media_db/src/media_db/embeddings/onnx_export.py
```
Notes:
- Run on CPU (MPS does not support float64 used by the audio encoder's
mel filterbank).
- Requires `transformers >= 5.x`. The audio-feature accessor was renamed
from `audios=` to `audio=` between 4.x and 5.x; passing the old kwarg
silently returns wrong shapes.
- `tokenizer.json` is the unmodified file from the upstream HF repo,
fetched via `AutoTokenizer.from_pretrained(...).save_pretrained(...)`.
## License
BSD-3-Clause β€” same as the upstream LAION CLAP weights. See the
upstream repo for the original notice and attribution.