| --- |
| license: bsd-3-clause |
| tags: |
| - audio |
| - audio-classification |
| - sample-tagging |
| - clap |
| - htsat |
| - onnx |
| library_name: onnxruntime |
| --- |
| |
| # MAGDA Sample Tagger |
|
|
| ONNX exports of [LAION's CLAP HTSAT-unfused model](https://huggingface.co/laion/clap-htsat-unfused) |
| plus the RoBERTa tokenizer, packaged for the |
| [MAGDA DAW](https://github.com/Conceptual-Machines/magda-core)'s sample |
| library (issue #768). |
|
|
| ## What's in this repo |
|
|
| | File | Size | SHA-256 | |
| |------|------|---------| |
| | `clap_audio.onnx` | 111.8 MB | `3f42f71e555b62709910b6efa66fa5879f00d9571874b12b0fa674f82dbfe332` | |
| | `clap_text.onnx` | 478.2 MB | `c07b27204836877d5b615c103685b66ea8f21bc6b5b70a572be356125423a8bf` | |
| | `tokenizer.json` | 3.4 MB | `4fd1d86b4f5b53f40a609fcd11c1f34024b735f870a07439d70202b98493661a` | |
|
|
| - `clap_audio.onnx` β audio encoder. Takes a mono 48 kHz waveform, |
| produces a 512-d normalised embedding suitable for cosine similarity |
| search. |
| - `clap_text.onnx` β text encoder. Takes RoBERTa token ids + attention |
| mask, produces a 512-d normalised embedding in the same space as the |
| audio encoder so a text query can rank audio files by similarity. |
| - `tokenizer.json` β the RoBERTa BPE tokenizer that pairs with the |
| text encoder. MAGDA's C++ tokenizer reads this file directly. |
|
|
| ## How MAGDA uses these |
|
|
| MAGDA's media database (a SQLite catalogue of audio samples) uses |
| these encoders to: |
|
|
| - Compute an embedding per indexed sample at index time, stored in the |
| `media_embedding` table. |
| - Encode the user's free-text search query at query time and rank |
| samples by cosine similarity to the query embedding. |
|
|
| Without these models MAGDA falls back to filename / tag full-text |
| search β still useful, just no semantic similarity. |
|
|
| ## Export procedure |
|
|
| ONNX exports are generated from `laion/clap-htsat-unfused` via the |
| export script in MAGDA's prototype: |
|
|
| ``` |
| prototypes/media_db/src/media_db/embeddings/onnx_export.py |
| ``` |
|
|
| Notes: |
|
|
| - Run on CPU (MPS does not support float64 used by the audio encoder's |
| mel filterbank). |
| - Requires `transformers >= 5.x`. The audio-feature accessor was renamed |
| from `audios=` to `audio=` between 4.x and 5.x; passing the old kwarg |
| silently returns wrong shapes. |
| - `tokenizer.json` is the unmodified file from the upstream HF repo, |
| fetched via `AutoTokenizer.from_pretrained(...).save_pretrained(...)`. |
|
|
| ## License |
|
|
| BSD-3-Clause β same as the upstream LAION CLAP weights. See the |
| upstream repo for the original notice and attribution. |
|
|