model card
Browse files
README.md
ADDED
|
@@ -0,0 +1,72 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: bsd-3-clause
|
| 3 |
+
tags:
|
| 4 |
+
- audio
|
| 5 |
+
- audio-classification
|
| 6 |
+
- sample-tagging
|
| 7 |
+
- clap
|
| 8 |
+
- htsat
|
| 9 |
+
- onnx
|
| 10 |
+
library_name: onnxruntime
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
# MAGDA Sample Tagger
|
| 14 |
+
|
| 15 |
+
ONNX exports of [LAION's CLAP HTSAT-unfused model](https://huggingface.co/laion/clap-htsat-unfused)
|
| 16 |
+
plus the RoBERTa tokenizer, packaged for the
|
| 17 |
+
[MAGDA DAW](https://github.com/Conceptual-Machines/magda-core)'s sample
|
| 18 |
+
library (issue #768).
|
| 19 |
+
|
| 20 |
+
## What's in this repo
|
| 21 |
+
|
| 22 |
+
| File | Size | SHA-256 |
|
| 23 |
+
|------|------|---------|
|
| 24 |
+
| `clap_audio.onnx` | 111.8 MB | `3f42f71e555b62709910b6efa66fa5879f00d9571874b12b0fa674f82dbfe332` |
|
| 25 |
+
| `clap_text.onnx` | 478.2 MB | `c07b27204836877d5b615c103685b66ea8f21bc6b5b70a572be356125423a8bf` |
|
| 26 |
+
| `tokenizer.json` | 3.4 MB | `4fd1d86b4f5b53f40a609fcd11c1f34024b735f870a07439d70202b98493661a` |
|
| 27 |
+
|
| 28 |
+
- `clap_audio.onnx` — audio encoder. Takes a mono 48 kHz waveform,
|
| 29 |
+
produces a 512-d normalised embedding suitable for cosine similarity
|
| 30 |
+
search.
|
| 31 |
+
- `clap_text.onnx` — text encoder. Takes RoBERTa token ids + attention
|
| 32 |
+
mask, produces a 512-d normalised embedding in the same space as the
|
| 33 |
+
audio encoder so a text query can rank audio files by similarity.
|
| 34 |
+
- `tokenizer.json` — the RoBERTa BPE tokenizer that pairs with the
|
| 35 |
+
text encoder. MAGDA's C++ tokenizer reads this file directly.
|
| 36 |
+
|
| 37 |
+
## How MAGDA uses these
|
| 38 |
+
|
| 39 |
+
MAGDA's media database (a SQLite catalogue of audio samples) uses
|
| 40 |
+
these encoders to:
|
| 41 |
+
|
| 42 |
+
- Compute an embedding per indexed sample at index time, stored in the
|
| 43 |
+
`media_embedding` table.
|
| 44 |
+
- Encode the user's free-text search query at query time and rank
|
| 45 |
+
samples by cosine similarity to the query embedding.
|
| 46 |
+
|
| 47 |
+
Without these models MAGDA falls back to filename / tag full-text
|
| 48 |
+
search — still useful, just no semantic similarity.
|
| 49 |
+
|
| 50 |
+
## Export procedure
|
| 51 |
+
|
| 52 |
+
ONNX exports are generated from `laion/clap-htsat-unfused` via the
|
| 53 |
+
export script in MAGDA's prototype:
|
| 54 |
+
|
| 55 |
+
```
|
| 56 |
+
prototypes/media_db/src/media_db/embeddings/onnx_export.py
|
| 57 |
+
```
|
| 58 |
+
|
| 59 |
+
Notes:
|
| 60 |
+
|
| 61 |
+
- Run on CPU (MPS does not support float64 used by the audio encoder's
|
| 62 |
+
mel filterbank).
|
| 63 |
+
- Requires `transformers >= 5.x`. The audio-feature accessor was renamed
|
| 64 |
+
from `audios=` to `audio=` between 4.x and 5.x; passing the old kwarg
|
| 65 |
+
silently returns wrong shapes.
|
| 66 |
+
- `tokenizer.json` is the unmodified file from the upstream HF repo,
|
| 67 |
+
fetched via `AutoTokenizer.from_pretrained(...).save_pretrained(...)`.
|
| 68 |
+
|
| 69 |
+
## License
|
| 70 |
+
|
| 71 |
+
BSD-3-Clause — same as the upstream LAION CLAP weights. See the
|
| 72 |
+
upstream repo for the original notice and attribution.
|