ConceptualMachines
/

magda-sample-tagger

Audio Classification

Model card Files Files and versions

magda-sample-tagger / README.md

ConceptualMachines's picture

ConceptualMachines

model card

6f03a79 verified 17 days ago

|

history blame contribute delete

2.47 kB

	---
	license: bsd-3-clause
	tags:
	- audio
	- audio-classification
	- sample-tagging
	- clap
	- htsat
	- onnx
	library_name: onnxruntime
	---

	# MAGDA Sample Tagger

	ONNX exports of [LAION's CLAP HTSAT-unfused model](https://huggingface.co/laion/clap-htsat-unfused)
	plus the RoBERTa tokenizer, packaged for the
	[MAGDA DAW](https://github.com/Conceptual-Machines/magda-core)'s sample
	library (issue #768).

	## What's in this repo

	\| File \| Size \| SHA-256 \|
	\|------\|------\|---------\|
	\| `clap_audio.onnx` \| 111.8 MB \| `3f42f71e555b62709910b6efa66fa5879f00d9571874b12b0fa674f82dbfe332` \|
	\| `clap_text.onnx` \| 478.2 MB \| `c07b27204836877d5b615c103685b66ea8f21bc6b5b70a572be356125423a8bf` \|
	\| `tokenizer.json` \| 3.4 MB \| `4fd1d86b4f5b53f40a609fcd11c1f34024b735f870a07439d70202b98493661a` \|

	- `clap_audio.onnx` — audio encoder. Takes a mono 48 kHz waveform,
	produces a 512-d normalised embedding suitable for cosine similarity
	search.
	- `clap_text.onnx` — text encoder. Takes RoBERTa token ids + attention
	mask, produces a 512-d normalised embedding in the same space as the
	audio encoder so a text query can rank audio files by similarity.
	- `tokenizer.json` — the RoBERTa BPE tokenizer that pairs with the
	text encoder. MAGDA's C++ tokenizer reads this file directly.

	## How MAGDA uses these

	MAGDA's media database (a SQLite catalogue of audio samples) uses
	these encoders to:

	- Compute an embedding per indexed sample at index time, stored in the
	`media_embedding` table.
	- Encode the user's free-text search query at query time and rank
	samples by cosine similarity to the query embedding.

	Without these models MAGDA falls back to filename / tag full-text
	search — still useful, just no semantic similarity.

	## Export procedure

	ONNX exports are generated from `laion/clap-htsat-unfused` via the
	export script in MAGDA's prototype:

	```
	prototypes/media_db/src/media_db/embeddings/onnx_export.py
	```

	Notes:

	- Run on CPU (MPS does not support float64 used by the audio encoder's
	mel filterbank).
	- Requires `transformers >= 5.x`. The audio-feature accessor was renamed
	from `audios=` to `audio=` between 4.x and 5.x; passing the old kwarg
	silently returns wrong shapes.
	- `tokenizer.json` is the unmodified file from the upstream HF repo,
	fetched via `AutoTokenizer.from_pretrained(...).save_pretrained(...)`.

	## License

	BSD-3-Clause — same as the upstream LAION CLAP weights. See the
	upstream repo for the original notice and attribution.