AIST-95M / README.md

Publish AIST-95M

789accf verified 24 days ago

4.3 kB

	---
	language:
	- en
	license: apache-2.0
	tags:
	- multimodal
	- embedding
	- trimodal
	- dual-audio
	- retrieval
	- cross-modal
	- image-text-audio
	- feature-extraction
	library_name: pytorch
	pipeline_tag: feature-extraction
	datasets:
	- custom
	---

	# AIST-95M

	`AIST-95M` is the dual-audio Trimodal Embeddings teacher checkpoint built on:

	- text: `MongoDB/mdbr-leaf-ir`
	- image: `mobilenetv4_conv_medium.e180_r384_in12k`
	- audio: `mn20_as + whisper-tiny encoder`

	Its canonical Augmem name follows the repo standard:

	- `AIST` = `audio + image + speech + text`, alphabetized and reduced to first letters
	- `95M` = exact loaded parameter count rounded to integer millions

	It maps text, image, and audio into a shared 1280-dimensional embedding space with Matryoshka truncation support at `[1280, 768, 512, 256, 128]`.

	This repo publishes the dual-audio teacher as a safetensors release artifact plus the exact local gate baseline used for later teacher-recovery experiments.

	## Parameter Count

	Exact loaded parameter count in the deployed evaluation path:

	\| Component \| Params \|
	\|---\|---:\|
	\| Text encoder (`MongoDB/mdbr-leaf-ir`) \| 22,861,056 \|
	\| Image encoder (`mobilenetv4_conv_medium.e180_r384_in12k`) \| 8,434,512 \|
	\| Audio encoder (`mn20_as`, full loaded module) \| 17,909,287 \|
	\| Audio encoder (`openai/whisper-tiny`, encoder only) \| 8,208,384 \|
	\| Image projection head \| 12,306,560 \|
	\| Audio projection head \| 14,272,640 \|
	\| Text projection head \| 11,323,520 \|
	\| Total exact loaded params \| 95,315,959 \|

	For continuity with older notes:

	- historical shorthand: `TE-86M Dual Audio`
	- `89,048,552` params if you exclude the EfficientAT classifier head from the `mn20_as` module
	- `37,902,720` params are trainable checkpoint weights in the three projection heads

	## Architecture

	The dual-audio teacher uses a frozen-encoder / trained-head setup:

	```text
	Text -> mdbr-leaf-ir (768-d) ----------------> DeepProjectionHead-d2 -> 1280
	Image -> MobileNetV4-Medium (1280-d) ---------> DeepProjectionHead-d2 -> 1280
	Audio -> EfficientAT mn20_as (1920-d) \
	+--> concat(2304-d) -> DeepProjectionHead-d2 -> 1280
	Whisper-Tiny encoder (384-d) /
	```

	The audio path is dual-encoder by construction. EfficientAT contributes the environmental / general audio branch; Whisper-Tiny contributes the speech-sensitive branch.

	Core training config:

	- projection hidden dim: `1920`
	- projection output dim: `1280`
	- projection depth: `2`
	- loss: InfoNCE
	- audio encoder dim after concat: `2304`
	- Matryoshka dims: `[1280, 768, 512, 256, 128]`

	Published config file: `te_mn20_whisper_d2_validaudio.yaml`

	## Local Gate Baseline

	The attached JSON `teacher_dual_mn20whisper_exact_gate_baseline_20260424T155324Z.json` is the canonical local gate baseline used for later teacher continuation experiments.

	Seeded split-excluded baseline at `1280d`:

	\| Slice \| Metric \|
	\|---\|---:\|
	\| Speech holdout A->T R@1 \| 0.5652 \|
	\| Speech holdout T->A R@1 \| 0.5992 \|
	\| Speech holdout avg R@1 \| 0.5822 \|
	\| WavCaps FSD A->T R@1 \| 0.1078 \|
	\| WavCaps FSD T->A R@1 \| 0.1030 \|
	\| WavCaps FSD avg R@1 \| 0.1054 \|
	\| SALT A->I R@1 \| 0.1692 \|
	\| SALT I->A R@1 \| 0.1261 \|

	Important scope note:

	- These are the exact local gate numbers used for bounded recovery experiments.
	- They are not a claim of broad public benchmark superiority.
	- The external 4-task audio smoke baseline was not packaged into this release.

	## Files

	\| File \| Purpose \|
	\|---\|---\|
	\| `AIST-95M.safetensors` \| Self-contained dual-audio teacher release artifact \|
	\| `te_mn20_whisper_d2_validaudio.yaml` \| Training config for the teacher line \|
	\| `teacher_dual_mn20whisper_exact_gate_baseline_20260424T155324Z.json` \| Canonical exact-gate baseline \|
	\| `parameter_breakdown.json` \| Exact parameter accounting used in this card \|

	## Loading

	This release is a self-contained safetensors artifact containing:

	- text encoder weights
	- image encoder weights
	- EfficientAT audio encoder weights
	- Whisper-Tiny encoder weights
	- text / image / audio projection heads

	## Caveats

	- This release uses the canonical Augmem name `AIST-95M`.
	- Older `TE-86M Dual Audio` references are legacy aliases for the same artifact line.
	- The existing older `augmem/TE-86M` release on Hugging Face is a different artifact line.