AIST-95M
AIST-95M is the dual-audio Trimodal Embeddings teacher checkpoint built on:
- text:
MongoDB/mdbr-leaf-ir - image:
mobilenetv4_conv_medium.e180_r384_in12k - audio:
mn20_as + whisper-tiny encoder
Its canonical Augmem name follows the repo standard:
AIST=audio + image + speech + text, alphabetized and reduced to first letters95M= exact loaded parameter count rounded to integer millions
It maps text, image, and audio into a shared 1280-dimensional embedding space with Matryoshka truncation support at [1280, 768, 512, 256, 128].
This repo publishes the dual-audio teacher as a safetensors release artifact plus the exact local gate baseline used for later teacher-recovery experiments.
Parameter Count
Exact loaded parameter count in the deployed evaluation path:
| Component | Params |
|---|---|
Text encoder (MongoDB/mdbr-leaf-ir) |
22,861,056 |
Image encoder (mobilenetv4_conv_medium.e180_r384_in12k) |
8,434,512 |
Audio encoder (mn20_as, full loaded module) |
17,909,287 |
Audio encoder (openai/whisper-tiny, encoder only) |
8,208,384 |
| Image projection head | 12,306,560 |
| Audio projection head | 14,272,640 |
| Text projection head | 11,323,520 |
| Total exact loaded params | 95,315,959 |
For continuity with older notes:
- historical shorthand:
TE-86M Dual Audio 89,048,552params if you exclude the EfficientAT classifier head from themn20_asmodule37,902,720params are trainable checkpoint weights in the three projection heads
Architecture
The dual-audio teacher uses a frozen-encoder / trained-head setup:
Text -> mdbr-leaf-ir (768-d) ----------------> DeepProjectionHead-d2 -> 1280
Image -> MobileNetV4-Medium (1280-d) ---------> DeepProjectionHead-d2 -> 1280
Audio -> EfficientAT mn20_as (1920-d) \
+--> concat(2304-d) -> DeepProjectionHead-d2 -> 1280
Whisper-Tiny encoder (384-d) /
The audio path is dual-encoder by construction. EfficientAT contributes the environmental / general audio branch; Whisper-Tiny contributes the speech-sensitive branch.
Core training config:
- projection hidden dim:
1920 - projection output dim:
1280 - projection depth:
2 - loss: InfoNCE
- audio encoder dim after concat:
2304 - Matryoshka dims:
[1280, 768, 512, 256, 128]
Published config file: te_mn20_whisper_d2_validaudio.yaml
Local Gate Baseline
The attached JSON teacher_dual_mn20whisper_exact_gate_baseline_20260424T155324Z.json is the canonical local gate baseline used for later teacher continuation experiments.
Seeded split-excluded baseline at 1280d:
| Slice | Metric |
|---|---|
| Speech holdout A->T R@1 | 0.5652 |
| Speech holdout T->A R@1 | 0.5992 |
| Speech holdout avg R@1 | 0.5822 |
| WavCaps FSD A->T R@1 | 0.1078 |
| WavCaps FSD T->A R@1 | 0.1030 |
| WavCaps FSD avg R@1 | 0.1054 |
| SALT A->I R@1 | 0.1692 |
| SALT I->A R@1 | 0.1261 |
Important scope note:
- These are the exact local gate numbers used for bounded recovery experiments.
- They are not a claim of broad public benchmark superiority.
- The external 4-task audio smoke baseline was not packaged into this release.
Files
| File | Purpose |
|---|---|
AIST-95M.safetensors |
Self-contained dual-audio teacher release artifact |
te_mn20_whisper_d2_validaudio.yaml |
Training config for the teacher line |
teacher_dual_mn20whisper_exact_gate_baseline_20260424T155324Z.json |
Canonical exact-gate baseline |
parameter_breakdown.json |
Exact parameter accounting used in this card |
Loading
This release is a self-contained safetensors artifact containing:
- text encoder weights
- image encoder weights
- EfficientAT audio encoder weights
- Whisper-Tiny encoder weights
- text / image / audio projection heads
Caveats
- This release uses the canonical Augmem name
AIST-95M. - Older
TE-86M Dual Audioreferences are legacy aliases for the same artifact line. - The existing older
augmem/TE-86Mrelease on Hugging Face is a different artifact line.