AIST-95M / README.md
gcoderw's picture
Publish AIST-95M
789accf verified
---
language:
- en
license: apache-2.0
tags:
- multimodal
- embedding
- trimodal
- dual-audio
- retrieval
- cross-modal
- image-text-audio
- feature-extraction
library_name: pytorch
pipeline_tag: feature-extraction
datasets:
- custom
---
# AIST-95M
`AIST-95M` is the dual-audio Trimodal Embeddings teacher checkpoint built on:
- text: `MongoDB/mdbr-leaf-ir`
- image: `mobilenetv4_conv_medium.e180_r384_in12k`
- audio: `mn20_as + whisper-tiny encoder`
Its canonical Augmem name follows the repo standard:
- `AIST` = `audio + image + speech + text`, alphabetized and reduced to first letters
- `95M` = exact loaded parameter count rounded to integer millions
It maps text, image, and audio into a shared 1280-dimensional embedding space with Matryoshka truncation support at `[1280, 768, 512, 256, 128]`.
This repo publishes the dual-audio teacher as a safetensors release artifact plus the exact local gate baseline used for later teacher-recovery experiments.
## Parameter Count
Exact loaded parameter count in the deployed evaluation path:
| Component | Params |
|---|---:|
| Text encoder (`MongoDB/mdbr-leaf-ir`) | 22,861,056 |
| Image encoder (`mobilenetv4_conv_medium.e180_r384_in12k`) | 8,434,512 |
| Audio encoder (`mn20_as`, full loaded module) | 17,909,287 |
| Audio encoder (`openai/whisper-tiny`, encoder only) | 8,208,384 |
| Image projection head | 12,306,560 |
| Audio projection head | 14,272,640 |
| Text projection head | 11,323,520 |
| **Total exact loaded params** | **95,315,959** |
For continuity with older notes:
- historical shorthand: `TE-86M Dual Audio`
- `89,048,552` params if you exclude the EfficientAT classifier head from the `mn20_as` module
- `37,902,720` params are trainable checkpoint weights in the three projection heads
## Architecture
The dual-audio teacher uses a frozen-encoder / trained-head setup:
```text
Text -> mdbr-leaf-ir (768-d) ----------------> DeepProjectionHead-d2 -> 1280
Image -> MobileNetV4-Medium (1280-d) ---------> DeepProjectionHead-d2 -> 1280
Audio -> EfficientAT mn20_as (1920-d) \
+--> concat(2304-d) -> DeepProjectionHead-d2 -> 1280
Whisper-Tiny encoder (384-d) /
```
The audio path is dual-encoder by construction. EfficientAT contributes the environmental / general audio branch; Whisper-Tiny contributes the speech-sensitive branch.
Core training config:
- projection hidden dim: `1920`
- projection output dim: `1280`
- projection depth: `2`
- loss: InfoNCE
- audio encoder dim after concat: `2304`
- Matryoshka dims: `[1280, 768, 512, 256, 128]`
Published config file: `te_mn20_whisper_d2_validaudio.yaml`
## Local Gate Baseline
The attached JSON `teacher_dual_mn20whisper_exact_gate_baseline_20260424T155324Z.json` is the canonical local gate baseline used for later teacher continuation experiments.
Seeded split-excluded baseline at `1280d`:
| Slice | Metric |
|---|---:|
| Speech holdout A->T R@1 | 0.5652 |
| Speech holdout T->A R@1 | 0.5992 |
| Speech holdout avg R@1 | 0.5822 |
| WavCaps FSD A->T R@1 | 0.1078 |
| WavCaps FSD T->A R@1 | 0.1030 |
| WavCaps FSD avg R@1 | 0.1054 |
| SALT A->I R@1 | 0.1692 |
| SALT I->A R@1 | 0.1261 |
Important scope note:
- These are the exact local gate numbers used for bounded recovery experiments.
- They are not a claim of broad public benchmark superiority.
- The external 4-task audio smoke baseline was not packaged into this release.
## Files
| File | Purpose |
|---|---|
| `AIST-95M.safetensors` | Self-contained dual-audio teacher release artifact |
| `te_mn20_whisper_d2_validaudio.yaml` | Training config for the teacher line |
| `teacher_dual_mn20whisper_exact_gate_baseline_20260424T155324Z.json` | Canonical exact-gate baseline |
| `parameter_breakdown.json` | Exact parameter accounting used in this card |
## Loading
This release is a self-contained safetensors artifact containing:
- text encoder weights
- image encoder weights
- EfficientAT audio encoder weights
- Whisper-Tiny encoder weights
- text / image / audio projection heads
## Caveats
- This release uses the canonical Augmem name `AIST-95M`.
- Older `TE-86M Dual Audio` references are legacy aliases for the same artifact line.
- The existing older `augmem/TE-86M` release on Hugging Face is a different artifact line.