| --- |
| language: |
| - en |
| license: apache-2.0 |
| tags: |
| - multimodal |
| - embedding |
| - trimodal |
| - dual-audio |
| - retrieval |
| - cross-modal |
| - image-text-audio |
| - feature-extraction |
| library_name: pytorch |
| pipeline_tag: feature-extraction |
| datasets: |
| - custom |
| --- |
| |
| # AIST-95M |
|
|
| `AIST-95M` is the dual-audio Trimodal Embeddings teacher checkpoint built on: |
|
|
| - text: `MongoDB/mdbr-leaf-ir` |
| - image: `mobilenetv4_conv_medium.e180_r384_in12k` |
| - audio: `mn20_as + whisper-tiny encoder` |
|
|
| Its canonical Augmem name follows the repo standard: |
|
|
| - `AIST` = `audio + image + speech + text`, alphabetized and reduced to first letters |
| - `95M` = exact loaded parameter count rounded to integer millions |
|
|
| It maps text, image, and audio into a shared 1280-dimensional embedding space with Matryoshka truncation support at `[1280, 768, 512, 256, 128]`. |
|
|
| This repo publishes the dual-audio teacher as a safetensors release artifact plus the exact local gate baseline used for later teacher-recovery experiments. |
|
|
| ## Parameter Count |
|
|
| Exact loaded parameter count in the deployed evaluation path: |
|
|
| | Component | Params | |
| |---|---:| |
| | Text encoder (`MongoDB/mdbr-leaf-ir`) | 22,861,056 | |
| | Image encoder (`mobilenetv4_conv_medium.e180_r384_in12k`) | 8,434,512 | |
| | Audio encoder (`mn20_as`, full loaded module) | 17,909,287 | |
| | Audio encoder (`openai/whisper-tiny`, encoder only) | 8,208,384 | |
| | Image projection head | 12,306,560 | |
| | Audio projection head | 14,272,640 | |
| | Text projection head | 11,323,520 | |
| | **Total exact loaded params** | **95,315,959** | |
|
|
| For continuity with older notes: |
|
|
| - historical shorthand: `TE-86M Dual Audio` |
| - `89,048,552` params if you exclude the EfficientAT classifier head from the `mn20_as` module |
| - `37,902,720` params are trainable checkpoint weights in the three projection heads |
|
|
| ## Architecture |
|
|
| The dual-audio teacher uses a frozen-encoder / trained-head setup: |
|
|
| ```text |
| Text -> mdbr-leaf-ir (768-d) ----------------> DeepProjectionHead-d2 -> 1280 |
| Image -> MobileNetV4-Medium (1280-d) ---------> DeepProjectionHead-d2 -> 1280 |
| Audio -> EfficientAT mn20_as (1920-d) \ |
| +--> concat(2304-d) -> DeepProjectionHead-d2 -> 1280 |
| Whisper-Tiny encoder (384-d) / |
| ``` |
|
|
| The audio path is dual-encoder by construction. EfficientAT contributes the environmental / general audio branch; Whisper-Tiny contributes the speech-sensitive branch. |
|
|
| Core training config: |
|
|
| - projection hidden dim: `1920` |
| - projection output dim: `1280` |
| - projection depth: `2` |
| - loss: InfoNCE |
| - audio encoder dim after concat: `2304` |
| - Matryoshka dims: `[1280, 768, 512, 256, 128]` |
|
|
| Published config file: `te_mn20_whisper_d2_validaudio.yaml` |
|
|
| ## Local Gate Baseline |
|
|
| The attached JSON `teacher_dual_mn20whisper_exact_gate_baseline_20260424T155324Z.json` is the canonical local gate baseline used for later teacher continuation experiments. |
|
|
| Seeded split-excluded baseline at `1280d`: |
|
|
| | Slice | Metric | |
| |---|---:| |
| | Speech holdout A->T R@1 | 0.5652 | |
| | Speech holdout T->A R@1 | 0.5992 | |
| | Speech holdout avg R@1 | 0.5822 | |
| | WavCaps FSD A->T R@1 | 0.1078 | |
| | WavCaps FSD T->A R@1 | 0.1030 | |
| | WavCaps FSD avg R@1 | 0.1054 | |
| | SALT A->I R@1 | 0.1692 | |
| | SALT I->A R@1 | 0.1261 | |
|
|
| Important scope note: |
|
|
| - These are the exact local gate numbers used for bounded recovery experiments. |
| - They are not a claim of broad public benchmark superiority. |
| - The external 4-task audio smoke baseline was not packaged into this release. |
|
|
| ## Files |
|
|
| | File | Purpose | |
| |---|---| |
| | `AIST-95M.safetensors` | Self-contained dual-audio teacher release artifact | |
| | `te_mn20_whisper_d2_validaudio.yaml` | Training config for the teacher line | |
| | `teacher_dual_mn20whisper_exact_gate_baseline_20260424T155324Z.json` | Canonical exact-gate baseline | |
| | `parameter_breakdown.json` | Exact parameter accounting used in this card | |
|
|
| ## Loading |
|
|
| This release is a self-contained safetensors artifact containing: |
|
|
| - text encoder weights |
| - image encoder weights |
| - EfficientAT audio encoder weights |
| - Whisper-Tiny encoder weights |
| - text / image / audio projection heads |
|
|
| ## Caveats |
|
|
| - This release uses the canonical Augmem name `AIST-95M`. |
| - Older `TE-86M Dual Audio` references are legacy aliases for the same artifact line. |
| - The existing older `augmem/TE-86M` release on Hugging Face is a different artifact line. |
|
|