--- language: - en license: apache-2.0 tags: - multimodal - embedding - trimodal - dual-audio - retrieval - cross-modal - image-text-audio - feature-extraction library_name: pytorch pipeline_tag: feature-extraction datasets: - custom --- # AIST-95M `AIST-95M` is the dual-audio Trimodal Embeddings teacher checkpoint built on: - text: `MongoDB/mdbr-leaf-ir` - image: `mobilenetv4_conv_medium.e180_r384_in12k` - audio: `mn20_as + whisper-tiny encoder` Its canonical Augmem name follows the repo standard: - `AIST` = `audio + image + speech + text`, alphabetized and reduced to first letters - `95M` = exact loaded parameter count rounded to integer millions It maps text, image, and audio into a shared 1280-dimensional embedding space with Matryoshka truncation support at `[1280, 768, 512, 256, 128]`. This repo publishes the dual-audio teacher as a safetensors release artifact plus the exact local gate baseline used for later teacher-recovery experiments. ## Parameter Count Exact loaded parameter count in the deployed evaluation path: | Component | Params | |---|---:| | Text encoder (`MongoDB/mdbr-leaf-ir`) | 22,861,056 | | Image encoder (`mobilenetv4_conv_medium.e180_r384_in12k`) | 8,434,512 | | Audio encoder (`mn20_as`, full loaded module) | 17,909,287 | | Audio encoder (`openai/whisper-tiny`, encoder only) | 8,208,384 | | Image projection head | 12,306,560 | | Audio projection head | 14,272,640 | | Text projection head | 11,323,520 | | **Total exact loaded params** | **95,315,959** | For continuity with older notes: - historical shorthand: `TE-86M Dual Audio` - `89,048,552` params if you exclude the EfficientAT classifier head from the `mn20_as` module - `37,902,720` params are trainable checkpoint weights in the three projection heads ## Architecture The dual-audio teacher uses a frozen-encoder / trained-head setup: ```text Text -> mdbr-leaf-ir (768-d) ----------------> DeepProjectionHead-d2 -> 1280 Image -> MobileNetV4-Medium (1280-d) ---------> DeepProjectionHead-d2 -> 1280 Audio -> EfficientAT mn20_as (1920-d) \ +--> concat(2304-d) -> DeepProjectionHead-d2 -> 1280 Whisper-Tiny encoder (384-d) / ``` The audio path is dual-encoder by construction. EfficientAT contributes the environmental / general audio branch; Whisper-Tiny contributes the speech-sensitive branch. Core training config: - projection hidden dim: `1920` - projection output dim: `1280` - projection depth: `2` - loss: InfoNCE - audio encoder dim after concat: `2304` - Matryoshka dims: `[1280, 768, 512, 256, 128]` Published config file: `te_mn20_whisper_d2_validaudio.yaml` ## Local Gate Baseline The attached JSON `teacher_dual_mn20whisper_exact_gate_baseline_20260424T155324Z.json` is the canonical local gate baseline used for later teacher continuation experiments. Seeded split-excluded baseline at `1280d`: | Slice | Metric | |---|---:| | Speech holdout A->T R@1 | 0.5652 | | Speech holdout T->A R@1 | 0.5992 | | Speech holdout avg R@1 | 0.5822 | | WavCaps FSD A->T R@1 | 0.1078 | | WavCaps FSD T->A R@1 | 0.1030 | | WavCaps FSD avg R@1 | 0.1054 | | SALT A->I R@1 | 0.1692 | | SALT I->A R@1 | 0.1261 | Important scope note: - These are the exact local gate numbers used for bounded recovery experiments. - They are not a claim of broad public benchmark superiority. - The external 4-task audio smoke baseline was not packaged into this release. ## Files | File | Purpose | |---|---| | `AIST-95M.safetensors` | Self-contained dual-audio teacher release artifact | | `te_mn20_whisper_d2_validaudio.yaml` | Training config for the teacher line | | `teacher_dual_mn20whisper_exact_gate_baseline_20260424T155324Z.json` | Canonical exact-gate baseline | | `parameter_breakdown.json` | Exact parameter accounting used in this card | ## Loading This release is a self-contained safetensors artifact containing: - text encoder weights - image encoder weights - EfficientAT audio encoder weights - Whisper-Tiny encoder weights - text / image / audio projection heads ## Caveats - This release uses the canonical Augmem name `AIST-95M`. - Older `TE-86M Dual Audio` references are legacy aliases for the same artifact line. - The existing older `augmem/TE-86M` release on Hugging Face is a different artifact line.