TE-86M Dual Audio (Legacy Alias for AIST-95M)
TE-86M Dual Audio is the dual-audio Trimodal Embeddings teacher checkpoint built on:
- text:
MongoDB/mdbr-leaf-ir - image:
mobilenetv4_conv_medium.e180_r384_in12k - audio:
mn20_as + whisper-tiny encoder
It maps text, image, and audio into a shared 1280-dimensional embedding space with Matryoshka truncation support at [1280, 768, 512, 256, 128].
This repo publishes the base dual-audio teacher checkpoint and the exact local gate baseline used for later teacher-recovery experiments.
Canonical name for this artifact line:
This repo is retained as a legacy TE alias for continuity with older notes and links.
Parameter Count
Historical project shorthand called this the 86M teacher line. The exact loaded dual-audio stack is larger.
Exact loaded parameter count in the deployed evaluation path:
| Component | Params |
|---|---|
Text encoder (MongoDB/mdbr-leaf-ir) |
22,861,056 |
Image encoder (mobilenetv4_conv_medium.e180_r384_in12k) |
8,434,512 |
Audio encoder (mn20_as, full loaded module) |
17,909,287 |
Audio encoder (openai/whisper-tiny, encoder only) |
8,208,384 |
| Image projection head | 12,306,560 |
| Audio projection head | 14,272,640 |
| Text projection head | 11,323,520 |
| Total exact loaded params | 95,315,959 |
For continuity with earlier notes:
89,048,552params if you exclude the EfficientAT classifier head from the mn20_as module.37,902,720params are trainable checkpoint weights in the three projection heads.
The checkpoint file stores a combined projection_state_dict plus per-head state dicts; those are duplicate representations of the same projection weights and are not double-counted above.
Architecture
The dual-audio teacher uses a frozen-encoder / trained-head setup:
Text -> mdbr-leaf-ir (768-d) ----------------> DeepProjectionHead-d2 -> 1280
Image -> MobileNetV4-Medium (1280-d) ---------> DeepProjectionHead-d2 -> 1280
Audio -> EfficientAT mn20_as (1920-d) \
+--> concat(2304-d) -> DeepProjectionHead-d2 -> 1280
Whisper-Tiny encoder (384-d) /
The audio path is dual-encoder by construction. EfficientAT contributes the environmental / general audio branch; Whisper-Tiny contributes the speech-sensitive branch.
Core training config:
- projection hidden dim:
1920 - projection output dim:
1280 - projection depth:
2 - loss: InfoNCE
- audio encoder dim after concat:
2304 - Matryoshka dims:
[1280, 768, 512, 256, 128]
Published config file: te_mn20_whisper_d2_validaudio.yaml
Local Gate Baseline
The attached JSON teacher_dual_mn20whisper_exact_gate_baseline_20260424T155324Z.json is the canonical local gate baseline used for later teacher continuation experiments.
Seeded split-excluded baseline at 1280d:
| Slice | Metric |
|---|---|
| Speech holdout A->T R@1 | 0.5652 |
| Speech holdout T->A R@1 | 0.5992 |
| Speech holdout avg R@1 | 0.5822 |
| WavCaps FSD A->T R@1 | 0.1078 |
| WavCaps FSD T->A R@1 | 0.1030 |
| WavCaps FSD avg R@1 | 0.1054 |
| SALT A->I R@1 | 0.1692 |
| SALT I->A R@1 | 0.1261 |
Important scope note:
- These are the exact local gate numbers used for bounded recovery experiments.
- They are not a claim of broad public benchmark superiority.
- The external 4-task audio smoke baseline was not packaged into this release.
Files
| File | Purpose |
|---|---|
TE-86M-dual-audio-best_model.pt |
Base dual-audio teacher checkpoint |
te_mn20_whisper_d2_validaudio.yaml |
Training config for the teacher line |
teacher_dual_mn20whisper_exact_gate_baseline_20260424T155324Z.json |
Canonical exact-gate baseline |
parameter_breakdown.json |
Exact parameter accounting used in this card |
Loading
This release is a native PyTorch checkpoint in the format used by triembed.
Example local evaluation:
uv run python scripts/eval_teacher_exact_gate.py \
--teacher TE-86M-dual-audio-best_model.pt \
--cache-dir cache \
--audio-suffix mn20_whisper_audio_features \
--output-json teacher_gate_eval.json
Caveats
- Canonical repo for this artifact line:
augmem/AIST-95M. - This repo is retained as a historical TE alias for continuity with older links.
- The exact loaded dual-audio stack is
95,315,959params, not86M. - The existing older
augmem/TE-86Mrelease on Hugging Face is a different artifact line; this repo is the dual-audio teacher checkpoint specifically.