TE-86M Dual Audio (Legacy Alias for AIST-95M)

TE-86M Dual Audio is the dual-audio Trimodal Embeddings teacher checkpoint built on:

text: MongoDB/mdbr-leaf-ir
image: mobilenetv4_conv_medium.e180_r384_in12k
audio: mn20_as + whisper-tiny encoder

It maps text, image, and audio into a shared 1280-dimensional embedding space with Matryoshka truncation support at [1280, 768, 512, 256, 128].

This repo publishes the base dual-audio teacher checkpoint and the exact local gate baseline used for later teacher-recovery experiments.

Canonical name for this artifact line:

AIST-95M

This repo is retained as a legacy TE alias for continuity with older notes and links.

Parameter Count

Historical project shorthand called this the 86M teacher line. The exact loaded dual-audio stack is larger.

Exact loaded parameter count in the deployed evaluation path:

Component	Params
Text encoder (`MongoDB/mdbr-leaf-ir`)	22,861,056
Image encoder (`mobilenetv4_conv_medium.e180_r384_in12k`)	8,434,512
Audio encoder (`mn20_as`, full loaded module)	17,909,287
Audio encoder (`openai/whisper-tiny`, encoder only)	8,208,384
Image projection head	12,306,560
Audio projection head	14,272,640
Text projection head	11,323,520
Total exact loaded params	95,315,959

For continuity with earlier notes:

89,048,552 params if you exclude the EfficientAT classifier head from the mn20_as module.
37,902,720 params are trainable checkpoint weights in the three projection heads.

The checkpoint file stores a combined projection_state_dict plus per-head state dicts; those are duplicate representations of the same projection weights and are not double-counted above.

Architecture

The dual-audio teacher uses a frozen-encoder / trained-head setup:

Text   -> mdbr-leaf-ir (768-d) ----------------> DeepProjectionHead-d2 -> 1280
Image  -> MobileNetV4-Medium (1280-d) ---------> DeepProjectionHead-d2 -> 1280
Audio  -> EfficientAT mn20_as (1920-d) \
                                           +--> concat(2304-d) -> DeepProjectionHead-d2 -> 1280
         Whisper-Tiny encoder (384-d)    /

The audio path is dual-encoder by construction. EfficientAT contributes the environmental / general audio branch; Whisper-Tiny contributes the speech-sensitive branch.

Core training config:

projection hidden dim: 1920
projection output dim: 1280
projection depth: 2
loss: InfoNCE
audio encoder dim after concat: 2304
Matryoshka dims: [1280, 768, 512, 256, 128]

Published config file: te_mn20_whisper_d2_validaudio.yaml

Local Gate Baseline

The attached JSON teacher_dual_mn20whisper_exact_gate_baseline_20260424T155324Z.json is the canonical local gate baseline used for later teacher continuation experiments.

Seeded split-excluded baseline at 1280d:

Slice	Metric
Speech holdout A->T R@1	0.5652
Speech holdout T->A R@1	0.5992
Speech holdout avg R@1	0.5822
WavCaps FSD A->T R@1	0.1078
WavCaps FSD T->A R@1	0.1030
WavCaps FSD avg R@1	0.1054
SALT A->I R@1	0.1692
SALT I->A R@1	0.1261

Important scope note:

These are the exact local gate numbers used for bounded recovery experiments.
They are not a claim of broad public benchmark superiority.
The external 4-task audio smoke baseline was not packaged into this release.

Files

File	Purpose
`TE-86M-dual-audio-best_model.pt`	Base dual-audio teacher checkpoint
`te_mn20_whisper_d2_validaudio.yaml`	Training config for the teacher line
`teacher_dual_mn20whisper_exact_gate_baseline_20260424T155324Z.json`	Canonical exact-gate baseline
`parameter_breakdown.json`	Exact parameter accounting used in this card

Loading

This release is a native PyTorch checkpoint in the format used by triembed.

Example local evaluation:

uv run python scripts/eval_teacher_exact_gate.py \
  --teacher TE-86M-dual-audio-best_model.pt \
  --cache-dir cache \
  --audio-suffix mn20_whisper_audio_features \
  --output-json teacher_gate_eval.json

Caveats

Canonical repo for this artifact line: augmem/AIST-95M.
This repo is retained as a historical TE alias for continuity with older links.
The exact loaded dual-audio stack is 95,315,959 params, not 86M.
The existing older augmem/TE-86M release on Hugging Face is a different artifact line; this repo is the dual-audio teacher checkpoint specifically.

Downloads last month: -; Downloads are not tracked for this model. How to track