TE-86M Dual Audio (Legacy Alias for AIST-95M)

TE-86M Dual Audio is the dual-audio Trimodal Embeddings teacher checkpoint built on:

  • text: MongoDB/mdbr-leaf-ir
  • image: mobilenetv4_conv_medium.e180_r384_in12k
  • audio: mn20_as + whisper-tiny encoder

It maps text, image, and audio into a shared 1280-dimensional embedding space with Matryoshka truncation support at [1280, 768, 512, 256, 128].

This repo publishes the base dual-audio teacher checkpoint and the exact local gate baseline used for later teacher-recovery experiments.

Canonical name for this artifact line:

This repo is retained as a legacy TE alias for continuity with older notes and links.

Parameter Count

Historical project shorthand called this the 86M teacher line. The exact loaded dual-audio stack is larger.

Exact loaded parameter count in the deployed evaluation path:

Component Params
Text encoder (MongoDB/mdbr-leaf-ir) 22,861,056
Image encoder (mobilenetv4_conv_medium.e180_r384_in12k) 8,434,512
Audio encoder (mn20_as, full loaded module) 17,909,287
Audio encoder (openai/whisper-tiny, encoder only) 8,208,384
Image projection head 12,306,560
Audio projection head 14,272,640
Text projection head 11,323,520
Total exact loaded params 95,315,959

For continuity with earlier notes:

  • 89,048,552 params if you exclude the EfficientAT classifier head from the mn20_as module.
  • 37,902,720 params are trainable checkpoint weights in the three projection heads.

The checkpoint file stores a combined projection_state_dict plus per-head state dicts; those are duplicate representations of the same projection weights and are not double-counted above.

Architecture

The dual-audio teacher uses a frozen-encoder / trained-head setup:

Text   -> mdbr-leaf-ir (768-d) ----------------> DeepProjectionHead-d2 -> 1280
Image  -> MobileNetV4-Medium (1280-d) ---------> DeepProjectionHead-d2 -> 1280
Audio  -> EfficientAT mn20_as (1920-d) \
                                           +--> concat(2304-d) -> DeepProjectionHead-d2 -> 1280
         Whisper-Tiny encoder (384-d)    /

The audio path is dual-encoder by construction. EfficientAT contributes the environmental / general audio branch; Whisper-Tiny contributes the speech-sensitive branch.

Core training config:

  • projection hidden dim: 1920
  • projection output dim: 1280
  • projection depth: 2
  • loss: InfoNCE
  • audio encoder dim after concat: 2304
  • Matryoshka dims: [1280, 768, 512, 256, 128]

Published config file: te_mn20_whisper_d2_validaudio.yaml

Local Gate Baseline

The attached JSON teacher_dual_mn20whisper_exact_gate_baseline_20260424T155324Z.json is the canonical local gate baseline used for later teacher continuation experiments.

Seeded split-excluded baseline at 1280d:

Slice Metric
Speech holdout A->T R@1 0.5652
Speech holdout T->A R@1 0.5992
Speech holdout avg R@1 0.5822
WavCaps FSD A->T R@1 0.1078
WavCaps FSD T->A R@1 0.1030
WavCaps FSD avg R@1 0.1054
SALT A->I R@1 0.1692
SALT I->A R@1 0.1261

Important scope note:

  • These are the exact local gate numbers used for bounded recovery experiments.
  • They are not a claim of broad public benchmark superiority.
  • The external 4-task audio smoke baseline was not packaged into this release.

Files

File Purpose
TE-86M-dual-audio-best_model.pt Base dual-audio teacher checkpoint
te_mn20_whisper_d2_validaudio.yaml Training config for the teacher line
teacher_dual_mn20whisper_exact_gate_baseline_20260424T155324Z.json Canonical exact-gate baseline
parameter_breakdown.json Exact parameter accounting used in this card

Loading

This release is a native PyTorch checkpoint in the format used by triembed.

Example local evaluation:

uv run python scripts/eval_teacher_exact_gate.py \
  --teacher TE-86M-dual-audio-best_model.pt \
  --cache-dir cache \
  --audio-suffix mn20_whisper_audio_features \
  --output-json teacher_gate_eval.json

Caveats

  • Canonical repo for this artifact line: augmem/AIST-95M.
  • This repo is retained as a historical TE alias for continuity with older links.
  • The exact loaded dual-audio stack is 95,315,959 params, not 86M.
  • The existing older augmem/TE-86M release on Hugging Face is a different artifact line; this repo is the dual-audio teacher checkpoint specifically.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support