AIST-87M
AIST-87M is a compact audio + image + speech + text embedding model for
human-memory augmentation workloads.
It is the single-audio evolution of the earlier dual-audio tower line: the
runtime audio path uses one merged native mn20_as EfficientAT encoder instead
of a separate EfficientAT + Whisper dual branch. The LoRA training weights are
merged into the native audio encoder in this release artifact, so there is no
separate LoRA pass at inference time.
Core stack:
- text:
MongoDB/mdbr-leaf-ir - image:
mobilenetv4_conv_medium.e180_r384_in12k - audio: native merged
mn20_asEfficientAT encoder - projection output:
1280d - Matryoshka slices:
[1280, 768, 512, 256, 128] - exact loaded params:
87,118,774
The canonical name follows the Augmem naming standard:
AIST= audio + image + speech + text87M= exact loaded parameter count rounded to integer millions
Runtime Contract
This model returns L2-normalized embeddings in a shared 1280-dimensional space. For smaller runtime profiles, truncate to a Matryoshka slice and renormalize:
z1280 = l2norm(model(input))
z768 = l2norm(z1280[0:768])
z512 = l2norm(z1280[0:512])
The release safetensors file is self-contained and includes the text encoder, image encoder, merged native audio encoder, and the three projection heads.
Evaluation Scope
This release uses a human-memory evaluation slice rather than a broad leaderboard sweep. The slice is chosen to match practical memory augmentation surfaces:
- text continuity: duplicate-question and semantic textual similarity tasks
- image recall: Flickr30k text-image and image-text retrieval
- audio recall: speech/general-audio text-audio retrieval tasks
Primary metrics:
- text continuity:
main_score - image recall:
NDCG@10 - audio recall:
NDCG@10
Human-Memory Slice
Source: aist87m_memory_slice_release_report.md and
aist87m_memory_slice_release_report.json.
| Dim | Tasks | Text continuity | Image recall | Audio recall | Overall |
|---|---|---|---|---|---|
| 1280 | 8 / 8 | 0.763 | 0.425 | 0.104 | 0.349 |
| 768 | 8 / 8 | 0.762 | 0.424 | 0.104 | 0.349 |
| 512 | 8 / 8 | 0.762 | 0.424 | 0.104 | 0.349 |
Selected 1280d task scores:
| Task | Family | Metric | Score | R@1 | R@10 |
|---|---|---|---|---|---|
| SprintDuplicateQuestions | Text continuity | main_score | 0.875 | - | - |
| STSBenchmark | Text continuity | main_score | 0.651 | - | - |
| Flickr30kT2IRetrieval | Image recall | NDCG@10 | 0.469 | 0.296 | 0.672 |
| Flickr30kI2TRetrieval | Image recall | NDCG@10 | 0.381 | 0.082 | 0.407 |
| CommonVoiceMini21T2ARetrieval | Audio recall | NDCG@10 | 0.028 | 0.006 | 0.062 |
| MACST2ARetrieval | Audio recall | NDCG@10 | 0.110 | 0.033 | 0.214 |
| UrbanSound8KT2ARetrieval | Audio recall | NDCG@10 | 0.009 | 0.002 | 0.018 |
| ClothoT2ARetrieval | Audio recall | NDCG@10 | 0.269 | 0.128 | 0.443 |
Task-Aligned Comparisons
Comparisons below are only for locally available, task-aligned runs from the same raw AIST line and its audio baselines.
| Comparison | Dim | Paired tasks | Read |
|---|---|---|---|
vs native mn20_as audio baseline |
768 | 4 | slightly lower selected audio recall on average; UrbanSound8K is flat |
| vs dual-audio tower | 768 | 6 | smaller single-audio runtime, but lower paired text/image/audio scores |
vs AIST-95M |
1280 | 2 | only paired Flickr tasks are available locally; AIST-95M remains stronger on that pair |
This release is not presented as a generic MTEB/MIEB/MAEB leaderboard model. Broad diagnostic runs contain many task families that are not part of this release gate.
Runtime Footprint vs Dual-Audio Tower
AIST-87M replaces the dual-audio tower's separate EfficientAT + Whisper-Tiny
audio branches with one merged native mn20_as EfficientAT encoder. The result
is a smaller deployed path with the same 1280d output contract.
| Runtime surface | AIST-87M | AIST-95M dual-audio tower | Delta |
|---|---|---|---|
| Loaded parameters | 87,118,774 | 95,315,959 | -8.6% |
| Safetensors artifact | 348.9 MB | 381.9 MB | -8.6% |
| Audio encoders | 1 | 2 | removes Whisper branch |
| Audio encoder parameters | 19,886,566 | 26,117,671 | -23.9% |
| Audio path parameters incl. projection | 32,193,126 | 40,390,311 | -20.3% |
| Audio projection input width | 1,280 | 2,304 | -44.4% |
Exact-gate tradeoff against the same dual-audio local baseline:
| 1280d exact-gate slice | AIST-87M | AIST-95M dual-audio tower | Delta |
|---|---|---|---|
| Speech holdout audio-text R@1 avg | 0.724 | 0.582 | +0.142 |
| WavCaps FSD audio-text R@1 avg | 0.097 | 0.105 | -0.009 |
| SALT audio-text R@1 avg | 0.008 | 0.007 | flat |
| SALT image-audio R@1 avg | 0.138 | 0.148 | -0.010 |
Measured PyTorch audio-stack throughput on an NVIDIA L4, using synthetic 10s 32 kHz CPU waveforms passed through waveform -> audio encoder -> projection -> normalized embedding. Median wall time is over 50 timed iterations after 20 warmup iterations. This excludes audio file decode, dataset download, and MTEB result serialization.
| Batch | AIST-87M median ms | AIST-87M throughput | AIST-95M median ms | AIST-95M throughput | Speedup |
|---|---|---|---|---|---|
| 1 | 5.36 | 186.7 clips/s; 1,867 audio-s/s | 10.50 | 95.2 clips/s; 952 audio-s/s | 1.96x |
| 8 | 16.46 | 486.0 clips/s; 4,860 audio-s/s | 60.29 | 132.7 clips/s; 1,327 audio-s/s | 3.66x |
| 16 | 41.19 | 388.5 clips/s; 3,885 audio-s/s | 133.95 | 119.4 clips/s; 1,194 audio-s/s | 3.25x |
Projection-only throughput at feature batch 2048 is also higher for the
single-audio path: 314k features/s for AIST-87M vs 282k features/s for the
dual-audio tower. Raw benchmark output is included as
aist87m_vs_dual_audio_throughput_l4_20260504.json.
Architecture
Text -> mdbr-leaf-ir (768-d) -----------------------> DeepProjectionHead-d2 -> 1280
Image -> MobileNetV4-Medium (1280-d) ----------------> DeepProjectionHead-d2 -> 1280
Audio -> merged native EfficientAT mn20_as (1280-d) -> DeepProjectionHead-d2 -> 1280
The audio encoder in this artifact is the merged native checkpoint:
mn20_native_merged_aistmix_audioheavy100k175k175k_continue_from_balanced_20260426T143137Z/latest_model.pt
Parameter Count
| Component | Params |
|---|---|
Text encoder (MongoDB/mdbr-leaf-ir) |
22,861,056 |
Image encoder (mobilenetv4_conv_medium.e180_r384_in12k) |
8,434,512 |
Audio encoder (merged native mn20_as) |
19,886,566 |
| Image projection head | 12,306,560 |
| Audio projection head | 12,306,560 |
| Text projection head | 11,323,520 |
| Total exact loaded params | 87,118,774 |
Files
| File | Purpose |
|---|---|
AIST-87M.safetensors |
Self-contained release artifact |
aist_81m_raw_mn20_lora.yaml |
Training recipe for the source run |
manifest.json |
Release manifest with checksums and eval coverage |
parameter_breakdown.json |
Exact parameter accounting |
aist87m_memory_slice_release_report.md |
Human-memory slice report |
aist87m_memory_slice_release_report.json |
Machine-readable evaluation summary |
aist87m_vs_dual_audio_throughput_l4_20260504.json |
L4 throughput benchmark vs dual-audio tower |
Caveats
- The model is optimized and reported for memory-relevant embedding surfaces, not broad leaderboard coverage.
- The single-audio path is smaller and simpler than the dual-audio tower, but it does not dominate the dual-audio tower on paired diagnostic scores.
- 1280d, 768d, and 512d human-memory slices are complete for the release checkpoint.