AIST-87M

AIST-87M is a compact audio + image + speech + text embedding model for human-memory augmentation workloads.

It is the single-audio evolution of the earlier dual-audio tower line: the runtime audio path uses one merged native mn20_as EfficientAT encoder instead of a separate EfficientAT + Whisper dual branch. The LoRA training weights are merged into the native audio encoder in this release artifact, so there is no separate LoRA pass at inference time.

Core stack:

  • text: MongoDB/mdbr-leaf-ir
  • image: mobilenetv4_conv_medium.e180_r384_in12k
  • audio: native merged mn20_as EfficientAT encoder
  • projection output: 1280d
  • Matryoshka slices: [1280, 768, 512, 256, 128]
  • exact loaded params: 87,118,774

The canonical name follows the Augmem naming standard:

  • AIST = audio + image + speech + text
  • 87M = exact loaded parameter count rounded to integer millions

Runtime Contract

This model returns L2-normalized embeddings in a shared 1280-dimensional space. For smaller runtime profiles, truncate to a Matryoshka slice and renormalize:

z1280 = l2norm(model(input))
z768  = l2norm(z1280[0:768])
z512  = l2norm(z1280[0:512])

The release safetensors file is self-contained and includes the text encoder, image encoder, merged native audio encoder, and the three projection heads.

Evaluation Scope

This release uses a human-memory evaluation slice rather than a broad leaderboard sweep. The slice is chosen to match practical memory augmentation surfaces:

  • text continuity: duplicate-question and semantic textual similarity tasks
  • image recall: Flickr30k text-image and image-text retrieval
  • audio recall: speech/general-audio text-audio retrieval tasks

Primary metrics:

  • text continuity: main_score
  • image recall: NDCG@10
  • audio recall: NDCG@10

Human-Memory Slice

Source: aist87m_memory_slice_release_report.md and aist87m_memory_slice_release_report.json.

Dim Tasks Text continuity Image recall Audio recall Overall
1280 8 / 8 0.763 0.425 0.104 0.349
768 8 / 8 0.762 0.424 0.104 0.349
512 8 / 8 0.762 0.424 0.104 0.349

Selected 1280d task scores:

Task Family Metric Score R@1 R@10
SprintDuplicateQuestions Text continuity main_score 0.875 - -
STSBenchmark Text continuity main_score 0.651 - -
Flickr30kT2IRetrieval Image recall NDCG@10 0.469 0.296 0.672
Flickr30kI2TRetrieval Image recall NDCG@10 0.381 0.082 0.407
CommonVoiceMini21T2ARetrieval Audio recall NDCG@10 0.028 0.006 0.062
MACST2ARetrieval Audio recall NDCG@10 0.110 0.033 0.214
UrbanSound8KT2ARetrieval Audio recall NDCG@10 0.009 0.002 0.018
ClothoT2ARetrieval Audio recall NDCG@10 0.269 0.128 0.443

Task-Aligned Comparisons

Comparisons below are only for locally available, task-aligned runs from the same raw AIST line and its audio baselines.

Comparison Dim Paired tasks Read
vs native mn20_as audio baseline 768 4 slightly lower selected audio recall on average; UrbanSound8K is flat
vs dual-audio tower 768 6 smaller single-audio runtime, but lower paired text/image/audio scores
vs AIST-95M 1280 2 only paired Flickr tasks are available locally; AIST-95M remains stronger on that pair

This release is not presented as a generic MTEB/MIEB/MAEB leaderboard model. Broad diagnostic runs contain many task families that are not part of this release gate.

Runtime Footprint vs Dual-Audio Tower

AIST-87M replaces the dual-audio tower's separate EfficientAT + Whisper-Tiny audio branches with one merged native mn20_as EfficientAT encoder. The result is a smaller deployed path with the same 1280d output contract.

Runtime surface AIST-87M AIST-95M dual-audio tower Delta
Loaded parameters 87,118,774 95,315,959 -8.6%
Safetensors artifact 348.9 MB 381.9 MB -8.6%
Audio encoders 1 2 removes Whisper branch
Audio encoder parameters 19,886,566 26,117,671 -23.9%
Audio path parameters incl. projection 32,193,126 40,390,311 -20.3%
Audio projection input width 1,280 2,304 -44.4%

Exact-gate tradeoff against the same dual-audio local baseline:

1280d exact-gate slice AIST-87M AIST-95M dual-audio tower Delta
Speech holdout audio-text R@1 avg 0.724 0.582 +0.142
WavCaps FSD audio-text R@1 avg 0.097 0.105 -0.009
SALT audio-text R@1 avg 0.008 0.007 flat
SALT image-audio R@1 avg 0.138 0.148 -0.010

Measured PyTorch audio-stack throughput on an NVIDIA L4, using synthetic 10s 32 kHz CPU waveforms passed through waveform -> audio encoder -> projection -> normalized embedding. Median wall time is over 50 timed iterations after 20 warmup iterations. This excludes audio file decode, dataset download, and MTEB result serialization.

Batch AIST-87M median ms AIST-87M throughput AIST-95M median ms AIST-95M throughput Speedup
1 5.36 186.7 clips/s; 1,867 audio-s/s 10.50 95.2 clips/s; 952 audio-s/s 1.96x
8 16.46 486.0 clips/s; 4,860 audio-s/s 60.29 132.7 clips/s; 1,327 audio-s/s 3.66x
16 41.19 388.5 clips/s; 3,885 audio-s/s 133.95 119.4 clips/s; 1,194 audio-s/s 3.25x

Projection-only throughput at feature batch 2048 is also higher for the single-audio path: 314k features/s for AIST-87M vs 282k features/s for the dual-audio tower. Raw benchmark output is included as aist87m_vs_dual_audio_throughput_l4_20260504.json.

Architecture

Text   -> mdbr-leaf-ir (768-d) -----------------------> DeepProjectionHead-d2 -> 1280
Image  -> MobileNetV4-Medium (1280-d) ----------------> DeepProjectionHead-d2 -> 1280
Audio  -> merged native EfficientAT mn20_as (1280-d) -> DeepProjectionHead-d2 -> 1280

The audio encoder in this artifact is the merged native checkpoint:

mn20_native_merged_aistmix_audioheavy100k175k175k_continue_from_balanced_20260426T143137Z/latest_model.pt

Parameter Count

Component Params
Text encoder (MongoDB/mdbr-leaf-ir) 22,861,056
Image encoder (mobilenetv4_conv_medium.e180_r384_in12k) 8,434,512
Audio encoder (merged native mn20_as) 19,886,566
Image projection head 12,306,560
Audio projection head 12,306,560
Text projection head 11,323,520
Total exact loaded params 87,118,774

Files

File Purpose
AIST-87M.safetensors Self-contained release artifact
aist_81m_raw_mn20_lora.yaml Training recipe for the source run
manifest.json Release manifest with checksums and eval coverage
parameter_breakdown.json Exact parameter accounting
aist87m_memory_slice_release_report.md Human-memory slice report
aist87m_memory_slice_release_report.json Machine-readable evaluation summary
aist87m_vs_dual_audio_throughput_l4_20260504.json L4 throughput benchmark vs dual-audio tower

Caveats

  • The model is optimized and reported for memory-relevant embedding surfaces, not broad leaderboard coverage.
  • The single-audio path is smaller and simpler than the dual-audio tower, but it does not dominate the dual-audio tower on paired diagnostic scores.
  • 1280d, 768d, and 512d human-memory slices are complete for the release checkpoint.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for augmem/AIST-87M

Quantizations
1 model