AIST-87M

AIST-87M is a compact audio + image + speech + text embedding model for human-memory augmentation workloads.

It is the single-audio evolution of the earlier dual-audio tower line: the runtime audio path uses one merged native mn20_as EfficientAT encoder instead of a separate EfficientAT + Whisper dual branch. The LoRA training weights are merged into the native audio encoder in this release artifact, so there is no separate LoRA pass at inference time.

Core stack:

text: MongoDB/mdbr-leaf-ir
image: mobilenetv4_conv_medium.e180_r384_in12k
audio: native merged mn20_as EfficientAT encoder
projection output: 1280d
Matryoshka slices: [1280, 768, 512, 256, 128]
exact loaded params: 87,118,774

The canonical name follows the Augmem naming standard:

AIST = audio + image + speech + text
87M = exact loaded parameter count rounded to integer millions

Runtime Contract

This model returns L2-normalized embeddings in a shared 1280-dimensional space. For smaller runtime profiles, truncate to a Matryoshka slice and renormalize:

z1280 = l2norm(model(input))
z768  = l2norm(z1280[0:768])
z512  = l2norm(z1280[0:512])

The release safetensors file is self-contained and includes the text encoder, image encoder, merged native audio encoder, and the three projection heads.

Evaluation Scope

This release uses a human-memory evaluation slice rather than a broad leaderboard sweep. The slice is chosen to match practical memory augmentation surfaces:

text continuity: duplicate-question and semantic textual similarity tasks
image recall: Flickr30k text-image and image-text retrieval
audio recall: speech/general-audio text-audio retrieval tasks

Primary metrics:

text continuity: main_score
image recall: NDCG@10
audio recall: NDCG@10

Human-Memory Slice

Source: aist87m_memory_slice_release_report.md and aist87m_memory_slice_release_report.json.

Dim	Tasks	Text continuity	Image recall	Audio recall	Overall
1280	8 / 8	0.763	0.425	0.104	0.349
768	8 / 8	0.762	0.424	0.104	0.349
512	8 / 8	0.762	0.424	0.104	0.349

Selected 1280d task scores:

Task	Family	Metric	Score	R@1	R@10
SprintDuplicateQuestions	Text continuity	main_score	0.875	-	-
STSBenchmark	Text continuity	main_score	0.651	-	-
Flickr30kT2IRetrieval	Image recall	NDCG@10	0.469	0.296	0.672
Flickr30kI2TRetrieval	Image recall	NDCG@10	0.381	0.082	0.407
CommonVoiceMini21T2ARetrieval	Audio recall	NDCG@10	0.028	0.006	0.062
MACST2ARetrieval	Audio recall	NDCG@10	0.110	0.033	0.214
UrbanSound8KT2ARetrieval	Audio recall	NDCG@10	0.009	0.002	0.018
ClothoT2ARetrieval	Audio recall	NDCG@10	0.269	0.128	0.443

Task-Aligned Comparisons

Comparisons below are only for locally available, task-aligned runs from the same raw AIST line and its audio baselines.

Comparison	Dim	Paired tasks	Read
vs native `mn20_as` audio baseline	768	4	slightly lower selected audio recall on average; UrbanSound8K is flat
vs dual-audio tower	768	6	smaller single-audio runtime, but lower paired text/image/audio scores
vs `AIST-95M`	1280	2	only paired Flickr tasks are available locally; `AIST-95M` remains stronger on that pair

This release is not presented as a generic MTEB/MIEB/MAEB leaderboard model. Broad diagnostic runs contain many task families that are not part of this release gate.

Runtime Footprint vs Dual-Audio Tower

AIST-87M replaces the dual-audio tower's separate EfficientAT + Whisper-Tiny audio branches with one merged native mn20_as EfficientAT encoder. The result is a smaller deployed path with the same 1280d output contract.

Runtime surface	AIST-87M	AIST-95M dual-audio tower	Delta
Loaded parameters	87,118,774	95,315,959	-8.6%
Safetensors artifact	348.9 MB	381.9 MB	-8.6%
Audio encoders	1	2	removes Whisper branch
Audio encoder parameters	19,886,566	26,117,671	-23.9%
Audio path parameters incl. projection	32,193,126	40,390,311	-20.3%
Audio projection input width	1,280	2,304	-44.4%

Exact-gate tradeoff against the same dual-audio local baseline:

1280d exact-gate slice	AIST-87M	AIST-95M dual-audio tower	Delta
Speech holdout audio-text R@1 avg	0.724	0.582	+0.142
WavCaps FSD audio-text R@1 avg	0.097	0.105	-0.009
SALT audio-text R@1 avg	0.008	0.007	flat
SALT image-audio R@1 avg	0.138	0.148	-0.010

Measured PyTorch audio-stack throughput on an NVIDIA L4, using synthetic 10s 32 kHz CPU waveforms passed through waveform -> audio encoder -> projection -> normalized embedding. Median wall time is over 50 timed iterations after 20 warmup iterations. This excludes audio file decode, dataset download, and MTEB result serialization.

Batch	AIST-87M median ms	AIST-87M throughput	AIST-95M median ms	AIST-95M throughput	Speedup
1	5.36	186.7 clips/s; 1,867 audio-s/s	10.50	95.2 clips/s; 952 audio-s/s	1.96x
8	16.46	486.0 clips/s; 4,860 audio-s/s	60.29	132.7 clips/s; 1,327 audio-s/s	3.66x
16	41.19	388.5 clips/s; 3,885 audio-s/s	133.95	119.4 clips/s; 1,194 audio-s/s	3.25x

Projection-only throughput at feature batch 2048 is also higher for the single-audio path: 314k features/s for AIST-87M vs 282k features/s for the dual-audio tower. Raw benchmark output is included as aist87m_vs_dual_audio_throughput_l4_20260504.json.

Architecture

Text   -> mdbr-leaf-ir (768-d) -----------------------> DeepProjectionHead-d2 -> 1280
Image  -> MobileNetV4-Medium (1280-d) ----------------> DeepProjectionHead-d2 -> 1280
Audio  -> merged native EfficientAT mn20_as (1280-d) -> DeepProjectionHead-d2 -> 1280

The audio encoder in this artifact is the merged native checkpoint:

mn20_native_merged_aistmix_audioheavy100k175k175k_continue_from_balanced_20260426T143137Z/latest_model.pt

Parameter Count

Component	Params
Text encoder (`MongoDB/mdbr-leaf-ir`)	22,861,056
Image encoder (`mobilenetv4_conv_medium.e180_r384_in12k`)	8,434,512
Audio encoder (merged native `mn20_as`)	19,886,566
Image projection head	12,306,560
Audio projection head	12,306,560
Text projection head	11,323,520
Total exact loaded params	87,118,774

Files

File	Purpose
`AIST-87M.safetensors`	Self-contained release artifact
`aist_81m_raw_mn20_lora.yaml`	Training recipe for the source run
`manifest.json`	Release manifest with checksums and eval coverage
`parameter_breakdown.json`	Exact parameter accounting
`aist87m_memory_slice_release_report.md`	Human-memory slice report
`aist87m_memory_slice_release_report.json`	Machine-readable evaluation summary
`aist87m_vs_dual_audio_throughput_l4_20260504.json`	L4 throughput benchmark vs dual-audio tower

Caveats

The model is optimized and reported for memory-relevant embedding surfaces, not broad leaderboard coverage.
The single-audio path is smaller and simpler than the dual-audio tower, but it does not dominate the dual-audio tower on paired diagnostic scores.
1280d, 768d, and 512d human-memory slices are complete for the release checkpoint.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for augmem/AIST-87M

Quantizations

1 model